# Enhanced PKL Processing Test Notebook

This notebook tests the integrated enhanced PKL processing functionality.

In [1]:
# %%
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Import the new enhanced PKL processing module
from data.qc.enhanced_pkl_processing import process_pkl_data_enhanced, EnhancedPKLProcessor
from config.notebook_config import NotebookConfig
from notebook_utils.pkl_cleaning_integration import create_enhanced_setup

# Your existing configuration
config = NotebookConfig(
    site_code='ETAD',
    wavelength='Red',
    quality_threshold=10,
    output_format='jpl',
    min_samples_for_analysis=30,
    confidence_level=0.95,
    outlier_threshold=3.0,
    figure_size=(12, 8),
    font_size=10,
    dpi=300
)

# Set your data paths (same as before)
base_data_path = "/Users/ahzs645/Library/CloudStorage/GoogleDrive-ahzs645@gmail.com/My Drive/University/Research/Grad/UC Davis Ann/NASA MAIA/Data"

config.aethalometer_files = {
    'pkl_data': os.path.join(
        base_data_path,
        "Aethelometry Data/Kyan Data/Mergedcleaned and uncleaned MA350 data20250707030704",
        "df_uncleaned_Jacros_API_and_OG.pkl"
    ),
    'csv_data': os.path.join(
        base_data_path,
        "Aethelometry Data/Raw",
        "Jacros_MA350_1-min_2022-2024_Cleaned.csv"
    )
}

config.ftir_db_path = os.path.join(
    base_data_path,
    "EC-HIPS-Aeth Comparison/Data/Original Data/Combined Database",
    "spartan_ftir_hips.db"
)

# Create enhanced setup
setup = create_enhanced_setup(config)

‚úÖ Advanced plotting style configured
üöÄ Aethalometer-FTIR/HIPS Pipeline with Simplified Setup
üìä Configuration Summary:
   Site: ETAD
   Wavelength: Red
   Output format: jpl
   Quality threshold: 10 minutes
   Output directory: outputs

üìÅ File paths:
   pkl_data: ‚úÖ df_uncleaned_Jacros_API_and_OG.pkl
   csv_data: ‚úÖ Jacros_MA350_1-min_2022-2024_Cleaned.csv
   FTIR DB: ‚úÖ spartan_ftir_hips.db
üßπ Enhanced setup with PKL cleaning capabilities loaded


In [2]:
# %%
print("üìÅ Loading datasets...")
datasets = setup.load_all_data()

# Get PKL data
pkl_data_original = setup.get_dataset('pkl_data')

# Quick fix for datetime_local issue (same as before)
if 'datetime_local' not in pkl_data_original.columns:
    if pkl_data_original.index.name == 'datetime_local':
        print("‚úÖ Converting datetime_local from index to column...")
        pkl_data_original = pkl_data_original.reset_index()
    elif hasattr(pkl_data_original.index, 'tz'):
        print("‚úÖ Creating datetime_local column from datetime index...")
        pkl_data_original['datetime_local'] = pkl_data_original.index
        pkl_data_original = pkl_data_original.reset_index(drop=True)

print(f"üìä PKL data ready: {pkl_data_original.shape}")
print(f"üìÖ Date range: {pkl_data_original['datetime_local'].min()} to {pkl_data_original['datetime_local'].max()}")

üìÅ Loading datasets...
üì¶ Setting up modular system...
‚úÖ Aethalometer loaders imported
‚úÖ Database loader imported
‚úÖ Plotting utilities imported
‚úÖ Plotting style configured
‚úÖ Successfully imported 5 modular components

üìÅ LOADING DATASETS
üìÅ Loading all datasets...

üìä Loading pkl_data
üìÅ Loading pkl_data: df_uncleaned_Jacros_API_and_OG.pkl
Detected format: standard
Set 'datetime_local' as DatetimeIndex for time series operations
Converted 17 columns to JPL format
‚úÖ Modular load: 1,665,156 rows √ó 238 columns
üìä Method: modular
üìä Format: jpl
üìä Memory: 7443.05 MB
üßÆ BC columns: 30
üìà ATN columns: 25
üìÖ Time range: 2021-01-09 16:38:00 to 2025-06-26 23:18:00
‚úÖ pkl_data loaded successfully

üìä Loading csv_data
üìÅ Loading csv_data: Jacros_MA350_1-min_2022-2024_Cleaned.csv
Set 'Time (Local)' as DatetimeIndex for time series operations
Converted 5 columns to JPL format
‚úÖ Modular load: 1,095,086 rows √ó 77 columns
üìä Method: modular
üìä Format: j

In [9]:
# %%
# ENHANCED PKL PROCESSING with optional Ethiopia fix

# üéõÔ∏è Configuration: Toggle Ethiopia fix here
APPLY_ETHIOPIA_FIX = True  # Set to True to enable Ethiopia pneumatic pump fix

print(f"üöÄ Enhanced PKL Processing {'WITH' if APPLY_ETHIOPIA_FIX else 'WITHOUT'} Ethiopia Fix")
print("=" * 60)

pkl_data_cleaned = process_pkl_data_enhanced(
    pkl_data_original,
    wavelengths_to_filter=['IR', 'Blue'],
    export_path=f'pkl_data_cleaned_{"ethiopia" if APPLY_ETHIOPIA_FIX else "standard"}',
    apply_ethiopia_fix=APPLY_ETHIOPIA_FIX,  # üîß Ethiopia fix toggle
    site_code='ETAD' if APPLY_ETHIOPIA_FIX else None,
    verbose=True
)

print(f"\n‚úÖ Processing complete: {pkl_data_cleaned.shape}")

# Show what Ethiopia corrections were added (if any)
if APPLY_ETHIOPIA_FIX:
    ethiopia_cols = [col for col in pkl_data_cleaned.columns if any(x in col for x in ['corrected', 'manual', 'optimized', 'denominator'])]
    if ethiopia_cols:
        print(f"\nüîß Ethiopia correction columns added ({len(ethiopia_cols)}):")
        for col in sorted(ethiopia_cols)[:10]:  # Show first 10
            print(f"  ‚Ä¢ {col}")
        if len(ethiopia_cols) > 10:
            print(f"  ... and {len(ethiopia_cols)-10} more")
    else:
        print("\n‚ö†Ô∏è No Ethiopia correction columns found")
else:
    print("\nüìä Standard processing - no Ethiopia corrections applied")

üöÄ Enhanced PKL Processing WITH Ethiopia Fix
üöÄ Enhanced PKL Data Processing Pipeline
üîß Comprehensive Preprocessing Pipeline
Step 1: Processing datetime...

Step 2: Fixing column names...
‚úÖ Renamed 16 columns

Step 3: Converting data types...
Converted IR ATN1 to float.
Converted UV ATN1 to float.
Converted Blue ATN1 to float.
Converted Green ATN1 to float.
Converted Red ATN1 to float.
‚úÖ Applied calibration.convert_to_float()

Step 4: Adding Session ID...

Step 5: Adding delta calculations...
‚úÖ Applied calibration.add_deltas()

Step 6: Final adjustments...
‚úÖ Filtered to 2022+: 1,665,156 -> 1,627,058 rows
üîÑ Applying DEMA Smoothing...

Processing IR wavelength...
  Available BC columns: ['IR BC1', 'IR BC2', 'IR BCc']
  ‚úÖ Created IR BC1 smoothed
  ‚úÖ Created IR BC2 smoothed
  ‚úÖ Created IR BCc smoothed

Processing Blue wavelength...
  Available BC columns: ['Blue BC1', 'Blue BC2', 'Blue BCc']
  ‚úÖ Created Blue BC1 smoothed
  ‚úÖ Created Blue BC2 smoothed
  ‚úÖ Creat

In [14]:
# %%
# VALIDATION: Ethiopia fix validation (only runs if fix was applied)

if APPLY_ETHIOPIA_FIX:
    print("üìä Ethiopia Fix Validation:")
    print("=" * 50)
    
    # Use the validation functions from the site corrections module
    from src.data.processors.site_corrections import SiteCorrections, apply_ethiopia_fix
    
    # Create a small sample for comparison (to demonstrate the fix)
    sample_size = 10000
    sample_original = pkl_data_original.head(sample_size)
    
    print(f"\nüî¨ Running validation on sample data ({sample_size:,} rows)...")
    
    # Apply just the Ethiopia fix (without full processing) for comparison
    sample_with_fix = apply_ethiopia_fix(sample_original, verbose=True)
    
    # Create corrector for validation
    corrector = SiteCorrections(site_code='ETAD', verbose=False)
    
    # Validate the fix
    validation_results = corrector.validate_corrections(
        sample_original,  # Original
        sample_with_fix,  # With Ethiopia fix only
        wavelength='IR'
    )
    
    print("\nüìà Validation Results:")
    for key, value in validation_results.items():
        if isinstance(value, dict):
            print(f"  {key}:")
            for subkey, subval in value.items():
                if isinstance(subval, float):
                    print(f"    {subkey}: {subval:.6f}")
                else:
                    print(f"    {subkey}: {subval}")
        else:
            if isinstance(value, float):
                print(f"  {key}: {value:.6f}")
            else:
                print(f"  {key}: {value}")
    
    # Check correlation improvements
    if 'original_atn_correlation' in validation_results and 'corrected_atn_correlation' in validation_results:
        orig_corr = validation_results['original_atn_correlation'] 
        corr_corr = validation_results['corrected_atn_correlation']
        improvement = abs(orig_corr) - abs(corr_corr)
        
        print(f"\nüéØ Key Improvement Metric:")
        print(f"  Original BCc-ATN1 correlation: {orig_corr:.6f}")
        print(f"  Corrected BCc-ATN1 correlation: {corr_corr:.6f}")
        print(f"  Improvement: {improvement:.6f} ({'‚úÖ Better!' if improvement > 0 else '‚ö†Ô∏è Check data'})")
        
        if improvement > 0:
            print(f"  üéâ Ethiopia fix successfully reduced correlation by {improvement:.6f}")
else:
    print("üìä Ethiopia Fix Validation: SKIPPED")
    print("=" * 50)
    print("Set APPLY_ETHIOPIA_FIX = True to run validation")

print(f"\nüìä Final Data Summary:")
print("=" * 50)
print(f"Shape: {pkl_data_cleaned.shape}")
print(f"Date range: {pkl_data_cleaned['datetime_local'].min()} to {pkl_data_cleaned['datetime_local'].max()}")

# Check key columns
key_cols = ['datetime_local', 'IR ATN1', 'IR BCc', 'Blue ATN1', 'Blue BCc', 'Flow total (mL/min)']
for col in key_cols:
    status = "‚úÖ" if col in pkl_data_cleaned.columns else "‚ùå"
    print(f"  {status} {col}")

# Show Ethiopia-specific columns if present
ethiopia_specific = [col for col in pkl_data_cleaned.columns if any(x in col for x in ['corrected', 'manual', 'optimized', 'denominator'])]
if ethiopia_specific:
    print(f"  üîß Ethiopia corrections: {len(ethiopia_specific)} columns")

memory_mb = pkl_data_cleaned.memory_usage(deep=True).sum() / 1024 / 1024
print(f"  üíæ Memory usage: {memory_mb:.1f} MB")

print(f"\n‚úÖ {'Ethiopia-enhanced' if APPLY_ETHIOPIA_FIX else 'Standard'} processing complete!")

üìä Ethiopia Fix Validation:


ModuleNotFoundError: No module named 'src'

In [14]:
# %%
# VALIDATION: Compare original uncleaned data vs Ethiopia-enhanced processed data

print("üìä Ethiopia Fix Validation:")
print("=" * 50)

# Load both original uncleaned data and Ethiopia-enhanced processed data for comparison
try:
    import pandas as pd
    import os
    
    # Use the same paths as your config
    base_data_path = "/Users/ahzs645/Library/CloudStorage/GoogleDrive-ahzs645@gmail.com/My Drive/University/Research/Grad/UC Davis Ann/NASA MAIA/Data"
    
    original_file = os.path.join(
        base_data_path,
        "Aethelometry Data/Kyan Data/Mergedcleaned and uncleaned MA350 data20250707030704",
        "df_uncleaned_Jacros_API_and_OG.pkl"
    )
    ethiopia_file = 'pkl_data_cleaned_ethiopia.pkl'
    
    print("üìÅ Loading datasets for comparison...")
    print(f"  üîç Original uncleaned: {os.path.basename(original_file)}")
    print(f"  üîß Ethiopia processed: {os.path.basename(ethiopia_file)}")
    
    # Load datasets
    datasets = {}
    
    # Load original uncleaned data
    try:
        datasets["Original"] = pd.read_pickle(original_file)
        print(f"  ‚úÖ Loaded Original: {datasets['Original'].shape}")
    except FileNotFoundError:
        print(f"  ‚ùå Original file not found: {original_file}")
    except Exception as e:
        print(f"  ‚ö†Ô∏è Error loading original: {e}")
    
    # Load Ethiopia-enhanced processed data
    try:
        datasets["Ethiopia"] = pd.read_pickle(ethiopia_file)
        print(f"  ‚úÖ Loaded Ethiopia: {datasets['Ethiopia'].shape}")
    except FileNotFoundError:
        print(f"  ‚ùå Ethiopia file not found: {ethiopia_file}")
        print("  üí° Make sure you've run processing with APPLY_ETHIOPIA_FIX = True")
    except Exception as e:
        print(f"  ‚ö†Ô∏è Error loading Ethiopia data: {e}")
    
    # Validation if we have both datasets
    if "Original" in datasets and "Ethiopia" in datasets:
        print(f"\nüî¨ Comparing Original Uncleaned vs Ethiopia-Enhanced Processed:")
        print("=" * 70)
        
        original_data = datasets["Original"]
        ethiopia_data = datasets["Ethiopia"]
        
        # Basic comparison
        print(f"üìä Dataset transformation:")
        print(f"  Original uncleaned: {original_data.shape} ({original_data.shape[1]} columns)")
        print(f"  Ethiopia processed: {ethiopia_data.shape} ({ethiopia_data.shape[1]} columns)")
        
        # Check Ethiopia-specific columns in processed data
        ethiopia_cols = [col for col in ethiopia_data.columns if any(x in col for x in ['corrected', 'manual', 'optimized', 'denominator'])]
        if ethiopia_cols:
            print(f"  üîß Ethiopia corrections added: {len(ethiopia_cols)} columns")
            
            # Group by wavelength
            wavelengths = ['IR', 'Blue', 'Red', 'Green', 'UV']
            correction_summary = []
            for wl in wavelengths:
                wl_cols = [col for col in ethiopia_cols if wl in col]
                if wl_cols:
                    correction_summary.append(f"{wl}({len(wl_cols)})")
            
            print(f"  üìà Corrections by wavelength: {', '.join(correction_summary)}")
        
        # Detailed validation for key wavelengths
        print(f"\nüéØ Ethiopia Fix Impact Analysis:")
        print("-" * 50)
        
        for wl in ['IR', 'Blue']:
            print(f"\n{wl} Wavelength Analysis:")
            
            # Column names
            original_bcc = f'{wl} BCc'  # or might be BC1 in original
            if original_bcc not in original_data.columns and f'{wl} BC1' in original_data.columns:
                original_bcc = f'{wl} BC1'
            
            corrected_bcc = f'{wl} BCc_corrected'
            atn_col = f'{wl} ATN1'
            
            # Check if required columns exist
            has_original_bcc = original_bcc in original_data.columns
            has_corrected_bcc = corrected_bcc in ethiopia_data.columns
            has_atn_original = atn_col in original_data.columns
            has_atn_ethiopia = atn_col in ethiopia_data.columns
            
            print(f"  üìã Column availability:")
            print(f"    Original {original_bcc}: {'‚úÖ' if has_original_bcc else '‚ùå'}")
            print(f"    Ethiopia {corrected_bcc}: {'‚úÖ' if has_corrected_bcc else '‚ùå'}")
            print(f"    {atn_col}: {'‚úÖ' if has_atn_original and has_atn_ethiopia else '‚ùå'}")
            
            if has_original_bcc and has_corrected_bcc and has_atn_original and has_atn_ethiopia:
                try:
                    # Get sample of data for analysis (use smaller sample for speed)
                    sample_size = min(50000, len(original_data), len(ethiopia_data))
                    
                    # Original data analysis
                    orig_sample = original_data.head(sample_size)
                    orig_bcc_data = orig_sample[original_bcc].dropna()
                    orig_atn_data = orig_sample[atn_col].dropna()
                    common_orig = orig_bcc_data.index.intersection(orig_atn_data.index)
                    
                    # Ethiopia data analysis
                    eth_sample = ethiopia_data.head(sample_size)
                    eth_bcc_data = eth_sample[corrected_bcc].dropna()
                    eth_atn_data = eth_sample[atn_col].dropna()
                    common_eth = eth_bcc_data.index.intersection(eth_atn_data.index)
                    
                    if len(common_orig) > 100 and len(common_eth) > 100:
                        # Calculate correlations
                        orig_corr = orig_sample.loc[common_orig, original_bcc].corr(orig_sample.loc[common_orig, atn_col])
                        corr_corr = eth_sample.loc[common_eth, corrected_bcc].corr(eth_sample.loc[common_eth, atn_col])
                        
                        # Calculate basic statistics
                        orig_mean = orig_sample[original_bcc].mean()
                        corr_mean = eth_sample[corrected_bcc].mean()
                        
                        print(f"  üìä Statistical comparison:")
                        print(f"    Original {original_bcc} mean: {orig_mean:.3f}")
                        print(f"    Corrected BCc mean: {corr_mean:.3f}")
                        print(f"    Mean difference: {abs(corr_mean - orig_mean):.3f}")
                        
                        print(f"  üéØ Correlation with {atn_col}:")
                        print(f"    Original correlation: {orig_corr:.6f}")
                        print(f"    Ethiopia corrected: {corr_corr:.6f}")
                        
                        if not (pd.isna(orig_corr) or pd.isna(corr_corr)):
                            improvement = abs(orig_corr) - abs(corr_corr)
                            print(f"    Improvement: {improvement:.6f} ({'‚úÖ Better!' if improvement > 0 else '‚ö†Ô∏è Check data'})")
                            
                            if improvement > 0:
                                improvement_pct = (improvement / abs(orig_corr)) * 100
                                print(f"    üéâ {improvement_pct:.1f}% correlation reduction!")
                                
                                # Ethiopia fix effectiveness
                                if abs(corr_corr) < 0.1:  # Near zero correlation
                                    print(f"    ‚ú® Excellent: Near-zero correlation achieved!")
                                elif improvement > 0.1:  # Significant improvement
                                    print(f"    üëç Good: Significant correlation reduction")
                                else:
                                    print(f"    üìà Moderate improvement")
                        else:
                            print(f"    ‚ö†Ô∏è Could not calculate correlation improvement")
                    else:
                        print(f"    ‚ö†Ô∏è Insufficient data for analysis (orig: {len(common_orig)}, eth: {len(common_eth)})")
                        
                except Exception as e:
                    print(f"    ‚ö†Ô∏è Error in analysis: {e}")
            else:
                missing_cols = []
                if not has_original_bcc:
                    missing_cols.append(f"Original: {original_bcc}")
                if not has_corrected_bcc:
                    missing_cols.append(f"Ethiopia: {corrected_bcc}")
                if not (has_atn_original and has_atn_ethiopia):
                    missing_cols.append(f"ATN: {atn_col}")
                print(f"    ‚ùå Missing columns: {', '.join(missing_cols)}")
        
        # Summary of Ethiopia fix benefits
        print(f"\n‚úÖ Ethiopia Fix Summary:")
        print("=" * 30)
        print(f"üìà Data processing: {original_data.shape[0]:,} ‚Üí {ethiopia_data.shape[0]:,} rows")
        print(f"üîß Columns added: {ethiopia_data.shape[1] - original_data.shape[1]} correction columns")
        print(f"üéØ Primary benefit: Reduced BCc-ATN1 correlation (pneumatic pump fix)")
        print(f"üìä Quality control: Applied comprehensive cleaning pipeline")
        print(f"üßπ DEMA smoothing: Added for noise reduction")
        
    elif "Ethiopia" in datasets:
        # Only Ethiopia data available - validate it has corrections
        ethiopia_data = datasets["Ethiopia"]
        ethiopia_cols = [col for col in ethiopia_data.columns if any(x in col for x in ['corrected', 'manual', 'optimized', 'denominator'])]
        
        print(f"\n‚úÖ Ethiopia-enhanced data loaded successfully!")
        print(f"üîß Found {len(ethiopia_cols)} correction columns")
        print("‚ö†Ô∏è Original uncleaned data not available for comparison")
        
        if ethiopia_cols:
            wavelengths = ['IR', 'Blue', 'Red', 'Green', 'UV']
            for wl in wavelengths:
                wl_cols = [col for col in ethiopia_cols if wl in col]
                if wl_cols:
                    print(f"  {wl}: {len(wl_cols)} corrections")
    
    else:
        print(f"\n‚ö†Ô∏è Could not load required files for comparison")
        if "Original" not in datasets:
            print("‚ùå Original uncleaned data not found")
        if "Ethiopia" not in datasets:
            print("‚ùå Ethiopia processed data not found - run processing with APPLY_ETHIOPIA_FIX = True")

except Exception as e:
    print(f"‚ùå Error during validation: {e}")

print(f"\nüìä Current Session Data:")
print("=" * 50)
if 'pkl_data_cleaned' in locals():
    print(f"Shape: {pkl_data_cleaned.shape}")
    print(f"Date range: {pkl_data_cleaned['datetime_local'].min()} to {pkl_data_cleaned['datetime_local'].max()}")
    
    # Show Ethiopia-specific columns if present
    ethiopia_specific = [col for col in pkl_data_cleaned.columns if any(x in col for x in ['corrected', 'manual', 'optimized', 'denominator'])]
    if ethiopia_specific:
        print(f"üîß Ethiopia corrections in current data: {len(ethiopia_specific)} columns")
        has_ethiopia_fix = True
    else:
        has_ethiopia_fix = False
    
    print(f"‚úÖ Current data: {'Ethiopia-enhanced' if has_ethiopia_fix else 'Standard'} processing")
else:
    print("No current data available - run the processing cell first")

üìä Ethiopia Fix Validation:
üìÅ Loading datasets for comparison...
  üîç Original uncleaned: df_uncleaned_Jacros_API_and_OG.pkl
  üîß Ethiopia processed: pkl_data_cleaned_ethiopia.pkl
  ‚úÖ Loaded Original: (1665156, 239)
  ‚úÖ Loaded Ethiopia: (1477783, 318)

üî¨ Comparing Original Uncleaned vs Ethiopia-Enhanced Processed:
üìä Dataset transformation:
  Original uncleaned: (1665156, 239) (239 columns)
  Ethiopia processed: (1477783, 318) (318 columns)
  üîß Ethiopia corrections added: 27 columns
  üìà Corrections by wavelength: IR(6), Blue(6), Red(5), Green(5), UV(5)

üéØ Ethiopia Fix Impact Analysis:
--------------------------------------------------

IR Wavelength Analysis:
  üìã Column availability:
    Original IR BCc: ‚úÖ
    Ethiopia IR BCc_corrected: ‚úÖ
    IR ATN1: ‚úÖ
  üìä Statistical comparison:
    Original IR BCc mean: 2790.828
    Corrected BCc mean: 6093.255
    Mean difference: 3302.426
  üéØ Correlation with IR ATN1:
    Original correlation: 0.417843
   