# PKL Data Cleaning Pipeline Demo

This notebook demonstrates how to use the PKL data cleaning pipeline with configurable data directory paths.

The PKL cleaning pipeline provides comprehensive data cleaning for aethalometer data in PKL format, including:
- Status-based cleaning using external calibration
- Optical saturation removal
- Flow validation and range checking
- Temperature change detection
- Roughness-based quality control
- DEMA smoothing

## Key Features
- **Configurable data directory**: Set your own path instead of hardcoded paths
- **Modular design**: Use individual cleaning methods or the complete pipeline
- **External calibration**: Preserves external calibration script as-is
- **Comprehensive reporting**: Track data removal at each step

## Setup and Configuration

First, let's configure the data directory path. **Change this to match your actual data location!**

In [2]:
import os
import pandas as pd
import numpy as np
from datetime import datetime

# =============================================================================
# CONFIGURE YOUR DATA DIRECTORY HERE
# =============================================================================

# Option 1: Use relative path (default)
data_directory = "/Users/ahzs645/Library/CloudStorage/GoogleDrive-ahzs645@gmail.com/My Drive/University/Research/Grad/UC Davis Ann/NASA MAIA/Data/Aethelometry Data/Kyan Data/Mergedcleaned and uncleaned MA350 data20250707030704/"

# Option 2: Use absolute path (recommended for production)
# data_directory = "/Users/your-username/path/to/your/pkl/data/"

# Option 3: Use environment variable
# data_directory = os.getenv('PKL_DATA_PATH', '../JPL_aeth/')

# Option 4: Interactive input
# data_directory = input("Enter path to PKL data directory: ")

print(f"📁 Data directory configured: {data_directory}")
print(f"📍 Directory exists: {os.path.exists(data_directory)}")

if not os.path.exists(data_directory):
    print("⚠️  Warning: Directory does not exist. Please update the path above.")
else:
    print(f"✅ Found directory with {len(os.listdir(data_directory))} items")

📁 Data directory configured: /Users/ahzs645/Library/CloudStorage/GoogleDrive-ahzs645@gmail.com/My Drive/University/Research/Grad/UC Davis Ann/NASA MAIA/Data/Aethelometry Data/Kyan Data/Mergedcleaned and uncleaned MA350 data20250707030704/
📍 Directory exists: True
✅ Found directory with 2 items


## Import PKL Cleaning Modules

Import the PKL cleaning functionality from the aethmodular package.

In [3]:
# Import PKL cleaning functionality
from src.data.qc import PKLDataCleaner, load_and_clean_pkl_data

print("✅ PKL cleaning modules imported successfully")

ModuleNotFoundError: No module named 'src'

## Method 1: Using the PKLDataCleaner Class (Recommended)

Create a PKLDataCleaner instance with your configured data directory. This is the recommended approach as it encapsulates the configuration.

In [None]:
# Create PKL data cleaner with custom data directory
cleaner = PKLDataCleaner(
    data_directory=data_directory,
    wavelengths_to_filter=['IR', 'Blue']  # Optional: customize wavelengths
)

print(f"🔧 PKL cleaner initialized with data directory: {cleaner.data_directory}")
print(f"📊 Wavelengths to filter: {cleaner.wls_to_filter}")

### Load and Clean Data Using the Instance Method

In [None]:
# Load and clean data using the instance method
# This will use the data_directory specified when creating the cleaner
try:
    df_cleaned = cleaner.load_and_clean_data(
        # Optional parameters for data loading
        verbose=True,
        summary=True,
        file_number_printout=True
    )
    
    print(f"\n📋 Cleaned data summary:")
    print(f"   Shape: {df_cleaned.shape}")
    print(f"   Date range: {df_cleaned['datetime_local'].min()} to {df_cleaned['datetime_local'].max()}")
    print(f"   Columns: {list(df_cleaned.columns[:10])}{'...' if len(df_cleaned.columns) > 10 else ''}")
    
except Exception as e:
    print(f"❌ Error loading data: {e}")
    print("💡 Make sure your data directory path is correct and contains PKL files")

## Method 2: Using the Standalone Function

Alternatively, you can use the standalone function with a custom directory path.

In [None]:
# Load and clean data using standalone function
try:
    df_cleaned_v2 = load_and_clean_pkl_data(
        directory_path=data_directory,
        verbose=False,
        summary=False
    )
    
    print(f"✅ Data loaded using standalone function")
    print(f"📋 Shape: {df_cleaned_v2.shape}")
    
except Exception as e:
    print(f"❌ Error with standalone function: {e}")

## Individual Cleaning Steps

You can also apply individual cleaning steps if you want more control over the process.

In [None]:
# Example: Apply individual cleaning steps
if 'df_cleaned' in locals():
    # Start with a subset for demonstration
    df_sample = df_cleaned.head(1000).copy()
    
    print("🔧 Applying individual cleaning steps:")
    print(f"   Original sample size: {len(df_sample)}")
    
    # Apply status cleaning
    df_step1 = cleaner.clean_by_status(df_sample)
    
    # Apply optical saturation cleaning
    df_step2 = cleaner.clean_optical_saturation(df_step1)
    
    # Apply flow range cleaning
    df_step3 = cleaner.clean_flow_range(df_step2)
    
    print(f"   Final sample size: {len(df_step3)}")
    print(f"   Total removed: {len(df_sample) - len(df_step3)} rows")
else:
    print("⏭️  Skipping individual steps demo (no data loaded)")

## Data Quality Assessment

Analyze the quality of the cleaned data.

In [None]:
if 'df_cleaned' in locals():
    print("📊 Data Quality Assessment")
    print("=" * 50)
    
    # Basic statistics
    print(f"Total data points: {len(df_cleaned):,}")
    print(f"Date range: {(df_cleaned['datetime_local'].max() - df_cleaned['datetime_local'].min()).days} days")
    
    # Check for missing values
    missing_cols = df_cleaned.isnull().sum()
    missing_cols = missing_cols[missing_cols > 0]
    
    if len(missing_cols) > 0:
        print("\n❗ Columns with missing values:")
        for col, count in missing_cols.head(5).items():
            print(f"   {col}: {count:,} ({count/len(df_cleaned)*100:.2f}%)")
    else:
        print("\n✅ No missing values found")
    
    # Check unique instruments
    if 'Serial number' in df_cleaned.columns:
        instruments = df_cleaned['Serial number'].unique()
        print(f"\n📱 Instruments: {len(instruments)} unique")
        for inst in instruments[:5]:
            count = (df_cleaned['Serial number'] == inst).sum()
            print(f"   {inst}: {count:,} data points")
    
    # Display sample of cleaned data
    print("\n📋 Sample of cleaned data:")
    display_cols = ['datetime_local', 'Serial number'] if 'Serial number' in df_cleaned.columns else ['datetime_local']
    display_cols.extend([col for col in df_cleaned.columns if 'BC' in col][:3])
    print(df_cleaned[display_cols].head())
    
else:
    print("⏭️  No data available for quality assessment")

## Configuration for Different Environments

Here are examples of how to configure the data directory for different environments:

In [None]:
print("🔧 Configuration Examples")
print("=" * 40)

print("\n1️⃣  Development Environment:")
print('   cleaner = PKLDataCleaner(data_directory="../data/pkl/")')

print("\n2️⃣  Production Environment:")
print('   cleaner = PKLDataCleaner(data_directory="/opt/data/aethalometer/pkl/")')

print("\n3️⃣  Using Environment Variable:")
print('   import os')
print('   data_dir = os.getenv("AETH_DATA_PATH", "../JPL_aeth/")')
print('   cleaner = PKLDataCleaner(data_directory=data_dir)')

print("\n4️⃣  User Configuration File:")
print('   import json')
print('   with open("config.json") as f:')
print('       config = json.load(f)')
print('   cleaner = PKLDataCleaner(data_directory=config["pkl_data_path"])')

print("\n5️⃣  Command Line Argument:")
print('   import sys')
print('   data_dir = sys.argv[1] if len(sys.argv) > 1 else "../JPL_aeth/"')
print('   cleaner = PKLDataCleaner(data_directory=data_dir)')

## Next Steps

After cleaning your PKL data, you can:

1. **Quality Control Analysis**: Use other QC modules for comprehensive quality assessment
2. **Visualization**: Create plots and visualizations of the cleaned data
3. **Export Results**: Save cleaned data to various formats
4. **Integration**: Integrate with other analysis pipelines

Example integration with other QC modules:

In [None]:
# Example: Integrate with other QC modules
try:
    from src.data.qc import quick_quality_check
    
    if 'df_cleaned' in locals():
        print("🔍 Running quick quality check on cleaned PKL data:")
        quick_quality_check(df_cleaned.set_index('datetime_local'), freq='min')
    else:
        print("⏭️  No cleaned data available for quality check")
        
except ImportError as e:
    print(f"⚠️  QC modules not available: {e}")
except Exception as e:
    print(f"❌ Error running quality check: {e}")

## Summary

This notebook demonstrated:

✅ **Configurable Data Paths**: How to set custom data directory paths instead of hardcoded ones

✅ **Multiple Usage Patterns**: Class-based and function-based approaches

✅ **Individual Cleaning Steps**: Fine-grained control over the cleaning process

✅ **Quality Assessment**: Basic data quality checks after cleaning

✅ **Environment Configuration**: Examples for different deployment scenarios

The PKL cleaning pipeline is now properly integrated into the aethmodular package structure while preserving the external calibration script for easy updates.