# Simplified Aethalometer Data Analysis

This notebook demonstrates the **dramatically simplified** approach to aethalometer data analysis using the new modular system.

## 🎯 What's New?

- **One-line setup**: `setup, datasets = load_etad_data()`
- **Automatic quality assessment**: Built into the loading process
- **Intelligent fallbacks**: Modular system + fallback loading automatically
- **Clean data access**: Simple methods to get exactly what you need
- **Easy customization**: Configuration-driven approach

## 📋 Comparison

**OLD**: 200+ lines of complex setup code  
**NEW**: 2 lines to load everything

Let's see it in action! 🚀

## 1. 🚀 Configurable Data Loading

You now have full control over the configuration. The setup below shows:
- **Explicit configuration parameters** - easily change site code, wavelength, quality thresholds
- **Clear file paths** - see exactly what files are being loaded
- **Multiple options** - use default configs, create custom ones, or modify parameters

This replaces 200+ lines of complex setup code from the original notebook:

In [None]:
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Import the necessary modules for more explicit configuration
from notebook_utils.setup import NotebookSetup, create_custom_config
from config.notebook_config import NotebookConfig, ConfigurationManager

# Option 1: Use default ETAD configuration with explicit parameters
config = NotebookConfig(
    site_code='ETAD',
    wavelength='Red',  # Choose from: 'Red', 'Blue', 'Green', 'UV', 'IR'
    quality_threshold=10,  # Maximum missing minutes for "excellent" quality
    output_format='jpl',  # 'jpl' or 'standard' format
    min_samples_for_analysis=30,
    confidence_level=0.95,
    outlier_threshold=3.0,
    figure_size=(12, 8),
    font_size=10,
    dpi=300
)

# Set ETAD-specific file paths
base_data_path = "/Users/ahzs645/Library/CloudStorage/GoogleDrive-ahzs645@gmail.com/My Drive/University/Research/Grad/UC Davis Ann/NASA MAIA/Data"

config.aethalometer_files = {
    'pkl_data': os.path.join(
        base_data_path,
        "Aethelometry Data/Kyan Data/Mergedcleaned and uncleaned MA350 data20250707030704",
        "df_uncleaned_Jacros_API_and_OG.pkl"
    ),
    'csv_data': os.path.join(
        base_data_path,
        "Aethelometry Data/Raw",
        "Jacros_MA350_1-min_2022-2024_Cleaned.csv"
    )
}

config.ftir_db_path = os.path.join(
    base_data_path,
    "EC-HIPS-Aeth Comparison/Data/Original Data/Combined Database",
    "spartan_ftir_hips.db"
)

# Option 2: Or use the configuration manager for ETAD defaults
# config = ConfigurationManager.create_etad_config(base_data_path)

# Option 3: Or create a custom configuration for a different site
# config = create_custom_config(
#     site_code='MYSITE',
#     aethalometer_files={'data': '/path/to/data.pkl'},
#     ftir_db_path='/path/to/database.db',
#     wavelength='Blue',
#     quality_threshold=15,
#     output_format='standard'
# )

# Create setup with the explicit configuration
setup = NotebookSetup(config)

# Load all datasets
print("📁 Loading datasets...")
datasets = setup.load_all_data()

# Assess data quality
print("\n🔍 Assessing data quality...")
quality_results = setup.assess_data_quality()

# Print comprehensive summary
setup.print_summary()

# Access specific datasets
pkl_data = setup.get_dataset('pkl_data')
csv_data = setup.get_dataset('csv_data')
ftir_data = setup.get_ftir_data()

# Get BC data for the configured wavelength
red_bc = setup.get_bc_data_for_wavelength('pkl_data', 'Red')

# Get excellent quality periods
excellent_periods = setup.get_excellent_periods('pkl_data')

print("\n✅ Complete! All data loaded, modular system configured, quality assessed.")
print(f"\n📚 Available datasets: {list(datasets.keys())}")
print(f"🔧 Current configuration: Site={config.site_code}, Wavelength={config.wavelength}")

✅ Advanced plotting style configured
🚀 Aethalometer-FTIR/HIPS Pipeline with Simplified Setup
📊 Configuration Summary:
   Site: ETAD
   Wavelength: Red
   Output format: jpl
   Quality threshold: 10 minutes
   Output directory: outputs

📁 File paths:
   pkl_data: ✅ df_uncleaned_Jacros_API_and_OG.pkl
   csv_data: ✅ Jacros_MA350_1-min_2022-2024_Cleaned.csv
   FTIR DB: ✅ spartan_ftir_hips.db
📁 Loading datasets...
📦 Setting up modular system...
✅ Aethalometer loaders imported
✅ Database loader imported
✅ Plotting utilities imported
✅ Plotting style configured
✅ Successfully imported 5 modular components

📁 LOADING DATASETS
📁 Loading all datasets...

📊 Loading pkl_data
📁 Loading pkl_data: df_uncleaned_Jacros_API_and_OG.pkl
Detected format: standard
Set 'datetime_local' as DatetimeIndex for time series operations
Converted 17 columns to JPL format
✅ Modular load: 1,665,156 rows × 238 columns
📊 Method: modular
📊 Format: jpl
📊 Memory: 7443.05 MB
🧮 BC columns: 30
📈 ATN columns: 25
📅 Time range: 

## 2. 📊 Explore What Was Loaded

Let's see what we got with that simple command:

In [2]:
# Show comprehensive summary
setup.print_summary()

print("\n🔧 Quick access to configuration:")
print(f"Site: {setup.config.site_code}")
print(f"Wavelength: {setup.config.wavelength}")
print(f"Quality threshold: {setup.config.quality_threshold} minutes")
print(f"Output format: {setup.config.output_format}")


📊 COMPREHENSIVE DATA SUMMARY

🔧 Configuration:
   Site: ETAD
   Wavelength: Red
   Output format: jpl
   Quality threshold: 10 minutes

📁 Loaded datasets: 3
   - pkl_data: 1,665,156 rows × 238 columns
     📅 Time range: 2021-01-09 16:38:00 to 2025-06-26 23:18:00
   - csv_data: 1,095,086 rows × 77 columns
     📅 Time range: 2022-04-12 12:46:01+03:00 to 2024-08-20 12:01:00+03:00
   - ftir_hips: 168 rows × 12 columns

🔍 Quality assessment:
   - pkl_data: 1036/1630 excellent periods (63.6%)
   - csv_data: 712/862 excellent periods (82.6%)

🧮 Red BC columns:
   - pkl_data: ['Red BC1', 'Red BC2', 'Red.BCc']
   - csv_data: ['Red BC1', 'Red BC2', 'Red.BCc']

🔧 Quick access to configuration:
Site: ETAD
Wavelength: Red
Quality threshold: 10 minutes
Output format: jpl


## 3. 🎯 Get Specific Data for Analysis

Clean, simple access to exactly what you need:

In [3]:
# Get specific datasets with simple method calls
pkl_data = setup.get_dataset('pkl_data')
csv_data = setup.get_dataset('csv_data') 
ftir_data = setup.get_ftir_data()

# Get BC data for configured wavelength (automatic column detection)
red_bc = setup.get_bc_data_for_wavelength('pkl_data')

# Get quality assessment results  
excellent_periods = setup.get_excellent_periods('pkl_data')

print(f"📊 Data Summary:")
print(f"   PKL data: {pkl_data.shape if pkl_data is not None else 'Not available'}")
print(f"   CSV data: {csv_data.shape if csv_data is not None else 'Not available'}")
print(f"   FTIR data: {ftir_data.shape if ftir_data is not None else 'Not available'}")
print(f"   Red BC data: {red_bc.shape if red_bc is not None else 'Not available'}")
print(f"   Excellent periods: {len(excellent_periods) if excellent_periods is not None else 0}")

📊 Using BC column: Red BC1
📊 Data Summary:
   PKL data: (1665156, 238)
   CSV data: (1095086, 77)
   FTIR data: (168, 12)
   Red BC data: (1665156,)
   Excellent periods: 1036


## 4. 🔍 Quality Assessment Results

Quality assessment was done automatically during loading. Let's examine the results:

In [None]:
# Access quality results that were computed automatically
if setup.quality_results:
    print("📊 Quality Assessment Results:")
    print("=" * 50)
    
    for dataset_name, result in setup.quality_results.items():
        print(f"\n📋 {dataset_name}:")
        print(f"   Total 24h periods: {result.total_periods}")
        print(f"   Excellent periods: {result.excellent_periods}")
        print(f"   Excellence rate: {result.excellent_percentage:.1f}%")
        print(f"   Data completeness: {result.data_completeness:.1f}%")
        print(f"   Missing data points: {result.missing_points:,}")
        
        if len(result.excellent_periods_df) > 0:
            print(f"   First excellent period: {result.excellent_periods_df.iloc[0]['start_time']}")
            print(f"   Last excellent period: {result.excellent_periods_df.iloc[-1]['start_time']}")
    
    # Find the best quality dataset
    best_dataset = max(setup.quality_results.items(), key=lambda x: x[1].excellent_percentage)
    print(f"\n🏆 Best quality dataset: {best_dataset[0]} ({best_dataset[1].excellent_percentage:.1f}% excellent)")
else:
    print("⚠️ No quality results available")

## 5. 📈 Quick Visualization

Let's visualize some data from an excellent quality period:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Plot BC data for an excellent quality period
if excellent_periods is not None and len(excellent_periods) > 0 and red_bc is not None:
    
    # Get the first excellent period
    first_period = excellent_periods.iloc[0]
    period_start = first_period['start_time']
    period_end = first_period['end_time']
    
    # Get BC data for this period
    period_bc = red_bc.loc[period_start:period_end]
    
    # Create the plot
    plt.figure(figsize=(15, 8))
    
    # Main time series plot
    plt.subplot(2, 1, 1)
    plt.plot(period_bc.index, period_bc.values, linewidth=1, alpha=0.8)
    plt.title(f'Black Carbon Time Series - Excellent Quality Period\n'
              f'{period_start.strftime("%Y-%m-%d %H:%M")} to {period_end.strftime("%Y-%m-%d %H:%M")}\n'
              f'Missing minutes: {first_period["missing_minutes"]}/{setup.config.quality_threshold} threshold')
    plt.ylabel(f'{setup.config.wavelength} BC (μg/m³)')
    plt.grid(True, alpha=0.3)
    
    # Histogram
    plt.subplot(2, 1, 2)
    plt.hist(period_bc.dropna().values, bins=50, alpha=0.7, edgecolor='black')
    plt.title('BC Distribution for This Period')
    plt.xlabel(f'{setup.config.wavelength} BC (μg/m³)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print some statistics
    print(f"📊 Period Statistics:")
    print(f"   Data points: {len(period_bc):,}")
    print(f"   Mean BC: {period_bc.mean():.2f} μg/m³")
    print(f"   Std BC: {period_bc.std():.2f} μg/m³")
    print(f"   Min BC: {period_bc.min():.2f} μg/m³")
    print(f"   Max BC: {period_bc.max():.2f} μg/m³")
    print(f"   Data completeness: {first_period['completeness_pct']:.1f}%")

else:
    print("⚠️ No excellent periods or BC data available for plotting")

## 6. 🔬 Advanced Analysis with Modular System

The setup automatically detects if modular analyzers are available:

In [None]:
# Try to use advanced modular analyzers
try:
    from analysis.bc.black_carbon_analyzer import BlackCarbonAnalyzer
    
    print("✅ Modular analyzers available!")
    
    if pkl_data is not None:
        # Run sophisticated analysis using the modular system
        analyzer = BlackCarbonAnalyzer()
        results = analyzer.analyze(pkl_data)
        
        print(f"\n📊 Advanced Analysis Results:")
        print(f"   Analysis type: {results.get('analysis_type', 'Unknown')}")
        print(f"   Results keys: {list(results.keys())}")
        
        # Show some results if available
        if 'summary_statistics' in results:
            stats = results['summary_statistics']
            print(f"\n📈 Summary Statistics:")
            for key, value in stats.items():
                if isinstance(value, (int, float)):
                    print(f"   {key}: {value:.3f}")
                else:
                    print(f"   {key}: {value}")
    else:
        print("⚠️ No PKL data available for advanced analysis")
        
except ImportError:
    print("⚠️ Advanced modular analyzers not available")
    print("💡 Using basic analysis instead...")
    
    # Fallback to basic analysis
    if red_bc is not None:
        print(f"\n📊 Basic BC Statistics:")
        print(f"   Data points: {len(red_bc.dropna()):,}")
        print(f"   Mean: {red_bc.mean():.3f} μg/m³")
        print(f"   Std: {red_bc.std():.3f} μg/m³")
        print(f"   Median: {red_bc.median():.3f} μg/m³")
        print(f"   Min: {red_bc.min():.3f} μg/m³")
        print(f"   Max: {red_bc.max():.3f} μg/m³")
        
        # Data quality check using configuration
        min_samples = setup.config.min_samples_for_analysis
        valid_data_points = len(red_bc.dropna())
        
        if valid_data_points >= min_samples:
            print(f"   ✅ Sufficient data for analysis ({valid_data_points:,} >= {min_samples:,})")
        else:
            print(f"   ⚠️ Insufficient data for reliable analysis ({valid_data_points:,} < {min_samples:,})")

## 7. 🎨 Multiple Wavelength Analysis

Easy analysis across different wavelengths:

In [None]:
# Analyze multiple wavelengths easily
wavelengths = ['Red', 'Blue', 'Green', 'UV', 'IR']
bc_data = {}
bc_stats = {}

print("🌈 Multi-wavelength BC Analysis:")
print("=" * 50)

for wavelength in wavelengths:
    bc_series = setup.get_bc_data_for_wavelength('pkl_data', wavelength)
    
    if bc_series is not None and len(bc_series.dropna()) > 0:
        bc_data[wavelength] = bc_series
        
        # Calculate statistics
        valid_data = bc_series.dropna()
        bc_stats[wavelength] = {
            'count': len(valid_data),
            'mean': valid_data.mean(),
            'std': valid_data.std(),
            'median': valid_data.median(),
            'min': valid_data.min(),
            'max': valid_data.max()
        }
        
        print(f"\n📊 {wavelength} BC:")
        print(f"   Points: {bc_stats[wavelength]['count']:,}")
        print(f"   Mean: {bc_stats[wavelength]['mean']:.3f} μg/m³")
        print(f"   Std: {bc_stats[wavelength]['std']:.3f} μg/m³")
        print(f"   Range: {bc_stats[wavelength]['min']:.3f} - {bc_stats[wavelength]['max']:.3f} μg/m³")
    else:
        print(f"\n❌ {wavelength} BC: No data available")

print(f"\n✅ Available wavelengths: {list(bc_data.keys())}")

# Quick comparison plot if we have multiple wavelengths
if len(bc_data) > 1:
    plt.figure(figsize=(12, 6))
    
    means = [bc_stats[w]['mean'] for w in bc_data.keys()]
    stds = [bc_stats[w]['std'] for w in bc_data.keys()]
    
    plt.bar(range(len(bc_data)), means, yerr=stds, alpha=0.7, capsize=5)
    plt.xlabel('Wavelength')
    plt.ylabel('BC Concentration (μg/m³)')
    plt.title('Mean BC Concentrations by Wavelength')
    plt.xticks(range(len(bc_data)), list(bc_data.keys()))
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 8. 🔧 Easy Customization Examples

The new system makes customization much easier:

In [None]:
# Example 1: Custom quality threshold analysis
from analysis.quality.data_quality_assessment import assess_single_dataset

if pkl_data is not None:
    print("🔍 Custom Quality Analysis:")
    print("=" * 40)
    
    # Compare different quality thresholds
    thresholds = [5, 10, 15, 20]
    
    for threshold in thresholds:
        custom_quality = assess_single_dataset(
            df=pkl_data, 
            dataset_name='pkl_data', 
            quality_threshold=threshold
        )
        
        print(f"Threshold {threshold:2d} min: {custom_quality.excellent_periods:3d} excellent periods "
              f"({custom_quality.excellent_percentage:5.1f}%)")

# Example 2: FTIR data exploration (if available)
if ftir_data is not None:
    print(f"\n🧪 FTIR Data Summary:")
    print("=" * 30)
    print(f"Samples: {len(ftir_data)}")
    print(f"Date range: {ftir_data['sample_date'].min()} to {ftir_data['sample_date'].max()}")
    
    # Show available measurements
    measurement_cols = [col for col in ftir_data.columns if any(x in col.lower() for x in ['ec', 'oc', 'fabs'])]
    print(f"Measurements: {measurement_cols}")
    
    # Basic statistics for key measurements
    if 'ec_ftir' in ftir_data.columns:
        ec_data = ftir_data['ec_ftir'].dropna()
        print(f"\nEC FTIR statistics:")
        print(f"  Valid samples: {len(ec_data)}")
        print(f"  Mean: {ec_data.mean():.3f}")
        print(f"  Std: {ec_data.std():.3f}")
        print(f"  Range: {ec_data.min():.3f} - {ec_data.max():.3f}")

## 9. 💾 Save Results

Easy saving using the configuration system:

In [None]:
import os

# Output directory from configuration
output_dir = setup.config.output_dir
site_code = setup.config.site_code

print(f"💾 Saving results to: {output_dir}")

# Save excellent periods
if excellent_periods is not None and len(excellent_periods) > 0:
    excellent_periods_file = os.path.join(output_dir, f'{site_code}_excellent_periods.csv')
    excellent_periods.to_csv(excellent_periods_file, index=False)
    print(f"✅ Saved excellent periods: {excellent_periods_file}")

# Save quality summary for all datasets
if setup.quality_results:
    quality_summary = []
    
    for dataset_name, result in setup.quality_results.items():
        quality_summary.append({
            'dataset': dataset_name,
            'site_code': site_code,
            'total_periods': result.total_periods,
            'excellent_periods': result.excellent_periods,
            'excellence_rate_pct': result.excellent_percentage,
            'data_completeness_pct': result.data_completeness,
            'missing_points': result.missing_points,
            'quality_threshold_min': result.quality_threshold,
            'start_date': result.time_range[0].strftime('%Y-%m-%d'),
            'end_date': result.time_range[1].strftime('%Y-%m-%d')
        })
    
    quality_summary_df = pd.DataFrame(quality_summary)
    quality_file = os.path.join(output_dir, f'{site_code}_quality_summary.csv')
    quality_summary_df.to_csv(quality_file, index=False)
    print(f"✅ Saved quality summary: {quality_file}")

# Save BC statistics for multiple wavelengths
if bc_stats:
    bc_summary = []
    
    for wavelength, stats in bc_stats.items():
        bc_summary.append({
            'site_code': site_code,
            'wavelength': wavelength,
            'count': stats['count'],
            'mean_ugm3': stats['mean'],
            'std_ugm3': stats['std'],
            'median_ugm3': stats['median'],
            'min_ugm3': stats['min'],
            'max_ugm3': stats['max']
        })
    
    bc_summary_df = pd.DataFrame(bc_summary)
    bc_file = os.path.join(output_dir, f'{site_code}_bc_wavelength_summary.csv')
    bc_summary_df.to_csv(bc_file, index=False)
    print(f"✅ Saved BC summary: {bc_file}")

# Save configuration for reproducibility
config_summary = {
    'site_code': setup.config.site_code,
    'wavelength': setup.config.wavelength,
    'quality_threshold_min': setup.config.quality_threshold,
    'output_format': setup.config.output_format,
    'min_samples_for_analysis': setup.config.min_samples_for_analysis,
    'confidence_level': setup.config.confidence_level,
    'analysis_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
}

config_df = pd.DataFrame([config_summary])
config_file = os.path.join(output_dir, f'{site_code}_analysis_config.csv')
config_df.to_csv(config_file, index=False)
print(f"✅ Saved configuration: {config_file}")

print(f"\n🎉 Analysis complete! All results saved to: {output_dir}")

## 🎯 Summary: What We Accomplished

### ✅ With Just 2 Lines of Code:
```python
from notebook_utils.setup import load_etad_data
setup, datasets = load_etad_data()
```

### 🚀 We Automatically Got:
1. **All data loaded** with intelligent fallbacks (modular system + direct loading)
2. **Quality assessment completed** for all aethalometer datasets  
3. **Configuration validated** and accessible throughout analysis
4. **Clean data access methods** for any wavelength or dataset
5. **Error handling** that tells us exactly what's available vs missing
6. **Modular system integration** with graceful fallbacks

### 📊 Compare This To Original Notebook:
- **OLD**: 200+ lines of complex setup, scattered configuration, manual quality assessment
- **NEW**: 2 lines for complete setup, everything automated and organized

### 🔧 Easy Customization:
- Different sites: Change configuration object
- Different wavelengths: Use `get_bc_data_for_wavelength(dataset, wavelength)`
- Different quality thresholds: Adjust in configuration or run custom assessment
- Custom file paths: Create custom configuration

### 🎉 Result:
**More time analyzing data, less time fighting setup code!**

## 🚀 Next Steps

### For Different Sites:
```python
from notebook_utils.setup import create_custom_config, quick_setup

custom_config = create_custom_config(
    site_code='BEIJING',
    aethalometer_files={'data': '/path/to/beijing_data.pkl'},
    ftir_db_path='/path/to/beijing_db.db',
    wavelength='Blue'
)

beijing_setup = quick_setup(custom_config)
beijing_datasets = beijing_setup.load_all_data()
```

### For Advanced Analysis:
- Use the modular analyzers if available
- Build custom analysis functions using the clean data access methods
- Extend the configuration for your specific analysis needs

### For Production Use:
- Create site-specific configuration files
- Build automated analysis pipelines using this simplified approach
- Share notebooks easily since setup is standardized

**Happy analyzing! 🎊**