# Multi-Site Aethalometer and Filter Data Analysis

This notebook analyzes aethalometer and filter data from four global sites:
- Beijing, China (CHTS)
- Delhi, India (INDH)
- JPL, California (USPA)
- Addis Ababa, Ethiopia (ETAD)

Each visualization creates:
1. A combined plot with all sites
2. Individual plots for each site

---

## Available Data Variables

### **Aethalometer Data** (`aethalometer_data` dictionary)
Access via: `aethalometer_data['Beijing']`, `aethalometer_data['Delhi']`, etc.

**Temporal:**
- `datetime_local` - Local timestamp (timezone-aware)
- `day_9am` - Date of 9 AM measurement
- `data_completeness_pct` - % of 1440 minutes with data (Beijing, Delhi, JPL only)
- `minutes_with_data` - Number of minutes with BC data

**Black Carbon Measurements (ng/m³):**
- `IR BCc`, `Blue BCc`, `Green BCc`, `Red BCc`, `UV BCc` - Raw BC concentrations
- `IR BCc smoothed`, `Blue BCc smoothed`, `UV BCc smoothed` - Smoothed BC (site-dependent)
- `BC1`, `BC2` variants - Dual-spot measurements
- `Biomass BCc (ng/m^3)`, `Fossil fuel BCc (ng/m^3)` - Source apportionment

**Attenuation (ATN):**
- `[Wavelength] ATN1`, `[Wavelength] ATN2` - Spot 1 and 2 attenuation
- `delta [Wavelength] ATN1/2` - ATN changes
- `[Wavelength] ATN1/2 rolling mean` - Smoothed ATN

**Environmental Sensors:**
- `Internal temp (C)`, `Sample temp (C)` - Temperature
- `Sample RH (%)` - Relative humidity
- `Sample dewpoint (C)` - Dewpoint
- `Internal pressure (Pa)` - Pressure

**Flow Measurements (mL/min):**
- `Flow total (mL/min)` - Total flow (~100)
- `Flow1 (mL/min)`, `Flow2 (mL/min)` - Dual-spot flows
- `Flow setpoint (mL/min)` - Target flow
- `ratio_flow` - Flow ratio

**Particulate Matter:**
- `opc.bins.i0` through `opc.bins.i23` - OPC size bins
- `opc.pms.i0/i1/i2` - PM mass concentrations
- `particulate.counts.i0-i23` - Particle counts by size
- `particulate.masses.i0/i1/i2` - Mass by size

**Additional Sensors:**
- `Accel X/Y/Z` - Accelerometer (device movement)
- `co2.co2` - CO2 concentration
- `co2.temperature`, `co2.relativeHumidity` - CO2 sensor

**Absorption Metrics:**
- `AAE` - Absorption Angstrom Exponent
- `AAE biomass`, `AAE fossil fuel` - Source-specific AAE
- `BB (%)` - Biomass burning percentage
- `Delta-C (ng/m^3)` - Delta-C marker

**Quality Flags:**
- `high_rough_period` - Boolean flag for high roughness
- `[Wavelength] ATN1/2_roughness` - Roughness metrics
- `Status` - Instrument status code

**Metadata:**
- `Site_Code` - Site code (CHTS, INDH, USPA, ETAD)
- `Site_Name` - Full site name
- `Device_ID` - Aethalometer serial number
- `Serial number` - Device serial
- `Firmware version` - Firmware version

---

### **Filter Data** (`filter_data` DataFrame)
Single DataFrame with all sites and parameters

**Identifiers:**
- `Site` - Site code (CHTS, ETAD, INDH, USPA)
- `FilterId` - Unique filter ID (e.g., ETAD-0017)
- `Barcode` - Filter barcode
- `SampleDate` - Date sample collected

**Measurement Info:**
- `Parameter` - What was measured (see categories below)
- `Concentration` - Value (ug/m³)
- `Concentration_Units` - Units (typically ug/m³)
- `MDL` - Minimum Detection Limit
- `MDL_Units` - MDL units
- `Uncertainty` - Measurement uncertainty

**Sample Details:**
- `FilterType` - Type of filter used
- `LotId` - Filter lot number
- `DepositArea_cm2` - Deposition area
- `Volume_m3` - Air volume sampled
- `MassLoading_ug` - Total mass on filter

**Analysis Info:**
- `AnalysisDate` - When analyzed
- `AnalysisTime` - Time analyzed
- `DataSource` - Source (ChemSpec, FTIR, HIPS)
- `CalibrationSetId` - Calibration used
- `FilterComments` - Comments (e.g., "SGP")

**Parameter Categories:**

**ChemSpec EC/OC:**
- `ChemSpec_EC_PM2.5` - Elemental Carbon
- `ChemSpec_OC_PM2.5` - Organic Carbon  
- `ChemSpec_OM_PM2.5` - Organic Matter
- `ChemSpec_BC_PM2.5` - Black Carbon

**FTIR EC/OC:**
- `EC_ftir` - Elemental Carbon (FTIR)
- `OC_ftir` - Organic Carbon (FTIR)
- `OM` - Organic Matter

**FTIR Functional Groups:**
- `alcoholCOH` - Alcohol groups
- `alkaneCH` - Alkane groups
- `carboxylicCOOH` - Carboxylic acid groups
- `naCO` - Carbonyl groups

**HIPS (Absorption):**
- `HIPS_T1`, `HIPS_Slope`, `HIPS_Intercept`
- `HIPS_R1`, `HIPS_t`, `HIPS_tau`, `HIPS_r`
- `HIPS_Fabs`, `HIPS_Uncertainty`, `HIPS_MDL`

**ChemSpec Ions:**
- `ChemSpec_Sulfate_Ion_PM2.5`, `ChemSpec_Nitrate_Ion_PM2.5`
- `ChemSpec_Ammonium_Ion_PM2.5`, `ChemSpec_Chloride_Ion_PM2.5`
- `ChemSpec_Sodium_Ion_PM2.5`, `ChemSpec_Potassium_Ion_PM2.5`
- `ChemSpec_Magnesium_Ion_PM2.5`, `ChemSpec_Calcium_Ion_PM2.5`

**ChemSpec Metals (XRF):**
- Major: Iron, Aluminum, Silicon, Sulfur, Calcium, Potassium
- Trace: Zinc, Lead, Copper, Manganese, Chromium, Nickel
- Ultra-trace: Arsenic, Cadmium, Selenium, Vanadium, etc.

---

## Data Quality Notes

**Aethalometer:**
- Beijing: 37.6% of days have BC data (gaps in record)
- Delhi: 39.1% of days have BC data
- JPL: 63.9% of days have BC data (best coverage)
- Addis Ababa: 100% of days have BC data

**Filter:**
- EC values < 0.5 µg/m³ are excluded (blanks/below MDL)
- Each filter typically has multiple parameters measured
- Sampling frequency varies by site (typically every 3 days)

---

## Site Information

| Site | Code | Location | Timezone | Device |
|------|------|----------|----------|--------|
| Beijing | CHTS | Beijing, China | Asia/Shanghai | WF0010 |
| Delhi | INDH | Delhi, India | Asia/Kolkata | MA350-0216 |
| JPL | USPA | Pasadena, USA | America/Los_Angeles | MA350-0229 |
| Addis Ababa | ETAD | Addis Ababa, Ethiopia | Africa/Addis_Ababa | MA350-0238 |

## Setup and Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Configure matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")

In [None]:
# Define paths
PROCESSED_SITES_DIR = Path('/Users/ahmadjalil/Github/aethmodular/FTIR_HIPS_Chem/processed_sites')
FILTER_DATA_PATH = Path('/Users/ahmadjalil/Github/aethmodular/FTIR_HIPS_Chem/Filter Data/unified_filter_dataset.pkl')

# Site configurations
SITES = {
    'Beijing': {
        'file': 'df_Beijing_9am_resampled.pkl',
        'code': 'CHTS',
        'color': '#E74C3C',  # Red
        'location': 'Beijing, China'
    },
    'Delhi': {
        'file': 'df_Delhi_9am_resampled.pkl',
        'code': 'INDH',
        'color': '#3498DB',  # Blue
        'location': 'Delhi, India'
    },
    'JPL': {
        'file': 'df_JPL_9am_resampled.pkl',
        'code': 'USPA',
        'color': '#2ECC71',  # Green
        'location': 'Pasadena, USA'
    },
    'Addis_Ababa': {
        'file': 'df_Addis_Ababa_9am_resampled.pkl',
        'code': 'ETAD',
        'color': '#F39C12',  # Orange
        'location': 'Addis Ababa, Ethiopia'
    }
}

print("Paths and configurations defined!")

In [None]:
# Load all aethalometer datasets
aethalometer_data = {}

for site_name, config in SITES.items():
    file_path = PROCESSED_SITES_DIR / config['file']
    
    if file_path.exists():
        with open(file_path, 'rb') as f:
            df = pickle.load(f)
        
        # Ensure day_9am is datetime
        df['day_9am'] = pd.to_datetime(df['day_9am'])
        
        aethalometer_data[site_name] = df
        print(f"✓ Loaded {site_name}: {len(df)} records, {df['day_9am'].min().date()} to {df['day_9am'].max().date()}")
    else:
        print(f"✗ File not found: {file_path}")

print(f"\nTotal sites loaded: {len(aethalometer_data)}")

In [None]:
# Load filter dataset
with open(FILTER_DATA_PATH, 'rb') as f:
    filter_data = pickle.load(f)

# Convert SampleDate to datetime
filter_data['SampleDate'] = pd.to_datetime(filter_data['SampleDate'])

print(f"Filter dataset loaded: {len(filter_data)} measurements")
print(f"Sites in filter data: {filter_data['Site'].unique()}")
print(f"Date range: {filter_data['SampleDate'].min().date()} to {filter_data['SampleDate'].max().date()}")
print(f"\nTotal unique parameters: {filter_data['Parameter'].nunique()}")

# Categorize parameters by measurement method/type
print("\n" + "="*80)
print("FILTER DATA BREAKDOWN BY MEASUREMENT TYPE")
print("="*80)

# Define categories
categories = {
    'ChemSpec EC/OC': ['ChemSpec_EC_PM2.5', 'ChemSpec_OC_PM2.5', 'ChemSpec_OM_PM2.5', 'ChemSpec_BC_PM2.5'],
    'FTIR EC/OC': ['EC_ftir', 'OC_ftir', 'OM'],
    'FTIR Functional Groups': ['alcoholCOH', 'alkaneCH', 'carboxylicCOOH', 'naCO'],
    'HIPS': ['HIPS_T1', 'HIPS_Slope', 'HIPS_Intercept', 'HIPS_R1', 'HIPS_t', 'HIPS_tau',
             'HIPS_r', 'HIPS_Fabs', 'HIPS_Uncertainty', 'HIPS_MDL'],
    'ChemSpec Ions': ['ChemSpec_Sulfate_Ion_PM2.5', 'ChemSpec_Nitrate_Ion_PM2.5',
                      'ChemSpec_Ammonium_Ion_PM2.5', 'ChemSpec_Chloride_Ion_PM2.5',
                      'ChemSpec_Sodium_Ion_PM2.5', 'ChemSpec_Potassium_Ion_PM2.5',
                      'ChemSpec_Magnesium_Ion_PM2.5', 'ChemSpec_Calcium_Ion_PM2.5'],
    'ChemSpec Metals': ['ChemSpec_Iron_PM2.5', 'ChemSpec_Aluminum_PM2.5',
                        'ChemSpec_Silicon_PM2.5', 'ChemSpec_Sulfur_PM2.5',
                        'ChemSpec_Calcium_PM2.5', 'ChemSpec_Potassium_PM2.5',
                        'ChemSpec_Zinc_PM2.5', 'ChemSpec_Lead_PM2.5',
                        'ChemSpec_Copper_PM2.5', 'ChemSpec_Manganese_PM2.5']
}

# Print breakdown for each category
for category, params in categories.items():
    # Get parameters that exist in the dataset
    existing_params = [p for p in params if p in filter_data['Parameter'].values]
    
    if len(existing_params) > 0:
        category_data = filter_data[filter_data['Parameter'].isin(existing_params)]
        
        print(f"\n{category}:")
        print(f"  Total measurements: {len(category_data)}")
        print(f"  Parameters ({len(existing_params)}):")
        
        param_counts = category_data['Parameter'].value_counts()
        for param in existing_params[:10]:  # Show first 10
            if param in param_counts.index:
                count = param_counts[param]
                print(f"    - {param}: {count}")
        
        if len(existing_params) > 10:
            print(f"    ... and {len(existing_params) - 10} more parameters")

# Show uncategorized parameters
all_categorized = []
for params in categories.values():
    all_categorized.extend(params)

uncategorized = filter_data[~filter_data['Parameter'].isin(all_categorized)]['Parameter'].unique()
if len(uncategorized) > 0:
    print(f"\nOther Parameters ({len(uncategorized)}):")
    print(f"  {list(uncategorized[:10])}")
    if len(uncategorized) > 10:
        print(f"  ... and {len(uncategorized) - 10} more")

print("\n" + "="*80)
print("BREAKDOWN BY SITE")
print("="*80)

for site in sorted(filter_data['Site'].unique()):
    site_data = filter_data[filter_data['Site'] == site]
    print(f"\n{site}:")
    print(f"  Total measurements: {len(site_data)}")
    print(f"  Unique filters: {site_data['FilterId'].nunique()}")
    print(f"  Date range: {site_data['SampleDate'].min().date()} to {site_data['SampleDate'].max().date()}")
    
    # Show top measurement types
    print(f"  Top measurement types:")
    for category, params in categories.items():
        category_count = len(site_data[site_data['Parameter'].isin(params)])
        if category_count > 0:
            print(f"    - {category}: {category_count} measurements")

## Helper Functions for Plotting

In [None]:
def plot_combined_and_individual(plot_function, title_base, **kwargs):
    """
    Create combined plot and individual plots for each site.
    
    Parameters:
    -----------
    plot_function : callable
        Function that takes (ax, site_name, df, config, **kwargs) and creates a plot
    title_base : str
        Base title for the plots
    **kwargs : dict
        Additional arguments to pass to plot_function
    """
    
    # 1. Combined plot
    fig, ax = plt.subplots(figsize=(14, 7))
    
    for site_name, df in aethalometer_data.items():
        config = SITES[site_name]
        plot_function(ax, site_name, df, config, **kwargs)
    
    ax.set_title(f"{title_base} - All Sites", fontsize=14, fontweight='bold')
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # 2. Individual plots
    for site_name, df in aethalometer_data.items():
        config = SITES[site_name]
        
        fig, ax = plt.subplots(figsize=(12, 6))
        plot_function(ax, site_name, df, config, **kwargs)
        
        ax.set_title(f"{title_base} - {site_name}", fontsize=14, fontweight='bold')
        ax.legend(loc='best')
        ax.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

print("Helper functions defined!")

## 1. Time Series: Black Carbon Concentrations

In [None]:
def plot_bc_timeseries(ax, site_name, df, config, wavelength='IR'):
    """Plot BC time series for a specific wavelength"""
    col_name = f'{wavelength} BCc'
    
    if col_name in df.columns:
        # Filter valid data
        valid_data = df[df[col_name].notna()].copy()
        
        if len(valid_data) > 0:
            ax.plot(valid_data['day_9am'], valid_data[col_name], 
                   color=config['color'], label=f"{site_name}", 
                   alpha=0.7, linewidth=1.5)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel(f'{wavelength} BC (ng/m³)', fontsize=12)
    ax.tick_params(axis='x', rotation=45)

# Create plots
plot_combined_and_individual(
    plot_bc_timeseries,
    "IR Black Carbon Time Series",
    wavelength='IR'
)

## 2. Multi-Wavelength BC Comparison

In [None]:
def plot_multiwavelength_bc(ax, site_name, df, config):
    """Plot BC for multiple wavelengths"""
    wavelengths = ['UV', 'Blue', 'Green', 'Red', 'IR']
    
    for wavelength in wavelengths:
        col_name = f'{wavelength} BCc'
        
        if col_name in df.columns:
            valid_data = df[df[col_name].notna()].copy()
            
            if len(valid_data) > 0:
                ax.plot(valid_data['day_9am'], valid_data[col_name], 
                       label=f"{wavelength}", alpha=0.6, linewidth=1)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('BC (ng/m³)', fontsize=12)
    ax.tick_params(axis='x', rotation=45)

# Create plots
plot_combined_and_individual(
    plot_multiwavelength_bc,
    "Multi-Wavelength BC Comparison"
)

## 3. BC Distribution (Box Plots)

In [None]:
# Combined box plot
fig, ax = plt.subplots(figsize=(12, 6))

bc_data_combined = []
site_labels = []

for site_name, df in aethalometer_data.items():
    if 'IR BCc' in df.columns:
        valid_data = df['IR BCc'].dropna()
        if len(valid_data) > 0:
            bc_data_combined.append(valid_data)
            site_labels.append(site_name)

bp = ax.boxplot(bc_data_combined, labels=site_labels, patch_artist=True)

# Color boxes
for patch, site_name in zip(bp['boxes'], site_labels):
    patch.set_facecolor(SITES[site_name]['color'])
    patch.set_alpha(0.6)

ax.set_title('IR BC Distribution - All Sites', fontsize=14, fontweight='bold')
ax.set_ylabel('IR BC (ng/m³)', fontsize=12)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

# Individual box plots (multi-wavelength)
for site_name, df in aethalometer_data.items():
    fig, ax = plt.subplots(figsize=(10, 6))
    
    wavelengths = ['UV', 'Blue', 'Green', 'Red', 'IR']
    bc_data = []
    labels = []
    
    for wavelength in wavelengths:
        col_name = f'{wavelength} BCc'
        if col_name in df.columns:
            valid_data = df[col_name].dropna()
            if len(valid_data) > 0:
                bc_data.append(valid_data)
                labels.append(wavelength)
    
    if len(bc_data) > 0:
        bp = ax.boxplot(bc_data, labels=labels, patch_artist=True)
        
        for patch in bp['boxes']:
            patch.set_facecolor(SITES[site_name]['color'])
            patch.set_alpha(0.6)
        
        ax.set_title(f'BC Distribution by Wavelength - {site_name}', fontsize=14, fontweight='bold')
        ax.set_ylabel('BC (ng/m³)', fontsize=12)
        ax.grid(True, alpha=0.3, axis='y')
        plt.tight_layout()
        plt.show()

## 4. Data Completeness Analysis

In [None]:
def plot_data_completeness(ax, site_name, df, config):
    """Plot data completeness over time"""
    if 'data_completeness_pct' in df.columns:
        ax.plot(df['day_9am'], df['data_completeness_pct'], 
               color=config['color'], label=site_name, 
               alpha=0.7, linewidth=1.5)
        ax.axhline(y=80, color='green', linestyle='--', alpha=0.5, label='High quality (80%)')
        ax.axhline(y=50, color='orange', linestyle='--', alpha=0.5, label='Medium quality (50%)')
    else:
        # For pre-resampled data, show BC availability
        if 'IR BCc' in df.columns:
            completeness = (df['IR BCc'].notna()).astype(int) * 100
            ax.scatter(df['day_9am'], completeness, 
                      color=config['color'], label=site_name, 
                      alpha=0.5, s=20)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Data Completeness (%)', fontsize=12)
    ax.set_ylim(-5, 105)
    ax.tick_params(axis='x', rotation=45)

# Create plots
plot_combined_and_individual(
    plot_data_completeness,
    "Data Completeness Over Time"
)

## 5. Filter vs Aethalometer Comparison

In [None]:
def plot_filter_vs_aethalometer(ax, site_name, df, config, individual=False):
    """Compare aethalometer BC with filter EC measurements"""
    site_code = config['code']
    
    # Get filter data for this site
    site_filters = filter_data[filter_data['Site'] == site_code].copy()
    
    # Get EC measurements
    ec_filters = site_filters[site_filters['Parameter'] == 'ChemSpec_EC_PM2.5'].copy()
    
    if len(ec_filters) > 0 and 'IR BCc' in df.columns:
        # Plot aethalometer BC (solid line)
        valid_aeth = df[df['IR BCc'].notna()].copy()
        ax.plot(valid_aeth['day_9am'], valid_aeth['IR BCc'], 
               color=config['color'], label=f'{site_name} - Aethalometer IR BC', 
               alpha=0.7, linewidth=2, linestyle='-')
        
        # Plot filter EC
        ec_filters = ec_filters.sort_values('SampleDate')
        ec_filters['Concentration_ng'] = ec_filters['Concentration'] * 1000
        
        if individual:
            # For individual plots: use contrasting color and solid line
            # Define contrasting colors for each site
            contrast_colors = {
                'Beijing': '#2ECC71',    # Green (opposite of Red)
                'Delhi': '#F39C12',      # Orange (opposite of Blue)
                'JPL': '#E74C3C',        # Red (opposite of Green)
                'Addis_Ababa': '#3498DB' # Blue (opposite of Orange)
            }
            
            filter_color = contrast_colors.get(site_name, '#95A5A6')  # Gray as fallback
            
            ax.plot(ec_filters['SampleDate'], ec_filters['Concentration_ng'], 
                   color=filter_color, linestyle='-', linewidth=2,
                   label=f'{site_name} - Filter EC', alpha=0.7)
        else:
            # For combined plots: use same color but dotted line
            ax.plot(ec_filters['SampleDate'], ec_filters['Concentration_ng'], 
                   color=config['color'], linestyle=':', linewidth=2.5,
                   label=f'{site_name} - Filter EC', alpha=0.8)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Concentration (ng/m³)', fontsize=12)
    ax.tick_params(axis='x', rotation=45)

# Combined plot
fig, ax = plt.subplots(figsize=(14, 7))

for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    plot_filter_vs_aethalometer(ax, site_name, df, config, individual=False)

ax.set_title("Aethalometer BC vs Filter EC - All Sites", fontsize=14, fontweight='bold')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Individual plots
for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    fig, ax = plt.subplots(figsize=(12, 6))
    plot_filter_vs_aethalometer(ax, site_name, df, config, individual=True)
    
    ax.set_title(f"Aethalometer BC vs Filter EC - {site_name}", fontsize=14, fontweight='bold')
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 6. Scatter: Aethalometer vs Filter EC Correlation

In [None]:
# Combined scatter plot
fig, ax = plt.subplots(figsize=(10, 10))

print("="*80)
print("MATCHING SUMMARY - Combined Plot")
print("="*80)

for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    site_code = config['code']
    
    # Get filter EC data
    site_filters = filter_data[
        (filter_data['Site'] == site_code) & 
        (filter_data['Parameter'] == 'ChemSpec_EC_PM2.5')
    ].copy()
    
    # Filter out values below or near MDL (exclude values < 0.5 ug/m3 = 500 ng/m3)
    # These are typically blanks or below detection limit
    site_filters = site_filters[site_filters['Concentration'] >= 0.5].copy()
    
    if len(site_filters) > 0 and 'IR BCc' in df.columns:
        # Merge on date (±1 day tolerance)
        matched_data = []
        
        for _, filter_row in site_filters.iterrows():
            filter_date = filter_row['SampleDate']
            
            # Find aethalometer measurement within ±1 day
            date_match = df[
                (df['day_9am'] >= filter_date - pd.Timedelta(days=1)) &
                (df['day_9am'] <= filter_date + pd.Timedelta(days=1))
            ]
            
            if len(date_match) > 0 and date_match['IR BCc'].notna().any():
                aeth_bc = date_match['IR BCc'].mean()
                filter_ec = filter_row['Concentration'] * 1000  # ug/m3 to ng/m3
                matched_data.append({'aeth_bc': aeth_bc, 'filter_ec': filter_ec})
        
        print(f"{site_name}: {len(site_filters)} filter dates → {len(matched_data)} matched pairs")
        
        if len(matched_data) > 0:
            matched_df = pd.DataFrame(matched_data)
            ax.scatter(matched_df['aeth_bc'], matched_df['filter_ec'], 
                      color=config['color'], label=f'{site_name} (n={len(matched_data)})', 
                      alpha=0.6, s=80, edgecolors='black', linewidth=1)

# Set axis limits to start from 0,0
ax.set_xlim(left=0)
ax.set_ylim(bottom=0)

# Add 1:1 line
max_val = max(ax.get_xlim()[1], ax.get_ylim()[1])
ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')

ax.set_xlabel('Aethalometer IR BC (ng/m³)', fontsize=12)
ax.set_ylabel('Filter EC (ng/m³)', fontsize=12)
ax.set_title('Aethalometer BC vs Filter EC - All Sites (EC ≥ 0.5 µg/m³)', fontsize=14, fontweight='bold')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("INDIVIDUAL SCATTER PLOTS")
print("="*80 + "\n")

# Individual scatter plots
for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    site_code = config['code']
    
    # Get filter EC data
    site_filters = filter_data[
        (filter_data['Site'] == site_code) & 
        (filter_data['Parameter'] == 'ChemSpec_EC_PM2.5')
    ].copy()
    
    # Filter out values below MDL
    site_filters = site_filters[site_filters['Concentration'] >= 0.5].copy()
    
    if len(site_filters) > 0 and 'IR BCc' in df.columns:
        # Merge on date
        matched_data = []
        
        for _, filter_row in site_filters.iterrows():
            filter_date = filter_row['SampleDate']
            
            date_match = df[
                (df['day_9am'] >= filter_date - pd.Timedelta(days=1)) &
                (df['day_9am'] <= filter_date + pd.Timedelta(days=1))
            ]
            
            if len(date_match) > 0 and date_match['IR BCc'].notna().any():
                aeth_bc = date_match['IR BCc'].mean()
                filter_ec = filter_row['Concentration'] * 1000
                matched_data.append({'aeth_bc': aeth_bc, 'filter_ec': filter_ec})
        
        if len(matched_data) > 0:
            matched_df = pd.DataFrame(matched_data)
            
            print(f"{site_name}:")
            print(f"  Filter dates available (EC ≥ 0.5 µg/m³): {len(site_filters)}")
            print(f"  Matched pairs: {len(matched_df)}")
            print(f"  Aethalometer BC range: {matched_df['aeth_bc'].min():.1f} - {matched_df['aeth_bc'].max():.1f} ng/m³")
            print(f"  Filter EC range: {matched_df['filter_ec'].min():.1f} - {matched_df['filter_ec'].max():.1f} ng/m³")
            
            fig, ax = plt.subplots(figsize=(8, 8))
            
            ax.scatter(matched_df['aeth_bc'], matched_df['filter_ec'], 
                      color=config['color'], alpha=0.6, s=100, 
                      edgecolors='black', linewidth=1.5)
            
            # Calculate linear regression (line of best fit)
            if len(matched_df) > 1:
                # Linear regression: y = mx + b
                coefficients = np.polyfit(matched_df['aeth_bc'], matched_df['filter_ec'], 1)
                slope = coefficients[0]
                intercept = coefficients[1]
                
                # Create line of best fit
                x_fit = np.array([0, matched_df['aeth_bc'].max()])
                y_fit = slope * x_fit + intercept
                
                # Plot line of best fit
                ax.plot(x_fit, y_fit, color=config['color'], linestyle='-', 
                       linewidth=2.5, alpha=0.8, label='Best fit')
                
                # Calculate R²
                correlation = np.corrcoef(matched_df['aeth_bc'], matched_df['filter_ec'])[0, 1]
                r_squared = correlation ** 2
            else:
                slope = 0
                intercept = 0
                r_squared = 0
            
            # Set axis limits to start from 0,0
            ax.set_xlim(left=0)
            ax.set_ylim(bottom=0)
            
            # Add 1:1 line
            max_val = max(ax.get_xlim()[1], ax.get_ylim()[1])
            ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')
            
            # Add text with stats and equation
            if intercept >= 0:
                equation = f'y = {slope:.3f}x + {intercept:.1f}'
            else:
                equation = f'y = {slope:.3f}x - {abs(intercept):.1f}'
            
            stats_text = f'n = {len(matched_df)}\nR² = {r_squared:.3f}\n{equation}'
            ax.text(0.05, 0.95, stats_text, 
                   transform=ax.transAxes, fontsize=11, 
                   verticalalignment='top', 
                   bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
            
            ax.set_xlabel('Aethalometer IR BC (ng/m³)', fontsize=12)
            ax.set_ylabel('Filter EC (ng/m³)', fontsize=12)
            ax.set_title(f'Aethalometer BC vs Filter EC - {site_name}', fontsize=14, fontweight='bold')
            ax.legend(loc='lower right')
            ax.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.show()
            print()
        else:
            print(f"{site_name}: No matching data")
    else:
        print(f"{site_name}: No filter EC data available")

## 7. Summary Statistics Table

In [None]:
# Create combined plot with all sites showing Flow1, Flow2, and their ratio

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    # Check if flow columns exist
    has_flow1 = 'Flow1 (mL/min)' in df.columns
    has_flow2 = 'Flow2 (mL/min)' in df.columns
    
    if has_flow1 and has_flow2:
        # Plot Flow1 and Flow2
        valid_data = df[(df['Flow1 (mL/min)'].notna()) & (df['Flow2 (mL/min)'].notna())].copy()
        
        if len(valid_data) > 0:
            ax1.plot(valid_data['day_9am'], valid_data['Flow1 (mL/min)'], 
                    color=config['color'], label=f"{site_name} - Flow1", 
                    alpha=0.7, linewidth=1.5, linestyle='-')
            ax1.plot(valid_data['day_9am'], valid_data['Flow2 (mL/min)'], 
                    color=config['color'], label=f"{site_name} - Flow2", 
                    alpha=0.7, linewidth=1.5, linestyle='--')
            
            # Calculate and plot Flow1/Flow2 ratio
            valid_data['flow_ratio'] = valid_data['Flow1 (mL/min)'] / valid_data['Flow2 (mL/min)']
            ax2.plot(valid_data['day_9am'], valid_data['flow_ratio'], 
                    color=config['color'], label=f"{site_name}", 
                    alpha=0.7, linewidth=2)

ax1.set_ylabel('Flow (mL/min)', fontsize=12)
ax1.set_title("Flow1 and Flow2 Over Time - All Sites", fontsize=14, fontweight='bold')
ax1.legend(loc='best', fontsize=9, ncol=2)
ax1.grid(True, alpha=0.3)

ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5, linewidth=1.5, label='Ratio = 1 (equal flows)')
ax2.set_xlabel('Date', fontsize=12)
ax2.set_ylabel('Flow1 / Flow2 Ratio', fontsize=12)
ax2.set_title("Flow1/Flow2 Ratio Over Time - All Sites", fontsize=14, fontweight='bold')
ax2.legend(loc='best', fontsize=9)
ax2.grid(True, alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Create individual plots for each site
for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    # Check if flow data exists
    has_flow1 = 'Flow1 (mL/min)' in df.columns
    has_flow2 = 'Flow2 (mL/min)' in df.columns
    
    if has_flow1 and has_flow2:
        valid_data = df[(df['Flow1 (mL/min)'].notna()) & (df['Flow2 (mL/min)'].notna())].copy()
        
        if len(valid_data) > 0:
            valid_data['flow_ratio'] = valid_data['Flow1 (mL/min)'] / valid_data['Flow2 (mL/min)']
            
            fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
            
            # Plot Flow1 and Flow2
            ax1.plot(valid_data['day_9am'], valid_data['Flow1 (mL/min)'], 
                    color=config['color'], label='Flow1', 
                    alpha=0.8, linewidth=2, linestyle='-')
            ax1.plot(valid_data['day_9am'], valid_data['Flow2 (mL/min)'], 
                    color=config['color'], label='Flow2', 
                    alpha=0.6, linewidth=2, linestyle='--')
            
            ax1.set_ylabel('Flow (mL/min)', fontsize=12)
            ax1.set_title(f"Flow1 and Flow2 Over Time - {site_name}", fontsize=14, fontweight='bold')
            ax1.legend(loc='best')
            ax1.grid(True, alpha=0.3)
            
            # Plot Flow1/Flow2 ratio
            ax2.plot(valid_data['day_9am'], valid_data['flow_ratio'], 
                    color=config['color'], label='Flow1/Flow2 Ratio', 
                    alpha=0.8, linewidth=2)
            ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5, linewidth=1.5, 
                       label='Ratio = 1 (equal flows)')
            
            # Add mean ratio line
            mean_ratio = valid_data['flow_ratio'].mean()
            ax2.axhline(y=mean_ratio, color='red', linestyle='--', alpha=0.7, linewidth=2, 
                       label=f'Mean Ratio = {mean_ratio:.3f}')
            
            ax2.set_xlabel('Date', fontsize=12)
            ax2.set_ylabel('Flow1 / Flow2 Ratio', fontsize=12)
            ax2.set_title(f"Flow1/Flow2 Ratio Over Time - {site_name}", fontsize=14, fontweight='bold')
            ax2.legend(loc='best')
            ax2.grid(True, alpha=0.3)
            ax2.tick_params(axis='x', rotation=45)
            
            plt.tight_layout()
            plt.show()
            
            # Print statistics
            print(f"\n{site_name} Flow Statistics:")
            print(f"  Flow1 - Mean: {valid_data['Flow1 (mL/min)'].mean():.2f} mL/min, "
                  f"Median: {valid_data['Flow1 (mL/min)'].median():.2f} mL/min, "
                  f"Std: {valid_data['Flow1 (mL/min)'].std():.2f} mL/min")
            print(f"  Flow2 - Mean: {valid_data['Flow2 (mL/min)'].mean():.2f} mL/min, "
                  f"Median: {valid_data['Flow2 (mL/min)'].median():.2f} mL/min, "
                  f"Std: {valid_data['Flow2 (mL/min)'].std():.2f} mL/min")
            print(f"  Flow1/Flow2 Ratio - Mean: {valid_data['flow_ratio'].mean():.3f}, "
                  f"Median: {valid_data['flow_ratio'].median():.3f}, "
                  f"Std: {valid_data['flow_ratio'].std():.3f}")
            print(f"  Flow1/Flow2 Ratio - Min: {valid_data['flow_ratio'].min():.3f}, "
                  f"Max: {valid_data['flow_ratio'].max():.3f}")
    else:
        print(f"\n{site_name}: Flow data not available")

## 8. Flow Ratio Analysis (Flow1/Flow2)

Analysis of dual-spot flow measurements and their ratio over time.

In [None]:
def create_correlation_scatter(ax, x_data, y_data, x_label, y_label, color='blue', site_name='', equal_axes=False):
    """
    Create a scatter plot with linear regression, R², and statistics
    
    Parameters:
    -----------
    ax : matplotlib axis
    x_data : array-like
    y_data : array-like
    x_label : str
    y_label : str
    color : str
    site_name : str
    equal_axes : bool - if True, set equal limits for x and y axes (1:1 aspect)
    
    Returns:
    --------
    dict with statistics (r2, slope, intercept, n, corr)
    """
    # Remove NaN values
    mask = (~np.isnan(x_data)) & (~np.isnan(y_data))
    x = x_data[mask]
    y = y_data[mask]
    
    if len(x) < 3:
        ax.text(0.5, 0.5, 'Insufficient data\nfor correlation', 
                transform=ax.transAxes, ha='center', va='center', fontsize=12)
        return None
    
    # Scatter plot
    ax.scatter(x, y, alpha=0.6, s=60, color=color, edgecolors='white', linewidth=0.5)
    
    # Linear regression
    z = np.polyfit(x, y, 1)
    p = np.poly1d(z)
    slope, intercept = z[0], z[1]
    
    # Plot regression line
    x_line = np.linspace(x.min(), x.max(), 100)
    ax.plot(x_line, p(x_line), 'r--', linewidth=2, alpha=0.8, label='Best fit')
    
    # Add 1:1 line if axes should be equal
    if equal_axes:
        # Set equal limits
        all_vals = np.concatenate([x, y])
        min_val = max(0, all_vals.min() * 0.95)  # Start from 0 or slightly below min
        max_val = all_vals.max() * 1.05
        
        ax.set_xlim(min_val, max_val)
        ax.set_ylim(min_val, max_val)
        ax.set_aspect('equal', adjustable='box')
        
        # Plot 1:1 line
        ax.plot([min_val, max_val], [min_val, max_val], 'k:', linewidth=1.5, alpha=0.5, label='1:1 line')
    
    # Calculate R² and correlation
    corr = np.corrcoef(x, y)[0, 1]
    r2 = corr**2
    n = len(x)
    
    # Add statistics text box
    stats_text = f'n = {n}\nR² = {r2:.3f}\nSlope = {slope:.3f}\ny = {slope:.3f}x + {intercept:.2f}'
    ax.text(0.05, 0.95, stats_text, transform=ax.transAxes, fontsize=10,
            verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    ax.set_xlabel(x_label, fontsize=11)
    ax.set_ylabel(y_label, fontsize=11)
    ax.grid(True, alpha=0.3)
    ax.legend(loc='lower right', fontsize=8)
    
    return {'r2': r2, 'slope': slope, 'intercept': intercept, 'n': n, 'corr': corr}


def match_filter_to_aeth_data(site_name, site_code, df_aeth, filter_params):
    """
    Match filter data to aethalometer data by date (±1 day tolerance)
    
    Parameters:
    -----------
    site_name : str
    site_code : str
    df_aeth : DataFrame with aethalometer data
    filter_params : list of parameter names to extract from filter data
    
    Returns:
    --------
    DataFrame with matched data
    """
    # Get filter data for this site
    site_filters = filter_data[filter_data['Site'] == site_code].copy()
    
    # Pivot filter data to have one row per date with different parameters as columns
    matched_records = []
    
    for _, aeth_row in df_aeth.iterrows():
        aeth_date = aeth_row['day_9am']
        
        # Find filter samples within ±1 day
        date_match = site_filters[
            (site_filters['SampleDate'] >= aeth_date - pd.Timedelta(days=1)) &
            (site_filters['SampleDate'] <= aeth_date + pd.Timedelta(days=1))
        ]
        
        if len(date_match) > 0:
            # Create a record combining aethalometer and filter data
            record = {
                'date': aeth_date,
                'Red BCc': aeth_row.get('Red BCc', np.nan),
                'IR BCc': aeth_row.get('IR BCc', np.nan),
                'Red BCc smoothed': aeth_row.get('Red BCc smoothed', np.nan),
                'IR BCc smoothed': aeth_row.get('IR BCc smoothed', np.nan)
            }
            
            # Add filter parameters
            for param in filter_params:
                param_data = date_match[date_match['Parameter'] == param]
                if len(param_data) > 0:
                    # Use mean if multiple measurements
                    if param == 'ChemSpec_Iron_PM2.5':
                        # Iron is in ng/m³, keep as is
                        record['Iron'] = param_data['Concentration'].mean()
                    elif param == 'EC_ftir':
                        # EC in ug/m³
                        record['EC_ftir'] = param_data['Concentration'].mean()
                    elif param == 'HIPS_Fabs':
                        # Fabs in Mm^-1
                        record['HIPS_Fabs'] = param_data['Concentration'].mean()
            
            matched_records.append(record)
    
    return pd.DataFrame(matched_records)


print("Correlation analysis functions defined!")

In [None]:
## 9. Correlation Analysis: BCc, FTIR EC, HIPS Fabs, and Iron

Scatter plot analysis comparing aethalometer measurements with filter-based measurements for each site.

In [None]:
# Create correlation plots for each site
filter_params_needed = ['EC_ftir', 'HIPS_Fabs', 'ChemSpec_Iron_PM2.5']

# Define the correlations to analyze
# equal_axes=True for plots where x and y have same units (BCc vs EC)
correlations = [
    {'x': 'Red BCc smoothed', 'y': 'EC_ftir', 'xlabel': 'Red BCc (μg/m³)', 'ylabel': 'FTIR EC (μg/m³)', 'title': 'Red BCc vs FTIR EC', 'equal_axes': True},
    {'x': 'Red BCc smoothed', 'y': 'Iron', 'xlabel': 'Red BCc (μg/m³)', 'ylabel': 'Iron (ng/m³)', 'title': 'Red BCc vs Iron', 'equal_axes': False},
    {'x': 'HIPS_Fabs', 'y': 'EC_ftir', 'xlabel': 'HIPS Fabs (Mm⁻¹)', 'ylabel': 'FTIR EC (μg/m³)', 'title': 'HIPS Fabs vs FTIR EC', 'equal_axes': False},
    {'x': 'HIPS_Fabs', 'y': 'Iron', 'xlabel': 'HIPS Fabs (Mm⁻¹)', 'ylabel': 'Iron (ng/m³)', 'title': 'HIPS Fabs vs Iron', 'equal_axes': False},
    {'x': 'EC_ftir', 'y': 'Iron', 'xlabel': 'FTIR EC (μg/m³)', 'ylabel': 'Iron (ng/m³)', 'title': 'FTIR EC vs Iron', 'equal_axes': False}
]

# Analyze each site
for site_name, df_aeth in aethalometer_data.items():
    config = SITES[site_name]
    site_code = config['code']
    
    print(f"\n{'='*80}")
    print(f"CORRELATION ANALYSIS - {site_name}")
    print(f"{'='*80}")
    
    # Match filter data to aethalometer data
    matched_df = match_filter_to_aeth_data(site_name, site_code, df_aeth, filter_params_needed)
    
    if len(matched_df) == 0:
        print(f"❌ No matched data available for {site_name}")
        continue
    
    print(f"✓ Matched {len(matched_df)} days with overlapping measurements")
    
    # Convert BCc from ng/m³ to μg/m³ if needed (check magnitude)
    for col in ['Red BCc', 'IR BCc', 'Red BCc smoothed', 'IR BCc smoothed']:
        if col in matched_df.columns:
            mean_val = matched_df[col].mean()
            if pd.notna(mean_val) and mean_val > 100:  # Likely in ng/m³
                matched_df[col] = matched_df[col] / 1000.0
    
    # Create 2x3 subplot grid (5 correlations + 1 summary)
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    
    # Store statistics for summary
    stats_summary = []
    
    # Create each correlation plot
    for idx, corr_config in enumerate(correlations):
        ax = axes[idx]
        
        # Check if data columns exist
        if corr_config['x'] not in matched_df.columns or corr_config['y'] not in matched_df.columns:
            ax.text(0.5, 0.5, f"Data not available\n{corr_config['title']}", 
                   transform=ax.transAxes, ha='center', va='center', fontsize=12)
            ax.set_title(corr_config['title'], fontsize=12, fontweight='bold')
            continue
        
        # Get data
        x_data = matched_df[corr_config['x']].values
        y_data = matched_df[corr_config['y']].values
        
        # Create scatter plot
        stats = create_correlation_scatter(
            ax, x_data, y_data,
            corr_config['xlabel'], corr_config['ylabel'],
            color=config['color'], site_name=site_name,
            equal_axes=corr_config.get('equal_axes', False)
        )
        
        ax.set_title(corr_config['title'], fontsize=12, fontweight='bold')
        
        # Store stats
        if stats:
            stats_summary.append({
                'comparison': corr_config['title'],
                **stats
            })
            print(f"  {corr_config['title']:30s}: R² = {stats['r2']:.3f}, Slope = {stats['slope']:8.3f}, n = {stats['n']}")
    
    # Create summary table in last subplot
    ax_summary = axes[5]
    ax_summary.axis('off')
    
    if len(stats_summary) > 0:
        # Create summary text
        summary_text = f"{site_name} - Correlation Summary\n\n"
        summary_text += f"{'Comparison':<25s} {'R²':>8s} {'Slope':>10s} {'n':>5s}\n"
        summary_text += "-" * 50 + "\n"
        
        for stat in stats_summary:
            summary_text += f"{stat['comparison']:<25s} {stat['r2']:>8.3f} {stat['slope']:>10.3f} {stat['n']:>5d}\n"
        
        ax_summary.text(0.1, 0.9, summary_text, transform=ax_summary.transAxes,
                       fontsize=10, verticalalignment='top', family='monospace',
                       bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.5))
    
    fig.suptitle(f'Correlation Analysis - {site_name} ({config["location"]})', 
                fontsize=16, fontweight='bold', y=0.995)
    plt.tight_layout(rect=[0, 0, 1, 0.99])
    plt.show()
    
    print(f"✓ Analysis complete for {site_name}\n")

## 9. Correlation Analysis: BCc, FTIR EC, HIPS Fabs, and Iron

Scatter plot analysis comparing aethalometer measurements with filter-based measurements for each site.

In [None]:
# Create combined plot with all sites showing Flow1, Flow2, and their ratio

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    # Check if flow columns exist
    has_flow1 = 'Flow1 (mL/min)' in df.columns
    has_flow2 = 'Flow2 (mL/min)' in df.columns
    
    if has_flow1 and has_flow2:
        # Plot Flow1 and Flow2
        valid_data = df[(df['Flow1 (mL/min)'].notna()) & (df['Flow2 (mL/min)'].notna())].copy()
        
        if len(valid_data) > 0:
            ax1.plot(valid_data['day_9am'], valid_data['Flow1 (mL/min)'], 
                    color=config['color'], label=f"{site_name} - Flow1", 
                    alpha=0.7, linewidth=1.5, linestyle='-')
            ax1.plot(valid_data['day_9am'], valid_data['Flow2 (mL/min)'], 
                    color=config['color'], label=f"{site_name} - Flow2", 
                    alpha=0.7, linewidth=1.5, linestyle='--')
            
            # Calculate and plot Flow1/Flow2 ratio
            valid_data['flow_ratio'] = valid_data['Flow1 (mL/min)'] / valid_data['Flow2 (mL/min)']
            ax2.plot(valid_data['day_9am'], valid_data['flow_ratio'], 
                    color=config['color'], label=f"{site_name}", 
                    alpha=0.7, linewidth=2)

ax1.set_ylabel('Flow (mL/min)', fontsize=12)
ax1.set_title("Flow1 and Flow2 Over Time - All Sites", fontsize=14, fontweight='bold')
ax1.legend(loc='best', fontsize=9, ncol=2)
ax1.grid(True, alpha=0.3)

ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5, linewidth=1.5, label='Ratio = 1 (equal flows)')
ax2.set_xlabel('Date', fontsize=12)
ax2.set_ylabel('Flow1 / Flow2 Ratio', fontsize=12)
ax2.set_title("Flow1/Flow2 Ratio Over Time - All Sites", fontsize=14, fontweight='bold')
ax2.legend(loc='best', fontsize=9)
ax2.grid(True, alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Create individual plots for each site
for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    # Check if flow data exists
    has_flow1 = 'Flow1 (mL/min)' in df.columns
    has_flow2 = 'Flow2 (mL/min)' in df.columns
    
    if has_flow1 and has_flow2:
        valid_data = df[(df['Flow1 (mL/min)'].notna()) & (df['Flow2 (mL/min)'].notna())].copy()
        
        if len(valid_data) > 0:
            valid_data['flow_ratio'] = valid_data['Flow1 (mL/min)'] / valid_data['Flow2 (mL/min)']
            
            fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
            
            # Plot Flow1 and Flow2
            ax1.plot(valid_data['day_9am'], valid_data['Flow1 (mL/min)'], 
                    color=config['color'], label='Flow1', 
                    alpha=0.8, linewidth=2, linestyle='-')
            ax1.plot(valid_data['day_9am'], valid_data['Flow2 (mL/min)'], 
                    color=config['color'], label='Flow2', 
                    alpha=0.6, linewidth=2, linestyle='--')
            
            ax1.set_ylabel('Flow (mL/min)', fontsize=12)
            ax1.set_title(f"Flow1 and Flow2 Over Time - {site_name}", fontsize=14, fontweight='bold')
            ax1.legend(loc='best')
            ax1.grid(True, alpha=0.3)
            
            # Plot Flow1/Flow2 ratio
            ax2.plot(valid_data['day_9am'], valid_data['flow_ratio'], 
                    color=config['color'], label='Flow1/Flow2 Ratio', 
                    alpha=0.8, linewidth=2)
            ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5, linewidth=1.5, 
                       label='Ratio = 1 (equal flows)')
            
            # Add mean ratio line
            mean_ratio = valid_data['flow_ratio'].mean()
            ax2.axhline(y=mean_ratio, color='red', linestyle='--', alpha=0.7, linewidth=2, 
                       label=f'Mean Ratio = {mean_ratio:.3f}')
            
            ax2.set_xlabel('Date', fontsize=12)
            ax2.set_ylabel('Flow1 / Flow2 Ratio', fontsize=12)
            ax2.set_title(f"Flow1/Flow2 Ratio Over Time - {site_name}", fontsize=14, fontweight='bold')
            ax2.legend(loc='best')
            ax2.grid(True, alpha=0.3)
            ax2.tick_params(axis='x', rotation=45)
            
            plt.tight_layout()
            plt.show()
            
            # Print statistics
            print(f"\n{site_name} Flow Statistics:")
            print(f"  Flow1 - Mean: {valid_data['Flow1 (mL/min)'].mean():.2f} mL/min, "
                  f"Median: {valid_data['Flow1 (mL/min)'].median():.2f} mL/min, "
                  f"Std: {valid_data['Flow1 (mL/min)'].std():.2f} mL/min")
            print(f"  Flow2 - Mean: {valid_data['Flow2 (mL/min)'].mean():.2f} mL/min, "
                  f"Median: {valid_data['Flow2 (mL/min)'].median():.2f} mL/min, "
                  f"Std: {valid_data['Flow2 (mL/min)'].std():.2f} mL/min")
            print(f"  Flow1/Flow2 Ratio - Mean: {valid_data['flow_ratio'].mean():.3f}, "
                  f"Median: {valid_data['flow_ratio'].median():.3f}, "
                  f"Std: {valid_data['flow_ratio'].std():.3f}")
            print(f"  Flow1/Flow2 Ratio - Min: {valid_data['flow_ratio'].min():.3f}, "
                  f"Max: {valid_data['flow_ratio'].max():.3f}")
    else:
        print(f"\n{site_name}: Flow data not available")

## 8. Flow Ratio Analysis (Flow1/Flow2)

Analysis of dual-spot flow measurements and their ratio over time.

In [None]:
# Example: Your custom analysis here
# Access data via:
# - aethalometer_data['Beijing'], aethalometer_data['Delhi'], etc.
# - filter_data

# Example: Plot temperature vs BC for one site
site_name = 'JPL'  # Change this to any site
df = aethalometer_data[site_name]

if 'Sample temp (C)' in df.columns and 'IR BCc' in df.columns:
    valid = df[(df['Sample temp (C)'].notna()) & (df['IR BCc'].notna())]
    
    fig, ax = plt.subplots(figsize=(10, 6))
    scatter = ax.scatter(valid['Sample temp (C)'], valid['IR BCc'], 
                        c=valid['day_9am'].astype(int), cmap='viridis', 
                        alpha=0.6, s=50)
    ax.set_xlabel('Temperature (°C)', fontsize=12)
    ax.set_ylabel('IR BC (ng/m³)', fontsize=12)
    ax.set_title(f'Temperature vs BC - {site_name}', fontsize=14, fontweight='bold')
    plt.colorbar(scatter, ax=ax, label='Date')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# Example: Export summary table to CSV
# summary_df.to_csv('site_summary.csv', index=False)
# print("Summary exported to site_summary.csv")