# Multi-Site Aethalometer Analysis: Follow-Up Analyses

This notebook contains follow-up analyses building on the main Multi_Site_Analysis_Modular notebook.

**Analyses included:**
1. Iron Color Gradient Cross-Plots
2. Multi-Site Comparison Plots (All Sites Including Addis Ababa)
3. HIPS vs Aethalometer Smooth/Raw Threshold Analysis
4. Before/After Flow Fix Separation Analysis
5. Wavelength Dependence Analysis (All Sites)
6. Research Questions Summary

---

## 1. Setup and Imports

In [None]:
# Add scripts folder to path
import sys
sys.path.insert(0, './scripts')

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import from our modular scripts
from config import (
    SITES, PROCESSED_SITES_DIR, FILTER_DATA_PATH,
    MAC_VALUE, FLOW_FIX_PERIODS, MIN_EC_THRESHOLD,
    SMOOTH_RAW_THRESHOLDS, CROSS_COMPARISONS
)

from outliers import (
    EXCLUDED_SAMPLES, MANUAL_OUTLIERS,
    apply_exclusion_flags, apply_threshold_flags,
    get_clean_data, print_exclusion_summary,
    identify_outlier_dates
)

from data_matching import (
    load_aethalometer_data, load_filter_data,
    match_aeth_filter_data, match_all_parameters,
    match_with_smooth_raw_info, add_flow_period_column,
    get_site_code, get_site_color, print_data_summary
)

from plotting import (
    calculate_regression_stats,
    plot_crossplot, plot_before_after_comparison,
    create_tiled_threshold_plots, plot_smooth_raw_distribution,
    plot_bc_timeseries, plot_multiwavelength_bc,
    print_comparison_table
)

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Modules loaded successfully!")
print(f"\nSites configured: {list(SITES.keys())}")
print(f"MAC value: {MAC_VALUE} m^2/g")

## 2. Load Data

In [None]:
# Load all aethalometer datasets
aethalometer_data = load_aethalometer_data()

In [None]:
# Load filter dataset
filter_data = load_filter_data()

In [None]:
# Match data for each site
all_matched_data = {}

for site_name in SITES:
    if site_name not in aethalometer_data:
        continue
    
    config = SITES[site_name]
    df_aeth = aethalometer_data[site_name]
    
    # Match aethalometer and filter data
    matched = match_aeth_filter_data(
        site_name, df_aeth, filter_data, config['code']
    )
    
    if matched is not None and len(matched) >= 3:
        all_matched_data[site_name] = matched
        print(f"{site_name}: {len(matched)} matched pairs")
    else:
        print(f"{site_name}: Insufficient matched data")

In [None]:
# Match all parameters for each site
all_params_data = {}

for site_name in SITES:
    if site_name not in aethalometer_data:
        continue
    
    config = SITES[site_name]
    df_aeth = aethalometer_data[site_name]
    
    matched = match_all_parameters(
        site_name, config['code'], df_aeth, filter_data
    )
    
    if matched is not None and len(matched) >= 3:
        all_params_data[site_name] = matched
        
        # Show available parameters
        available = [col for col in ['ir_bcc', 'hips_fabs', 'ftir_ec', 'iron'] 
                     if col in matched.columns and matched[col].notna().any()]
        print(f"{site_name}: {len(matched)} days, params: {', '.join(available)}")
    else:
        print(f"{site_name}: Insufficient data")

---

## 3. Iron Color Gradient Cross-Plots

Add iron concentration as a color gradient to visualize whether high iron samples deviate from the regression line (potential interference).

In [None]:
# Helper function for cross-plots with iron color gradient
def plot_crossplot_iron_gradient(ax, x_data, y_data, iron_data, x_label, y_label,
                                   equal_axes=True, outlier_mask=None, cmap='viridis'):
    """
    Cross-plot with iron concentration as color gradient.
    
    Parameters:
    -----------
    ax : matplotlib axis
    x_data, y_data, iron_data : arrays
    x_label, y_label : str
    equal_axes : bool
    outlier_mask : boolean array (True = outlier)
    cmap : colormap name
    
    Returns:
    --------
    dict with regression stats, scatter object
    """
    x_data = np.asarray(x_data)
    y_data = np.asarray(y_data)
    iron_data = np.asarray(iron_data)
    
    # Valid data mask (all three must be valid)
    valid_mask = (~np.isnan(x_data)) & (~np.isnan(y_data)) & (~np.isnan(iron_data))
    
    if outlier_mask is not None:
        outlier_mask = np.asarray(outlier_mask)
        clean_mask = valid_mask & ~outlier_mask
        outlier_plot_mask = valid_mask & outlier_mask
    else:
        clean_mask = valid_mask
        outlier_plot_mask = np.zeros(len(x_data), dtype=bool)
    
    x_clean = x_data[clean_mask]
    y_clean = y_data[clean_mask]
    iron_clean = iron_data[clean_mask]
    
    if len(x_clean) < 3:
        ax.text(0.5, 0.5, 'Insufficient data with iron values', 
                transform=ax.transAxes, ha='center', va='center')
        return None, None
    
    # Plot with iron color gradient
    scatter = ax.scatter(x_clean, y_clean, c=iron_clean, cmap=cmap,
                        alpha=0.7, s=100, edgecolors='black', linewidth=0.5)
    
    # Add colorbar
    cbar = plt.colorbar(scatter, ax=ax, shrink=0.8)
    cbar.set_label('Iron (ug/m3)', fontsize=10)
    
    # Plot outliers as red X
    if outlier_plot_mask.any():
        ax.scatter(x_data[outlier_plot_mask], y_data[outlier_plot_mask],
                   color='red', alpha=0.9, s=200, marker='X', linewidths=3,
                   label=f'Excluded (n={outlier_plot_mask.sum()})')
    
    # Calculate regression
    stats = calculate_regression_stats(x_clean, y_clean)
    
    if stats:
        # Set axis limits
        if equal_axes:
            all_vals = np.concatenate([x_clean, y_clean])
            max_val = all_vals.max() * 1.1
            ax.set_xlim(0, max_val)
            ax.set_ylim(0, max_val)
            ax.set_aspect('equal', adjustable='box')
            x_line = np.array([0, max_val])
        else:
            ax.set_xlim(left=0)
            ax.set_ylim(bottom=0)
            x_line = np.array([0, x_clean.max() * 1.1])
        
        # Plot regression line
        y_line = stats['slope'] * x_line + stats['intercept']
        ax.plot(x_line, y_line, 'r-', linewidth=2, alpha=0.8, label='Best fit')
        
        # 1:1 line
        if equal_axes:
            ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')
        
        # Stats text
        sign = '+' if stats['intercept'] >= 0 else '-'
        eq = f"y = {stats['slope']:.3f}x {sign} {abs(stats['intercept']):.2f}"
        text = f"n = {stats['n']}\nR^2 = {stats['r_squared']:.3f}\n{eq}"
        ax.text(0.05, 0.95, text, transform=ax.transAxes, fontsize=10,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
    
    ax.set_xlabel(x_label, fontsize=11)
    ax.set_ylabel(y_label, fontsize=11)
    ax.legend(loc='lower right', fontsize=8)
    ax.grid(True, alpha=0.3)
    
    return stats, scatter

print("Iron gradient plotting function defined.")

In [None]:
# Create iron gradient cross-plots for each site
# Comparisons: FTIR EC vs Aeth, HIPS vs Aeth, HIPS vs FTIR EC

iron_gradient_comparisons = [
    {
        'name': 'FTIR EC vs Aethalometer IR BCc',
        'x_col': 'ir_bcc',
        'y_col': 'ftir_ec',
        'x_label': 'Aethalometer IR BCc (ug/m3)',
        'y_label': 'FTIR EC (ug/m3)',
        'equal_axes': True
    },
    {
        'name': 'HIPS Fabs vs Aethalometer IR BCc',
        'x_col': 'ir_bcc',
        'y_col': 'hips_fabs',
        'x_label': 'Aethalometer IR BCc (ug/m3)',
        'y_label': 'HIPS Fabs / MAC (ug/m3)',
        'equal_axes': True
    },
    {
        'name': 'HIPS Fabs vs FTIR EC',
        'x_col': 'ftir_ec',
        'y_col': 'hips_fabs',
        'x_label': 'FTIR EC (ug/m3)',
        'y_label': 'HIPS Fabs / MAC (ug/m3)',
        'equal_axes': True
    }
]

for site_name, matched_df in all_params_data.items():
    config = SITES[site_name]
    
    print(f"\n{'='*60}")
    print(f"{site_name}: Iron Gradient Cross-Plots")
    print(f"{'='*60}")
    
    # Apply exclusion flags
    matched_df = apply_exclusion_flags(matched_df.copy(), site_name)
    outlier_mask = matched_df['is_excluded'].values if 'is_excluded' in matched_df.columns else None
    
    for comp in iron_gradient_comparisons:
        x_col = comp['x_col']
        y_col = comp['y_col']
        
        # Check if required columns exist
        if x_col not in matched_df.columns or y_col not in matched_df.columns:
            print(f"  {comp['name']}: Missing columns")
            continue
        
        if 'iron' not in matched_df.columns:
            print(f"  {comp['name']}: No iron data")
            continue
        
        # Check for sufficient data
        valid_count = ((~np.isnan(matched_df[x_col].values)) & 
                       (~np.isnan(matched_df[y_col].values)) &
                       (~np.isnan(matched_df['iron'].values))).sum()
        
        if valid_count < 3:
            print(f"  {comp['name']}: Insufficient data with iron (n={valid_count})")
            continue
        
        fig, ax = plt.subplots(figsize=(10, 9))
        
        stats, _ = plot_crossplot_iron_gradient(
            ax,
            matched_df[x_col].values,
            matched_df[y_col].values,
            matched_df['iron'].values,
            comp['x_label'],
            comp['y_label'],
            equal_axes=comp['equal_axes'],
            outlier_mask=outlier_mask,
            cmap='plasma'  # Good for showing iron concentration
        )
        
        ax.set_title(f"{site_name}: {comp['name']}\n(Color = Iron Concentration)", 
                     fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        if stats:
            print(f"  {comp['name']}: R^2 = {stats['r_squared']:.3f}, n = {stats['n']}")

---

## 4. Multi-Site Comparison Plots (All Sites Including Addis Ababa)

Combine all sites on single plots for direct comparison. Each site has its own color, and outliers are shown with consistent red X markers.

In [None]:
# Multi-site comparison plots
# Show all sites on same axes for direct comparison

multisite_comparisons = [
    {
        'name': 'Aethalometer IR BCc vs FTIR EC',
        'x_col': 'ir_bcc',
        'y_col': 'ftir_ec',
        'x_label': 'Aethalometer IR BCc (ug/m3)',
        'y_label': 'FTIR EC (ug/m3)',
        'equal_axes': True
    },
    {
        'name': 'HIPS Fabs vs FTIR EC',
        'x_col': 'hips_fabs',
        'y_col': 'ftir_ec',
        'x_label': 'HIPS Fabs / MAC (ug/m3)',
        'y_label': 'FTIR EC (ug/m3)',
        'equal_axes': True
    },
    {
        'name': 'HIPS Fabs vs Aethalometer IR BCc',
        'x_col': 'hips_fabs',
        'y_col': 'ir_bcc',
        'x_label': 'HIPS Fabs / MAC (ug/m3)',
        'y_label': 'Aethalometer IR BCc (ug/m3)',
        'equal_axes': True
    }
]

for comp in multisite_comparisons:
    fig, ax = plt.subplots(figsize=(12, 10))
    
    all_x = []
    all_y = []
    site_stats = {}
    
    for site_name, matched_df in all_params_data.items():
        config = SITES[site_name]
        
        x_col = comp['x_col']
        y_col = comp['y_col']
        
        # Check if columns exist
        if x_col not in matched_df.columns or y_col not in matched_df.columns:
            continue
        
        # Apply exclusion flags
        df_flagged = apply_exclusion_flags(matched_df.copy(), site_name)
        outlier_mask = df_flagged['is_excluded'].values if 'is_excluded' in df_flagged.columns else None
        
        x_data = df_flagged[x_col].values
        y_data = df_flagged[y_col].values
        
        # Valid data mask
        valid_mask = (~np.isnan(x_data)) & (~np.isnan(y_data))
        
        if outlier_mask is not None:
            clean_mask = valid_mask & ~outlier_mask
            outlier_plot_mask = valid_mask & outlier_mask
        else:
            clean_mask = valid_mask
            outlier_plot_mask = np.zeros(len(x_data), dtype=bool)
        
        x_clean = x_data[clean_mask]
        y_clean = y_data[clean_mask]
        
        if len(x_clean) < 3:
            continue
        
        all_x.extend(x_clean)
        all_y.extend(y_clean)
        
        # Plot clean data for this site
        ax.scatter(x_clean, y_clean, color=config['color'], alpha=0.6, s=80,
                   edgecolors='black', linewidth=0.5,
                   label=f"{site_name} (n={len(x_clean)})")
        
        # Plot outliers as red X (no legend entry to avoid duplicates)
        if outlier_plot_mask.any():
            ax.scatter(x_data[outlier_plot_mask], y_data[outlier_plot_mask],
                       color='red', alpha=0.9, s=200, marker='X', linewidths=3)
        
        # Calculate site stats
        stats = calculate_regression_stats(x_clean, y_clean)
        if stats:
            site_stats[site_name] = stats
    
    if len(all_x) < 3:
        print(f"{comp['name']}: Insufficient data")
        plt.close()
        continue
    
    # Set axis limits
    if comp['equal_axes']:
        max_val = max(max(all_x), max(all_y)) * 1.1
        ax.set_xlim(0, max_val)
        ax.set_ylim(0, max_val)
        ax.set_aspect('equal', adjustable='box')
        ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')
    else:
        ax.set_xlim(left=0)
        ax.set_ylim(bottom=0)
    
    # Add overall regression
    overall_stats = calculate_regression_stats(all_x, all_y)
    if overall_stats:
        x_line = np.array([0, max(all_x) * 1.1])
        y_line = overall_stats['slope'] * x_line + overall_stats['intercept']
        ax.plot(x_line, y_line, 'k-', linewidth=2.5, alpha=0.6, 
                label=f'Overall fit (R^2={overall_stats["r_squared"]:.3f})')
        
        sign = '+' if overall_stats['intercept'] >= 0 else '-'
        eq = f"y = {overall_stats['slope']:.3f}x {sign} {abs(overall_stats['intercept']):.2f}"
        text = f"OVERALL (all sites)\nn = {overall_stats['n']}\nR^2 = {overall_stats['r_squared']:.3f}\n{eq}"
        ax.text(0.05, 0.95, text, transform=ax.transAxes, fontsize=10,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
    
    # Add red X to legend (once)
    ax.scatter([], [], color='red', marker='X', s=200, linewidths=3, label='Excluded outliers')
    
    ax.set_xlabel(comp['x_label'], fontsize=12)
    ax.set_ylabel(comp['y_label'], fontsize=12)
    ax.set_title(f"Multi-Site Comparison: {comp['name']}", fontsize=14, fontweight='bold')
    ax.legend(loc='lower right', fontsize=9)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print stats table
    print(f"\n{comp['name']} - Site Statistics:")
    print(f"{'Site':<15s} {'n':>8s} {'R^2':>10s} {'Slope':>10s}")
    print("-" * 45)
    for site_name, stats in site_stats.items():
        print(f"{site_name:<15s} {stats['n']:>8d} {stats['r_squared']:>10.3f} {stats['slope']:>10.3f}")
    if overall_stats:
        print("-" * 45)
        print(f"{'OVERALL':<15s} {overall_stats['n']:>8d} {overall_stats['r_squared']:>10.3f} {overall_stats['slope']:>10.3f}")
    print()

---

## 5. HIPS vs Aethalometer Smooth/Raw Threshold Analysis

Redo the smooth/raw threshold analysis specifically for HIPS vs Aethalometer comparison (absorption vs absorption), rather than FTIR EC vs Aethalometer.

This is a more direct comparison since both HIPS and aethalometer measure light absorption.

In [None]:
# Match HIPS data with aethalometer smooth/raw info
def match_hips_with_smooth_raw_info(site_name, df_aeth, filter_data, site_code, wavelength='IR'):
    """
    Match HIPS Fabs and aethalometer data, including smooth/raw difference info.
    
    Returns DataFrame with:
    - date
    - aeth_bc (using raw BC, ug/m3)
    - aeth_bc_smooth (ug/m3)
    - hips_fabs (ug/m3 equivalent, divided by MAC)
    - smooth_raw_pct: % difference
    - smooth_raw_abs_pct: absolute % difference
    """
    raw_col = f'{wavelength} BCc'
    smooth_col = f'{wavelength} BCc smoothed'
    
    # Check if columns exist
    if raw_col not in df_aeth.columns:
        print(f"  {site_name}: {raw_col} not found")
        return None
    
    has_smooth = smooth_col in df_aeth.columns
    
    # Get HIPS Fabs data for this site
    site_hips = filter_data[
        (filter_data['Site'] == site_code) &
        (filter_data['Parameter'] == 'HIPS_Fabs')
    ].copy()
    
    if len(site_hips) == 0:
        print(f"  {site_name}: No HIPS data available")
        return None
    
    matched_records = []
    
    for _, filter_row in site_hips.iterrows():
        filter_date = filter_row['SampleDate']
        
        date_match = df_aeth[
            (df_aeth['day_9am'] >= filter_date - pd.Timedelta(days=1)) &
            (df_aeth['day_9am'] <= filter_date + pd.Timedelta(days=1))
        ]
        
        if len(date_match) > 0:
            bc_raw = date_match[raw_col].mean()  # ng/m3
            bc_smooth = date_match[smooth_col].mean() if has_smooth else np.nan
            
            if pd.notna(bc_raw) and pd.notna(filter_row['Concentration']):
                hips_fabs = filter_row['Concentration'] / MAC_VALUE  # Mm^-1 to ug/m3
                
                # Calculate % difference
                if bc_raw != 0 and pd.notna(bc_smooth):
                    pct_diff = ((bc_smooth - bc_raw) / bc_raw) * 100
                else:
                    pct_diff = np.nan
                
                matched_records.append({
                    'date': filter_date,
                    'aeth_bc': bc_raw / 1000,  # ng to ug/m3
                    'aeth_bc_smooth': bc_smooth / 1000 if pd.notna(bc_smooth) else np.nan,
                    'hips_fabs': hips_fabs,
                    'smooth_raw_pct': pct_diff,
                    'smooth_raw_abs_pct': abs(pct_diff) if pd.notna(pct_diff) else np.nan,
                    'filter_id': filter_row.get('FilterId', 'unknown')
                })
    
    return pd.DataFrame(matched_records) if matched_records else None

print("HIPS smooth/raw matching function defined.")

In [None]:
# Match HIPS with smooth/raw info for each site
hips_smooth_raw_data = {}

for site_name in SITES:
    if site_name not in aethalometer_data:
        continue
    
    config = SITES[site_name]
    df_aeth = aethalometer_data[site_name]
    
    matched = match_hips_with_smooth_raw_info(
        site_name, df_aeth, filter_data, config['code']
    )
    
    if matched is not None and len(matched) >= 3:
        n_with_smooth = matched['smooth_raw_abs_pct'].notna().sum()
        hips_smooth_raw_data[site_name] = matched
        print(f"{site_name}: {len(matched)} matched HIPS-Aeth pairs, {n_with_smooth} with smooth data")
    else:
        print(f"{site_name}: Insufficient HIPS-Aeth data")

In [None]:
# HIPS vs Aethalometer threshold analysis (individual plots per threshold)
hips_threshold_results = {}

for site_name, matched_df in hips_smooth_raw_data.items():
    config = SITES[site_name]
    
    if matched_df['smooth_raw_abs_pct'].notna().sum() < 3:
        print(f"{site_name}: Insufficient smooth data for HIPS analysis")
        continue
    
    print(f"\n{'='*60}")
    print(f"{site_name}: HIPS vs Aethalometer Smooth/Raw Threshold Analysis")
    print(f"{'='*60}")
    
    site_results = {}
    
    # Calculate axis limits (same for all plots for this site)
    all_vals = np.concatenate([
        matched_df['aeth_bc'].dropna().values,
        matched_df['hips_fabs'].dropna().values
    ])
    max_val = all_vals.max() * 1.1 if len(all_vals) > 0 else 100
    
    for threshold in SMOOTH_RAW_THRESHOLDS:
        # Separate by threshold
        below_threshold = matched_df[matched_df['smooth_raw_abs_pct'] <= threshold].copy()
        above_threshold = matched_df[matched_df['smooth_raw_abs_pct'] > threshold].copy()
        
        n_kept = len(below_threshold)
        n_removed = len(above_threshold)
        
        if n_kept < 3:
            print(f"  Threshold <={threshold}%: Insufficient data (n={n_kept})")
            continue
        
        # Create individual figure for this threshold
        fig, ax = plt.subplots(figsize=(9, 9))
        
        # Plot kept points
        ax.scatter(below_threshold['aeth_bc'], below_threshold['hips_fabs'],
                   color=config['color'], alpha=0.6, s=80,
                   edgecolors='black', linewidth=1,
                   label=f'Kept (n={n_kept})')
        
        # Plot removed points as red X
        if len(above_threshold) > 0:
            ax.scatter(above_threshold['aeth_bc'], above_threshold['hips_fabs'],
                       color='red', alpha=0.9, s=200, marker='X',
                       linewidths=3, label=f'Removed (n={n_removed})')
        
        # Calculate regression on kept points
        stats = calculate_regression_stats(
            below_threshold['aeth_bc'].values,
            below_threshold['hips_fabs'].values
        )
        
        if stats:
            # Plot regression line
            x_line = np.array([0, max_val])
            y_line = stats['slope'] * x_line + stats['intercept']
            ax.plot(x_line, y_line, 'g-', linewidth=2, alpha=0.8, label='Best fit')
            
            # Stats text
            sign = '+' if stats['intercept'] >= 0 else '-'
            eq = f"y = {stats['slope']:.3f}x {sign} {abs(stats['intercept']):.2f}"
            stats_text = f"n = {stats['n']}\nR^2 = {stats['r_squared']:.3f}\n{eq}"
            
            ax.text(0.05, 0.95, stats_text, transform=ax.transAxes, fontsize=10,
                    verticalalignment='top',
                    bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.9))
            
            site_results[threshold] = {
                'n_kept': n_kept,
                'n_removed': n_removed,
                **stats
            }
        
        # Set axes
        ax.set_xlim(0, max_val)
        ax.set_ylim(0, max_val)
        ax.set_aspect('equal', adjustable='box')
        ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')
        
        ax.set_xlabel('Aethalometer IR BCc (ug/m3)', fontsize=11)
        ax.set_ylabel('HIPS Fabs / MAC (ug/m3)', fontsize=11)
        ax.set_title(f'{site_name}: HIPS vs Aeth, Threshold <={threshold}% Smooth-Raw Diff\n(Absorption vs Absorption)',
                     fontsize=13, fontweight='bold')
        ax.legend(loc='lower right', fontsize=9)
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        if stats:
            print(f"  Threshold <={threshold}%: R^2 = {stats['r_squared']:.3f}, n = {stats['n']}")
    
    hips_threshold_results[site_name] = site_results

In [None]:
# Summary table for HIPS vs Aethalometer threshold analysis
print("\n" + "="*80)
print("SUMMARY: HIPS vs Aethalometer - Smooth/Raw Threshold Impact")
print("="*80)
print("(Absorption vs Absorption comparison)")

print_comparison_table(hips_threshold_results, metric_name='HIPS-Aeth Threshold Impact')

---

## 6. Before/After Flow Fix Separation Analysis

Separate data by flow fix periods and compare regression statistics. This is particularly relevant for Beijing and JPL which had flow issues early in deployment.

In [None]:
# Show flow fix period configuration
print("Flow Fix Period Configuration:")
print("="*60)
for site_name, config in FLOW_FIX_PERIODS.items():
    print(f"\n{site_name}:")
    print(f"  Description: {config['description']}")
    print(f"  Before end: {config['before_end']}")
    print(f"  After start: {config['after_start']}")
    print(f"  Notes: {config['notes']}")

In [None]:
# Before/After Flow Fix Analysis
# Compare Aethalometer vs FTIR EC for before/after periods

flow_fix_results = {}

for site_name, matched_df in all_matched_data.items():
    config = SITES[site_name]
    flow_config = FLOW_FIX_PERIODS.get(site_name, {})
    
    # Check if this site has flow fix periods defined
    before_end = flow_config.get('before_end')
    after_start = flow_config.get('after_start')
    
    if before_end is None and after_start is None:
        print(f"\n{site_name}: No flow fix periods defined (all data treated as single period)")
        continue
    
    print(f"\n{'='*60}")
    print(f"{site_name}: Before/After Flow Fix Analysis")
    print(f"{'='*60}")
    
    # Add flow period column
    matched_with_period = matched_df.copy()
    matched_with_period['date'] = pd.to_datetime(matched_with_period['date'])
    
    before_end_dt = pd.to_datetime(before_end) if before_end else None
    after_start_dt = pd.to_datetime(after_start) if after_start else None
    
    def classify_period(date):
        if before_end_dt and date <= before_end_dt:
            return 'before_fix'
        elif after_start_dt and date >= after_start_dt:
            return 'after_fix'
        else:
            return 'gap_period'
    
    matched_with_period['flow_period'] = matched_with_period['date'].apply(classify_period)
    
    # Count samples in each period
    period_counts = matched_with_period['flow_period'].value_counts()
    print(f"  Sample counts by period: {period_counts.to_dict()}")
    
    # Apply exclusion flags
    matched_with_period = apply_exclusion_flags(matched_with_period, site_name)
    
    site_results = {}
    
    # Calculate axis limits
    all_vals = np.concatenate([
        matched_with_period['aeth_bc'].dropna().values,
        matched_with_period['filter_ec'].dropna().values
    ])
    max_val = all_vals.max() * 1.1 if len(all_vals) > 0 else 100
    
    # Create separate plots for before/after
    for period, period_label, period_color in [
        ('before_fix', 'Before Flow Fix', '#E74C3C'),  # Red
        ('after_fix', 'After Flow Fix', '#2ECC71')     # Green
    ]:
        period_data = matched_with_period[matched_with_period['flow_period'] == period].copy()
        
        if len(period_data) < 3:
            print(f"  {period_label}: Insufficient data (n={len(period_data)})")
            continue
        
        # Get outlier mask for this period
        outlier_mask = period_data['is_excluded'].values if 'is_excluded' in period_data.columns else None
        
        fig, ax = plt.subplots(figsize=(9, 9))
        
        x_data = period_data['aeth_bc'].values
        y_data = period_data['filter_ec'].values
        
        # Valid data mask
        valid_mask = (~np.isnan(x_data)) & (~np.isnan(y_data))
        
        if outlier_mask is not None:
            clean_mask = valid_mask & ~outlier_mask
            outlier_plot_mask = valid_mask & outlier_mask
        else:
            clean_mask = valid_mask
            outlier_plot_mask = np.zeros(len(x_data), dtype=bool)
        
        x_clean = x_data[clean_mask]
        y_clean = y_data[clean_mask]
        
        # Plot clean data
        ax.scatter(x_clean, y_clean, color=period_color, alpha=0.6, s=80,
                   edgecolors='black', linewidth=1, label=f'Data (n={len(x_clean)})')
        
        # Plot outliers as red X
        if outlier_plot_mask.any():
            ax.scatter(x_data[outlier_plot_mask], y_data[outlier_plot_mask],
                       color='red', alpha=0.9, s=200, marker='X', linewidths=3,
                       label=f'Excluded (n={outlier_plot_mask.sum()})')
        
        # Calculate regression
        stats = calculate_regression_stats(x_clean, y_clean)
        
        if stats:
            # Regression line
            x_line = np.array([0, max_val])
            y_line = stats['slope'] * x_line + stats['intercept']
            ax.plot(x_line, y_line, 'g-', linewidth=2, alpha=0.8, label='Best fit')
            
            # Stats text
            sign = '+' if stats['intercept'] >= 0 else '-'
            eq = f"y = {stats['slope']:.3f}x {sign} {abs(stats['intercept']):.2f}"
            stats_text = f"{period_label}\nn = {stats['n']}\nR^2 = {stats['r_squared']:.3f}\n{eq}"
            ax.text(0.05, 0.95, stats_text, transform=ax.transAxes, fontsize=10,
                    verticalalignment='top',
                    bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
            
            site_results[period] = stats
            print(f"  {period_label}: R^2 = {stats['r_squared']:.3f}, Slope = {stats['slope']:.3f}, n = {stats['n']}")
        
        # Set axes
        ax.set_xlim(0, max_val)
        ax.set_ylim(0, max_val)
        ax.set_aspect('equal', adjustable='box')
        ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')
        
        ax.set_xlabel('Aethalometer IR BC (ng/m3)', fontsize=11)
        ax.set_ylabel('FTIR EC (ng/m3)', fontsize=11)
        ax.set_title(f'{site_name}: {period_label}\nAethalometer BC vs FTIR EC',
                     fontsize=14, fontweight='bold')
        ax.legend(loc='lower right', fontsize=9)
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    flow_fix_results[site_name] = site_results

In [None]:
# Summary table for flow fix analysis
print("\n" + "="*80)
print("SUMMARY: Before/After Flow Fix Comparison")
print("="*80)

for site_name, results in flow_fix_results.items():
    if len(results) == 0:
        continue
    
    print(f"\n{site_name}:")
    print(f"{'Period':<20s} {'n':>8s} {'R^2':>10s} {'Slope':>10s}")
    print("-" * 50)
    
    for period, stats in results.items():
        period_label = 'Before Fix' if period == 'before_fix' else 'After Fix'
        print(f"{period_label:<20s} {stats['n']:>8d} {stats['r_squared']:>10.3f} {stats['slope']:>10.3f}")
    
    # Show improvement
    if 'before_fix' in results and 'after_fix' in results:
        r2_before = results['before_fix']['r_squared']
        r2_after = results['after_fix']['r_squared']
        slope_before = results['before_fix']['slope']
        slope_after = results['after_fix']['slope']
        
        print(f"\n  R^2 change: {r2_before:.3f} -> {r2_after:.3f} (delta = {r2_after - r2_before:+.3f})")
        print(f"  Slope change: {slope_before:.3f} -> {slope_after:.3f} (delta = {slope_after - slope_before:+.3f})")

---

## 7. Wavelength Dependence Analysis (All Sites)

Compare BC measurements at different wavelengths across all sites. This helps understand source apportionment (biomass burning vs fossil fuel) and potential measurement artifacts.

In [None]:
# Check which wavelengths are available for each site
wavelengths = ['UV', 'Blue', 'Green', 'Red', 'IR']
wavelength_colors = {
    'UV': '#9B59B6',    # Purple
    'Blue': '#3498DB',  # Blue
    'Green': '#2ECC71', # Green
    'Red': '#E74C3C',   # Red
    'IR': '#34495E'     # Dark gray
}

print("Available BC wavelengths by site:")
print("="*60)

for site_name, df in aethalometer_data.items():
    available = []
    for wl in wavelengths:
        col = f'{wl} BCc'
        if col in df.columns:
            n_valid = df[col].notna().sum()
            if n_valid > 0:
                available.append(f"{wl} (n={n_valid})")
    print(f"\n{site_name}:")
    print(f"  {', '.join(available)}")

In [None]:
# Wavelength cross-plots: Compare UV BCc vs IR BCc (ratio indicates source type)
# Higher UV/IR ratio suggests biomass burning, lower ratio suggests fossil fuel

print("\n" + "="*80)
print("UV vs IR BC Comparison (Source Apportionment)")
print("="*80)
print("Higher UV/IR ratio suggests biomass burning contribution")
print("Lower UV/IR ratio suggests fossil fuel dominance")
print()

wavelength_results = {}

for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    # Check if both UV and IR are available
    if 'UV BCc' not in df.columns or 'IR BCc' not in df.columns:
        print(f"{site_name}: Missing UV or IR wavelength data")
        continue
    
    # Get valid data
    valid_mask = df['UV BCc'].notna() & df['IR BCc'].notna()
    valid_df = df[valid_mask].copy()
    
    if len(valid_df) < 3:
        print(f"{site_name}: Insufficient data (n={len(valid_df)})")
        continue
    
    uv_bc = valid_df['UV BCc'].values
    ir_bc = valid_df['IR BCc'].values
    
    # Calculate UV/IR ratio
    ratio = uv_bc / ir_bc
    
    fig, ax = plt.subplots(figsize=(9, 9))
    
    # Scatter plot
    ax.scatter(ir_bc, uv_bc, color=config['color'], alpha=0.6, s=80,
               edgecolors='black', linewidth=0.5, label=f'{site_name} (n={len(ir_bc)})')
    
    # Calculate regression
    stats = calculate_regression_stats(ir_bc, uv_bc)
    
    if stats:
        # Regression line
        max_val = max(ir_bc.max(), uv_bc.max()) * 1.1
        x_line = np.array([0, max_val])
        y_line = stats['slope'] * x_line + stats['intercept']
        ax.plot(x_line, y_line, 'g-', linewidth=2, alpha=0.8, label='Best fit')
        
        # 1:1 line
        ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, linewidth=1.5, label='1:1 line')
        
        ax.set_xlim(0, max_val)
        ax.set_ylim(0, max_val)
        ax.set_aspect('equal', adjustable='box')
        
        # Stats text
        sign = '+' if stats['intercept'] >= 0 else '-'
        eq = f"y = {stats['slope']:.3f}x {sign} {abs(stats['intercept']):.2f}"
        stats_text = f"n = {stats['n']}\nR^2 = {stats['r_squared']:.3f}\n{eq}\n\nMean UV/IR: {ratio.mean():.3f}\nMedian UV/IR: {np.median(ratio):.3f}"
        ax.text(0.05, 0.95, stats_text, transform=ax.transAxes, fontsize=10,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
        
        wavelength_results[site_name] = {
            **stats,
            'mean_ratio': ratio.mean(),
            'median_ratio': np.median(ratio),
            'std_ratio': ratio.std()
        }
    
    ax.set_xlabel('IR BCc (ng/m3)', fontsize=11)
    ax.set_ylabel('UV BCc (ng/m3)', fontsize=11)
    ax.set_title(f'{site_name}: UV BCc vs IR BCc\n(Wavelength Dependence)', 
                 fontsize=14, fontweight='bold')
    ax.legend(loc='lower right', fontsize=9)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"{site_name}:")
    print(f"  Mean UV/IR ratio: {ratio.mean():.3f}")
    print(f"  Median UV/IR ratio: {np.median(ratio):.3f}")
    print(f"  Std UV/IR ratio: {ratio.std():.3f}")
    print()

In [None]:
# Summary comparison of UV/IR ratios across sites
print("\n" + "="*80)
print("SUMMARY: Wavelength Dependence (UV/IR Ratio)")
print("="*80)
print("\nInterpretation:")
print("  - UV/IR > 1.0: Strong UV absorption, suggests biomass burning")
print("  - UV/IR ~ 1.0: Equal absorption, mixed sources")
print("  - UV/IR < 1.0: Strong IR absorption, suggests fossil fuel")
print()

print(f"{'Site':<15s} {'n':>8s} {'Mean UV/IR':>12s} {'Median UV/IR':>14s} {'Std':>8s} {'Slope':>8s}")
print("-" * 70)

for site_name, results in wavelength_results.items():
    print(f"{site_name:<15s} {results['n']:>8d} {results['mean_ratio']:>12.3f} "
          f"{results['median_ratio']:>14.3f} {results['std_ratio']:>8.3f} {results['slope']:>8.3f}")

In [None]:
# Multi-wavelength time series for each site
print("\n" + "="*80)
print("Multi-Wavelength BC Time Series by Site")
print("="*80)

for site_name, df in aethalometer_data.items():
    config = SITES[site_name]
    
    fig, ax = plt.subplots(figsize=(14, 6))
    
    has_data = False
    for wl in wavelengths:
        col = f'{wl} BCc'
        if col in df.columns:
            valid_data = df[df[col].notna()].copy()
            if len(valid_data) > 0:
                ax.plot(valid_data['day_9am'], valid_data[col],
                        color=wavelength_colors[wl], alpha=0.7, linewidth=1.5,
                        label=f'{wl}')
                has_data = True
    
    if not has_data:
        print(f"{site_name}: No wavelength data available")
        plt.close()
        continue
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('BC (ng/m3)', fontsize=12)
    ax.set_title(f'{site_name}: Multi-Wavelength BC Time Series', fontsize=14, fontweight='bold')
    ax.legend(loc='upper right', fontsize=10, ncol=5)
    ax.tick_params(axis='x', rotation=45)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

---

## 8. Research Questions Summary

### Key Research Questions

1. **How well do aethalometer BC measurements correlate with filter-based methods?**
   - FTIR EC vs Aethalometer IR BCc
   - HIPS Fabs vs Aethalometer IR BCc (absorption vs absorption)
   - Site-specific variations

2. **Does iron contamination affect HIPS or FTIR EC measurements?**
   - Iron gradient plots show if high-iron samples deviate from regression
   - Potential interference mechanism

3. **How does the smooth/raw BC difference affect data quality?**
   - Threshold analysis for FTIR EC comparison
   - Threshold analysis for HIPS comparison (absorption-absorption)
   - Optimal threshold selection

4. **How does the flow fix impact aethalometer-filter agreement?**
   - Before/after analysis for Beijing and JPL
   - R^2 and slope changes

5. **What does wavelength dependence tell us about BC sources?**
   - UV/IR ratio across sites
   - Seasonal/temporal variations
   - Biomass burning vs fossil fuel contributions

In [None]:
# Comprehensive Summary Statistics
print("="*100)
print("COMPREHENSIVE SUMMARY: MULTI-SITE AETHALOMETER VS FILTER ANALYSIS")
print("="*100)

# Calculate stats for summary
site_stats = {}
for site_name, matched_df in all_matched_data.items():
    stats = calculate_regression_stats(
        matched_df['aeth_bc'].values,
        matched_df['filter_ec'].values
    )
    if stats:
        site_stats[site_name] = stats

print("\n" + "-"*80)
print("1. AETHALOMETER VS FTIR EC CORRELATION")
print("-"*80)
print(f"{'Site':<15s} {'n':>8s} {'R^2':>10s} {'Slope':>10s} {'Intercept':>12s}")
print("-" * 60)
for site_name, stats in site_stats.items():
    print(f"{site_name:<15s} {stats['n']:>8d} {stats['r_squared']:>10.3f} "
          f"{stats['slope']:>10.3f} {stats['intercept']:>12.2f}")

print("\n" + "-"*80)
print("2. WAVELENGTH DEPENDENCE (UV/IR Ratio)")
print("-"*80)
print(f"{'Site':<15s} {'n':>8s} {'Mean UV/IR':>12s} {'Interpretation':>25s}")
print("-" * 65)
for site_name, results in wavelength_results.items():
    ratio = results['mean_ratio']
    if ratio > 1.1:
        interp = "Biomass burning"
    elif ratio < 0.9:
        interp = "Fossil fuel"
    else:
        interp = "Mixed sources"
    print(f"{site_name:<15s} {results['n']:>8d} {ratio:>12.3f} {interp:>25s}")

print("\n" + "-"*80)
print("3. KEY FINDINGS")
print("-"*80)
print("""
- Addis Ababa shows highest R^2 for FTIR EC comparison (likely cleanest conditions)
- Iron gradient plots may reveal potential interference in dusty conditions
- Smooth/raw threshold filtering can improve correlation for sites with noisy data
- Flow fix periods show clear improvement in agreement at Beijing and JPL
- UV/IR ratios vary significantly across sites, indicating different source mixtures
""")

print("\n" + "-"*80)
print("4. OPEN QUESTIONS / NEXT STEPS")
print("-"*80)
print("""
- Investigate why HIPS slopes differ significantly from 1:1 at some sites
- Explore seasonal patterns in UV/IR ratios
- Consider MAC value optimization for each site
- Investigate outlier mechanisms (contamination, instrument issues)
- Add more sites as data becomes available
""")

---

## Notes

### Section 3: Iron Gradient Plots
- Uses plasma colormap to visualize iron concentration
- Look for systematic deviations at high iron values
- May indicate dust interference

### Section 4: Multi-Site Comparisons
- All sites plotted together for direct comparison
- Overall regression line shows combined trend
- Individual site colors maintained

### Section 5: HIPS vs Aethalometer (Absorption-Absorption)
- More direct comparison than FTIR EC
- Both methods measure light absorption
- Threshold analysis similar to FTIR comparison

### Section 6: Flow Fix Analysis
- Beijing and JPL had known flow issues early on
- Before/after separation shows improvement
- Useful for understanding data quality evolution

### Section 7: Wavelength Dependence
- UV/IR ratio indicates source type
- Addis Ababa expected to show biomass burning signature
- Delhi and Beijing may show fossil fuel dominance

### How to Update This Notebook
1. Add new outliers to `scripts/outliers.py`
2. Modify thresholds in `scripts/config.py`
3. Restart kernel and re-run all cells