# Addis Ababa: BC/EC Method Comparison by Source Apportionment

This notebook compares HIPS Fabs, FTIR EC, and Aethalometer BC measurements,
stratified by dominant aerosol source type from PMF/source apportionment analysis.

## Method Pairs (all analyses applied to each):
1. **FTIR EC vs HIPS Fabs/MAC**
2. **FTIR EC vs Aethalometer IR BCc**
3. **HIPS Fabs/MAC vs Aethalometer IR BCc**

## Analysis for Each Pair:
1. **Baseline Regression** — All data scatter with regression statistics
2. **Source-Separated Regressions** — Filter by dominant source type
3. **Threshold-Filtered Analysis** — Exclude "mixed days" using source contribution thresholds

## Additional Visualizations:
4. **Source Contribution Visualization** — Daily source fraction bar charts

## Source Categories:
- Charcoal burning
- Wood burning
- Fossil fuel
- Polluted marine
- Sea salt

---

## Setup and Imports

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from matplotlib.dates import MonthLocator, DateFormatter
import warnings
warnings.filterwarnings('ignore')

# Add scripts folder to path
notebook_dir = os.path.dirname(os.path.abspath('__file__'))
scripts_path = os.path.join(notebook_dir, 'scripts')
if scripts_path not in sys.path:
    sys.path.insert(0, scripts_path)

from config import SITES, MAC_VALUE
from data_matching import (
    load_aethalometer_data,
    load_filter_data,
    add_base_filter_id,
    match_all_parameters,
    load_etad_factors_with_filter_ids,
)
print("Loaded config and data_matching modules")

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 13

# Create output directories
def setup_directories():
    dirs = {
        'plots': 'output/plots/addis_ababa/source_regression',
        'data': 'output/data/addis_ababa'
    }
    for dir_path in dirs.values():
        os.makedirs(dir_path, exist_ok=True)
    return dirs

dirs = setup_directories()
print("Setup complete!")
print(f"MAC value: {MAC_VALUE} m²/g")

## Configuration

In [None]:
# Site configuration
ADDIS_CONFIG = {
    'name': 'Addis_Ababa',
    'code': 'ETAD',
    'timezone': 'Africa/Addis_Ababa',
}

# Ethiopian seasons
SEASONS = {
    'Dry Season': [10, 11, 12, 1, 2],
    'Belg Rainy Season': [3, 4, 5],
    'Kiremt Rainy Season': [6, 7, 8, 9]
}
SEASONS_ORDER = ['Dry Season', 'Belg Rainy Season', 'Kiremt Rainy Season']
SEASON_COLORS = {'Dry Season': '#E67E22', 'Belg Rainy Season': '#27AE60', 'Kiremt Rainy Season': '#3498DB'}

# Source apportionment categories and their colors
SOURCE_CATEGORIES = {
    'charcoal': {'label': 'Charcoal Burning', 'color': '#2C3E50', 'marker': 'o'},
    'wood': {'label': 'Wood Burning', 'color': '#8B4513', 'marker': 's'},
    'fossil_fuel': {'label': 'Fossil Fuel', 'color': '#7D3C98', 'marker': '^'},
    'polluted_marine': {'label': 'Polluted Marine', 'color': '#2980B9', 'marker': 'D'},
    'sea_salt': {'label': 'Sea Salt', 'color': '#1ABC9C', 'marker': 'v'},
}
SOURCE_ORDER = ['charcoal', 'wood', 'fossil_fuel', 'polluted_marine', 'sea_salt']

# Fraction column names (for stacking order in bar charts)
FRAC_COLS = ['charcoal_frac', 'wood_frac', 'fossil_fuel_frac', 'polluted_marine_frac', 'sea_salt_frac']

# Stack order for bar charts: (column_name, label, color)
STACK_ORDER = [
    ('charcoal_frac', 'Charcoal Burning', '#2C3E50'),
    ('wood_frac', 'Wood Burning', '#8B4513'),
    ('fossil_fuel_frac', 'Fossil Fuel', '#7D3C98'),
    ('polluted_marine_frac', 'Polluted Marine', '#2980B9'),
    ('sea_salt_frac', 'Sea Salt', '#1ABC9C'),
]

# BC/EC measurement methods
METHODS = {
    'hips_fabs': {'label': 'HIPS Fabs/MAC', 'color': '#2ca02c', 'unit': 'µg/m³'},
    'ftir_ec': {'label': 'FTIR EC', 'color': '#d62728', 'unit': 'µg/m³'},
    'ir_bcc': {'label': 'Aeth IR BCc', 'color': '#1f77b4', 'unit': 'µg/m³'},
    'uv_bcc': {'label': 'Aeth UV BCc', 'color': '#ff7f0e', 'unit': 'µg/m³'},
}

# All method pairs to analyze: (x_col, y_col, x_label, y_label, file_prefix)
METHOD_PAIRS = [
    ('ftir_ec', 'hips_fabs', 'FTIR EC (µg/m³)', 'HIPS Fabs/MAC (µg/m³)', 'hips_vs_ec'),
    ('ftir_ec', 'ir_bcc', 'FTIR EC (µg/m³)', 'Aeth IR BCc (µg/m³)', 'aeth_ir_vs_ec'),
    ('hips_fabs', 'ir_bcc', 'HIPS Fabs/MAC (µg/m³)', 'Aeth IR BCc (µg/m³)', 'aeth_ir_vs_hips'),
]

# Thresholds to test for dominant source filtering
DOMINANCE_THRESHOLDS = [0.30, 0.40, 0.50, 0.60]  # 30%, 40%, 50%, 60%

print(f"Site: {ADDIS_CONFIG['name']}")
print(f"Source categories: {', '.join(SOURCE_CATEGORIES.keys())}")
print(f"Method pairs to analyze: {len(METHOD_PAIRS)}")
for x_col, y_col, x_lab, y_lab, prefix in METHOD_PAIRS:
    print(f"  {y_lab} vs {x_lab}")
print(f"Dominance thresholds to test: {DOMINANCE_THRESHOLDS}")

## Data Loading

Load factor contributions (joined to Filter IDs via `oldDate`), then merge with
FTIR EC, HIPS Fabs, and Aethalometer BC measurements via `base_filter_id`.

In [None]:
# =============================================================================
# Load factor contributions with Filter IDs (joined via oldDate)
# =============================================================================
factors_df = load_etad_factors_with_filter_ids()

# Map GF columns to the _frac names used by the rest of the notebook
FACTOR_TO_FRAC = {
    'GF3 (Charcoal)':              'charcoal_frac',
    'GF2 (Wood Burning)':          'wood_frac',
    'GF5 (Fossil Fuel Combustion)':'fossil_fuel_frac',
    'GF4 (Polluted Marine)':       'polluted_marine_frac',
    'GF1 (Sea Salt Mixed)':        'sea_salt_frac',
}
factors_df = factors_df.rename(columns=FACTOR_TO_FRAC)
frac_cols = list(FACTOR_TO_FRAC.values())

print("=" * 80)
print("RAW GF FRACTIONS (before normalization)")
print("=" * 80)
print("\nThese are the raw PM2.5 mass fractions from PMF analysis.")
print("They do NOT sum to 1.0 — they represent the fraction of total PM2.5 from each source.\n")

# Show raw fraction statistics
raw_sums = factors_df[frac_cols].sum(axis=1)
print(f"Raw GF row sums: min={raw_sums.min():.3f}, max={raw_sums.max():.3f}, mean={raw_sums.mean():.3f}")
print("\nRaw GF statistics by source:")
for col in frac_cols:
    vals = factors_df[col].dropna()
    print(f"  {col}: mean={vals.mean():.4f}, min={vals.min():.4f}, max={vals.max():.4f}")

---

## Step 1: Normalization of GF Fractions

### Why Normalize?

The raw GF fractions don't sum to 1.0 — they represent the fraction of **total PM₂.₅ mass**
attributable to each source. The sum varies by day (typically 0.03–0.46) depending on how
much of the aerosol mass is "explained" by the PMF factors.

For visualization and source comparison, we normalize so that:
- Each source's contribution is relative to the **explained OM** (not total PM₂.₅)
- The normalized fractions sum to exactly 1.0 (100%)

### Normalization Formula

For each day:
```
normalized_fraction[source] = raw_GF[source] / sum(all raw_GF values)
```

### Example

| Source | Raw GF | Normalized |
|--------|--------|------------|
| Fossil Fuel | 0.05 | 0.05/0.25 = 0.20 |
| Polluted Marine | 0.03 | 0.03/0.25 = 0.12 |
| Sea Salt | 0.02 | 0.02/0.25 = 0.08 |
| Wood Burning | 0.07 | 0.07/0.25 = 0.28 |
| Charcoal | 0.08 | 0.08/0.25 = 0.32 |
| **Total** | **0.25** | **1.00** |

### Key Point

The normalization changes the interpretation:
- **Raw GF**: Fraction of total PM₂.₅ mass from each source
- **Normalized**: Fraction of **explained OM** from each source (always sums to 100%)

In [None]:
# =============================================================================
# NORMALIZATION: Divide each fraction by the row total so they sum to 1.0
# =============================================================================

print("=" * 80)
print("STEP 1: NORMALIZATION")
print("=" * 80)

# Calculate row totals (sum of all GF fractions for each day)
row_totals = factors_df[frac_cols].sum(axis=1)

print(f"\nRow totals before normalization:")
print(f"  Min: {row_totals.min():.4f}")
print(f"  Max: {row_totals.max():.4f}")
print(f"  Mean: {row_totals.mean():.4f}")

# Normalize: divide each fraction by the row total
for col in frac_cols:
    factors_df[col] = factors_df[col] / row_totals

# Verify normalization worked
normalized_sums = factors_df[frac_cols].sum(axis=1)
print(f"\nRow totals after normalization:")
print(f"  Min: {normalized_sums.min():.4f}")
print(f"  Max: {normalized_sums.max():.4f}")
print(f"  Mean: {normalized_sums.mean():.4f}")
print(f"  (Should all be ~1.0)")

print("\nNormalized fraction statistics by source:")
for col in frac_cols:
    vals = factors_df[col].dropna()
    print(f"  {col}: mean={vals.mean():.3f}, min={vals.min():.3f}, max={vals.max():.3f}")

---

## Load and Merge BC/EC Measurements

In [None]:
# =============================================================================
# Load aethalometer + filter measurements and match by date
# =============================================================================
aethalometer_data = load_aethalometer_data()
filter_data = load_filter_data()
filter_data = add_base_filter_id(filter_data)

df_aeth = aethalometer_data.get('Addis_Ababa')
bc_df = match_all_parameters('Addis_Ababa', 'ETAD', df_aeth, filter_data)

# =============================================================================
# Merge BC/EC measurements with factor contributions via base_filter_id
# =============================================================================
# Get the base_filter_id for each bc_df date by looking up in the unified dataset
etad_filters = filter_data[filter_data['Site'] == 'ETAD'][['SampleDate', 'FilterId']].drop_duplicates()
etad_filters = etad_filters.rename(columns={'SampleDate': 'date', 'FilterId': 'base_filter_id'})
bc_df['date'] = pd.to_datetime(bc_df['date'])
etad_filters['date'] = pd.to_datetime(etad_filters['date'])

bc_with_id = pd.merge(bc_df, etad_filters, on='date', how='left')

# Now merge with factor contributions on base_filter_id
factor_merge_cols = ['base_filter_id'] + frac_cols
df = pd.merge(bc_with_id, factors_df[factor_merge_cols].drop_duplicates(),
              on='base_filter_id', how='inner')

# =============================================================================
# Add temporal features
# =============================================================================
df['Month'] = df['date'].dt.month
df['Ethiopian_Season'] = df['Month'].map(lambda m:
    'Dry Season' if m in SEASONS['Dry Season'] else
    'Belg Rainy Season' if m in SEASONS['Belg Rainy Season'] else
    'Kiremt Rainy Season'
)

# Determine dominant source for each sample (using normalized fractions)
df['dominant_source'] = df[frac_cols].idxmax(axis=1).str.replace('_frac', '')
df['dominant_fraction'] = df[frac_cols].max(axis=1)

print(f"\nFinal dataset: {len(df)} samples")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"\nBC/EC availability:")
for col in ['ftir_ec', 'hips_fabs', 'ir_bcc']:
    if col in df.columns:
        n = df[col].notna().sum()
        print(f"  {col}: {n} samples")

print(f"\nDominant source distribution:")
print(df['dominant_source'].value_counts().to_string())
print(f"\nDominant fraction stats: mean={df['dominant_fraction'].mean():.1%}, "
      f"min={df['dominant_fraction'].min():.1%}, max={df['dominant_fraction'].max():.1%}")
print(f"Samples with ≥50% dominant: {(df['dominant_fraction'] >= 0.50).sum()}")
print(f"Samples with ≥30% dominant: {(df['dominant_fraction'] >= 0.30).sum()}")

---

# Task 1: Baseline Regression Plots (All Method Pairs)

**Goal**: Create baseline scatter plots with regression statistics for all three method comparisons.

In [None]:
def plot_regression(df, x_col, y_col, x_label, y_label, title, color_by=None,
                    color_dict=None, ax=None, show_stats=True, force_through_origin=False):
    """
    Create a regression scatter plot with statistics.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 7))
    else:
        fig = ax.figure
    
    valid = df[[x_col, y_col]].dropna()
    if color_by and color_by in df.columns:
        valid = pd.merge(valid, df[[color_by]], left_index=True, right_index=True)
    
    if len(valid) < 3:
        ax.text(0.5, 0.5, f'Insufficient data\n(n={len(valid)})', 
                transform=ax.transAxes, ha='center', va='center', fontsize=14)
        ax.set_title(title)
        return fig, None
    
    x = valid[x_col].values
    y = valid[y_col].values
    
    # Plot points
    if color_by and color_by in valid.columns and color_dict:
        for category in valid[color_by].unique():
            mask = valid[color_by] == category
            cat_info = color_dict.get(category, {'color': 'gray', 'label': category, 'marker': 'o'})
            ax.scatter(valid.loc[mask, x_col], valid.loc[mask, y_col],
                      s=60, alpha=0.7, color=cat_info.get('color', 'gray'),
                      marker=cat_info.get('marker', 'o'),
                      edgecolors='black', linewidth=0.3,
                      label=f"{cat_info.get('label', category)} (n={mask.sum()})")
    else:
        ax.scatter(x, y, s=60, alpha=0.6, color='#3498DB', edgecolors='black', linewidth=0.3)
    
    # Regression
    if force_through_origin:
        slope = np.sum(x * y) / np.sum(x * x)
        intercept = 0
        y_pred = slope * x
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        r_squared = 1 - (ss_res / ss_tot)
        r = np.sqrt(r_squared) * np.sign(slope)
        p = stats.pearsonr(x, y)[1]
        se = np.sqrt(ss_res / (len(x) - 1))
    else:
        slope, intercept, r, p, se = stats.linregress(x, y)
        r_squared = r ** 2
    
    # Plot regression line and 1:1
    ax_max = max(x.max(), y.max()) * 1.1
    x_fit = np.linspace(0, ax_max, 100)
    ax.plot(x_fit, slope * x_fit + intercept, 'k-', linewidth=2, alpha=0.7, label='Regression')
    ax.plot([0, ax_max], [0, ax_max], 'k--', linewidth=1.5, alpha=0.4, label='1:1 line')
    
    # Statistics annotation
    if show_stats:
        sig = '***' if p < 0.001 else '**' if p < 0.01 else '*' if p < 0.05 else 'ns'
        if force_through_origin:
            stats_text = f'y = {slope:.3f}x\nR² = {r_squared:.3f} ({sig})\nn = {len(valid)}'
        else:
            stats_text = f'y = {slope:.3f}x + {intercept:.3f}\nR² = {r_squared:.3f} ({sig})\nn = {len(valid)}'
        ax.text(0.03, 0.97, stats_text, transform=ax.transAxes, fontsize=10, va='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
    
    # Formatting
    ax.set_xlim(0, ax_max)
    ax.set_ylim(0, ax_max)
    ax.set_xlabel(x_label, fontsize=12)
    ax.set_ylabel(y_label, fontsize=12)
    ax.set_title(title, fontsize=13, fontweight='bold')
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    if color_by:
        ax.legend(fontsize=8, loc='lower right')
    
    results = {
        'slope': slope, 'intercept': intercept, 'r': r, 'r_squared': r_squared,
        'p_value': p, 'se': se, 'n': len(valid)
    }
    
    return fig, results


print("="*80)
print("TASK 1: BASELINE REGRESSIONS — ALL METHOD PAIRS")
print("="*80)

baseline_results = {}

for x_col, y_col, x_label, y_label, prefix in METHOD_PAIRS:
    pair_label = f'{y_label} vs {x_label}'
    print(f"\n--- {pair_label} ---")
    
    fig, results = plot_regression(
        df, x_col, y_col, x_label, y_label,
        f'{y_label.split(" (")[0]} vs {x_label.split(" (")[0]} — All Data'
    )
    plt.tight_layout()
    plt.savefig(os.path.join(dirs['plots'], f'{prefix}_baseline.png'), dpi=150, bbox_inches='tight')
    plt.show()
    
    if results:
        baseline_results[prefix] = results
        print(f"  Slope: {results['slope']:.4f}")
        print(f"  Intercept: {results['intercept']:.4f}")
        print(f"  R²: {results['r_squared']:.4f}")
        print(f"  p-value: {results['p_value']:.2e}")
        print(f"  n: {results['n']}")

---

# Task 2: Source-Separated Regression Plots (All Method Pairs)

**Goal**: Create regression plots filtered by dominant source type for all three method comparisons.

In [None]:
def plot_source_separated_regressions(df, x_col, y_col, x_label, y_label, 
                                       sources=SOURCE_ORDER, source_info=SOURCE_CATEGORIES):
    """
    Create panel of regression plots, one per dominant source.
    """
    n_sources = len(sources)
    n_cols = 3
    n_rows = int(np.ceil(n_sources / n_cols))
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(6*n_cols, 5.5*n_rows))
    axes = axes.flatten() if n_sources > 1 else [axes]
    
    results_all = {}
    
    for idx, source in enumerate(sources):
        ax = axes[idx]
        
        # Filter to dominant source
        source_mask = df['dominant_source'] == source
        source_data = df[source_mask].copy()
        
        if len(source_data) < 3:
            ax.text(0.5, 0.5, f'{source_info[source]["label"]}\n(n={len(source_data)})\nInsufficient data',
                   transform=ax.transAxes, ha='center', va='center', fontsize=11)
            ax.set_title(source_info[source]['label'], fontsize=11, fontweight='bold',
                        color=source_info[source]['color'])
            ax.grid(True, alpha=0.3)
            continue
        
        valid = source_data[[x_col, y_col]].dropna()
        
        if len(valid) < 3:
            ax.text(0.5, 0.5, f'{source_info[source]["label"]}\n(n={len(valid)})\nInsufficient data',
                   transform=ax.transAxes, ha='center', va='center', fontsize=11)
            ax.set_title(source_info[source]['label'], fontsize=11, fontweight='bold',
                        color=source_info[source]['color'])
            ax.grid(True, alpha=0.3)
            continue
        
        x = valid[x_col].values
        y = valid[y_col].values
        
        # Scatter
        ax.scatter(x, y, s=50, alpha=0.6, color=source_info[source]['color'],
                  marker=source_info[source]['marker'], edgecolors='black', linewidth=0.3)
        
        # Regression
        slope, intercept, r, p, se = stats.linregress(x, y)
        
        ax_max = max(x.max(), y.max()) * 1.1 if len(x) > 0 else 10
        x_fit = np.linspace(0, ax_max, 100)
        ax.plot(x_fit, slope * x_fit + intercept, 'k-', linewidth=1.5, alpha=0.7)
        ax.plot([0, ax_max], [0, ax_max], 'k--', linewidth=1, alpha=0.3)
        
        # Stats
        sig = '***' if p < 0.001 else '**' if p < 0.01 else '*' if p < 0.05 else 'ns'
        ax.text(0.03, 0.97, f'y = {slope:.3f}x + {intercept:.2f}\nR² = {r**2:.3f} ({sig})\nn = {len(valid)}',
                transform=ax.transAxes, fontsize=9, va='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
        
        ax.set_xlim(0, ax_max)
        ax.set_ylim(0, ax_max)
        ax.set_xlabel(x_label, fontsize=10)
        ax.set_ylabel(y_label, fontsize=10)
        ax.set_title(source_info[source]['label'], fontsize=11, fontweight='bold',
                    color=source_info[source]['color'])
        ax.set_aspect('equal')
        ax.grid(True, alpha=0.3)
        
        results_all[source] = {
            'slope': slope, 'intercept': intercept, 'r': r, 'r_squared': r**2,
            'p_value': p, 'n': len(valid)
        }
    
    # Hide unused axes
    for idx in range(n_sources, len(axes)):
        axes[idx].set_visible(False)
    
    plt.suptitle(f'{y_label} vs {x_label} — By Dominant Source',
                fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    return fig, results_all


print("="*80)
print("TASK 2: SOURCE-SEPARATED REGRESSIONS — ALL METHOD PAIRS")
print("="*80)

all_source_results = {}

for x_col, y_col, x_label, y_label, prefix in METHOD_PAIRS:
    pair_label = f'{y_label.split(" (")[0]} vs {x_label.split(" (")[0]}'
    print(f"\n{'='*60}")
    print(f"  {pair_label}")
    print(f"{'='*60}")
    
    fig, source_results = plot_source_separated_regressions(
        df, x_col, y_col, x_label, y_label
    )
    plt.savefig(os.path.join(dirs['plots'], f'{prefix}_by_source.png'), dpi=150, bbox_inches='tight')
    plt.show()
    
    all_source_results[prefix] = source_results
    
    # Summary table
    print(f"\n{'Source':<20s} {'n':>5s} {'Slope':>8s} {'Intercept':>10s} {'R²':>8s} {'p-value':>12s}")
    print("-" * 70)
    for source in SOURCE_ORDER:
        if source in source_results:
            r = source_results[source]
            sig = '*' if r['p_value'] < 0.05 else ''
            print(f"{SOURCE_CATEGORIES[source]['label']:<20s} {r['n']:>5d} {r['slope']:>8.3f} "
                  f"{r['intercept']:>10.3f} {r['r_squared']:>8.3f} {r['p_value']:>11.2e}{sig}")

---

# Task 3: Threshold-Filtered Regressions (All Method Pairs)

**Goal**: Filter out "mixed days" for all three method comparisons — only keep days where dominant source exceeds threshold.

In [None]:
def analyze_threshold_filtering(df, thresholds=DOMINANCE_THRESHOLDS):
    """
    Analyze how threshold filtering affects sample counts.
    """
    print("\nThreshold Filtering Analysis:")
    print("=" * 70)
    print(f"{'Threshold':<12s}", end='')
    for source in SOURCE_ORDER:
        print(f" {SOURCE_CATEGORIES[source]['label'][:10]:>10s}", end='')
    print(f" {'Total':>10s}")
    print("-" * 70)
    
    for thresh in thresholds:
        filtered = df[df['dominant_fraction'] >= thresh]
        print(f"{thresh*100:.0f}%{'':<9s}", end='')
        for source in SOURCE_ORDER:
            n = (filtered['dominant_source'] == source).sum()
            print(f" {n:>10d}", end='')
        print(f" {len(filtered):>10d}")


def plot_threshold_comparison(df, x_col, y_col, x_label, y_label, 
                               thresholds=DOMINANCE_THRESHOLDS):
    """
    Create comparison plots at different thresholds.
    """
    n_thresh = len(thresholds)
    fig, axes = plt.subplots(1, n_thresh + 1, figsize=(5*(n_thresh+1), 5))
    
    results = {}
    
    # Plot 0: All data
    ax = axes[0]
    valid = df[[x_col, y_col]].dropna()
    if len(valid) >= 3:
        x, y = valid[x_col].values, valid[y_col].values
        ax.scatter(x, y, s=40, alpha=0.5, color='gray', edgecolors='black', linewidth=0.2)
        slope, intercept, r, p, se = stats.linregress(x, y)
        ax_max = max(x.max(), y.max()) * 1.1
        x_fit = np.linspace(0, ax_max, 100)
        ax.plot(x_fit, slope * x_fit + intercept, 'k-', linewidth=1.5)
        ax.plot([0, ax_max], [0, ax_max], 'k--', linewidth=1, alpha=0.3)
        ax.text(0.03, 0.97, f'R² = {r**2:.3f}\nn = {len(valid)}',
                transform=ax.transAxes, fontsize=9, va='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
        ax.set_xlim(0, ax_max)
        ax.set_ylim(0, ax_max)
        results['all'] = {'r_squared': r**2, 'n': len(valid), 'slope': slope}
    ax.set_xlabel(x_label, fontsize=10)
    ax.set_ylabel(y_label, fontsize=10)
    ax.set_title('All Data', fontsize=11, fontweight='bold')
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    # Threshold-filtered plots
    for idx, thresh in enumerate(thresholds):
        ax = axes[idx + 1]
        filtered = df[df['dominant_fraction'] >= thresh]
        valid = filtered[[x_col, y_col, 'dominant_source']].dropna()
        
        if len(valid) >= 3:
            # Color by source
            for source in SOURCE_ORDER:
                mask = valid['dominant_source'] == source
                if mask.sum() > 0:
                    ax.scatter(valid.loc[mask, x_col], valid.loc[mask, y_col],
                              s=40, alpha=0.6, color=SOURCE_CATEGORIES[source]['color'],
                              marker=SOURCE_CATEGORIES[source]['marker'],
                              edgecolors='black', linewidth=0.2,
                              label=SOURCE_CATEGORIES[source]['label'][:8])
            
            x, y = valid[x_col].values, valid[y_col].values
            slope, intercept, r, p, se = stats.linregress(x, y)
            ax_max = max(x.max(), y.max()) * 1.1
            x_fit = np.linspace(0, ax_max, 100)
            ax.plot(x_fit, slope * x_fit + intercept, 'k-', linewidth=1.5)
            ax.plot([0, ax_max], [0, ax_max], 'k--', linewidth=1, alpha=0.3)
            ax.text(0.03, 0.97, f'R² = {r**2:.3f}\nn = {len(valid)}',
                    transform=ax.transAxes, fontsize=9, va='top',
                    bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
            ax.set_xlim(0, ax_max)
            ax.set_ylim(0, ax_max)
            results[f'{thresh*100:.0f}%'] = {'r_squared': r**2, 'n': len(valid), 'slope': slope}
        
        ax.set_xlabel(x_label, fontsize=10)
        ax.set_title(f'≥{thresh*100:.0f}% Dominant', fontsize=11, fontweight='bold')
        ax.set_aspect('equal')
        ax.grid(True, alpha=0.3)
        if idx == len(thresholds) - 1:
            ax.legend(fontsize=6, loc='lower right')
    
    plt.suptitle(f'{y_label} vs {x_label} — Threshold Comparison',
                fontsize=13, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    return fig, results


def plot_source_regressions_threshold_filtered(df, x_col, y_col, x_label, y_label, threshold=0.50):
    """
    Plot source-separated regressions with threshold filter applied.
    """
    filtered = df[df['dominant_fraction'] >= threshold].copy()
    
    print(f"\nFiltered to {threshold*100:.0f}%+ dominant: {len(filtered)} samples")
    print(f"Source breakdown: {filtered['dominant_source'].value_counts().to_dict()}")
    
    fig, results = plot_source_separated_regressions(
        filtered, x_col, y_col, x_label, y_label
    )
    
    plt.suptitle(f'{y_label} vs {x_label} — By Source (≥{threshold*100:.0f}% Dominant)',
                fontsize=14, fontweight='bold', y=1.02)
    
    return fig, results


print("="*80)
print("TASK 3: THRESHOLD-FILTERED REGRESSIONS — ALL METHOD PAIRS")
print("="*80)

# Show threshold sample counts (same for all pairs)
analyze_threshold_filtering(df)

all_thresh_results = {}

for x_col, y_col, x_label, y_label, prefix in METHOD_PAIRS:
    pair_label = f'{y_label.split(" (")[0]} vs {x_label.split(" (")[0]}'
    print(f"\n{'='*60}")
    print(f"  {pair_label} — Threshold Comparison")
    print(f"{'='*60}")
    
    # Threshold comparison strip plot
    fig, thresh_results = plot_threshold_comparison(
        df, x_col, y_col, x_label, y_label
    )
    plt.savefig(os.path.join(dirs['plots'], f'{prefix}_threshold_comparison.png'), dpi=150, bbox_inches='tight')
    plt.show()
    
    all_thresh_results[prefix] = thresh_results
    
    print(f"\nR² by Threshold:")
    for key, val in thresh_results.items():
        print(f"  {key}: R² = {val['r_squared']:.3f}, n = {val['n']}")
    
    # Source-separated at 50% threshold
    print(f"\n--- {pair_label} — By Source (≥50% Dominant) ---")
    fig, thresh_source_results = plot_source_regressions_threshold_filtered(
        df, x_col, y_col, x_label, y_label, threshold=0.50
    )
    plt.savefig(os.path.join(dirs['plots'], f'{prefix}_by_source_50pct.png'), dpi=150, bbox_inches='tight')
    plt.show()

---

# Task 4: Source Contribution Bar Charts

**Goal**: Visualize daily source contributions as stacked bar charts.

## Stacking Method

The bars are stacked using a running `bottom` accumulator:

```python
bottom = np.zeros(len(valid_norm))  # Start at 0

for col, label, color in STACK_ORDER:
    values = valid_norm[col].values
    ax.bar(x_dates, values, width=bar_width, bottom=bottom, ...)
    bottom += values  # Next source starts where this one ended
```

This produces bars that all reach exactly 1.0 (100%), with each color segment 
showing that source's relative contribution to the explained OM.

In [None]:
def plot_source_contributions_stacked_bars(df, date_col='date', max_samples=60):
    """
    Create stacked bar chart of daily source contributions (normalized fractions).
    
    The fractions have already been normalized to sum to 1.0 (100%).
    Uses a `bottom` accumulator to stack bars so they reach exactly 100%.
    
    Parameters:
    -----------
    df : DataFrame with normalized fraction columns
    date_col : name of the date column
    max_samples : maximum number of samples to show (for readability)
    """
    # Filter to samples with source data
    valid_norm = df.dropna(subset=FRAC_COLS).copy()
    valid_norm = valid_norm.sort_values(date_col)
    
    if len(valid_norm) == 0:
        print("No valid source apportionment data")
        return None
    
    # Limit to manageable number for visualization
    if len(valid_norm) > max_samples:
        valid_norm = valid_norm.tail(max_samples)
        print(f"Showing last {max_samples} samples")
    
    # Verify normalization
    row_sums = valid_norm[FRAC_COLS].sum(axis=1)
    print(f"\nVerifying normalization for bar chart:")
    print(f"  Row sums: min={row_sums.min():.4f}, max={row_sums.max():.4f}, mean={row_sums.mean():.4f}")
    print(f"  (Should all be ~1.0)")
    
    # Create figure
    fig, ax = plt.subplots(figsize=(16, 6))
    
    # X-axis: sample indices
    x = np.arange(len(valid_norm))
    bar_width = 0.8
    
    # ==========================================================================
    # STACKING: Use a running `bottom` accumulator
    # ==========================================================================
    bottom = np.zeros(len(valid_norm))  # Start at 0
    
    for col, label, color in STACK_ORDER:
        # Get the normalized values for this source
        values = valid_norm[col].values
        
        # Create bar starting at current `bottom`
        ax.bar(x, values, bar_width, bottom=bottom, 
               color=color, label=label,
               edgecolor='white', linewidth=0.5)
        
        # Update bottom for next source: this source's bars now form the base
        bottom += values
    
    # ==========================================================================
    # Formatting
    # ==========================================================================
    
    # X-axis labels (show every 3rd date for readability)
    date_labels = valid_norm[date_col].dt.strftime('%m/%d')
    ax.set_xticks(x[::3])
    ax.set_xticklabels(date_labels.iloc[::3], rotation=45, ha='right', fontsize=8)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Source Fraction (Normalized)', fontsize=12)
    ax.set_title('Daily Source Contributions — Normalized to 100%', fontsize=14, fontweight='bold')
    
    # Legend outside plot
    ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=9)
    
    # Y-axis should go from 0 to 1.0 (or slightly above for visibility)
    ax.set_ylim(0, 1.05)
    ax.set_yticks([0, 0.25, 0.5, 0.75, 1.0])
    ax.set_yticklabels(['0%', '25%', '50%', '75%', '100%'])
    
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    return fig


def plot_source_contributions_summary(df):
    """
    Summary visualization: overall average and seasonal breakdown.
    """
    valid = df.dropna(subset=FRAC_COLS + ['Ethiopian_Season'])
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Panel 1: Overall average (pie chart)
    ax = axes[0]
    avg_fracs = [valid[col].mean() for col in FRAC_COLS]
    colors = [SOURCE_CATEGORIES[c.replace('_frac', '')]['color'] for c in FRAC_COLS]
    labels = [SOURCE_CATEGORIES[c.replace('_frac', '')]['label'] for c in FRAC_COLS]
    
    wedges, texts, autotexts = ax.pie(avg_fracs, labels=None, autopct='%1.1f%%',
                                       colors=colors, startangle=90,
                                       wedgeprops={'edgecolor': 'white', 'linewidth': 1})
    ax.legend(wedges, labels, loc='center left', bbox_to_anchor=(0.85, 0.5), fontsize=9)
    ax.set_title(f'Overall Source Contributions\n(n={len(valid)} samples)', fontsize=12, fontweight='bold')
    
    # Panel 2: By season (grouped bars)
    ax = axes[1]
    
    x = np.arange(len(SEASONS_ORDER))
    n_sources = len(FRAC_COLS)
    width = 0.15
    
    for i, col in enumerate(FRAC_COLS):
        source_key = col.replace('_frac', '')
        means = []
        stds = []
        for season in SEASONS_ORDER:
            season_data = valid[valid['Ethiopian_Season'] == season][col]
            means.append(season_data.mean() if len(season_data) > 0 else 0)
            stds.append(season_data.std() if len(season_data) > 1 else 0)
        
        offset = (i - n_sources/2 + 0.5) * width
        ax.bar(x + offset, means, width, yerr=stds, capsize=2,
               color=SOURCE_CATEGORIES[source_key]['color'],
               label=SOURCE_CATEGORIES[source_key]['label'],
               edgecolor='black', linewidth=0.5)
    
    ax.set_xlabel('Ethiopian Season', fontsize=11)
    ax.set_ylabel('Mean Fraction', fontsize=11)
    ax.set_title('Source Contributions by Season', fontsize=12, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels([s.replace(' Season', '') for s in SEASONS_ORDER], rotation=15)
    ax.legend(fontsize=8, loc='upper right')
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    return fig


def plot_dominant_source_timeseries(df):
    """
    Time series showing which source is dominant each day and its fraction.
    """
    valid = df.dropna(subset=['dominant_source', 'dominant_fraction']).copy()
    valid = valid.sort_values('date')
    
    fig, axes = plt.subplots(2, 1, figsize=(16, 8), sharex=True,
                              gridspec_kw={'height_ratios': [2, 1]})
    
    # Panel 1: Dominant fraction colored by source
    ax = axes[0]
    for source in SOURCE_ORDER:
        mask = valid['dominant_source'] == source
        if mask.sum() > 0:
            ax.scatter(valid.loc[mask, 'date'], valid.loc[mask, 'dominant_fraction'],
                      s=50, alpha=0.7, color=SOURCE_CATEGORIES[source]['color'],
                      marker=SOURCE_CATEGORIES[source]['marker'],
                      label=SOURCE_CATEGORIES[source]['label'],
                      edgecolors='black', linewidth=0.3)
    
    ax.axhline(0.5, color='red', linestyle='--', linewidth=1, alpha=0.5, label='50% threshold')
    ax.set_ylabel('Dominant Source Fraction', fontsize=11)
    ax.set_title('Dominant Source Time Series', fontsize=13, fontweight='bold')
    ax.legend(fontsize=8, loc='upper right', ncol=2)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(0, 1)
    
    # Panel 2: Source category (categorical)
    ax = axes[1]
    source_map = {s: i for i, s in enumerate(SOURCE_ORDER)}
    valid['source_idx'] = valid['dominant_source'].map(source_map)
    
    for source in SOURCE_ORDER:
        mask = valid['dominant_source'] == source
        if mask.sum() > 0:
            ax.scatter(valid.loc[mask, 'date'], valid.loc[mask, 'source_idx'],
                      s=40, alpha=0.8, color=SOURCE_CATEGORIES[source]['color'],
                      marker='|')
    
    ax.set_yticks(range(len(SOURCE_ORDER)))
    ax.set_yticklabels([SOURCE_CATEGORIES[s]['label'] for s in SOURCE_ORDER], fontsize=9)
    ax.set_xlabel('Date', fontsize=11)
    ax.set_ylabel('Dominant Source', fontsize=11)
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    return fig


print("="*80)
print("TASK 5: SOURCE CONTRIBUTION VISUALIZATION")
print("="*80)

# Daily stacked bar chart (main visualization)
print("\n--- Stacked Bar Chart (Normalized to 100%) ---")
fig = plot_source_contributions_stacked_bars(df)
if fig:
    plt.savefig(os.path.join(dirs['plots'], 'source_contributions_stacked_bars.png'), dpi=150, bbox_inches='tight')
    plt.show()

# Summary plots
print("\n--- Summary Visualizations ---")
fig = plot_source_contributions_summary(df)
plt.savefig(os.path.join(dirs['plots'], 'source_contributions_summary.png'), dpi=150, bbox_inches='tight')
plt.show()

# Dominant source time series
print("\n--- Dominant Source Time Series ---")
fig = plot_dominant_source_timeseries(df)
plt.savefig(os.path.join(dirs['plots'], 'dominant_source_timeseries.png'), dpi=150, bbox_inches='tight')
plt.show()

# Print verification stats
print("\n" + "="*80)
print("SOURCE CONTRIBUTION VERIFICATION")
print("="*80)
print("\nNormalized fraction statistics (should sum to ~1.0 for each row):")
for col in FRAC_COLS:
    if col in df.columns:
        vals = df[col].dropna()
        print(f"  {col}: mean={vals.mean():.3f}, median={vals.median():.3f}, "
              f"min={vals.min():.3f}, max={vals.max():.3f}")

row_sums = df[FRAC_COLS].sum(axis=1).dropna()
print(f"\nRow sums: mean={row_sums.mean():.4f}, min={row_sums.min():.4f}, max={row_sums.max():.4f}")

---

# Summary Table: All Regression Results

In [None]:
print("="*80)
print("SUMMARY: ALL REGRESSION RESULTS")
print("="*80)

print("\nBaseline Method Comparison Summary:")
print("-" * 70)

for x_col, y_col, x_label, y_label, prefix in METHOD_PAIRS:
    valid = df[[x_col, y_col]].dropna()
    if len(valid) >= 3:
        slope, intercept, r, p, se = stats.linregress(valid[x_col], valid[y_col])
        pair_label = f'{y_label.split(" (")[0]} vs {x_label.split(" (")[0]}'
        print(f"\n{pair_label}:")
        print(f"  n = {len(valid)}")
        print(f"  Slope = {slope:.3f}")
        print(f"  Intercept = {intercept:.3f}")
        print(f"  R² = {r**2:.3f}")
        print(f"  p = {p:.2e}")

print("\n\nThreshold R² Comparison Across Method Pairs:")
print("-" * 70)
print(f"{'Method Pair':<30s}", end='')
print(f" {'All':>8s}", end='')
for t in DOMINANCE_THRESHOLDS:
    print(f" {'≥'+str(int(t*100))+'%':>8s}", end='')
print()
print("-" * 70)
for x_col, y_col, x_label, y_label, prefix in METHOD_PAIRS:
    pair_label = f'{y_label.split(" (")[0]} vs {x_label.split(" (")[0]}'
    if prefix in all_thresh_results:
        print(f"{pair_label:<30s}", end='')
        tr = all_thresh_results[prefix]
        if 'all' in tr:
            print(f" {tr['all']['r_squared']:>8.3f}", end='')
        else:
            print(f" {'N/A':>8s}", end='')
        for t in DOMINANCE_THRESHOLDS:
            key = f'{t*100:.0f}%'
            if key in tr:
                print(f" {tr[key]['r_squared']:>8.3f}", end='')
            else:
                print(f" {'N/A':>8s}", end='')
        print()

---

# Export Data

In [None]:
# Save merged dataset
output_path = os.path.join(dirs['data'], 'bc_ec_source_merged.csv')
df.to_csv(output_path, index=False)
print(f"\nSaved merged dataset to: {output_path}")

# Save summary statistics
summary_stats = []
for col in ['ftir_ec', 'hips_fabs', 'ir_bcc', 'uv_bcc']:
    if col in df.columns:
        vals = df[col].dropna()
        if len(vals) > 0:
            summary_stats.append({
                'Variable': col,
                'n': len(vals),
                'Mean': vals.mean(),
                'Std': vals.std(),
                'Median': vals.median(),
                'Min': vals.min(),
                'Max': vals.max()
            })

stats_df = pd.DataFrame(summary_stats)
stats_path = os.path.join(dirs['data'], 'bc_ec_summary_stats.csv')
stats_df.to_csv(stats_path, index=False)
print(f"Saved summary statistics to: {stats_path}")

print("\n" + "="*80)
print("ANALYSIS COMPLETE")
print("="*80)
print(f"\nPlots saved to: {dirs['plots']}")
print(f"Data saved to: {dirs['data']}")

---

# Appendix: Data Flow & Normalization Explained

## How data is loaded and merged:

1. **`load_etad_factors_with_filter_ids()`** loads both CSVs and joins on `oldDate`:
   - `ETAD Factor Contributions .csv` (PMF fractions GF1-GF5 + concentrations K_F1-K_F5)
   - `ETAD Filter ID.csv` (maps dates to FilterId like `ETAD-0035-3`)
   - Produces `base_filter_id` (e.g., `ETAD-0035`) for joining to unified dataset

2. **`match_all_parameters()`** loads FTIR EC, HIPS Fabs, and Aethalometer BC from the
   unified filter dataset, matched by date

3. **Final merge** joins BC/EC measurements to factor contributions on `base_filter_id`

## Normalization Process:

### Step 1: Calculate Row Totals
The raw GF fractions don't sum to 1.0 (they sum to ~0.03–0.46). We normalize them:

```python
# Calculate row totals
row_totals = factors_df[frac_cols].sum(axis=1)

# Divide each fraction by the total so they sum to 1.0
for col in frac_cols:
    factors_df[col] = factors_df[col] / row_totals
```

### Example for one day:

| Source | Raw GF | Normalized |
|--------|--------|------------|
| Fossil Fuel | 0.05 | 0.05/0.25 = 0.20 |
| Polluted Marine | 0.03 | 0.03/0.25 = 0.12 |
| Sea Salt | 0.02 | 0.02/0.25 = 0.08 |
| Wood Burning | 0.07 | 0.07/0.25 = 0.28 |
| Charcoal | 0.08 | 0.08/0.25 = 0.32 |
| **Total** | **0.25** | **1.00** |

### Step 2: Stacking for Bar Charts
The bars are stacked using a running `bottom` accumulator:

```python
bottom = np.zeros(len(valid_norm))  # Start at 0

for col, label, color in STACK_ORDER:
    values = valid_norm[col].values
    ax.bar(x_dates, values, width=bar_width, bottom=bottom, ...)
    bottom += values  # Next source starts where this one ended
```

This produces bars that all reach exactly 1.0 (100%), with each color segment showing 
that source's relative contribution to the explained OM.

### Key Point
The normalization changes the interpretation:
- **Raw GF**: Fraction of total PM₂.₅ mass from each source
- **Normalized**: Fraction of **explained OM** from each source (always sums to 100%)

## Key parameters:
- `DOMINANCE_THRESHOLDS`: List of thresholds to test (default: 30%, 40%, 50%, 60%)
- `MAC_VALUE`: Mass absorption coefficient for HIPS conversion (default: 10 m²/g)