# Comprehensive Air Quality Data EDA - Multan AQI Features

This notebook completes the comprehensive analysis of engineered air quality and weather data from Hopsworks feature store.

## Dataset Overview
- **Source**: Hopsworks Feature Store (multan_aqi_features)
- **Records**: 905 observations 
- **Features**: 127 engineered features
- **Time Range**: June 2025 - July 2025

## Modeling Approach
- **🎯 Goal**: Accurate US AQI prediction for Multan
- **🔧 Method**: Train ML model to predict PM2.5 & PM10 → Calculate AQI via EPA formula
- **📊 ML Targets**: pm2_5, pm10 concentrations (µg/m³)
- **✅ Success Metric**: How well calculated AQI matches actual AQI values

## Feature Categories
1. **Raw Air Quality**: pm2_5, pm10, co, no, no2, so2, o3, nh3
2. **AQI Calculations**: pm2_5_aqi, pm10_aqi, us_aqi
3. **Weather Data**: temperature, humidity, pressure, wind_speed, wind_direction
4. **Time Features**: Cyclical encodings (hour, day, month, etc.)
5. **Lag Features**: 1h-72h historical values
6. **Rolling Statistics**: 3h-24h windows (mean, std, min, max)
7. **Engineered Features**: Interactions, squared terms, categorical flags


## 1. Data Overview
Loading and examining the basic structure of our modeling dataset from Hopsworks.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import hopsworks
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Import configuration
from config import HOPSWORKS_CONFIG

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)


In [None]:
# Connect to Hopsworks and load data
print("Connecting to Hopsworks...")
project = hopsworks.login(api_key_value=HOPSWORKS_CONFIG["api_key"], project=HOPSWORKS_CONFIG["project_name"])
fs = project.get_feature_store()

print("Loading feature group data...")
fg = fs.get_feature_group(HOPSWORKS_CONFIG["feature_group_name"], version=1)
df = fg.read()

print(f"Successfully loaded {len(df)} records from Hopsworks")
print(f"Date range: {df['time'].min()} to {df['time'].max()}")


In [None]:
# Fix column references and prepare data
print("Data preparation and column check...")
print(f"Time column: {'time' if 'time' in df.columns else 'timestamp not found'}")
print(f"Date range: {df['time'].min()} to {df['time'].max()}")

# Ensure time is datetime
if df['time'].dtype == 'object':
    df['time'] = pd.to_datetime(df['time'])

# Sort by time
df = df.sort_values('time').reset_index(drop=True)
print("✓ Data sorted by time")


## 9. Rolling Features Analysis

**Focus**: Analyzing 32 rolling statistics features (16 for PM2.5 + 16 for PM10) to determine their predictive value and optimal usage.

### 9.1 Rolling Features Overview


In [None]:
# 9.1 Rolling Features Overview
print("=" * 60)
print("ROLLING FEATURES ANALYSIS")
print("=" * 60)

# Identify all rolling features
pm25_rolling = [col for col in df.columns if 'rolling' in col and 'pm2_5' in col]
pm10_rolling = [col for col in df.columns if 'rolling' in col and 'pm10' in col]
all_rolling = pm25_rolling + pm10_rolling

print(f"Total Rolling Features: {len(all_rolling)}")
print(f"PM2.5 Rolling Features: {len(pm25_rolling)}")
print(f"PM10 Rolling Features: {len(pm10_rolling)}")

# Categorize by window size and statistic
windows = ['3h', '6h', '12h', '24h']
stats = ['mean', 'std', 'min', 'max']

print(f"\nRolling Feature Structure:")
print(f"Windows: {windows}")
print(f"Statistics: {stats}")
print(f"Total combinations per target: {len(windows)} windows × {len(stats)} stats = {len(windows) * len(stats)} features")

# Show sample features
print(f"\nSample PM2.5 Rolling Features:")
for i, feature in enumerate(pm25_rolling[:8]):
    print(f"  {i+1:2d}. {feature}")
    
print(f"\nSample PM10 Rolling Features:")
for i, feature in enumerate(pm10_rolling[:8]):
    print(f"  {i+1:2d}. {feature}")


### 9.2 Rolling Features vs Current PM Correlation


In [None]:
# 9.2 Rolling Features vs Current PM Correlation
print("\n" + "=" * 60)
print("ROLLING FEATURES vs CURRENT PM CORRELATION")
print("=" * 60)

# PM2.5 Rolling Features Analysis
print(f"\nPM2.5 ROLLING FEATURES CORRELATION WITH CURRENT PM2.5:")
print("-" * 55)

pm25_correlations = {}
for stat_type in stats:
    stat_features = [col for col in pm25_rolling if stat_type in col]
    if stat_features:
        print(f"\n{stat_type.upper()} features:")
        correlations = df[['pm2_5'] + stat_features].corr()['pm2_5'].drop('pm2_5')
        pm25_correlations[stat_type] = correlations
        for feature, corr in correlations.sort_values(ascending=False).items():
            window = feature.split('_')[-1]
            print(f"    {window:<4} window: {corr:.3f}")

# PM10 Rolling Features Analysis  
print(f"\nPM10 ROLLING FEATURES CORRELATION WITH CURRENT PM10:")
print("-" * 54)

pm10_correlations = {}
for stat_type in stats:
    stat_features = [col for col in pm10_rolling if stat_type in col]
    if stat_features:
        print(f"\n{stat_type.upper()} features:")
        correlations = df[['pm10'] + stat_features].corr()['pm10'].drop('pm10')
        pm10_correlations[stat_type] = correlations
        for feature, corr in correlations.sort_values(ascending=False).items():
            window = feature.split('_')[-1]
            print(f"    {window:<4} window: {corr:.3f}")

# Summary insights
print(f"\n📊 KEY INSIGHTS:")
print(f"• Rolling features show how current PM relates to recent historical patterns")
print(f"• Higher correlations = better predictive value for current conditions")
print(f"• Different statistics capture different aspects of PM behavior")


### 9.3 Rolling Features vs Future PM Correlation


In [None]:
# 9.3 Rolling Features vs Future PM Correlation
print("\n" + "=" * 60)
print("ROLLING FEATURES vs FUTURE PM CORRELATION")
print("=" * 60)

# Create future PM values for correlation analysis
future_horizons = [1, 6, 12, 24, 48, 72]
future_pm_data = {}

for horizon in future_horizons:
    future_pm_data[f'pm2_5_future_{horizon}h'] = df['pm2_5'].shift(-horizon)
    future_pm_data[f'pm10_future_{horizon}h'] = df['pm10'].shift(-horizon)

future_df = pd.DataFrame(future_pm_data)
combined_df = pd.concat([df[all_rolling + ['pm2_5', 'pm10']], future_df], axis=1)

# Analyze PM2.5 rolling features vs future PM2.5
print(f"\nPM2.5 ROLLING FEATURES vs FUTURE PM2.5:")
print("-" * 45)

for horizon in future_horizons:
    future_col = f'pm2_5_future_{horizon}h'
    print(f"\n{horizon}h ahead predictions:")
    
    # Best rolling feature for this horizon
    rolling_corrs = combined_df[pm25_rolling + [future_col]].corr()[future_col].drop(future_col)
    best_rolling = rolling_corrs.abs().idxmax()
    best_corr = rolling_corrs[best_rolling]
    
    print(f"  Best rolling feature: {best_rolling} (corr: {best_corr:.3f})")
    
    # Top 3 by statistic type
    for stat_type in stats:
        stat_features = [col for col in pm25_rolling if stat_type in col]
        if stat_features:
            stat_corrs = rolling_corrs[stat_features]
            if len(stat_corrs) > 0:
                best_stat_feature = stat_corrs.abs().idxmax()
                best_stat_corr = stat_corrs[best_stat_feature]
                window = best_stat_feature.split('_')[-1]
                print(f"    {stat_type:<4}: {window} window ({best_stat_corr:.3f})")

# Analyze PM10 rolling features vs future PM10
print(f"\nPM10 ROLLING FEATURES vs FUTURE PM10:")
print("-" * 44)

for horizon in future_horizons:
    future_col = f'pm10_future_{horizon}h'
    print(f"\n{horizon}h ahead predictions:")
    
    # Best rolling feature for this horizon
    rolling_corrs = combined_df[pm10_rolling + [future_col]].corr()[future_col].drop(future_col)
    best_rolling = rolling_corrs.abs().idxmax()
    best_corr = rolling_corrs[best_rolling]
    
    print(f"  Best rolling feature: {best_rolling} (corr: {best_corr:.3f})")
    
    # Top by statistic type
    for stat_type in stats:
        stat_features = [col for col in pm10_rolling if stat_type in col]
        if stat_features:
            stat_corrs = rolling_corrs[stat_features]
            if len(stat_corrs) > 0:
                best_stat_feature = stat_corrs.abs().idxmax()
                best_stat_corr = stat_corrs[best_stat_feature]
                window = best_stat_feature.split('_')[-1]
                print(f"    {stat_type:<4}: {window} window ({best_stat_corr:.3f})")

print(f"\n🎯 FORECASTING INSIGHTS:")
print(f"• Rolling features show predictive power for future PM values")
print(f"• Different rolling windows optimal for different prediction horizons")
print(f"• Short windows better for short-term, long windows for long-term predictions")


## 10. Missing Pollutants Analysis

**Focus**: Analyzing NO (Nitric Oxide) and NH3 (Ammonia) - pollutants we haven't analyzed yet.

### 10.1 NO and NH3 Overview


In [None]:
# 10.1 NO and NH3 Overview
print("=" * 60)
print("MISSING POLLUTANTS ANALYSIS: NO & NH3")
print("=" * 60)

# Check if these pollutants exist in our data
missing_pollutants = ['no', 'nh3']
available_pollutants = [col for col in missing_pollutants if col in df.columns]
missing_from_data = [col for col in missing_pollutants if col not in df.columns]

print(f"Available missing pollutants: {available_pollutants}")
print(f"Not in dataset: {missing_from_data}")

if available_pollutants:
    print(f"\nBASIC STATISTICS:")
    for pollutant in available_pollutants:
        data = df[pollutant]
        print(f"\n{pollutant.upper()} (Nitric Oxide):" if pollutant == 'no' else f"\n{pollutant.upper()} (Ammonia):")
        print(f"  Range: {data.min():.2f} - {data.max():.2f}")
        print(f"  Mean: {data.mean():.2f}")
        print(f"  Std: {data.std():.2f}")
        print(f"  Missing: {data.isnull().sum()} ({(data.isnull().sum()/len(data)*100):.1f}%)")
        
        # Check for zeros
        zero_count = (data == 0).sum()
        if zero_count > 0:
            print(f"  Zero values: {zero_count} ({(zero_count/len(data)*100):.1f}%)")

# Visualize missing pollutants over time
if available_pollutants:
    fig, axes = plt.subplots(len(available_pollutants), 1, figsize=(15, 4*len(available_pollutants)))
    if len(available_pollutants) == 1:
        axes = [axes]
    
    for i, pollutant in enumerate(available_pollutants):
        axes[i].plot(df['time'], df[pollutant], alpha=0.7, color='green' if pollutant == 'no' else 'purple')
        axes[i].set_title(f'{pollutant.upper()} Concentration Over Time')
        axes[i].set_ylabel(f'{pollutant.upper()} (µg/m³)')
        axes[i].grid(True, alpha=0.3)
        
    plt.xlabel('Date')
    plt.tight_layout()
    plt.show()
else:
    print("⚠️ No missing pollutants found in dataset")


### 10.2 NO and NH3 Correlation with PM2.5/PM10


In [None]:
# 10.2 NO and NH3 Correlation with PM2.5/PM10
if available_pollutants:
    print(f"\nCORRELATION WITH ML TARGETS:")
    print("-" * 40)
    
    targets = ['pm2_5', 'pm10']
    for target in targets:
        print(f"\n{target.upper()} correlations:")
        target_data = df[target]
        
        for pollutant in available_pollutants:
            pollutant_data = df[pollutant]
            corr = target_data.corr(pollutant_data)
            significance = ""
            if abs(corr) > 0.5:
                significance = " (STRONG)"
            elif abs(corr) > 0.3:
                significance = " (MODERATE)"
            elif abs(corr) > 0.1:
                significance = " (WEAK)"
            else:
                significance = " (NEGLIGIBLE)"
                
            print(f"  {pollutant.upper():<4}: {corr:>6.3f}{significance}")
    
    # Compare with other pollutants we analyzed
    print(f"\nCOMPARISON WITH ANALYZED POLLUTANTS:")
    print("-" * 45)
    
    analyzed_pollutants = ['carbon_monoxide', 'nitrogen_dioxide', 'ozone', 'sulphur_dioxide']
    available_analyzed = [col for col in analyzed_pollutants if col in df.columns]
    
    all_pollutants = available_pollutants + available_analyzed
    
    if len(all_pollutants) >= 2:
        # Create correlation matrix
        corr_matrix = df[all_pollutants + targets].corr()
        
        # Extract correlations with targets
        for target in targets:
            print(f"\n{target.upper()} correlation ranking:")
            target_corrs = corr_matrix[target].drop(target).abs().sort_values(ascending=False)
            for i, (pollutant, corr) in enumerate(target_corrs.items(), 1):
                status = "★ NEW" if pollutant in available_pollutants else "  OLD"
                print(f"  {i:2d}. {status} {pollutant:<20}: {corr:.3f}")
    
    print(f"\n🔬 MISSING POLLUTANTS INSIGHTS:")
    print(f"• NO and NH3 provide additional chemical information")
    print(f"• Agricultural emissions (NH3) and combustion processes (NO)")
    print(f"• May capture different pollution sources than analyzed pollutants")
else:
    print("⚠️ Cannot analyze correlations - pollutants not in dataset")


## 11. Wind Direction Analysis

**Focus**: Analyzing wind direction patterns and their relationship with PM concentrations to understand pollution dispersion.

### 11.1 Wind Direction Patterns


In [None]:
# 11.1 Wind Direction Patterns
print("=" * 60)
print("WIND DIRECTION ANALYSIS")
print("=" * 60)

if 'wind_direction' in df.columns:
    wind_dir = df['wind_direction']
    
    print(f"WIND DIRECTION STATISTICS:")
    print(f"  Range: {wind_dir.min():.1f}° - {wind_dir.max():.1f}°")
    print(f"  Mean: {wind_dir.mean():.1f}°")
    print(f"  Std: {wind_dir.std():.1f}°")
    print(f"  Missing: {wind_dir.isnull().sum()} ({(wind_dir.isnull().sum()/len(wind_dir)*100):.1f}%)")
    
    # Convert degrees to cardinal directions for better understanding
    def degrees_to_cardinal(degrees):
        """Convert wind direction degrees to cardinal directions"""
        if pd.isna(degrees):
            return 'Unknown'
        
        directions = ['N', 'NNE', 'NE', 'ENE', 'E', 'ESE', 'SE', 'SSE',
                     'S', 'SSW', 'SW', 'WSW', 'W', 'WNW', 'NW', 'NNW']
        
        # Each direction covers 22.5 degrees (360/16)
        index = int((degrees + 11.25) / 22.5) % 16
        return directions[index]
    
    # Add cardinal directions
    df['wind_cardinal'] = wind_dir.apply(degrees_to_cardinal)
    
    # Analyze wind direction distribution
    print(f"\nWIND DIRECTION DISTRIBUTION:")
    direction_counts = df['wind_cardinal'].value_counts()
    print(direction_counts)
    
    # Visualize wind direction distribution
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Wind rose plot (polar)
    theta = np.radians(wind_dir.dropna())
    ax1 = plt.subplot(121, projection='polar')
    ax1.hist(theta, bins=16, alpha=0.7, color='skyblue')
    ax1.set_title('Wind Direction Distribution (Wind Rose)')
    ax1.set_theta_zero_location('N')
    ax1.set_theta_direction(-1)
    
    # Cardinal direction bar plot
    ax2 = plt.subplot(122)
    direction_counts.plot(kind='bar', ax=ax2, color='lightcoral')
    ax2.set_title('Wind Direction by Cardinal Points')
    ax2.set_xlabel('Direction')
    ax2.set_ylabel('Frequency')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
else:
    print("⚠️ Wind direction data not available in dataset")


### 11.2 Wind Direction vs PM Correlation


In [None]:
# 11.2 Wind Direction vs PM Correlation
if 'wind_direction' in df.columns and 'wind_cardinal' in df.columns:
    print(f"\nWIND DIRECTION vs PM CONCENTRATIONS:")
    print("-" * 45)
    
    # Analyze PM levels by wind direction
    targets = ['pm2_5', 'pm10']
    
    for target in targets:
        print(f"\n{target.upper()} by Wind Direction:")
        pm_by_direction = df.groupby('wind_cardinal')[target].agg(['mean', 'std', 'count']).round(2)
        pm_by_direction = pm_by_direction.sort_values('mean', ascending=False)
        
        print(pm_by_direction)
        
        # Find pollution source directions
        max_direction = pm_by_direction['mean'].idxmax()
        min_direction = pm_by_direction['mean'].idxmin()
        
        print(f"\n  🔥 Highest {target.upper()}: {max_direction} direction ({pm_by_direction.loc[max_direction, 'mean']:.1f} µg/m³)")
        print(f"  🌿 Lowest {target.upper()}: {min_direction} direction ({pm_by_direction.loc[min_direction, 'mean']:.1f} µg/m³)")
        
        # Visualize PM by wind direction
        plt.figure(figsize=(12, 6))
        
        plt.subplot(1, 2, 1)
        pm_by_direction['mean'].plot(kind='bar', color='orange' if target == 'pm2_5' else 'red')
        plt.title(f'{target.upper()} Mean Concentration by Wind Direction')
        plt.xlabel('Wind Direction')
        plt.ylabel(f'{target.upper()} (µg/m³)')
        plt.xticks(rotation=45)
        
        # Polar plot showing PM levels by direction
        plt.subplot(1, 2, 2, projection='polar')
        
        # Convert cardinal directions back to angles
        direction_angles = {
            'N': 0, 'NNE': 22.5, 'NE': 45, 'ENE': 67.5,
            'E': 90, 'ESE': 112.5, 'SE': 135, 'SSE': 157.5,
            'S': 180, 'SSW': 202.5, 'SW': 225, 'WSW': 247.5,
            'W': 270, 'WNW': 292.5, 'NW': 315, 'NNW': 337.5
        }
        
        angles = [np.radians(direction_angles.get(direction, 0)) for direction in pm_by_direction.index]
        values = pm_by_direction['mean'].values
        
        plt.polar(angles, values, 'o-', color='red' if target == 'pm10' else 'orange')
        plt.fill(angles, values, alpha=0.3, color='red' if target == 'pm10' else 'orange')
        plt.title(f'{target.upper()} by Wind Direction (Polar)')
        plt.thetagrids(range(0, 360, 45), ['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW'])
        
        plt.tight_layout()
        plt.show()
    
    print(f"\n🌬️ WIND DIRECTION INSIGHTS:")
    print(f"• Wind direction reveals potential pollution source locations")
    print(f"• Higher PM from certain directions = pollution sources upwind")
    print(f"• Lower PM from certain directions = cleaner air sources")
    print(f"• Important for source apportionment and prediction modeling")
    
else:
    print("⚠️ Cannot analyze wind direction - data not available")


## 12. Interaction Features Analysis

**Focus**: Analyzing 5 interaction features to determine if combined variables provide better predictive power than individual features.

### 12.1 Interaction Features Overview


In [None]:
# 12.1 Interaction Features Overview
print("=" * 60)
print("INTERACTION FEATURES ANALYSIS")
print("=" * 60)

# Identify interaction features
interaction_features = [col for col in df.columns if 'interaction' in col]
print(f"Total Interaction Features: {len(interaction_features)}")

if interaction_features:
    print(f"\nInteraction Features:")
    for i, feature in enumerate(interaction_features, 1):
        # Parse the feature name to understand what it combines
        parts = feature.replace('_interaction', '').split('_')
        if len(parts) >= 2:
            var1 = '_'.join(parts[:-1])
            var2 = parts[-1]
            print(f"  {i}. {feature:<30} = {var1} × {var2}")
        else:
            print(f"  {i}. {feature}")
    
    # Basic statistics for interaction features
    print(f"\nINTERACTION FEATURES STATISTICS:")
    interaction_stats = df[interaction_features].describe()
    print(interaction_stats.round(2))
    
else:
    print("⚠️ No interaction features found in dataset")


### 12.2 Interaction Features vs Individual Features Comparison


In [None]:
# 12.2 Interaction Features vs Individual Features Comparison
if interaction_features:
    print(f"\nINTERACTION vs INDIVIDUAL FEATURES CORRELATION:")
    print("-" * 55)
    
    targets = ['pm2_5', 'pm10']
    
    # Analysis for each interaction feature
    interaction_analysis = {}
    
    for interaction in interaction_features:
        print(f"\n{interaction.upper()}:")
        
        # Parse individual components
        if 'temp_humidity' in interaction:
            components = ['temperature', 'humidity']
        elif 'temp_wind' in interaction:
            components = ['temperature', 'wind_speed']
        elif 'pm2_5_temp' in interaction:
            components = ['pm2_5', 'temperature']
        elif 'pm2_5_humidity' in interaction:
            components = ['pm2_5', 'humidity']
        elif 'wind_pm2_5' in interaction:
            components = ['wind_speed', 'pm2_5']
        else:
            components = []
        
        # Calculate correlations with targets
        interaction_analysis[interaction] = {'components': components}
        
        for target in targets:
            if target in df.columns and interaction in df.columns:
                interaction_corr = df[target].corr(df[interaction])
                
                # Compare with individual component correlations
                component_corrs = []
                for comp in components:
                    if comp in df.columns:
                        comp_corr = df[target].corr(df[comp])
                        component_corrs.append((comp, comp_corr))
                
                # Store results
                interaction_analysis[interaction][target] = {
                    'interaction_corr': interaction_corr,
                    'component_corrs': component_corrs
                }
                
                print(f"  {target.upper()} correlation:")
                print(f"    Interaction feature: {interaction_corr:>6.3f}")
                for comp, corr in component_corrs:
                    print(f"    {comp:<15}: {corr:>6.3f}")
                
                # Determine if interaction adds value
                max_individual = max([abs(corr) for _, corr in component_corrs]) if component_corrs else 0
                if abs(interaction_corr) > max_individual:
                    print(f"    ✅ Interaction IMPROVES correlation (+{abs(interaction_corr) - max_individual:.3f})")
                else:
                    print(f"    ❌ Interaction REDUCES correlation (-{max_individual - abs(interaction_corr):.3f})")
    
    # Summary of interaction feature value
    print(f"\n🔗 INTERACTION FEATURES SUMMARY:")
    useful_interactions = 0
    total_comparisons = 0
    
    for interaction, analysis in interaction_analysis.items():
        for target in targets:
            if target in analysis:
                total_comparisons += 1
                interaction_corr = abs(analysis[target]['interaction_corr'])
                max_individual = max([abs(corr) for _, corr in analysis[target]['component_corrs']]) if analysis[target]['component_corrs'] else 0
                
                if interaction_corr > max_individual:
                    useful_interactions += 1
    
    if total_comparisons > 0:
        improvement_rate = (useful_interactions / total_comparisons) * 100
        print(f"• {useful_interactions}/{total_comparisons} interactions improve over individual features ({improvement_rate:.1f}%)")
        
        if improvement_rate > 50:
            print(f"• ✅ RECOMMENDATION: Include interaction features in model")
        else:
            print(f"• ❌ RECOMMENDATION: Individual features may be sufficient")
    
else:
    print("⚠️ Cannot analyze interactions - no interaction features found")


## 13. Binary Indicators Analysis

**Focus**: Analyzing binary indicator features (is_hot, is_rush_hour, etc.) to determine their categorical predictive value.

### 13.1 Binary Indicators Overview


In [None]:
# 13.1 Binary Indicators Overview
print("=" * 60)
print("BINARY INDICATORS ANALYSIS")
print("=" * 60)

# Identify binary indicator features
binary_features = [col for col in df.columns if col.startswith('is_')]
print(f"Total Binary Indicator Features: {len(binary_features)}")

if binary_features:
    print(f"\nBinary Indicator Features:")
    
    # Categorize binary features
    categories = {
        'Weather': [col for col in binary_features if any(x in col for x in ['hot', 'cold', 'humidity', 'wind', 'pressure'])],
        'Time': [col for col in binary_features if any(x in col for x in ['spring', 'summer', 'autumn', 'winter', 'night', 'rush'])],
        'Pollution': [col for col in binary_features if any(x in col for x in ['pm2_5', 'pm10', 'no2', 'o3', 'co', 'so2'])]
    }
    
    for category, features in categories.items():
        if features:
            print(f"\n{category} Indicators ({len(features)}):")
            for feature in features:
                # Calculate distribution
                if feature in df.columns:
                    true_count = df[feature].sum()
                    total_count = len(df)
                    true_pct = (true_count / total_count) * 100
                    print(f"  {feature:<25}: {true_count:>4}/{total_count} ({true_pct:>5.1f}%)")
    
    # Show overall distribution
    print(f"\nBINARY FEATURES DISTRIBUTION:")
    binary_stats = df[binary_features].sum().sort_values(ascending=False)
    total_records = len(df)
    
    for feature, count in binary_stats.items():
        percentage = (count / total_records) * 100
        print(f"  {feature:<25}: {count:>4} ({percentage:>5.1f}%)")
        
else:
    print("⚠️ No binary indicator features found in dataset")


## 14. Comprehensive Feature Ranking

**Focus**: Final ranking of ALL features for model selection based on comprehensive analysis.

### 14.1 Feature Importance Ranking


In [None]:
# 14.1 Comprehensive Feature Importance Ranking
print("=" * 60)
print("COMPREHENSIVE FEATURE RANKING FOR MODEL SELECTION")
print("=" * 60)

# Get all feature categories
all_features = [col for col in df.columns if col not in ['time', 'time_str', 'wind_cardinal']]
targets = ['pm2_5', 'pm10']

feature_rankings = {}

print(f"Analyzing {len(all_features)} features for model selection...")

for target in targets:
    print(f"\n{target.upper()} FEATURE RANKING:")
    print("-" * 30)
    
    # Calculate correlations for all features
    correlations = {}
    
    for feature in all_features:
        if feature != target and feature in df.columns:
            try:
                corr = df[target].corr(df[feature])
                if not pd.isna(corr):
                    correlations[feature] = abs(corr)
            except:
                continue
    
    # Sort by correlation strength
    sorted_correlations = sorted(correlations.items(), key=lambda x: x[1], reverse=True)
    
    # Categorize features for better understanding
    feature_categories = {
        'Raw Pollutants': [f for f in all_features if f in ['carbon_monoxide', 'nitrogen_dioxide', 'ozone', 'sulphur_dioxide', 'no', 'nh3']],
        'Weather': [f for f in all_features if f in ['temperature', 'humidity', 'pressure', 'wind_speed', 'wind_direction']],
        'Time Features': [f for f in all_features if any(x in f for x in ['hour_', 'day_', 'month_', 'is_spring', 'is_summer', 'is_autumn', 'is_winter'])],
        'Lag Features': [f for f in all_features if 'lag_' in f],
        'Rolling Features': [f for f in all_features if 'rolling_' in f],
        'Change Rate': [f for f in all_features if 'change_rate' in f],
        'Derived Features': [f for f in all_features if any(x in f for x in ['squared', 'cubed', 'is_hot', 'is_cold', 'is_high', 'is_low'])],
        'Interactions': [f for f in all_features if 'interaction' in f]
    }
    
    # Show top features overall
    print(f"\nTOP 15 FEATURES by Correlation:")
    for i, (feature, corr) in enumerate(sorted_correlations[:15], 1):
        # Identify category
        category = "Other"
        for cat, features in feature_categories.items():
            if feature in features:
                category = cat
                break
        
        print(f"  {i:2d}. {feature:<35} {corr:.3f} ({category})")
    
    # Show top features by category
    print(f"\nTOP FEATURES BY CATEGORY:")
    for category, category_features in feature_categories.items():
        category_corrs = [(f, correlations.get(f, 0)) for f in category_features if f in correlations]
        if category_corrs:
            top_feature = max(category_corrs, key=lambda x: x[1])
            print(f"  {category:<18}: {top_feature[0]:<30} ({top_feature[1]:.3f})")
    
    # Store rankings
    feature_rankings[target] = sorted_correlations

# Feature selection recommendations
print(f"\n🎯 FEATURE SELECTION RECOMMENDATIONS:")
print("=" * 50)

# Based on our EDA findings
print(f"\nHIGH PRIORITY FEATURES (Strong correlations + EDA insights):")
high_priority = []

# Add features based on EDA analysis
weather_features = ['temperature', 'humidity', 'pressure', 'wind_speed']
high_priority.extend([f for f in weather_features if f in all_features])

# Key pollutants from Section 3 analysis
key_pollutants = ['carbon_monoxide', 'ozone', 'sulphur_dioxide']
high_priority.extend([f for f in key_pollutants if f in all_features])

# Short-term lags from Section 8.2 analysis
short_lags = [f for f in all_features if 'lag_' in f and any(x in f for x in ['1h', '2h', '3h'])]
high_priority.extend(short_lags)

# Time features
time_features = [f for f in all_features if any(x in f for x in ['hour_sin', 'hour_cos', 'day_sin', 'day_cos'])]
high_priority.extend(time_features)

print(f"Total high priority features: {len(set(high_priority))}")
for feature in sorted(set(high_priority)):
    if feature in df.columns:
        pm25_corr = df['pm2_5'].corr(df[feature]) if 'pm2_5' in df.columns else 0
        pm10_corr = df['pm10'].corr(df[feature]) if 'pm10' in df.columns else 0
        print(f"  {feature:<30}: PM2.5({pm25_corr:>6.3f}) PM10({pm10_corr:>6.3f})")

print(f"\nMEDIUM PRIORITY FEATURES (Based on EDA analysis):")
medium_features = []

# Rolling features (keep some based on analysis)
rolling_means = [f for f in all_features if 'rolling_mean' in f and any(x in f for x in ['3h', '6h'])]
medium_features.extend(rolling_means)

# Change rate features
change_rates = [f for f in all_features if 'change_rate' in f]
medium_features.extend(change_rates)

# Binary indicators that showed promise
useful_binary = [f for f in all_features if f.startswith('is_') and any(x in f for x in ['rush', 'night'])]
medium_features.extend(useful_binary)

print(f"Total medium priority features: {len(set(medium_features))}")

print(f"\nLOW PRIORITY FEATURES (Based on EDA analysis):")
print(f"• Long-term lags (6h+) - Section 8.2 showed 0% utility")
print(f"• Most squared/cubed features - likely redundant with originals")
print(f"• Nitrogen dioxide - high correlation with CO (multicollinearity)")
print(f"• Most binary indicators - may not add significant value")

print(f"\n📋 FINAL FEATURE SELECTION STRATEGY:")
print(f"1. Start with HIGH PRIORITY features ({len(set(high_priority))} features)")
print(f"2. Add MEDIUM PRIORITY features if model performance improves")
print(f"3. Use feature selection algorithms to fine-tune")
print(f"4. Monitor for multicollinearity and remove redundant features")
print(f"5. Validate feature importance through model training")


## 15. Secondary Pollutants → Future PM Analysis

**Focus**: CRITICAL MISSING ANALYSIS - How current secondary pollutants (CO, NO2, O3, SO2, NO, NH3) predict future PM2.5/PM10 values.

### 15.1 Current Pollutants → Future PM Lead-Lag Analysis


In [16]:
# 15.1 Current Pollutants → Future PM Lead-Lag Analysis
print("=" * 60)
print("SECONDARY POLLUTANTS → FUTURE PM PREDICTION ANALYSIS")
print("=" * 60)

# Define secondary pollutants to analyze
secondary_pollutants = ['carbon_monoxide', 'nitrogen_dioxide', 'ozone', 'sulphur_dioxide', 'no', 'nh3']
available_pollutants = [col for col in secondary_pollutants if col in df.columns]
missing_pollutants = [col for col in secondary_pollutants if col not in df.columns]

print(f"Available secondary pollutants: {available_pollutants}")
print(f"Missing from dataset: {missing_pollutants}")

if not available_pollutants:
    print("⚠️ No secondary pollutants available for analysis")
else:
    # Define prediction horizons (same as Section 8.3 weather analysis)
    prediction_horizons = [1, 6, 12, 24, 48, 72]
    targets = ['pm2_5', 'pm10']
    
    print(f"\nAnalyzing {len(available_pollutants)} pollutants across {len(prediction_horizons)} prediction horizons...")
    
    # Create future PM values for correlation analysis
    print(f"\nCreating future PM targets for lead-lag analysis...")
    future_pm_data = {}
    
    for horizon in prediction_horizons:
        for target in targets:
            future_col = f'{target}_future_{horizon}h'
            future_pm_data[future_col] = df[target].shift(-horizon)
    
    # Combine current pollutants with future PM data
    pollutant_future_df = pd.concat([df[available_pollutants], pd.DataFrame(future_pm_data)], axis=1)
    
    print(f"✓ Created future PM targets for prediction horizons: {prediction_horizons}")
    print(f"✓ Total correlation pairs to analyze: {len(available_pollutants)} × {len(targets)} × {len(prediction_horizons)} = {len(available_pollutants) * len(targets) * len(prediction_horizons)}")


SECONDARY POLLUTANTS → FUTURE PM PREDICTION ANALYSIS
Available secondary pollutants: ['carbon_monoxide', 'nitrogen_dioxide', 'ozone', 'sulphur_dioxide', 'no', 'nh3']
Missing from dataset: []

Analyzing 6 pollutants across 6 prediction horizons...

Creating future PM targets for lead-lag analysis...
✓ Created future PM targets for prediction horizons: [1, 6, 12, 24, 48, 72]
✓ Total correlation pairs to analyze: 6 × 2 × 6 = 72


### 15.2 Pollutant Lead-Lag Correlation Analysis


In [17]:
# 15.2 Detailed Pollutant → Future PM Correlation Analysis
if available_pollutants:
    print("\n" + "=" * 60)
    print("POLLUTANT → FUTURE PM CORRELATION MATRIX")
    print("=" * 60)
    
    # Store all correlation results for comprehensive analysis
    pollutant_predictive_power = {}
    
    for pollutant in available_pollutants:
        print(f"\n{pollutant.upper()} → FUTURE PM PREDICTION POWER:")
        print("-" * 55)
        
        pollutant_predictive_power[pollutant] = {}
        
        for target in targets:
            print(f"\n{pollutant.upper()} → {target.upper()}:")
            print(f"{'Horizon':<10} {'Correlation':<12} {'Predictive Power':<18} {'Assessment'}")
            print("-" * 50)
            
            horizon_correlations = {}
            
            for horizon in prediction_horizons:
                future_col = f'{target}_future_{horizon}h'
                
                if future_col in pollutant_future_df.columns and pollutant in pollutant_future_df.columns:
                    # Calculate correlation between current pollutant and future PM
                    corr = pollutant_future_df[pollutant].corr(pollutant_future_df[future_col])
                    
                    if not pd.isna(corr):
                        horizon_correlations[horizon] = corr
                        
                        # Assess predictive power
                        abs_corr = abs(corr)
                        if abs_corr > 0.5:
                            assessment = "🔥 STRONG predictor"
                        elif abs_corr > 0.3:
                            assessment = "🟡 MODERATE predictor"
                        elif abs_corr > 0.2:
                            assessment = "🟢 WEAK predictor"
                        elif abs_corr > 0.1:
                            assessment = "⚪ MINIMAL predictor"
                        else:
                            assessment = "❌ NEGLIGIBLE predictor"
                        
                        # Format horizon display
                        if horizon < 24:
                            horizon_str = f"{horizon}h"
                        else:
                            days = horizon // 24
                            remaining_hours = horizon % 24
                            if remaining_hours == 0:
                                horizon_str = f"{days}d"
                            else:
                                horizon_str = f"{days}d{remaining_hours}h"
                        
                        print(f"{horizon_str:<10} {corr:>8.3f}    {abs_corr:>8.3f}          {assessment}")
                    else:
                        horizon_correlations[horizon] = 0
                        print(f"{horizon}h:<10 {'N/A':<12} {'N/A':<18} ❌ Data issue")
                else:
                    horizon_correlations[horizon] = 0
                    print(f"{horizon}h:<10 {'N/A':<12} {'N/A':<18} ❌ Missing data")
            
            # Store results for this pollutant-target combination
            pollutant_predictive_power[pollutant][target] = horizon_correlations
            
            # Summary for this pollutant-target combination
            valid_corrs = [abs(corr) for corr in horizon_correlations.values() if corr != 0]
            if valid_corrs:
                max_corr = max(valid_corrs)
                avg_corr = np.mean(valid_corrs)
                
                # Find best prediction horizon
                best_horizon = max(horizon_correlations.items(), key=lambda x: abs(x[1]))[0]
                best_corr = horizon_correlations[best_horizon]
                
                print(f"\n  📊 SUMMARY for {pollutant.upper()} → {target.upper()}:")
                print(f"    Best prediction horizon: {best_horizon}h (corr: {best_corr:.3f})")
                print(f"    Max correlation: {max_corr:.3f}")
                print(f"    Average correlation: {avg_corr:.3f}")
                
                # Early warning capability assessment
                if best_horizon <= 6 and max_corr > 0.3:
                    print(f"    ✅ EXCELLENT early warning indicator (<6h, strong correlation)")
                elif best_horizon <= 12 and max_corr > 0.2:
                    print(f"    🟡 GOOD early warning indicator (<12h, moderate correlation)")
                elif best_horizon <= 24 and max_corr > 0.1:
                    print(f"    🟢 FAIR early warning indicator (<24h, weak correlation)")
                else:
                    print(f"    ❌ LIMITED early warning value")
    
    print(f"\n🎯 POLLUTANT PREDICTIVE POWER SUMMARY:")
    print("=" * 50)
    
    # Rank pollutants by overall predictive power
    pollutant_rankings = {}
    
    for pollutant in available_pollutants:
        all_correlations = []
        for target in targets:
            if target in pollutant_predictive_power[pollutant]:
                correlations = [abs(corr) for corr in pollutant_predictive_power[pollutant][target].values() if corr != 0]
                all_correlations.extend(correlations)
        
        if all_correlations:
            avg_predictive_power = np.mean(all_correlations)
            max_predictive_power = max(all_correlations)
            pollutant_rankings[pollutant] = {
                'avg': avg_predictive_power,
                'max': max_predictive_power
            }
    
    # Sort by average predictive power
    ranked_pollutants = sorted(pollutant_rankings.items(), key=lambda x: x[1]['avg'], reverse=True)
    
    print(f"\nPOLLUTANT RANKING by Future PM Prediction Power:")
    print("-" * 50)
    
    for i, (pollutant, scores) in enumerate(ranked_pollutants, 1):
        avg_score = scores['avg']
        max_score = scores['max']
        
        if avg_score > 0.3:
            tier = "🔥 TIER 1 (Strong)"
        elif avg_score > 0.2:
            tier = "🟡 TIER 2 (Moderate)"
        elif avg_score > 0.1:
            tier = "🟢 TIER 3 (Weak)"
        else:
            tier = "❌ TIER 4 (Negligible)"
        
        print(f"  {i}. {pollutant.upper():<20}: Avg={avg_score:.3f}, Max={max_score:.3f} - {tier}")

else:
    print("⚠️ Cannot perform pollutant → future PM analysis - no pollutants available")



POLLUTANT → FUTURE PM CORRELATION MATRIX

CARBON_MONOXIDE → FUTURE PM PREDICTION POWER:
-------------------------------------------------------

CARBON_MONOXIDE → PM2_5:
Horizon    Correlation  Predictive Power   Assessment
--------------------------------------------------
1h            0.496       0.496          🟡 MODERATE predictor
6h            0.426       0.426          🟡 MODERATE predictor
12h           0.404       0.404          🟡 MODERATE predictor
1d            0.418       0.418          🟡 MODERATE predictor
2d            0.312       0.312          🟡 MODERATE predictor
3d            0.158       0.158          ⚪ MINIMAL predictor

  📊 SUMMARY for CARBON_MONOXIDE → PM2_5:
    Best prediction horizon: 1h (corr: 0.496)
    Max correlation: 0.496
    Average correlation: 0.369

CARBON_MONOXIDE → PM10:
Horizon    Correlation  Predictive Power   Assessment
--------------------------------------------------
1h           -0.176       0.176          ⚪ MINIMAL predictor
6h           -0.

### 15.3 Pollutant vs Weather Predictive Power Comparison


In [19]:
# 15.3 Compare Pollutants vs Weather for Future PM Prediction
if available_pollutants:
    print("\n" + "=" * 60)
    print("POLLUTANTS vs WEATHER: FUTURE PM PREDICTION COMPARISON")
    print("=" * 60)
    
    # Weather features for comparison (same as Section 8.3)
    weather_features = ['temperature', 'humidity', 'pressure', 'wind_speed']
    available_weather = [col for col in weather_features if col in df.columns]
    
    print(f"Comparing:")
    print(f"• {len(available_pollutants)} pollutants: {available_pollutants}")
    print(f"• {len(available_weather)} weather features: {available_weather}")
    
    if available_weather:
        # Calculate weather → future PM correlations for comparison
        weather_predictive_power = {}
        
        for weather_feature in available_weather:
            weather_predictive_power[weather_feature] = {}
            
            for target in targets:
                horizon_correlations = {}
                
                for horizon in prediction_horizons:
                    future_col = f'{target}_future_{horizon}h'
                    
                    if future_col in pollutant_future_df.columns:
                        # Calculate correlation between current weather and future PM
                        corr = df[weather_feature].corr(pollutant_future_df[future_col])
                        horizon_correlations[horizon] = corr if not pd.isna(corr) else 0
                    else:
                        horizon_correlations[horizon] = 0
                
                weather_predictive_power[weather_feature][target] = horizon_correlations
        
        # Compare predictive power: Pollutants vs Weather
        print(f"\n📊 PREDICTIVE POWER COMPARISON:")
        print("=" * 50)
        
        for target in targets:
            print(f"\n{target.upper()} PREDICTION - TOP PREDICTORS by Average Correlation:")
            print("-" * 60)
            
            # Collect all predictors (pollutants + weather)
            all_predictors = {}
            
            # Add pollutant scores
            for pollutant in available_pollutants:
                if target in pollutant_predictive_power[pollutant]:
                    correlations = [abs(corr) for corr in pollutant_predictive_power[pollutant][target].values() if corr != 0]
                    if correlations:
                        all_predictors[f"{pollutant} (POLLUTANT)"] = np.mean(correlations)
            
            # Add weather scores
            for weather_feature in available_weather:
                if target in weather_predictive_power[weather_feature]:
                    correlations = [abs(corr) for corr in weather_predictive_power[weather_feature][target].values() if corr != 0]
                    if correlations:
                        all_predictors[f"{weather_feature} (WEATHER)"] = np.mean(correlations)
            
            # Sort by predictive power
            sorted_predictors = sorted(all_predictors.items(), key=lambda x: x[1], reverse=True)
            
            print(f"{'Rank':<4} {'Predictor':<35} {'Avg Correlation':<15} {'Type'}")
            print("-" * 70)
            
            for i, (predictor, avg_corr) in enumerate(sorted_predictors, 1):
                predictor_type = predictor.split('(')[1].replace(')', '')
                predictor_name = predictor.split('(')[0].strip()
                
                if avg_corr > 0.3:
                    strength = "🔥 STRONG"
                elif avg_corr > 0.2:
                    strength = "🟡 MODERATE"
                elif avg_corr > 0.1:
                    strength = "🟢 WEAK"
                else:
                    strength = "❌ NEGLIGIBLE"
                
                print(f"{i:<4} {predictor_name:<35} {avg_corr:<15.3f} {strength}")
        
        # Summary insights
        print(f"\n🎯 KEY INSIGHTS:")
        print("=" * 30)
        
        # Count pollutants vs weather in top predictors
        strong_pollutants = []
        strong_weather = []
        
        for target in targets:
            all_predictors = {}
            
            # Add pollutant scores
            for pollutant in available_pollutants:
                if target in pollutant_predictive_power[pollutant]:
                    correlations = [abs(corr) for corr in pollutant_predictive_power[pollutant][target].values() if corr != 0]
                    if correlations:
                        avg_corr = np.mean(correlations)
                        if avg_corr > 0.2:  # Strong threshold
                            strong_pollutants.append(pollutant)
            
            # Add weather scores
            for weather_feature in available_weather:
                if target in weather_predictive_power[weather_feature]:
                    correlations = [abs(corr) for corr in weather_predictive_power[weather_feature][target].values() if corr != 0]
                    if correlations:
                        avg_corr = np.mean(correlations)
                        if avg_corr > 0.2:  # Strong threshold
                            strong_weather.append(weather_feature)
        
        strong_pollutants = list(set(strong_pollutants))
        strong_weather = list(set(strong_weather))
        
        print(f"• Strong pollutant predictors (>0.2 avg correlation): {len(strong_pollutants)}")
        if strong_pollutants:
            print(f"  {strong_pollutants}")
        
        print(f"• Strong weather predictors (>0.2 avg correlation): {len(strong_weather)}")
        if strong_weather:
            print(f"  {strong_weather}")
        
        # Recommendation
        if len(strong_pollutants) > len(strong_weather):
            print(f"\n✅ RECOMMENDATION: Pollutants are MORE IMPORTANT than weather for future PM prediction")
        elif len(strong_weather) > len(strong_pollutants):
            print(f"\n✅ RECOMMENDATION: Weather is MORE IMPORTANT than pollutants for future PM prediction")
        else:
            print(f"\n✅ RECOMMENDATION: Pollutants and weather are EQUALLY IMPORTANT for future PM prediction")
        
        print(f"\n📋 MODELING IMPLICATIONS:")
        print(f"• Include both current pollutants AND weather for future PM prediction")
        print(f"• Pollutants provide chemical precursor information")
        print(f"• Weather provides atmospheric dispersion information")
        print(f"• Combined approach likely optimal for 72-hour forecasting")
    
    else:
        print("⚠️ No weather features available for comparison")

else:
    print("⚠️ Cannot perform comparison - no pollutants available")



POLLUTANTS vs WEATHER: FUTURE PM PREDICTION COMPARISON
Comparing:
• 6 pollutants: ['carbon_monoxide', 'nitrogen_dioxide', 'ozone', 'sulphur_dioxide', 'no', 'nh3']
• 4 weather features: ['temperature', 'humidity', 'pressure', 'wind_speed']

📊 PREDICTIVE POWER COMPARISON:

PM2_5 PREDICTION - TOP PREDICTORS by Average Correlation:
------------------------------------------------------------
Rank Predictor                           Avg Correlation Type
----------------------------------------------------------------------
1    pressure                            0.370           🔥 STRONG
2    carbon_monoxide                     0.369           🔥 STRONG
3    nh3                                 0.291           🟡 MODERATE
4    ozone                               0.286           🟡 MODERATE
5    sulphur_dioxide                     0.235           🟡 MODERATE
6    nitrogen_dioxide                    0.229           🟡 MODERATE
7    wind_speed                          0.218           🟡 MODERATE
8  