# Urban Air Quality Analysis for Sustainable Cities

## Project Overview
This notebook performs comprehensive analysis of urban air quality data from various cities across India to identify trends, patterns, and actionable insights for sustainable urban development.

### Key Objectives:
- Analyze air quality trends across major Indian cities
- Identify pollution patterns and seasonal variations
- Examine relationships between different pollutants
- Provide data-driven insights for policy recommendations

### Dataset Information:
- **Time Period**: 2022-2024 (3 years)
- **Cities Covered**: 10 major Indian cities
- **Pollutants Measured**: PM2.5, PM10, NO2, SO2, CO, O3
- **Metrics**: Air Quality Index (AQI)

## 1. Setup and Data Loading

Import necessary libraries and load the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("Libraries imported successfully!")

In [None]:
df = pd.read_csv('../data/air_quality_data.csv')
print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
df.head(10)

## 2. Initial Data Exploration

Understanding the structure and basic statistics of the dataset.

In [None]:
print("Dataset Information:")
print("="*50)
df.info()

In [None]:
print("Statistical Summary:")
print("="*50)
df.describe()

In [None]:
print("Cities in Dataset:")
print(df['City'].unique())
print(f"\nTotal Cities: {df['City'].nunique()}")
print(f"\nStates Covered:")
print(df['State'].unique())
print(f"\nTotal States: {df['State'].nunique()}")

In [None]:
print("Date Range:")
df['Date'] = pd.to_datetime(df['Date'])
print(f"Start Date: {df['Date'].min()}")
print(f"End Date: {df['Date'].max()}")
print(f"Total Days: {(df['Date'].max() - df['Date'].min()).days}")

## 3. Data Cleaning and Preprocessing

Handle missing values, outliers, and prepare data for analysis.

In [None]:
print("Missing Values Analysis:")
print("="*50)
missing_data = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_percent
})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

In [None]:
plt.figure(figsize=(12, 6))
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0]
if len(missing_data) > 0:
    missing_data.plot(kind='bar', color='coral')
    plt.title('Missing Values by Column', fontsize=16, fontweight='bold')
    plt.xlabel('Columns', fontsize=12)
    plt.ylabel('Number of Missing Values', fontsize=12)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('../visualizations/missing_values.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No missing values found in the dataset!")

In [None]:
df_clean = df.copy()

numeric_columns = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'AQI']
for col in numeric_columns:
    if df_clean[col].isnull().sum() > 0:
        df_clean[col].fillna(df_clean[col].median(), inplace=True)

print(f"Original dataset shape: {df.shape}")
print(f"Cleaned dataset shape: {df_clean.shape}")
print(f"\nMissing values after cleaning: {df_clean.isnull().sum().sum()}")

In [None]:
df_clean['Year'] = df_clean['Date'].dt.year
df_clean['Month'] = df_clean['Date'].dt.month
df_clean['Month_Name'] = df_clean['Date'].dt.month_name()
df_clean['Day'] = df_clean['Date'].dt.day
df_clean['DayOfWeek'] = df_clean['Date'].dt.day_name()
df_clean['Quarter'] = df_clean['Date'].dt.quarter
df_clean['Season'] = df_clean['Month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Autumn', 10: 'Autumn', 11: 'Autumn'
})

print("New features created:")
print(df_clean[['Date', 'Year', 'Month', 'Month_Name', 'Season', 'DayOfWeek']].head())

## 4. Exploratory Data Analysis (EDA)

### 4.1 Overall Distribution of Pollutants

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
pollutants = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'AQI']

for idx, pollutant in enumerate(pollutants):
    row = idx // 3
    col = idx % 3
    
    axes[row, col].hist(df_clean[pollutant], bins=50, color='skyblue', edgecolor='black', alpha=0.7)
    axes[row, col].set_title(f'Distribution of {pollutant}', fontsize=14, fontweight='bold')
    axes[row, col].set_xlabel(pollutant, fontsize=11)
    axes[row, col].set_ylabel('Frequency', fontsize=11)
    axes[row, col].grid(alpha=0.3)

axes[2, 1].axis('off')
axes[2, 2].axis('off')

plt.tight_layout()
plt.savefig('../visualizations/pollutant_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.2 City-wise Air Quality Comparison

In [None]:
city_avg = df_clean.groupby('City')[['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'AQI']].mean().round(2)
city_avg_sorted = city_avg.sort_values('AQI', ascending=False)
print("Average Pollutant Levels by City:")
print("="*80)
city_avg_sorted

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

city_avg_sorted['AQI'].plot(kind='barh', ax=axes[0, 0], color='crimson', alpha=0.7)
axes[0, 0].set_title('Average AQI by City', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Average AQI', fontsize=11)
axes[0, 0].grid(alpha=0.3)

city_avg_sorted['PM2.5'].plot(kind='barh', ax=axes[0, 1], color='orange', alpha=0.7)
axes[0, 1].set_title('Average PM2.5 by City', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Average PM2.5 (μg/m³)', fontsize=11)
axes[0, 1].grid(alpha=0.3)

city_avg_sorted['NO2'].plot(kind='barh', ax=axes[1, 0], color='steelblue', alpha=0.7)
axes[1, 0].set_title('Average NO2 by City', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Average NO2 (μg/m³)', fontsize=11)
axes[1, 0].grid(alpha=0.3)

city_avg_sorted['PM10'].plot(kind='barh', ax=axes[1, 1], color='forestgreen', alpha=0.7)
axes[1, 1].set_title('Average PM10 by City', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Average PM10 (μg/m³)', fontsize=11)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../visualizations/city_wise_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.3 Temporal Analysis - Trends Over Time

In [None]:
monthly_avg = df_clean.groupby(['Year', 'Month'])['AQI'].mean().reset_index()
monthly_avg['Date'] = pd.to_datetime(monthly_avg[['Year', 'Month']].assign(day=1))

plt.figure(figsize=(16, 6))
plt.plot(monthly_avg['Date'], monthly_avg['AQI'], marker='o', linewidth=2, markersize=6, color='darkred')
plt.title('Monthly Average AQI Trend (2022-2024)', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Average AQI', fontsize=12)
plt.grid(alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('../visualizations/monthly_aqi_trend.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
top_cities = city_avg_sorted.head(5).index
df_top_cities = df_clean[df_clean['City'].isin(top_cities)]

monthly_city = df_top_cities.groupby(['Year', 'Month', 'City'])['AQI'].mean().reset_index()
monthly_city['Date'] = pd.to_datetime(monthly_city[['Year', 'Month']].assign(day=1))

plt.figure(figsize=(16, 8))
for city in top_cities:
    city_data = monthly_city[monthly_city['City'] == city]
    plt.plot(city_data['Date'], city_data['AQI'], marker='o', linewidth=2, label=city)

plt.title('Monthly AQI Trends - Top 5 Most Polluted Cities', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Average AQI', fontsize=12)
plt.legend(loc='best', fontsize=11)
plt.grid(alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('../visualizations/top_cities_aqi_trend.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.4 Seasonal Analysis

In [None]:
seasonal_avg = df_clean.groupby('Season')[['PM2.5', 'PM10', 'NO2', 'AQI']].mean().round(2)
season_order = ['Winter', 'Spring', 'Summer', 'Autumn']
seasonal_avg = seasonal_avg.reindex(season_order)
print("Average Pollutant Levels by Season:")
print("="*60)
seasonal_avg

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

seasonal_avg['PM2.5'].plot(kind='bar', ax=axes[0, 0], color=['skyblue', 'lightgreen', 'coral', 'gold'])
axes[0, 0].set_title('Average PM2.5 by Season', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('PM2.5 (μg/m³)', fontsize=11)
axes[0, 0].set_xticklabels(season_order, rotation=0)
axes[0, 0].grid(alpha=0.3)

seasonal_avg['PM10'].plot(kind='bar', ax=axes[0, 1], color=['skyblue', 'lightgreen', 'coral', 'gold'])
axes[0, 1].set_title('Average PM10 by Season', fontsize=14, fontweight='bold')
axes[0, 1].set_ylabel('PM10 (μg/m³)', fontsize=11)
axes[0, 1].set_xticklabels(season_order, rotation=0)
axes[0, 1].grid(alpha=0.3)

seasonal_avg['NO2'].plot(kind='bar', ax=axes[1, 0], color=['skyblue', 'lightgreen', 'coral', 'gold'])
axes[1, 0].set_title('Average NO2 by Season', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('NO2 (μg/m³)', fontsize=11)
axes[1, 0].set_xticklabels(season_order, rotation=0)
axes[1, 0].grid(alpha=0.3)

seasonal_avg['AQI'].plot(kind='bar', ax=axes[1, 1], color=['skyblue', 'lightgreen', 'coral', 'gold'])
axes[1, 1].set_title('Average AQI by Season', fontsize=14, fontweight='bold')
axes[1, 1].set_ylabel('AQI', fontsize=11)
axes[1, 1].set_xticklabels(season_order, rotation=0)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../visualizations/seasonal_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
seasonal_city = df_clean.groupby(['City', 'Season'])['AQI'].mean().reset_index()
seasonal_pivot = seasonal_city.pivot(index='City', columns='Season', values='AQI')[season_order]
sns.heatmap(seasonal_pivot, annot=True, fmt='.1f', cmap='YlOrRd', cbar_kws={'label': 'Average AQI'})
plt.title('City-wise Seasonal AQI Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Season', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.tight_layout()
plt.savefig('../visualizations/seasonal_city_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.5 Day of Week Analysis

In [None]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_avg = df_clean.groupby('DayOfWeek')[['PM2.5', 'NO2', 'AQI']].mean().reindex(day_order).round(2)
print("Average Pollutant Levels by Day of Week:")
print("="*60)
dow_avg

In [None]:
plt.figure(figsize=(14, 6))
x_pos = np.arange(len(day_order))
width = 0.25

plt.bar(x_pos - width, dow_avg['PM2.5'], width, label='PM2.5', color='coral')
plt.bar(x_pos, dow_avg['NO2'], width, label='NO2', color='steelblue')
plt.bar(x_pos + width, dow_avg['AQI']/2, width, label='AQI (scaled)', color='forestgreen')

plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Concentration', fontsize=12)
plt.title('Weekday vs Weekend Pollution Patterns', fontsize=16, fontweight='bold')
plt.xticks(x_pos, day_order, rotation=45)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('../visualizations/weekday_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

## 5. Correlation Analysis

Examining relationships between different pollutants.

In [None]:
correlation_matrix = df_clean[['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'AQI']].corr()
print("Correlation Matrix:")
print("="*80)
correlation_matrix

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Air Pollutants', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig('../visualizations/correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

axes[0, 0].scatter(df_clean['PM2.5'], df_clean['PM10'], alpha=0.3, color='coral')
axes[0, 0].set_xlabel('PM2.5 (μg/m³)', fontsize=11)
axes[0, 0].set_ylabel('PM10 (μg/m³)', fontsize=11)
axes[0, 0].set_title('PM2.5 vs PM10', fontsize=14, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

axes[0, 1].scatter(df_clean['PM2.5'], df_clean['AQI'], alpha=0.3, color='steelblue')
axes[0, 1].set_xlabel('PM2.5 (μg/m³)', fontsize=11)
axes[0, 1].set_ylabel('AQI', fontsize=11)
axes[0, 1].set_title('PM2.5 vs AQI', fontsize=14, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

axes[1, 0].scatter(df_clean['NO2'], df_clean['AQI'], alpha=0.3, color='forestgreen')
axes[1, 0].set_xlabel('NO2 (μg/m³)', fontsize=11)
axes[1, 0].set_ylabel('AQI', fontsize=11)
axes[1, 0].set_title('NO2 vs AQI', fontsize=14, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

axes[1, 1].scatter(df_clean['CO'], df_clean['NO2'], alpha=0.3, color='purple')
axes[1, 1].set_xlabel('CO (mg/m³)', fontsize=11)
axes[1, 1].set_ylabel('NO2 (μg/m³)', fontsize=11)
axes[1, 1].set_title('CO vs NO2', fontsize=14, fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../visualizations/scatter_plots.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Advanced Analysis

### 6.1 Year-over-Year Comparison

In [None]:
yearly_avg = df_clean.groupby('Year')[['PM2.5', 'PM10', 'NO2', 'AQI']].mean().round(2)
print("Year-over-Year Average Pollutant Levels:")
print("="*60)
yearly_avg

In [None]:
yearly_change = yearly_avg.pct_change() * 100
print("\nYear-over-Year Percentage Change:")
print("="*60)
yearly_change.round(2)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

yearly_avg['PM2.5'].plot(kind='bar', ax=axes[0, 0], color=['skyblue', 'lightgreen', 'coral'])
axes[0, 0].set_title('Average PM2.5 by Year', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('PM2.5 (μg/m³)', fontsize=11)
axes[0, 0].set_xticklabels(yearly_avg.index, rotation=0)
axes[0, 0].grid(alpha=0.3)

yearly_avg['PM10'].plot(kind='bar', ax=axes[0, 1], color=['skyblue', 'lightgreen', 'coral'])
axes[0, 1].set_title('Average PM10 by Year', fontsize=14, fontweight='bold')
axes[0, 1].set_ylabel('PM10 (μg/m³)', fontsize=11)
axes[0, 1].set_xticklabels(yearly_avg.index, rotation=0)
axes[0, 1].grid(alpha=0.3)

yearly_avg['NO2'].plot(kind='bar', ax=axes[1, 0], color=['skyblue', 'lightgreen', 'coral'])
axes[1, 0].set_title('Average NO2 by Year', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('NO2 (μg/m³)', fontsize=11)
axes[1, 0].set_xticklabels(yearly_avg.index, rotation=0)
axes[1, 0].grid(alpha=0.3)

yearly_avg['AQI'].plot(kind='bar', ax=axes[1, 1], color=['skyblue', 'lightgreen', 'coral'])
axes[1, 1].set_title('Average AQI by Year', fontsize=14, fontweight='bold')
axes[1, 1].set_ylabel('AQI', fontsize=11)
axes[1, 1].set_xticklabels(yearly_avg.index, rotation=0)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../visualizations/yearly_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.2 AQI Category Analysis

In [None]:
def aqi_category(aqi):
    if aqi <= 50:
        return 'Good'
    elif aqi <= 100:
        return 'Satisfactory'
    elif aqi <= 200:
        return 'Moderate'
    elif aqi <= 300:
        return 'Poor'
    elif aqi <= 400:
        return 'Very Poor'
    else:
        return 'Severe'

df_clean['AQI_Category'] = df_clean['AQI'].apply(aqi_category)

aqi_category_counts = df_clean['AQI_Category'].value_counts()
print("Distribution of AQI Categories:")
print("="*60)
print(aqi_category_counts)
print(f"\nPercentage Distribution:")
print((aqi_category_counts / len(df_clean) * 100).round(2))

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

category_order = ['Good', 'Satisfactory', 'Moderate', 'Poor', 'Very Poor', 'Severe']
aqi_counts = df_clean['AQI_Category'].value_counts().reindex(category_order, fill_value=0)
colors = ['green', 'yellowgreen', 'yellow', 'orange', 'red', 'darkred']

aqi_counts.plot(kind='bar', ax=axes[0], color=colors)
axes[0].set_title('Distribution of AQI Categories', fontsize=14, fontweight='bold')
axes[0].set_xlabel('AQI Category', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_xticklabels(category_order, rotation=45)
axes[0].grid(alpha=0.3)

aqi_counts.plot(kind='pie', ax=axes[1], colors=colors, autopct='%1.1f%%', startangle=90)
axes[1].set_title('AQI Category Distribution (%)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.savefig('../visualizations/aqi_categories.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
city_aqi_category = pd.crosstab(df_clean['City'], df_clean['AQI_Category'])
city_aqi_category = city_aqi_category[category_order]
city_aqi_category = city_aqi_category.loc[city_avg_sorted.index]

city_aqi_category.plot(kind='barh', stacked=True, figsize=(14, 8), color=colors)
plt.title('AQI Category Distribution by City', fontsize=16, fontweight='bold')
plt.xlabel('Number of Days', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.legend(title='AQI Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('../visualizations/city_aqi_categories.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.3 State-wise Analysis

In [None]:
state_avg = df_clean.groupby('State')[['PM2.5', 'PM10', 'NO2', 'AQI']].mean().round(2)
state_avg_sorted = state_avg.sort_values('AQI', ascending=False)
print("Average Pollutant Levels by State:")
print("="*80)
state_avg_sorted

In [None]:
plt.figure(figsize=(14, 8))
state_avg_sorted['AQI'].plot(kind='barh', color='crimson', alpha=0.7)
plt.title('Average AQI by State', fontsize=16, fontweight='bold')
plt.xlabel('Average AQI', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('../visualizations/state_wise_aqi.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Key Insights and Findings

### Summary Statistics

In [None]:
print("="*80)
print("KEY FINDINGS FROM URBAN AIR QUALITY ANALYSIS")
print("="*80)

print("\n1. MOST POLLUTED CITIES:")
print("-" * 40)
for i, (city, aqi) in enumerate(city_avg_sorted['AQI'].head(5).items(), 1):
    print(f"   {i}. {city}: AQI {aqi:.2f}")

print("\n2. LEAST POLLUTED CITIES:")
print("-" * 40)
for i, (city, aqi) in enumerate(city_avg_sorted['AQI'].tail(5).items(), 1):
    print(f"   {i}. {city}: AQI {aqi:.2f}")

print("\n3. SEASONAL PATTERNS:")
print("-" * 40)
worst_season = seasonal_avg['AQI'].idxmax()
best_season = seasonal_avg['AQI'].idxmin()
print(f"   Worst Air Quality: {worst_season} (AQI: {seasonal_avg.loc[worst_season, 'AQI']:.2f})")
print(f"   Best Air Quality: {best_season} (AQI: {seasonal_avg.loc[best_season, 'AQI']:.2f})")

print("\n4. POLLUTANT CORRELATIONS:")
print("-" * 40)
print(f"   PM2.5 vs PM10: {correlation_matrix.loc['PM2.5', 'PM10']:.3f}")
print(f"   PM2.5 vs AQI: {correlation_matrix.loc['PM2.5', 'AQI']:.3f}")
print(f"   NO2 vs AQI: {correlation_matrix.loc['NO2', 'AQI']:.3f}")

print("\n5. WEEKDAY VS WEEKEND:")
print("-" * 40)
weekday_aqi = df_clean[df_clean['DayOfWeek'].isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])]['AQI'].mean()
weekend_aqi = df_clean[df_clean['DayOfWeek'].isin(['Saturday', 'Sunday'])]['AQI'].mean()
print(f"   Average Weekday AQI: {weekday_aqi:.2f}")
print(f"   Average Weekend AQI: {weekend_aqi:.2f}")
print(f"   Difference: {abs(weekday_aqi - weekend_aqi):.2f}")

print("\n6. YEAR-OVER-YEAR TRENDS:")
print("-" * 40)
for year in yearly_avg.index:
    print(f"   {year}: AQI {yearly_avg.loc[year, 'AQI']:.2f}")

print("\n7. AQI CATEGORY DISTRIBUTION:")
print("-" * 40)
for category in category_order:
    if category in aqi_category_counts.index:
        count = aqi_category_counts[category]
        percentage = (count / len(df_clean) * 100)
        print(f"   {category}: {count} days ({percentage:.1f}%)")

print("\n" + "="*80)

## 8. Conclusions and Recommendations

Based on the comprehensive analysis of urban air quality data across 10 major Indian cities over 3 years, we can draw the following conclusions:

### Key Conclusions:

1. **Geographic Disparities**: Northern Indian cities consistently show higher pollution levels compared to southern cities, with Delhi, Lucknow, and Jaipur being among the most polluted.

2. **Seasonal Patterns**: Winter months show significantly higher pollution levels due to temperature inversions and reduced dispersion of pollutants. Summer shows relatively better air quality.

3. **Particulate Matter Dominance**: PM2.5 and PM10 are the primary contributors to poor AQI scores, showing strong correlation with overall air quality index.

4. **Weekday vs Weekend**: Slightly lower pollution levels on weekends suggest that vehicular and industrial emissions play a significant role in urban air quality.

5. **Pollutant Relationships**: Strong positive correlations exist between PM2.5 and PM10, indicating common sources like vehicular emissions, construction dust, and industrial activities.

### Recommendations for Policy Makers:

1. **Seasonal Action Plans**:
   - Implement stricter emission controls during winter months
   - Enhanced monitoring and rapid response systems for high pollution days

2. **City-Specific Interventions**:
   - Focus resources on highly polluted cities (Delhi, Lucknow, Jaipur)
   - Study and replicate best practices from cleaner cities (Chennai, Bangalore)

3. **Traffic Management**:
   - Promote public transportation and electric vehicles
   - Implement congestion pricing in high-pollution zones

4. **Industrial Regulations**:
   - Enforce stricter emission norms for industries
   - Incentivize adoption of cleaner technologies

5. **Public Awareness**:
   - Real-time AQI monitoring and public advisories
   - Health awareness campaigns during high pollution periods

6. **Green Infrastructure**:
   - Increase urban green spaces and tree cover
   - Promote rooftop gardens and vertical forests

### Future Research Directions:

1. Integration of meteorological data to understand weather's impact on air quality
2. Analysis of specific emission sources and their contribution to pollution
3. Health impact assessment linking AQI data with respiratory disease incidence
4. Cost-benefit analysis of various intervention strategies
5. Development of predictive models for air quality forecasting

## 9. Export Summary Report

In [None]:
summary_report = {
    'Analysis Period': f"{df_clean['Date'].min().strftime('%Y-%m-%d')} to {df_clean['Date'].max().strftime('%Y-%m-%d')}",
    'Total Records': len(df_clean),
    'Cities Analyzed': df_clean['City'].nunique(),
    'States Covered': df_clean['State'].nunique(),
    'Top 3 Most Polluted Cities': ', '.join(city_avg_sorted.head(3).index),
    'Top 3 Cleanest Cities': ', '.join(city_avg_sorted.tail(3).index[::-1]),
    'Overall Average AQI': f"{df_clean['AQI'].mean():.2f}",
    'Overall Average PM2.5': f"{df_clean['PM2.5'].mean():.2f}",
    'Overall Average PM10': f"{df_clean['PM10'].mean():.2f}",
    'Worst Season': worst_season,
    'Best Season': best_season
}

summary_df = pd.DataFrame(list(summary_report.items()), columns=['Metric', 'Value'])
summary_df.to_csv('../data/analysis_summary.csv', index=False)
print("\nAnalysis Summary Report:")
print("="*80)
print(summary_df.to_string(index=False))
print("\nSummary report exported to: ../data/analysis_summary.csv")

## Analysis Complete!

This comprehensive analysis has covered:
- Data loading and preprocessing
- Exploratory data analysis
- Temporal and seasonal analysis
- City and state-wise comparisons
- Correlation analysis
- AQI categorization and distribution
- Key insights and recommendations

All visualizations have been saved to the `visualizations/` directory.

For questions or further analysis, please refer to the project README.