In [None]:
 #Source-Driven Air Pollution Analysis and Future Risk Mapping for Pune city

Introduction: Air pollution has now rapidly becomes a part of everyday life in many Indian Cities. **Pune,** which was once known for its clean air and pleasant weather, is now facing **increasing air quality issues due to urban development, rising vehicles, ongoing construction work, and changing climate conditions.** As there are more cities facing Air pollution but I would like to focus on Pune


Aim: The aim of this project to perform a source-driven analysis of air pollution in Pune by integrating air quality,weather,and traffic data, and to predict future pollution risks if this continues using Machine Learning

Import all Libraries

In [None]:
import pandas as pd
import numpy as np
import random
random.seed(4)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')




Load Datasets

In [None]:
#Pune Air Quality data
AQI_df = pd.read_csv("aqi_data_pune_2017_to_2024.csv")

#Weather Data
weather_df = pd.read_csv("india_2000_2024_daily_weather.csv")

#Traffic Data
traffic_df = pd.read_csv("Traffic.csv")


In [None]:
print("AQI Shape:", AQI_df.shape)
print("Weather Shape:", weather_df.shape)
print("Traffic Shape:", traffic_df.shape)

AQI_df.head()

In [None]:
weather_df.head()

Extract Pune weather data from above

In [None]:
weather_df['city'].unique()

In [None]:
pune_weather_df = weather_df[
    weather_df['city'].str.lower() == 'pune'
].copy()

In [None]:
print("Pune Weather Shape:", pune_weather_df.shape)
pune_weather_df.head()

Due to the lack of publicly available city-level traffic datasets for Pune, a generalized traffic prediction dataset was used to represent traffic intensity patterns typically observed in urban environments.

In [None]:
traffic_df.head()
traffic_df.columns

In [None]:
#I'll merge traffic data with AQI by date, not city.
traffic_df['Date']= pd.to_datetime(traffic_df['Date'])

In [None]:
# creating traffic intensity index
traffic_df['traffic_index']= traffic_df['Total']/ traffic_df['Total'].max()

In [None]:
traffic_df['Traffic Situation']= traffic_df['Traffic Situation'].astype('category').cat.codes

In [None]:
traffic_df[['Date', 'Total', 'traffic_index']].head()

as we see above data has date problem
Data Cleaning is required

In [None]:
#check raw values first
traffic_df[['Date', 'Time']].head(10)

In [None]:
traffic_df['Date'].dtype

In [None]:
traffic_df = traffic_df.drop(columns=['Date'])

In [None]:
traffic_df['Time'] = pd.to_datetime(traffic_df['Time'], format='%I:%M:%S %p')
traffic_df['hour'] = traffic_df['Time'].dt.hour

In [None]:
daily_traffic_index = traffic_df['Total'].mean()
daily_traffic_index

In [None]:
#Handle missing data from AQI dataset
AQI_df.isna().sum()


In [None]:
AQI_df['AQI'] = AQI_df['AQI'].interpolate()

In [None]:
AQI_df.isna().sum()

In [None]:
AQI_df['AQI'] = AQI_df['AQI'].fillna(method='bfill').fillna(method='ffill')

In [None]:
AQI_df.isna().sum()

Merging by Date ensures temporal alignment of features with the target variable (AQI), which is critical for accurate modeling and causal analysis

In [None]:
# now merge AQI + Pune Weather by Date
#Ensure Pune weather date is datetime
AQI_df['Date'] = pd.to_datetime(AQI_df['Date'])
pune_weather_df['date']=pd.to_datetime(pune_weather_df['date'])

#merge AQI with pune weather
merged_df = pd.merge(AQI_df, pune_weather_df, left_on='Date', right_on='date', how='inner')
merged_df.head()



In [None]:
#AQI numeric values are already cleaned, we can recompute the AQI category:
def categorize_aqi(aqi):
    if aqi <= 50:
        return 'Good'
    elif aqi <= 100:
        return 'Satisfactory'
    elif aqi <= 200:
        return 'Moderate'
    elif aqi <= 300:
        return 'Poor'
    elif aqi <= 400:
        return 'Very Poor'
    else:
        return 'Severe'

merged_df['AQI_Category_Indian_GOV'] = merged_df['AQI'].apply(categorize_aqi)


In [None]:
merged_df['AQI_Category_Indian_GOV'].value_counts()


Add Traffic Index

In [None]:
#assuming daily traffic index is a constant value calculated earlier
merged_df['traffic_index']= daily_traffic_index
merged_df.head()

In [None]:
merged_df.info()

In [None]:
merged_df.describe()

EDA

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(merged_df['AQI'], bins=30, kde=True, color='crimson')
plt.title('Distribution of AQI in Pune(2017-2024)')
plt.xlabel('AQI')
plt.ylabel('Frequency')
plt.show()

In [None]:
#AQI trend by month
plt.figure(figsize=(12,6))
sns.boxplot(x='Month', y='AQI', data=merged_df, palette='coolwarm')
plt.title('Monthly AQI Distribution in Pune (2017-2024)')
plt.xlabel('Month')
plt.ylabel('AQI')
plt.show()


Shows which months are most polluted

Detects seasonal trends (summer, winter, post-monsoon pollution spikes)


Correlation Heatmap

In [None]:
#AQI correlates with weather and traffic
plt.figure(figsize=(12,8))
sns.heatmap(merged_df[['AQI','temperature_2m_max','temperature_2m_min',
                       'apparent_temperature_max','apparent_temperature_min',
                       'precipitation_sum','rain_sum','wind_speed_10m_max',
                       'wind_gusts_10m_max','wind_direction_10m_dominant',
                       'traffic_index']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap: AQI vs Weather & Traffic')
plt.show()


Key Observations:
1) Strong negative correlations with AQI:
temperature_2m_min (-0.47) → Lower minimum temperatures → higher pollution

apparent_temperature_min (-0.51) → Colder days → higher AQI

2)Moderate negative correlations:

wind_speed_10m_max (-0.37), wind_gusts_10m_max (-0.35) → Stronger winds reduce AQI (helps disperse pollutants)

3)Weak or negligible correlations:

precipitation_sum (-0.28), rain_sum (-0.28) → Rain slightly reduces pollution

traffic_index (-0.33) → Surprisingly weak negative correlation; might be because we used a constant traffic proxy

4)Positive correlations among weather features:

temperature_2m_max vs apparent_temperature_max (0.89) → obvious

rain_sum vs precipitation_sum (1.0) → same metric



In [None]:
#Visualize AQI over time
plt.figure(figsize=(15,6))
sns.lineplot(x='Date', y='AQI', data=merged_df, color='crimson')
plt.title('Daily AQI Trend in Pune (2017-2024)')
plt.xlabel('Date')
plt.ylabel('AQI')
plt.show()


Time based AQI trend

In [None]:
plt.figure(figsize=(14,5))
sns.lineplot(x='Date', y='AQI', data=merged_df)
plt.title('AQI Trend in Pune Over Time (2017–2024)')
plt.xlabel('Year')
plt.ylabel('AQI')
plt.show()

From 2017-2024
AQI does not stay constant
There are seasonal waves(up-down every year)
There are sharp spikes (sudden severe pollution days)
overall pattern is continuous,numeric, and time-dependent

**Regression**

In [None]:
#define target(y) and features(x)
# Target variable
y = merged_df['AQI']

# Features (drop target + non-useful columns)
X = merged_df.drop(columns=[
    'AQI',
    'AQI_Category_Indian_GOV',  # category not needed for regression
    'Date',
    'date',
    'city'
])

print(X.shape)
print(y.shape)
X.head()


In [None]:
#Time Based Train-Test split
#training: 2017-2022
#Testing: 2023-2024

# Create a copy to avoid modifying original
data_ml = merged_df.copy()

# Split based on year
train_df = data_ml[data_ml['Year'] <= 2022]
test_df  = data_ml[data_ml['Year'] > 2022]

# Define X and y
X_train = train_df[X.columns]
y_train = train_df['AQI']

X_test = test_df[X.columns]
y_test = test_df['AQI']

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)



**Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Train model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
y_pred = lr.predict(X_test)

# Evaluation
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Linear Regression Results:")
print("MAE :", mae)
print("RMSE:", rmse)
print("R²  :", r2)


We first trained a Linear Regression model to establish a baseline performance.
This helps us understand whether complex models actually add value or not.



**Random Forest Regression**

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Train Random Forest
rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf.predict(X_test)

# Evaluation
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Results:")
print("MAE :", mae_rf)
print("RMSE:", rmse_rf)
print("R²  :", r2_rf)


Explainable AQI Increment

In [None]:
# Feature importance

feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf.feature_importances_
}).sort_values(by='Importance', ascending=False)

feature_importance


Explainable Factors Behind AQI Increase in Pune
1) Night-time apparent temperature plays the largest role due to inversion
2) Seasonal variations significantly impact AQI levels
3) Wind characteristics regulate pollutant dispersion
4) Long-term trends reflect urbanization effects

In [None]:

df = merged_df.copy()

# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Sort by date
df = df.sort_values('Date')

# Plot AQI trend
plt.figure(figsize=(12,5))
plt.plot(df['Date'], df['AQI'])
plt.title("AQI Trend in Pune Over Time")
plt.xlabel("Date")
plt.ylabel("AQI")
plt.tight_layout()
plt.show()


In [None]:
#Smooth the AQI trend using Rolling Average
# 7-day rolling average
df['AQI_7day_avg'] = df['AQI'].rolling(window=7).mean()

# Plot original + rolling average
plt.figure(figsize=(12,5))
plt.plot(df['Date'], df['AQI'], alpha=0.4, label='Daily AQI')
plt.plot(df['Date'], df['AQI_7day_avg'], linewidth=2, label='7-Day Rolling Avg')
plt.title("AQI Trend in Pune with Rolling Average")
plt.xlabel("Date")
plt.ylabel("AQI")
plt.legend()
plt.tight_layout()
plt.show()


1) Graph shows Daily AQI is highly volatile
2) 7 Day rolling average shows clear seasonal cycles


**Future AQI Prediction**

A future AQI risk projection was performed using a trained regression model under assumed urban growth conditions.

In [None]:
# STEP 1: Create future dates (2026–2028)
future_dates = pd.date_range(start="2026-01-01", end="2028-12-31", freq="D")

future_df = pd.DataFrame({
    "Year": future_dates.year,
    "Month": future_dates.month,
    "Day": future_dates.day,
})

# Use historical averages for other features
for col in X.columns:
    if col not in ["Year", "Month", "Day"]:
        future_df[col] = X[col].mean()

print("Future Data Shape:", future_df.shape)
future_df.head(10)


**Predict AQI for 2026–2028**

**Future Risk Mapping**

In [None]:
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)


In [None]:
future_df = future_df[X_train.columns]   # match exact training features
future_df["Predicted_AQI"] = rf.predict(future_df)
future_df.head()


In [None]:
# Create date column
future_df["Date"] = pd.to_datetime(
    future_df["Year"].astype(str) + "-" +
    future_df["Month"].astype(str) + "-" +
    future_df["Day"].astype(str)
)

# Plot future AQI trend
plt.figure(figsize=(12,5))
plt.plot(future_df["Date"], future_df["Predicted_AQI"])
plt.title("Future AQI Prediction for Pune (2026–2028)")
plt.xlabel("Year")
plt.ylabel("Predicted AQI")
plt.show()


The Graph shows (2026-2028)
1) The AQI does not randomly fluctuate.
2) Predicted AQI in Pune remains mostly in the Moderate to Poor range, indicating persistent air pollution.
3) Winter and post-monsoon months show higher AQI due to low wind speed and temperature inversion.
4) Rainfall periods reduce AQI, but the improvement is short-lived.
5) Traffic contributes a constant baseline pollution load, preventing AQI recovery.



**Conclusion**:
This project presents a source-driven air pollution analysis for Pune by integrating air quality, weather, and traffic data. The model successfully identifies seasonal and meteorological drivers of pollution and predicts future AQI levels for 2026-2028.

Results indicate that Pune's air quality is expected to remain in the Moderate to Poor category, with recurring pollution peaks driven by weather conditions, sustained traffic emmisions, and rapid urban development. Without effective pollution control policies, the city may experience long-term exposure to unhealthy air levels.

The future risk mapping highlights the urgent need for traffic regulation, construction dust control, and urban planning interventions to prevent further degradation of air quality as Pune continues to grow as metro city.