# PM2.5 Prediction using MLP Neural Network

This notebook demonstrates the prediction of PM2.5 concentrations using a Multi-Layer Perceptron (MLP) neural network. The model uses meteorological variables, atmospheric energy parameters, and webcam-derived RGB features to estimate particulate matter concentrations.



Import necessary libraries for data processing, visualization, and deep learning.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import tensorflow as tf

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

## Reproducibility Settings

Set random seeds for NumPy and TensorFlow to ensure reproducible results across different runs.


In [None]:
SEED = 1
np.random.seed(SEED)
tf.random.set_seed(SEED)

## Data Loading

Load the input dataset containing meteorological variables, RGB features from webcam images, and PM2.5 measurements.


In [None]:
path = "Model Input/ML_DL_input.csv"
df = pd.read_csv(path)
df.shape

## Datetime Processing

Create a datetime column by combining Date and Hour fields for temporal analysis.


In [None]:
df['Datetime'] = pd.to_datetime(df['Date'].astype(str) + ' ' + df['Hour'].astype(str), errors='coerce')

## Feature Engineering: Atmospheric Energy Parameters

Calculate atmospheric energy features at two pressure levels (500 hPa and 850 hPa):
- **Density (ρ)**: Air density calculated using the ideal gas law
- **Kinetic Energy (KE)**: Energy from wind velocity components (U, V)
- **Geopotential Energy (GE)**: Gravitational potential energy

These features capture atmospheric dynamics relevant to pollutant dispersion.


In [None]:
T_500=df['T_500']
T_850=df['T_850']
Ro_500=T_500.map(lambda x: 50000*29/(8314*(x)))
Ro_850=T_850.map(lambda x: 85000*29/(8314*(x)))

U_500=df['U_500']
V_500=df['V_500']
U_850=df['U_850']
V_850=df['V_850']

KE_500=0.5*Ro_500*(U_500**2+V_500**2)
KE_850=0.5*Ro_850*(U_850**2+V_850**2)

GP_500=df['GP_500']
GP_850=df['GP_850']
GE_500=Ro_500*GP_500
GE_850=Ro_850*GP_850

df['GE_500']=GE_500
df['KE_500']=KE_500
df['GE_850']=GE_850
df['KE_850']=KE_850

## Feature Engineering: RGB Color Ratios

Calculate color ratios from webcam imagery (sky and ground regions):
- **R/G, R/B, B/R ratios**: Color channel relationships
- **RGB sum**: Total brightness indicator

These features capture atmospheric turbidity and visibility conditions that correlate with PM2.5 concentrations.


In [None]:
R_G_Sky= df['R_S_M']/df['G_S_M']
R_B_Sky = df['R_S_M']/df['B_S_M']
RGB_Sky = df['R_S_M']+df['G_S_M']+df['B_S_M']
G_R_Sky= df['G_S_M']/df['R_S_M']
B_R_Sky= df['B_S_M']/df['R_S_M']

R_G_Ground= df['R_G_M']/df['G_G_M']
R_B_Ground = df['R_G_M']/df['B_G_M']
RGB_Ground = df['R_G_M']+df['G_G_M']+df['B_G_M']

df['R_G_Sky']=R_G_Sky
df['R_B_Sky']=R_B_Sky
df['B_R_Sky']=B_R_Sky
df['RGB_Sky']=RGB_Sky
df['R_G_Ground']=R_G_Ground
df['R_B_Ground']=R_B_Ground
df['RGB_Ground']=RGB_Ground

PM25 = df['PM25']

GE_500=df['GE_500']
BLH=df['BLH']

## Feature Engineering: Temporal Cyclical Encoding

Encode hour and month as cyclical features using sine and cosine transformations. This preserves the circular nature of time (e.g., hour 23 is close to hour 0).


In [None]:
df['hr']=df['Datetime'].dt.hour
df['mnth']=df['Datetime'].dt.month
print(df['hr'].unique(),df['mnth'].unique())

df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['mnth_sin'] = np.sin((df.mnth-1)*(2.*np.pi/12))
df['mnth_cos'] = np.cos((df.mnth-1)*(2.*np.pi/12))

## Data Preparation

Prepare the dataset for modeling:
1. Define target variable (PM2.5) and columns to drop
2. Remove non-numeric and temporal columns
3. Handle missing values
4. Separate features (X) and target (y)


In [None]:
y_cols = ['PM25']
drop_cols = ['Datetime', 'Date', 'Hour', 'hr', 'mnth']

df2 = df.drop(columns=[c for c in drop_cols if c in df.columns]).copy()
df2 = df2.select_dtypes(include=[np.number]).copy()
df2 = df2.dropna().reset_index(drop=True)

# Feature and target selection
x_cols = [c for c in df2.columns if c not in y_cols]
print(f"Features: {len(x_cols)}")
print(f"Target: {y_cols}")
print(f"Total samples: {len(df2)}")

## Chronological Train-Test Split

Split data chronologically (not randomly) to simulate real-world forecasting:
- Training: First 80% of temporal data
- Testing: Last 20% of temporal data

This approach prevents data leakage and provides realistic performance estimates.


In [None]:
X = df2[x_cols].to_numpy()
y = df2[y_cols].to_numpy()

test_frac = 0.2
n = len(df2)
split_idx = int((1 - test_frac) * n)

X_train, y_train = X[:split_idx], y[:split_idx]
X_test, y_test = X[split_idx:], y[split_idx:]

print(f"Total rows: {n}")
print(f"Train: {X_train.shape}")
print(f"Test: {X_test.shape}")

## Data Normalization

Apply Min-Max scaling to normalize features and target to [0, 1] range:
- Fit scalers on **training data only** to prevent data leakage
- Transform both train and test sets using training statistics
- Keep scalers for inverse transformation of predictions


In [None]:
sx = MinMaxScaler(feature_range=(0, 1))
sy = MinMaxScaler(feature_range=(0, 1))

X_train_s = sx.fit_transform(X_train)
X_test_s = sx.transform(X_test)

y_train_s = sy.fit_transform(y_train)
y_test_s = sy.transform(y_test)

print("Scaling complete")

## MLP Model Architecture

Build a Multi-Layer Perceptron (MLP) for regression:
- **Input layer**: Number of features
- **Hidden layers**: 3 dense layers (32, 32, 16 neurons) with ReLU activation
- **Output layer**: Single neuron (PM2.5 prediction) with linear activation
- **Optimizer**: Adam with learning rate = 0.001
- **Loss function**: Mean Squared Error (MSE)


In [None]:
n_input = X_train_s.shape[1]   
n_output = y_train_s.shape[1]  

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(n_input,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(n_output)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse'
)

model.summary()

## Model Training

Train the MLP with early stopping:
- **Epochs**: Maximum 300
- **Batch size**: 256
- **Early stopping**: Monitor validation loss with patience=20
- **Validation**: Use test set for monitoring (chronological split ensures no leakage)


In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=20,
    restore_best_weights=True
)

history = model.fit(
    X_train_s, y_train_s,
    epochs=300,
    batch_size=256,
    validation_data=(X_test_s, y_test_s),
    callbacks=[early_stop],
    verbose=1
)

## Training History Visualization

Plot training and validation loss curves to assess model convergence and potential overfitting.


In [None]:
plt.figure(figsize=(8, 5), dpi=150)
plt.plot(history.history['loss'], label='Train MSE (scaled)')
plt.plot(history.history['val_loss'], label='Test MSE (scaled)')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.title('MLP Training History')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
#plt.savefig('loss.png')
plt.show()

## Model Evaluation

Evaluate model performance on test set using original units (µg/m³):
- **RMSE**: Root Mean Squared Error
- **MAE**: Mean Absolute Error
- **R²**: Coefficient of determination

Predictions are inverse-transformed from normalized space back to original scale.


In [None]:
yhat_test_s = model.predict(X_test_s, verbose=0)
yhat_test = sy.inverse_transform(yhat_test_s)

rmse = np.sqrt(mean_squared_error(y_test, yhat_test))
mae = mean_absolute_error(y_test, yhat_test)
r2 = r2_score(y_test, yhat_test)

print(f"Test RMSE: {rmse:.3f} µg/m³")
print(f"Test MAE:  {mae:.3f} µg/m³")
print(f"Test R²:   {r2:.3f}")

## Prediction Visualization

Scatter plot comparing observed vs. predicted PM2.5 values. Points along the diagonal indicate perfect predictions.


In [None]:
plt.figure(figsize=(6, 6), dpi=150)
plt.scatter(y_test, yhat_test, s=8, alpha=0.5, edgecolors='none')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', lw=2, label='Perfect prediction')
plt.xlabel("Observed PM2.5 (µg/m³)")
plt.ylabel("Predicted PM2.5 (µg/m³)")
plt.title("Observed vs Predicted PM2.5 (Test Set)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
#plt.savefig('comparison_mlp.png')
plt.show()