# Deep Learning model for traffic and air quality prediction in time

This notebook explains and performs the training of two models for predicting a state hours (the number is configurable) in the future.
It defines a clear and simple way of setting up the experimental environnement for machine learning experiment (train/test datasets, evaluation metrics...).
I chose to use sklearn for its ease of use.

## Deep Learning Model - GRU Network

This notebook build a GRU (Gated Recurrent Unit) neural network, witch is similar to LSTM but simpler and often faster while stil caturing time relations in data.

In [74]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from lightgbm import LGBMRegressor
import plotly.graph_objects as go

Loading the dataset

In [75]:
df=pd.read_pickle("created_dataset.pkl")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16081 entries, 0 to 16080
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           16081 non-null  object        
 1   hour           16081 non-null  int32         
 2   traffic_z0     16081 non-null  float64       
 3   traffic_z1     16081 non-null  float64       
 4   traffic_z3     16081 non-null  float64       
 5   traffic_z4     16081 non-null  float64       
 6   traffic_z5     16081 non-null  float64       
 7   traffic_z6     16081 non-null  float64       
 8   traffic_z7     16081 non-null  float64       
 9   traffic_z8     16081 non-null  float64       
 10  station_4      16081 non-null  float64       
 11  station_43     16081 non-null  float64       
 12  station_44     16081 non-null  float64       
 13  station_54     16081 non-null  float64       
 14  station_57     16081 non-null  float64       
 15  station_58     1608

We will predict both traffic and air quality in the future. Here is the number of hours in the future the prediction will be made.

In [76]:
forecast_h = 12 # Prediction will be made for forecast_h hours in the future

# Predicting traffic

This part is not Deep Learning, and as it is independant from air quality prediction with Neural Networks, I just commented it out.

In [77]:
# # Identifying the 'traffic_zN' columns
traffic_cols = [c for c in df.columns if c.startswith("traffic_")]

# # Creating the target value (value forecast_h hours after) for every traffic area
# for col in traffic_cols:
#     df[f"target_{col}"] = df[col].shift(-forecast_h)

# # This functions add lags (values from previous hours) and rolls (previous rolling means over time windows)
# def add_lags_and_rolls(df, cols, lags=[1, 2, 3, 6, 12], rolls=[3, 6, 12]):
#     for col in cols:
#         for lag in lags:
#             df[f"{col}_lag{lag}"] = df[col].shift(lag)
#         for w in rolls:
#             df[f"{col}_roll{w}"] = df[col].rolling(window=w).mean()
#     return df

# df = add_lags_and_rolls(df, traffic_cols)
df["dayofweek"] = df["datetime_hour"].dt.dayofweek
# df = df.dropna().reset_index(drop=True)

Training of one model per zone

In [78]:
# models_traffic = {}
# scalers_traffic = {}
# for col in traffic_cols:
#     # Features : lags/rolling de la zone + temporel
#     feature_cols = [c for c in df.columns if c.startswith(f"{col}_") or c in ["hour", "dayofweek"]]
#     X = df[feature_cols]
#     y = df[f"target_{col}"]

#     # Split train/test
#     split_idx = int(len(df) * 0.8)
#     X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
#     y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

#     # Normalisation
#     scaler = StandardScaler()
#     X_train_scaled = scaler.fit_transform(X_train)
#     X_test_scaled = scaler.transform(X_test)
#     scalers_traffic[col] = scaler

#     # Model
#     model = LGBMRegressor(n_estimators=300, learning_rate=0.05, random_state=42)
#     model.fit(X_train_scaled, y_train)
#     models_traffic[col] = model

#     # Metrics
#     y_pred = model.predict(X_test_scaled)
#     mae = mean_absolute_error(y_test, y_pred)
#     rmse = mean_squared_error(y_test, y_pred)
#     r2 = r2_score(y_test, y_pred)
#     print(f"[TRAFFIC] {col}: MAE = {mae:.2f}, RMSE = {rmse:.2f}, R² = {r2:.3f}")

#     fig = go.Figure()
#     fig.add_trace(go.Scatter(
#         x=df["datetime_hour"].iloc[split_idx:],
#         y=y_test,
#         mode='lines',
#         name='VReal values',
#         line=dict(color='blue')
#     ))
#     fig.add_trace(go.Scatter(
#         x=df["datetime_hour"].iloc[split_idx:],
#         y=y_pred,
#         mode='lines',
#         name='Prédictions',
#         line=dict(color='red', dash='dash')
#     ))
#     fig.update_layout(
#         title=f"Traffic prediction in ({col}) for +{forecast_h}h",
#         xaxis_title="Date",
#         yaxis_title="Traffic",
#         hovermode="x unified",
#         template="plotly_white"
#     )
#     fig.show()


# Predicting air quality
First with real traffic values. (Later I will use the predicted ones)

In [79]:
# Identify the station cols
station_cols = [c for c in df.columns if c.startswith("station_")]
weather_cols = ["temperature", "wind_u", "wind_v", "precipitation", "is_raining", "humidity", "pressure", "cloud_cover"]
weather_cols = [c for c in weather_cols if c in df.columns]

Creation of the `target column`: this is the value we will try to predict for a given hour. It is either the air quality measured by station 4 in `forecast_h` hours, or the mean of all the measures in `forecast_h` hours. the following cell allows to choose.

In [80]:
# Target: Can be either the mean of all the stations, or one specific station.

# For the mean, uncomment this
#target, target_label = df[station_cols].mean(axis=1), "Average PM10 levels"

# For the station 4, uncomment this
target, target_label = df[['station_4']], "PM10 levels in station 4"

In [81]:
df["air_quality_selected"] = target
df["target_air_quality"] = df["air_quality_selected"].shift(-forecast_h)

df = df.dropna().reset_index(drop=True)

traffic_features = traffic_cols

In [82]:
# Get the created features names
feature_cols_air = (
    traffic_features +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and "air_quality" in c] +
    [c for c in df.columns if c in weather_cols] +
    ["hour", "dayofweek"]
)

Splitting the dataset into a training dataset and testing other.
Since I need continous time, I can not pick randomly to fill these two new datasets. I chose to take the first 80% of the timeframe as training, and the remaining part as testing. This has some flaws (for example, the seasons change).

In [83]:
# Split train/test
X = df[feature_cols_air]
y = df["target_air_quality"]
split_idx = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]


# Important: we removed some rows from X, we have to remove the corresponding ones from Y
y_test = y_test.loc[X_test.index]  # Have the same indexes as X_test
y_train = y_train.loc[X_train.index]

In [107]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Dropout
from scikeras.wrappers import KerasRegressor
from sklearn.preprocessing import StandardScaler

# Normalization
scaler_air = StandardScaler()
X_train_scaled = scaler_air.fit_transform(X_train)
X_test_scaled = scaler_air.transform(X_test)

# Reshape for LSTM/GRU: (samples, timesteps=1, features)
X_train_lstm = X_train_scaled.reshape((X_train_scaled.shape[0], 1, X_train_scaled.shape[1]))
X_test_lstm = X_test_scaled.reshape((X_test_scaled.shape[0], 1, X_test_scaled.shape[1]))

def build_lstm_model():
    model = Sequential([
        GRU(128, activation='tanh', return_sequences=True, input_shape=(1, X_train_scaled.shape[1])),
        #Dropout(0.2),
        GRU(64, activation='tanh'),
        #Dropout(0.2),
        Dense(32, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

model_dl = KerasRegressor(model=build_lstm_model, epochs=100, batch_size=32, verbose=1)
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model_dl.fit(X_train_lstm, y_train, callbacks=[early_stop])
y_pred = model_dl.predict(X_test_lstm)
model_air = model_dl

Epoch 1/100



Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 276.4978 - mae: 11.8341
Epoch 2/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 169.2892 - mae: 9.1085
Epoch 3/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 166.5851 - mae: 9.0148
Epoch 4/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 163.1221 - mae: 8.8979
Epoch 5/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 159.7235 - mae: 8.8091
Epoch 6/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 157.2796 - mae: 8.7085
Epoch 7/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 154.5207 - mae: 8.6198
Epoch 8/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 152.0462 - mae: 8.5175
Epoch 9/100
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

# Metrics and plots

In [None]:
# Metrics and plots
# This part of the code is not clean, some measures are computed twice

y_pred = model_air.predict(X_test_lstm)
print(y_pred)
# print("y test", y_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE (air) = {mae:.2f}, RMSE = {rmse:.2f}, R² = {r2:.3f}")


# Check for overfitting by comparing train and test performance
y_train_pred = model_air.predict(X_train_lstm)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

print(f"\n[Overfitting Check]")


print(f"Train - MAE: {train_mae:.2f}, RMSE: {rmse:.2f}, R²: {train_r2:.3f}")
print(f"Test  - MAE: {mae:.2f}, RMSE: {rmse:.2f}, R²: {r2:.3f}")
  

# Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_test, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()


# Training Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[:split_idx], y=y_train, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[:split_idx], y=y_train_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"[TRAINING PHASE]{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()

[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 628us/step
[18.941256  9.415594 20.452778 ...  7.360422 14.177489 13.898617]
MAE (air) = 12.23, RMSE = 16.62, R² = -0.527
[1m402/402[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 544us/step

[Overfitting Check]
Train - MAE = 3.45, R² = 0.887
Test  - MAE = 12.23, R² = -0.527
Difference - MAE = 8.78, R² = -1.414
Train - MAE: 3.45, RMSE: 4.80, R²: 0.887
Test  - MAE: 12.23, RMSE: 16.62, R²: -0.527
