# Regression model for traffic and air quality prediction in time

This notebook explains and performs the training of two models for predicting a state H hours (the number is configurable) in the future.
It defines a clear and simple way of setting up the experimental environnement for machine learning experiment (train/test datasets, evaluation metrics...).
I chose to use sklearn for its ease of use.

## Regression Models for Traffic and Air Quality

This notebook contanes regression modeles that predict continuos values for traffic density and air pollution levels using historical data and feature engeneering.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from lightgbm import LGBMRegressor
import plotly.graph_objects as go

Loading the dataset (created by `dataset_creation.ipynb`)

In [2]:
df=pd.read_pickle("created_dataset.pkl")
df.head()

Unnamed: 0,date,hour,traffic_z0,traffic_z1,traffic_z3,traffic_z4,traffic_z5,traffic_z6,traffic_z7,traffic_z8,...,station_58,datetime_hour,temperature,precipitation,humidity,pressure,cloud_cover,is_raining,wind_u,wind_v
0,2023-01-01,0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,...,10.0,2023-01-01 00:00:00,9.7,0.0,86,1023.2,100,0,-6.464392,0.679435
1,2023-01-01,1,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,10.0,2023-01-01 01:00:00,8.7,0.0,91,1023.3,99,0,-5.724745,1.427339
2,2023-01-01,2,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,7.0,2023-01-01 02:00:00,8.2,0.0,89,1023.4,100,0,-4.664265,2.914556
3,2023-01-01,3,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,...,8.0,2023-01-01 03:00:00,7.7,0.0,89,1023.1,100,0,-4.289849,3.599611
4,2023-01-01,4,3.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,...,8.0,2023-01-01 04:00:00,6.9,0.0,91,1022.9,100,0,-3.985826,2.490621


We will predict both traffic and air quality in the future. Here is the number of hours in the future the prediction will be made.

In [3]:
forecast_h = 12 # Prediction will be made for forecast_h hours in the future

# Predicting traffic

In [4]:
# Identifying the 'traffic_zN' columns
traffic_cols = [c for c in df.columns if c.startswith("traffic_")]

# Creating the target value (value forecast_h hours after) for every traffic area
for col in traffic_cols:
    df[f"target_{col}"] = df[col].shift(-forecast_h)

# This functions add lags (values from previous hours) and rolls (previous rolling means over time windows)
def add_lags_and_rolls(df, cols, lags=[1, 2, 3, 6, 12], rolls=[3, 6, 12]):
    for col in cols:
        for lag in lags:
            df[f"{col}_lag{lag}"] = df[col].shift(lag)
        for w in rolls:
            df[f"{col}_roll{w}"] = df[col].rolling(window=w).mean()
    return df

df = add_lags_and_rolls(df, traffic_cols)
df["dayofweek"] = df["datetime_hour"].dt.dayofweek
df = df.dropna().reset_index(drop=True)

Training of one model per zone

In [None]:
models_traffic = {}
scalers_traffic = {}
for col in traffic_cols:
    # Features : retrieve lags, rolls, hour and day of week
    feature_cols = [c for c in df.columns if c.startswith(f"{col}_") or c in ["hour", "dayofweek"]]
    X = df[feature_cols]
    y = df[f"target_{col}"]

    # Split train/test
    split_idx = int(len(df) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Normalisation
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    scalers_traffic[col] = scaler

    # Model
    model = LGBMRegressor(n_estimators=300, learning_rate=0.05, random_state=42)
    # Training of the model
    model.fit(X_train_scaled, y_train)
    models_traffic[col] = model # Save the trained model for this traffic area

    # Metrics
    y_pred = model.predict(X_test_scaled)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"[TRAFFIC] {col}: MAE = {mae:.2f}, R² = {r2:.3f}")

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df["datetime_hour"].iloc[split_idx:],
        y=y_test,
        mode='lines',
        name='VReal values',
        line=dict(color='blue')
    ))
    fig.add_trace(go.Scatter(
        x=df["datetime_hour"].iloc[split_idx:],
        y=y_pred,
        mode='lines',
        name='Predictions',
        line=dict(color='red', dash='dash')
    ))
    fig.update_layout(
        title=f"Traffic prediction in ({col}) for +{forecast_h}h",
        xaxis_title="Date",
        yaxis_title="Traffic",
        hovermode="x unified",
        template="plotly_white"
    )
    fig.show()


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000280 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1272
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.492577




[TRAFFIC] traffic_z0: MAE = 0.37, R² = 0.324


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000309 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 971
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.163819
[TRAFFIC] traffic_z1: MAE = 0.28, R² = 0.658



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000193 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1033
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.286975
[TRAFFIC] traffic_z3: MAE = 0.35, R² = 0.584



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000208 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1045
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.870782
[TRAFFIC] traffic_z4: MAE = 0.49, R² = -0.340



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000197 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 949
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.452476
[TRAFFIC] traffic_z5: MAE = 0.23, R² = 0.513



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000201 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1173
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.288416
[TRAFFIC] traffic_z6: MAE = 0.96, R² = 0.407



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000213 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1023
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 1.986092
[TRAFFIC] traffic_z7: MAE = 0.23, R² = 0.730



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000189 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1054
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.027489
[TRAFFIC] traffic_z8: MAE = 0.24, R² = 0.696



X does not have valid feature names, but LGBMRegressor was fitted with feature names



# Predicting air quality
First, train the model with real traffic values from the dataset (A production model would use the traffic value predicted by the traffic prediction model, however training the air quality model with predicted (biased) traffic values would lead to poorer results).

In [6]:
# Identify the station cols
station_cols = [c for c in df.columns if c.startswith("station_")]
weather_cols = ["temperature", "wind_u", "wind_v", "precipitation", "is_raining", "humidity", "pressure", "cloud_cover"]
weather_cols = [c for c in weather_cols if c in df.columns]

Creation of the `target column`: this is the value we will try to predict for a given hour. It is either the air quality measured by station 4 in `forecast_h` hours, or the mean of all the measures in `forecast_h` hours. the following cell allows to choose.

In [7]:
# Target: Can be either the mean of all the stations, or one specific station.

# For the mean, uncomment this
#target, target_label = df[station_cols].mean(axis=1), "Average PM10 levels"

# For the station 4, uncomment this
target, target_label = df[['station_4']], "PM10 levels in station 4"

In [None]:
df["air_quality_selected"] = target
df["target_air_quality"] = df["air_quality_selected"].shift(-forecast_h)


# Create Lags/rolling pour la qualité de l'air et la météo
df = add_lags_and_rolls(df, ["air_quality_selected"] + weather_cols)
df = df.dropna().reset_index(drop=True) # By creating lags, we will loose forecast_h rows


traffic_features = traffic_cols



In [9]:
# Get the created features names
feature_cols_air = (
    traffic_features +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and "air_quality" in c] +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and c.split("_")[0] in weather_cols] +
    ["hour", "dayofweek"]
)

Splitting the dataset into a training dataset and testing other.
Since I need continous time, I can not pick randomly to fill these two new datasets. I chose to take the first 80% of the timeframe as training, and the remaining part as testing. This has some flaws (for example, the seasons change), I will try other ways later.

In [10]:
# Split train/test
X = df[feature_cols_air]
y = df["target_air_quality"]
split_idx = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]


# Important: we removed some rows from X, we have to remove the corresponding ones from Y
y_test = y_test.loc[X_test.index]  # Have the same indexes as X_test
y_train = y_train.loc[X_train.index]

X_train.head()

Unnamed: 0,traffic_z0,traffic_z1,traffic_z3,traffic_z4,traffic_z5,traffic_z6,traffic_z7,traffic_z8,air_quality_selected_lag1,air_quality_selected_lag2,...,pressure_lag1,pressure_lag2,pressure_lag3,pressure_lag6,pressure_lag12,pressure_roll3,pressure_roll6,pressure_roll12,hour,dayofweek
0,2.4,1.0,1.0,2.0,2.0,1.0,1.0,1.0,41.0,44.0,...,1021.9,1022.1,1022.0,1021.6,1023.6,1021.933333,1021.933333,1022.016667,0,0
1,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,34.0,41.0,...,1021.8,1021.9,1022.1,1021.8,1023.1,1021.833333,1021.933333,1021.908333,1,0
2,2.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,28.0,34.0,...,1021.8,1021.8,1021.9,1022.0,1022.7,1021.766667,1021.883333,1021.825,2,0
3,2.0,1.0,1.0,6.0,1.0,1.0,2.0,1.0,31.0,28.0,...,1021.7,1021.8,1021.8,1022.0,1021.9,1021.666667,1021.8,1021.791667,3,0
4,2.0,1.0,1.0,6.0,1.0,1.0,1.6,1.0,33.0,31.0,...,1021.5,1021.7,1021.8,1022.1,1021.7,1021.4,1021.616667,1021.733333,4,0


In [None]:
# Normalisation
scaler_air = StandardScaler()
X_train_scaled = scaler_air.fit_transform(X_train)
X_test_scaled = scaler_air.transform(X_test)





# from sklearn.model_selection import RandomizedSearchCV
# import numpy as np

# # Define the range of parameters to test
# param_dist = {
#     'n_estimators': [500, 1000, 2000],
#     'learning_rate': [0.01, 0.05, 0.1],
#     'num_leaves': [20, 31, 50, 100],
#     'max_depth': [-1, 10, 20],
#     'alpha': [0.7, 0.8, 0.9] # Testing different "levels" of boldness
# }

# random_search = RandomizedSearchCV(
#     estimator=LGBMRegressor(objective='quantile'),
#     param_distributions=param_dist,
#     n_iter=20, # Try 20 random combinations
#     cv=3,      # 3-fold cross-validation
#     scoring='neg_mean_absolute_error',
#     verbose=1,
#     n_jobs=-1
# )

# random_search.fit(X_train_scaled, y_train)
# print(f"Best parameters: {random_search.best_params_}")




# # Actual model for air quality prediction
# model_air = LGBMRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
# model_air.fit(X_train_scaled, y_train)




# Quantile approach
model_air = LGBMRegressor(
    objective='quantile',
    alpha=0.7, # Predict the 90th percentile
    n_estimators=2000, 
    learning_rate=0.01,
    num_leaves=1000, 
    max_depth=10,
)
import lightgbm as lgb
# Fit with callback
model_air.fit(
    X_train_scaled, y_train,
    eval_set=[(X_test_scaled, y_test)], # Must provide a validation set
    eval_metric='quantile',
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=10) # Optional: prints progress every 10 trees
    ]
)




# from sklearn.svm import SVR
# from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit

# # 1. Initialize SVR
# # 'rbf' (Radial Basis Function) is the most common kernel for non-linear data
# svr_model = SVR(kernel='rbf')

# # 2. Define Hyperparameter Space
# # C: Regularization (high C = low tolerance for errors, risk of overfitting)
# # epsilon: The width of the "tube" where no penalty is given
# # gamma: Defines how far the influence of a single training example reaches
# param_dist_svr = {
#     'C': [0.1, 0.5, 1, 2, 3, 10],
#     'epsilon': [0.01, 0.1, 0.05, 0.2, 1],
#     'gamma': [0.01, 0.02, 0.005]
# }

# # 3. TimeSeries aware search
# tscv = TimeSeriesSplit(n_splits=5)

# random_search_svr = RandomizedSearchCV(
#     estimator=svr_model,
#     param_distributions=param_dist_svr,
#     n_iter=15, # SVR is slower, so we try fewer combinations
#     cv=tscv,
#     scoring='neg_mean_absolute_error',
#     verbose=2,
#     n_jobs=-1
# )

# # 4. Fit
# random_search_svr.fit(X_train_scaled, y_train)

# print(f"Best SVR params: {random_search_svr.best_params_}")
# model_air = random_search_svr.best_estimator_








# Metrics and plots
y_pred = model_air.predict(X_test_scaled) # test predictions
# print("X test scaled", X_test_scaled)
# print("y test", y_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE (air) = {mae:.2f}, RMSE = {rmse:.2f}, R² = {r2:.3f}")


# Check for overfitting by comparing train and test performance
y_train_pred = model_air.predict(X_train_scaled)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

print(f"\n[Overfitting Check]")

model = model_air
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

train_mae = mean_absolute_error(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)

test_mae = mean_absolute_error(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)

mae_diff = test_mae - train_mae
r2_diff = train_r2 - test_r2  # Positive = overfitting

print(f"Train - MAE: {train_mae:.2f}, RMSE: {train_rmse:.2f}, R²: {train_r2:.3f}")
print(f"Test  - MAE: {test_mae:.2f}, RMSE: {test_rmse:.2f}, R²: {test_r2:.3f}")
  

# Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_test, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()


# Training Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[:split_idx], y=y_train, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[:split_idx], y=y_train_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"[TRAINING PHASE]{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001622 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8070
[LightGBM] [Info] Number of data points in the train set: 12826, number of used features: 50
[LightGBM] [Info] Start training from score 28.000000
Training until validation scores don't improve for 50 rounds
[10]	valid_0's quantile: 4.6508
[20]	valid_0's quantile: 4.50507
[30]	valid_0's quantile: 4.38774
[40]	valid_0's quantile: 4.29003
[50]	valid_0's quantile: 4.20615
[60]	valid_0's quantile: 4.1321
[70]	valid_0's quantile: 4.07359
[80]	valid_0's quantile: 4.02258
[90]	valid_0's quantile: 3.97847
[100]	valid_0's quantile: 3.94331
[110]	valid_0's quantile: 3.91232
[120]	valid_0's quantile: 3.88527
[130]	valid_0's quantile: 3.86293
[140]	valid_0's quantile: 3.8443
[150]	valid_0's quantile: 3.82709
[160]	valid_0's quantile: 3.81151
[170]	valid_0's quantile: 3.79641
[180]	valid_0's quantile: 3.78


X does not have valid feature names, but LGBMRegressor was fitted with feature names


X does not have valid feature names, but LGBMRegressor was fitted with feature names


X does not have valid feature names, but LGBMRegressor was fitted with feature names




[Overfitting Check]
Train - MAE = 5.28, R² = 0.615
Test  - MAE = 8.95, R² = 0.223
Difference - MAE = 3.67, R² = -0.393
AAAAA
Train - MAE: 5.28, RMSE: 8.85, R²: 0.615
Test  - MAE: 8.95, RMSE: 11.82, R²: 0.223



X does not have valid feature names, but LGBMRegressor was fitted with feature names

