# Regression model for traffic and air quality prediction in time

This notebook explains and performs the training of two models for predicting a state hours (the number is configurable) in the future.
It defines a clear and simple way of setting up the experimental environnement for machine learning experiment (train/test datasets, evaluation metrics...).
I chose to use sklearn for its ease of use.

In [34]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score
from lightgbm import LGBMRegressor
import plotly.graph_objects as go

Loading the dataset

In [35]:
df=pd.read_pickle("final_dataset.pkl")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7321 entries, 0 to 7320
Data columns (total 25 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             7321 non-null   object        
 1   hour             7321 non-null   int32         
 2   traffic_z0       7321 non-null   float64       
 3   traffic_z1       7321 non-null   float64       
 4   traffic_z3       7321 non-null   float64       
 5   traffic_z4       7321 non-null   float64       
 6   traffic_z5       7321 non-null   float64       
 7   traffic_z6       7321 non-null   float64       
 8   traffic_z7       7321 non-null   float64       
 9   traffic_z8       7321 non-null   float64       
 10  datetime_hour_x  7321 non-null   datetime64[ns]
 11  station_4        7321 non-null   float64       
 12  station_43       7321 non-null   float64       
 13  station_54       7321 non-null   float64       
 14  station_58       7321 non-null   float64

We will predict both traffic and air quality in the future. Here is the number of hours in the future the prediction will be made.

In [36]:
forecast_h = 12 # Prediction will be made for forecast_h hours in the future

# Predicting traffic

In [37]:
# Identifying the 'traffic_zN' columns
traffic_cols = [c for c in df.columns if c.startswith("traffic_")]

# Creating the target value (value forecast_h hours after) for every traffic area
for col in traffic_cols:
    df[f"target_{col}"] = df[col].shift(-forecast_h)

# This functions add lags (values from previous hours) and rolls (previous rolling means over time windows)
def add_lags_and_rolls(df, cols, lags=[1, 2, 3, 6, 12], rolls=[3, 6, 12]):
    for col in cols:
        for lag in lags:
            df[f"{col}_lag{lag}"] = df[col].shift(lag)
        for w in rolls:
            df[f"{col}_roll{w}"] = df[col].rolling(window=w).mean()
    return df

df = add_lags_and_rolls(df, traffic_cols)
df["dayofweek"] = df["datetime_hour"].dt.dayofweek
df = df.dropna().reset_index(drop=True)

Training of one model per zone

In [38]:
models_traffic = {}
scalers_traffic = {}
for col in traffic_cols:
    # Features : lags/rolling de la zone + temporel
    feature_cols = [c for c in df.columns if c.startswith(f"{col}_") or c in ["hour", "dayofweek"]]
    X = df[feature_cols]
    y = df[f"target_{col}"]

    # Split train/test
    split_idx = int(len(df) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Normalisation
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    scalers_traffic[col] = scaler

    # Model
    model = LGBMRegressor(n_estimators=300, learning_rate=0.05, random_state=42)
    model.fit(X_train_scaled, y_train)
    models_traffic[col] = model

    # Metrics
    y_pred = model.predict(X_test_scaled)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"[TRAFFIC] {col}: MAE = {mae:.2f}, R² = {r2:.3f}")

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df["datetime_hour"].iloc[split_idx:],
        y=y_test,
        mode='lines',
        name='VReal values',
        line=dict(color='blue')
    ))
    fig.add_trace(go.Scatter(
        x=df["datetime_hour"].iloc[split_idx:],
        y=y_pred,
        mode='lines',
        name='Prédictions',
        line=dict(color='red', dash='dash')
    ))
    fig.update_layout(
        title=f"Traffic prediction in ({col}) for +{forecast_h}h",
        xaxis_title="Date",
        yaxis_title="Traffic",
        hovermode="x unified",
        template="plotly_white"
    )
    fig.show()


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000103 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1133
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 2.516661
[TRAFFIC] traffic_z0: MAE = 0.39, R² = 0.273



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000196 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 2.099529
[TRAFFIC] traffic_z1: MAE = 0.28, R² = 0.641



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000164 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 776
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 2.241922
[TRAFFIC] traffic_z3: MAE = 0.32, R² = 0.716



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000152 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 823
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 2.799486
[TRAFFIC] traffic_z4: MAE = 0.36, R² = 0.041



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000258 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 819
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 2.489430
[TRAFFIC] traffic_z5: MAE = 0.20, R² = 0.540



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000160 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1167
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 2.640372
[TRAFFIC] traffic_z6: MAE = 0.61, R² = 0.694



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000096 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 863
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 1.970764
[TRAFFIC] traffic_z7: MAE = 0.23, R² = 0.749



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000156 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 742
[LightGBM] [Info] Number of data points in the train set: 5837, number of used features: 10
[LightGBM] [Info] Start training from score 1.943927
[TRAFFIC] traffic_z8: MAE = 0.25, R² = 0.727



X does not have valid feature names, but LGBMRegressor was fitted with feature names



# Predicting air quality
First with real traffic values. (Later I will use the predicted ones)

In [39]:
# Identify the station cols
station_cols = [c for c in df.columns if c.startswith("station_")]
weather_cols = ["temperature", "wind_u", "wind_v", "precipitation", "is_raining", "humidity", "pressure", "cloud_cover"]
weather_cols = [c for c in weather_cols if c in df.columns]

Creation of the `target column`: this is the value we will try to predict for a given hour. It is either the air quality measured by station 4 in `forecast_h` hours, or the mean of all the measures in `forecast_h` hours. the following cell allows to choose.

In [40]:
# Target: Can be either the mean of all the stations, or one specific station.

# For the mean, uncomment this
#target, target_label = df[station_cols].mean(axis=1), "Average PM10 levels"

# For the station 4, uncomment this
target, target_label = df[['station_4']], "PM10 levels in station 4"

In [41]:
df["air_quality_selected"] = target
df["target_air_quality"] = df["air_quality_selected"].shift(-forecast_h)


# Create Lags/rolling pour la qualité de l'air et la météo
df = add_lags_and_rolls(df, ["air_quality_selected"] + weather_cols)
df = df.dropna().reset_index(drop=True) # By creating lags, we will loose forecast_h rows

"""# Ajout des prédictions de trafic (ou des données réelles)
use_predicted_traffic = True  # Option true/False
 if use_predicted_traffic:
    for col in traffic_cols:
        # Prédire le trafic pour toute la période (exemple simplifié..)
        feature_cols = [c for c in df.columns if c.startswith(f"{col}_") or c in ["hour", "dayofweek"]]
        X_traffic = df[feature_cols]
        X_traffic_scaled = scalers_traffic[col].transform(X_traffic)
        df[f"predicted_{col}"] = models_traffic[col].predict(X_traffic_scaled)
    traffic_features = [f"predicted_{col}" for col in traffic_cols]
else:
    traffic_features = traffic_cols """
traffic_features = traffic_cols

In [42]:
# Get the created features names
feature_cols_air = (
    traffic_features +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and "air_quality" in c] +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and c.split("_")[0] in weather_cols] +
    ["hour", "dayofweek"]
)

Splitting the dataset into a training dataset and testing other.
Since I need continous time, I can not pick randomly to fill these two new datasets. I chose to take the first 80% of the timeframe as training, and the remaining part as testing. This has some flaws (for example, the seasons change), I will try other ways later.

In [43]:
# Split train/test
X = df[feature_cols_air]
y = df["target_air_quality"]
split_idx = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]


# Important: we removed some rows from X, we have to remove the corresponding ones from Y
y_test = y_test.loc[X_test.index]  # Have the same indexes as X_test
y_train = y_train.loc[X_train.index]

In [44]:
# Normalisation
scaler_air = StandardScaler()
X_train_scaled = scaler_air.fit_transform(X_train)
X_test_scaled = scaler_air.transform(X_test)

# Actual model for air quality prediction
model_air = LGBMRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
model_air.fit(X_train_scaled, y_train)

# Metrics and plots
y_pred = model_air.predict(X_test_scaled) # test predictions
# print("X test scaled", X_test_scaled)
# print("y test", y_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_absolute_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE (air) = {mae:.2f}, RMSE = {rmse:.2f}, R² = {r2:.3f}")

# Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_test, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000926 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7681
[LightGBM] [Info] Number of data points in the train set: 5818, number of used features: 50
[LightGBM] [Info] Start training from score 22.861121
MAE (air) = 8.02, RMSE = 2.83, R² = 0.064



X does not have valid feature names, but LGBMRegressor was fitted with feature names

