# Regression model for traffic and air quality prediction in time

<span style="color:red">**This notebook is mostly the same as `regression_main.ipynb`, except that it provides an additional cell of data balancing (only keep pollution spikes and data 24 hours before, but also some more "normal" data)**</span>

This notebook explains and performs the training of two models for predicting a state H hours (the number is configurable) in the future.
It defines a clear and simple way of setting up the experimental environnement for machine learning experiment (train/test datasets, evaluation metrics...).
I chose to use sklearn for its ease of use.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from lightgbm import LGBMRegressor
import plotly.graph_objects as go

Loading the dataset (created by `dataset_creation.ipynb`)

In [2]:
df=pd.read_pickle("created_dataset.pkl")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16081 entries, 0 to 16080
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           16081 non-null  object        
 1   hour           16081 non-null  int32         
 2   traffic_z0     16081 non-null  float64       
 3   traffic_z1     16081 non-null  float64       
 4   traffic_z3     16081 non-null  float64       
 5   traffic_z4     16081 non-null  float64       
 6   traffic_z5     16081 non-null  float64       
 7   traffic_z6     16081 non-null  float64       
 8   traffic_z7     16081 non-null  float64       
 9   traffic_z8     16081 non-null  float64       
 10  station_4      16081 non-null  float64       
 11  station_43     16081 non-null  float64       
 12  station_44     16081 non-null  float64       
 13  station_54     16081 non-null  float64       
 14  station_57     16081 non-null  float64       
 15  station_58     1608

We will predict both traffic and air quality in the future. Here is the number of hours in the future the prediction will be made.

In [3]:
forecast_h = 12 # Prediction will be made for forecast_h hours in the future

# Predicting traffic

In [4]:
# Identifying the 'traffic_zN' columns
traffic_cols = [c for c in df.columns if c.startswith("traffic_")]

# Creating the target value (value forecast_h hours after) for every traffic area
for col in traffic_cols:
    df[f"target_{col}"] = df[col].shift(-forecast_h)

# This functions add lags (values from previous hours) and rolls (previous rolling means over time windows)
def add_lags_and_rolls(df, cols, lags=[1, 2, 3, 6, 12], rolls=[3, 6, 12]):
    for col in cols:
        for lag in lags:
            df[f"{col}_lag{lag}"] = df[col].shift(lag)
        for w in rolls:
            df[f"{col}_roll{w}"] = df[col].rolling(window=w).mean()
    return df

df = add_lags_and_rolls(df, traffic_cols)
df["dayofweek"] = df["datetime_hour"].dt.dayofweek
df = df.dropna().reset_index(drop=True)

Training of one model per zone

In [5]:
models_traffic = {}
scalers_traffic = {}
for col in traffic_cols:
    # Features : retrieve lags, rolls, hour and day of week
    feature_cols = [c for c in df.columns if c.startswith(f"{col}_") or c in ["hour", "dayofweek"]]
    X = df[feature_cols]
    y = df[f"target_{col}"]

    # Split train/test
    split_idx = int(len(df) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Normalisation
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    scalers_traffic[col] = scaler

    # Model
    model = LGBMRegressor(n_estimators=300, learning_rate=0.05, random_state=42)
    # Training of the model
    model.fit(X_train_scaled, y_train)
    models_traffic[col] = model # Save the trained model for this traffic area

    # Metrics
    y_pred = model.predict(X_test_scaled)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"[TRAFFIC] {col}: MAE = {mae:.2f}, R² = {r2:.3f}")

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df["datetime_hour"].iloc[split_idx:],
        y=y_test,
        mode='lines',
        name='VReal values',
        line=dict(color='blue')
    ))
    fig.add_trace(go.Scatter(
        x=df["datetime_hour"].iloc[split_idx:],
        y=y_pred,
        mode='lines',
        name='Prédictions',
        line=dict(color='red', dash='dash')
    ))
    fig.update_layout(
        title=f"Traffic prediction in ({col}) for +{forecast_h}h",
        xaxis_title="Date",
        yaxis_title="Traffic",
        hovermode="x unified",
        template="plotly_white"
    )
    fig.show()


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000179 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1272
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.492577




[TRAFFIC] traffic_z0: MAE = 0.37, R² = 0.324


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000119 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 971
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.163819
[TRAFFIC] traffic_z1: MAE = 0.28, R² = 0.658



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000186 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1033
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.286975
[TRAFFIC] traffic_z3: MAE = 0.35, R² = 0.584



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000199 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1045
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.870782
[TRAFFIC] traffic_z4: MAE = 0.49, R² = -0.340



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000319 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 949
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.452476
[TRAFFIC] traffic_z5: MAE = 0.23, R² = 0.513



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000260 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1173
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.288416
[TRAFFIC] traffic_z6: MAE = 0.96, R² = 0.407



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000130 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1023
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 1.986092
[TRAFFIC] traffic_z7: MAE = 0.23, R² = 0.730



X does not have valid feature names, but LGBMRegressor was fitted with feature names



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1054
[LightGBM] [Info] Number of data points in the train set: 12845, number of used features: 10
[LightGBM] [Info] Start training from score 2.027489
[TRAFFIC] traffic_z8: MAE = 0.24, R² = 0.696



X does not have valid feature names, but LGBMRegressor was fitted with feature names



# Predicting air quality
First, train the model with real traffic values from the dataset (A production model would use the traffic value predicted by the traffic prediction model, however training the air quality model with predicted (biased) traffic values would lead to poorer results).

In [6]:
# Identify the station cols
station_cols = [c for c in df.columns if c.startswith("station_")]
weather_cols = ["temperature", "wind_u", "wind_v", "precipitation", "is_raining", "humidity", "pressure", "cloud_cover"]
weather_cols = [c for c in weather_cols if c in df.columns]

Creation of the `target column`: this is the value we will try to predict for a given hour. It is either the air quality measured by station 4 in `forecast_h` hours, or the mean of all the measures in `forecast_h` hours. the following cell allows to choose.

In [7]:
# Target: Can be either the mean of all the stations, or one specific station.

# For the mean, uncomment this
#target, target_label = df[station_cols].mean(axis=1), "Average PM10 levels"

# For the station 4, uncomment this
target, target_label = df[['station_4']], "PM10 levels in station 4"

In [8]:
df["air_quality_selected"] = target
df["target_air_quality"] = df["air_quality_selected"].shift(-forecast_h)


# Create Lags/rolling pour la qualité de l'air et la météo
df = add_lags_and_rolls(df, ["air_quality_selected"] + weather_cols)
df = df.dropna().reset_index(drop=True) # By creating lags, we will loose forecast_h rows

"""# Ajout des prédictions de trafic (ou des données réelles)
use_predicted_traffic = True  # Option true/False
 if use_predicted_traffic:
    for col in traffic_cols:
        # Prédire le trafic pour toute la période (exemple simplifié..)
        feature_cols = [c for c in df.columns if c.startswith(f"{col}_") or c in ["hour", "dayofweek"]]
        X_traffic = df[feature_cols]
        X_traffic_scaled = scalers_traffic[col].transform(X_traffic)
        df[f"predicted_{col}"] = models_traffic[col].predict(X_traffic_scaled)
    traffic_features = [f"predicted_{col}" for col in traffic_cols]
else:
    traffic_features = traffic_cols """
traffic_features = traffic_cols



In [9]:
# Get the created features names
feature_cols_air = (
    traffic_features +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and "air_quality" in c] +
    [c for c in df.columns if any(x in c for x in ["_lag", "_roll"]) and c.split("_")[0] in weather_cols] +
    ["hour", "dayofweek"]
)

Splitting the dataset into a training dataset and testing other.
Since I need continous time, I can not pick randomly to fill these two new datasets. I chose to take the first 80% of the timeframe as training, and the remaining part as testing. This has some flaws (for example, the seasons change).



<span style="color:red">Only keep 24 hours before every pollution spike (I considered >50µg/m³ to be a spike) in the training dataset (testing is untouched)</span> (focus on station_54)

In [10]:
import pandas as pd
import numpy as np

# 1. Split into train/test first
split_idx = int(len(df) * 0.8)
df_train = df.iloc[:split_idx].copy()
df_test = df.iloc[split_idx:].copy()

# 2. Identify Spike Windows (24h before + spike hour = 25 rows)
is_spike = (df_train['station_54'] > 50)
spike_mask = is_spike.iloc[::-1].rolling(window=25, min_periods=1).max().iloc[::-1].astype(bool)

# 3. Identify "Available" indices (those not in a spike window)
available_indices = df_train.index[~spike_mask].tolist()

# 4. Sample 3000 rows in 30h blocks (100 blocks total)
block_size = 30
num_blocks = 3000 // block_size
sampled_indices = []

# Safety check: ensure we have enough non-spike data
if len(available_indices) > 3000:
    for _ in range(num_blocks):
        # Pick a random starting point from available indices
        # We subtract block_size to ensure the full window fits
        start_idx = np.random.choice(available_indices[:-block_size])
        
        # Create a range of 30 consecutive indices
        # Note: In a real-world scenario, you'd check if these overlap 
        # with spikes, but for simplicity, we grab the sequence.
        block = list(range(start_idx, start_idx + block_size))
        sampled_indices.extend(block)
        
        # Optional: Remove these from available to prevent double-sampling
        available_indices = [i for i in available_indices if i not in block]
else:
    print("Warning: Not enough non-spike data to pull 3000 rows.")
    sampled_indices = available_indices # Just take what's left

# 5. Combine Spike windows and Sampled blocks
# We use set() to handle any accidental overlaps
final_train_indices = sorted(list(set(df_train.index[spike_mask]) | set(sampled_indices)))
df_train_filtered = df_train.loc[final_train_indices]

# 6. Final Assignment
X_train = df_train_filtered[feature_cols_air]
y_train = df_train_filtered["target_air_quality"]

X_test = df_test[feature_cols_air]
y_test = df_test["target_air_quality"]

print(f"Training set: {len(df_train_filtered)} rows ({sum(spike_mask)} from spikes, {len(sampled_indices)} from background)")

Training set: 5475 rows (2770 from spikes, 3000 from background)


In [None]:
# Normalisation
scaler_air = StandardScaler()
X_train_scaled = scaler_air.fit_transform(X_train)
X_test_scaled = scaler_air.transform(X_test)





# from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# from sklearn.metrics import mean_pinball_loss, make_scorer
# import lightgbm as lgb
# import numpy as np

# # Define the range of parameters to test
# param_dist = {
#     'n_estimators': [500, 1000, 2000],
#     'learning_rate': [0.01, 0.05, 0.1],
#     'num_leaves': [20, 31, 50, 100],
#     'max_depth': [-1, 10, 20],
#     'alpha': [0.7, 0.8, 0.9] # Testing different "levels" of boldness
# }

# random_search = RandomizedSearchCV(
#     estimator=LGBMRegressor(objective='quantile'),
#     param_distributions=param_dist,
#     n_iter=20, # Try 20 random combinations
#     cv=3,      # 3-fold cross-validation
#     scoring='neg_mean_absolute_error',
#     verbose=1,
#     n_jobs=-1
# )

# random_search.fit(X_train_scaled, y_train)
# model_air = random_search.best_estimator_
# print(f"Best parameters: {random_search.best_params_}")

#Best parameters: {'num_leaves': 50, 'n_estimators': 2000, 'max_depth': 10, 'learning_rate': 0.05, 'alpha': 0.7}
model_air = LGBMRegressor(num_leaves=30, n_estimators=2000, max_depth=5, learning_rate=0.05, random_state=42, min_child_samples=300, reg_lambda=10)
model_air.fit(X_train_scaled, y_train)


# Actual model for air quality prediction
# model_air = LGBMRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
# model_air.fit(X_train_scaled, y_train)


# from scipy.stats import randint, uniform

# param_dist = {
#     'n_estimators': randint(100, 3000),
#     'learning_rate': uniform(0.01, 0.3),
#     'num_leaves': randint(20, 150),
#     'max_depth': randint(3, 12),
#     'min_child_samples': randint(10, 100),
#     'subsample': uniform(0.6, 0.4),
#     'colsample_bytree': uniform(0.6, 0.4)
# }

# lgbm = LGBMRegressor(random_state=42)

# # 3. Configuration of RandomSearch, will search n_iter combinations
# random_search = RandomizedSearchCV(
#     estimator=lgbm,
#     param_distributions=param_dist,
#     n_iter=50,
#     scoring='neg_mean_absolute_error',
#     cv=5,
#     verbose=1,
#     n_jobs=-1,
#     random_state=42
# )

# # 4. Training
# random_search.fit(X_train_scaled, y_train)

# model_air = random_search.best_estimator_

# print(f"Best parameters: {random_search.best_params_}")
#Best parameters: {'colsample_bytree': np.float64(0.7173952698872152), 'learning_rate': np.float64(0.014223946814525337), 'max_depth': 5, 'min_child_samples': 90, 'n_estimators': 491, 'num_leaves': 52, 'subsample': np.float64(0.8423839899124046)}

#just the resulting model:
# model_air = LGBMRegressor(colsample_bytree=0.7173952698872152, learning_rate=0.014223946814525337, max_depth=5, min_child_samples=90, n_estimators=491, num_leaves=52, subsample=0.8423839899124046, random_state=42)
# model_air.fit(X_train_scaled, y_train)




# # AUTER POSSIBILITEER
# model_air = LGBMRegressor(
#     objective='quantile',
#     alpha=0.7, # Predict the 90th percentile
#     n_estimators=2000, 
#     learning_rate=0.01,
#     num_leaves=1000, 
#     max_depth=10,
# )
# import lightgbm as lgb
# # Fit with callback
# model_air.fit(
#     X_train_scaled, y_train,
#     eval_set=[(X_test_scaled, y_test)], # Must provide a validation set
#     eval_metric='quantile',
#     callbacks=[
#         lgb.early_stopping(stopping_rounds=50),
#         lgb.log_evaluation(period=10) # Optional: prints progress every 10 trees
#     ]
# )





# from sklearn.svm import SVR
# from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit

# # 1. Initialize SVR
# # 'rbf' (Radial Basis Function) is the most common kernel for non-linear data
# svr_model = SVR(kernel='rbf')

# # 2. Define Hyperparameter Space
# # C: Regularization (high C = low tolerance for errors, risk of overfitting)
# # epsilon: The width of the "tube" where no penalty is given
# # gamma: Defines how far the influence of a single training example reaches
# param_dist_svr = {
#     'C': [0.1, 0.5, 1, 2, 3, 10],
#     'epsilon': [0.01, 0.1, 0.05, 0.2, 1],
#     'gamma': [0.01, 0.02, 0.005]
# }

# # 3. TimeSeries aware search
# tscv = TimeSeriesSplit(n_splits=5)

# random_search_svr = RandomizedSearchCV(
#     estimator=svr_model,
#     param_distributions=param_dist_svr,
#     n_iter=15, # SVR is slower, so we try fewer combinations
#     cv=tscv,
#     scoring='neg_mean_absolute_error',
#     verbose=2,
#     n_jobs=-1
# )

# # 4. Fit
# random_search_svr.fit(X_train_scaled, y_train)

# print(f"Best SVR params: {random_search_svr.best_params_}")
# model_air = random_search_svr.best_estimator_








# Metrics and plots
y_pred = model_air.predict(X_test_scaled) # test predictions
# print("X test scaled", X_test_scaled)
# print("y test", y_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE (air) = {mae:.2f}, RMSE = {rmse:.2f}, R² = {r2:.3f}")


# Check for overfitting by comparing train and test performance
y_train_pred = model_air.predict(X_train_scaled)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

print(f"\n[Overfitting Check]")

# This part of the code is not clean, some measures are computed twice

model = model_air
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

train_mae = mean_absolute_error(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)

test_mae = mean_absolute_error(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)

mae_diff = test_mae - train_mae
r2_diff = train_r2 - test_r2  

print(f"Train - MAE: {train_mae:.2f}, RMSE: {train_rmse:.2f}, R²: {train_r2:.3f}")
print(f"Test  - MAE: {test_mae:.2f}, RMSE: {test_rmse:.2f}, R²: {test_r2:.3f}")
  

# Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_test, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[split_idx:], y=y_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()


# Training Graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[:split_idx], y=y_train, mode='lines', name='Real'))
fig.add_trace(go.Scatter(x=df["datetime_hour"].iloc[:split_idx], y=y_train_pred, mode='lines', name='Prediction', line=dict(dash='dash')))
fig.update_layout(title=f"[TRAINING PHASE]{target_label} predicted for +{forecast_h}h", xaxis_title="Date", yaxis_title="PM10 concentration")
fig.show()

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000972 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7512
[LightGBM] [Info] Number of data points in the train set: 5475, number of used features: 50
[LightGBM] [Info] Start training from score 26.892055



X does not have valid feature names, but LGBMRegressor was fitted with feature names


X does not have valid feature names, but LGBMRegressor was fitted with feature names


X does not have valid feature names, but LGBMRegressor was fitted with feature names



MAE (air) = 9.92, RMSE = 13.23, R² = 0.027

[Overfitting Check]
Train - MAE = 5.29, R² = 0.776
Test  - MAE = 9.92, R² = 0.027
Difference - MAE = 4.63, R² = -0.749
AAAAA
Train - MAE: 5.29, RMSE: 7.74, R²: 0.776
Test  - MAE: 9.92, RMSE: 13.23, R²: 0.027



X does not have valid feature names, but LGBMRegressor was fitted with feature names

