# Forecasting Delays in the Swiss Transportation System

## Establishing Baseline Models with Engineered Data Sets

Copyrights © 2025, 2026 Yvan Richard.  
All rights reserved.

## Foreword

In this notebook, I rapidly try some basic baseline models on the engineered data set. Since the missing values are very sparse, I'll use a complete case analysis for this baseline. This approach will be further assessed in the error analysis part.

## 1. Load the Data

In [6]:
# load the data frame from january
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# load the data frame from january features
df_jan = pd.read_parquet("../../data/features/ist_features_2025_01.parquet")

# drop rows with missing values
n1 = len(df_jan)
df_jan = df_jan.dropna()
n2 = len(df_jan)
print(f"Dropped {(n1 - n2) / n1 * 100:.2f}% rows with missing values.")

# head
df_jan.head()

Dropped 2.89% rows with missing values.


Unnamed: 0,op_date,trip_id,stop_id,stop_name,line_name,vehicle_type,additional_trip,arrival_scheduled_dt,arrival_observed_dt,arrival_delay_minutes,...,vt_RJX,vt_S,vt_SN,vt_TER,vt_TGV,latitude,longitude,connection_density,running_trip_delay,mean_stop_delay
1,2025-01-03,ch:1:sjyid:100001:19220-001,8506013,Aadorf,S12,S,False,2025-01-03 05:31:00,2025-01-03 05:30:55,-0.083333,...,False,True,False,False,False,47.488118,8.903301,35,0.15,1.216667
2,2025-01-03,ch:1:sjyid:100001:19219-002,8506013,Aadorf,S12,S,False,2025-01-03 06:28:00,2025-01-03 06:28:41,0.683333,...,False,True,False,False,False,47.488118,8.903301,35,0.577778,0.566667
3,2025-01-03,ch:1:sjyid:100001:19224-001,8506013,Aadorf,S12,S,False,2025-01-03 06:31:00,2025-01-03 06:31:04,0.066667,...,False,True,False,False,False,47.488118,8.903301,35,0.994444,0.605556
4,2025-01-03,ch:1:sjyid:100001:19223-002,8506013,Aadorf,S12,S,False,2025-01-03 07:28:00,2025-01-03 07:29:28,1.466667,...,False,True,False,False,False,47.488118,8.903301,35,2.272222,0.375
5,2025-01-03,ch:1:sjyid:100001:19228-001,8506013,Aadorf,S12,S,False,2025-01-03 07:31:00,2025-01-03 07:31:08,0.133333,...,False,True,False,False,False,47.488118,8.903301,35,0.866667,0.738889


In [7]:
df_jan.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1744304 entries, 1 to 1796286
Data columns (total 37 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   op_date                datetime64[ns]
 1   trip_id                object        
 2   stop_id                int64         
 3   stop_name              object        
 4   line_name              object        
 5   vehicle_type           object        
 6   additional_trip        object        
 7   arrival_scheduled_dt   datetime64[ns]
 8   arrival_observed_dt    datetime64[ns]
 9   arrival_delay_minutes  float64       
 10  is_delayed             int8          
 11  hour_of_day            int32         
 12  day_of_week            int32         
 13  is_weekend             bool          
 14  is_peak                bool          
 15  vt_AG                  bool          
 16  vt_ATZ                 bool          
 17  vt_EC                  bool          
 18  vt_EXT                 bool

## 2. Train-Test Split

For these first baseline models, I do not use any particular processing techniques. I simply select the numerical exploitable features for predicting.

In [13]:
# convert bool columns to int
bool_columns = df_jan.select_dtypes(include=['bool']).columns
df_jan[bool_columns] = df_jan[bool_columns].astype(int)

# X: only numerical columns except target: `arrival_delay_minutes`, `is_delayed`
numerical_columns = df_jan.select_dtypes(include=[np.number]).columns.tolist()
target_columns = ['arrival_delay_minutes', 'is_delayed']
feature_columns = [col for col in numerical_columns if col not in target_columns]

As stated above, I use the most minimalistic approach and do not use a validation window. I simply split the training data set (January) in **train set that accounts for roughly 70%** of the data set. To prevent data leakage, the **split is based on chronological order**.

In [None]:
# sort by arrival_scheduled_dt
df_jan = df_jan.sort_values(by=['arrival_scheduled_dt'])

# first 70% for training, last 30% for testing
split_index = int(0.7 * len(df_jan))

# training set
X_train = df_jan[feature_columns].iloc[:split_index]
y_train = df_jan['is_delayed'].iloc[:split_index]

# testing set
X_test = df_jan[feature_columns].iloc[split_index:]
y_test = df_jan['is_delayed'].iloc[split_index:]



# print shapes
print("SHAPES OF DATASETS")
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)
print("\nTARGET VARIABLE STATISTICS")
print("Training set target mean:", f"{100 * y_train.mean():.2f}%")
print("Testing set target mean:", f"{100 * y_test.mean():.2f}%")

SHAPES OF DATASETS
Training set shape: (1221012, 27) (1221012,)
Testing set shape: (523292, 27) (523292,)

TARGET VARIABLE STATISTICS
Training set target mean: 4.99%
Testing set target mean: 6.29%


## 3. Baseline Models

In this section, I run a few basic models to observe what kind of prediction I obtain.

### 3.0. Dummy Classifier

This dummy classifier will simply predict the most common class, i.e 0 / "not delayed".

In [20]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report

dummy = DummyClassifier(strategy="most_frequent")  # always predicts 0
dummy.fit(X_train, y_train)
print(classification_report(y_test, dummy.predict(X_test)))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97    490381
           1       0.00      0.00      0.00     32911

    accuracy                           0.94    523292
   macro avg       0.47      0.50      0.48    523292
weighted avg       0.88      0.94      0.91    523292



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 3.1. Logistic Regression

In [18]:
# logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# create and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

# evaluate the model
print("CLASSIFICATION REPORT")
print(classification_report(y_test, y_pred))
print("CONFUSION MATRIX")
print(confusion_matrix(y_test, y_pred))

CLASSIFICATION REPORT
              precision    recall  f1-score   support

           0       0.95      0.99      0.97    490381
           1       0.79      0.28      0.42     32911

    accuracy                           0.95    523292
   macro avg       0.87      0.64      0.69    523292
weighted avg       0.94      0.95      0.94    523292

CONFUSION MATRIX
[[487881   2500]
 [ 23636   9275]]


**Class 1 (delayed)**

+ Precision 0.79: when the model predicts “delayed”, it’s right 79% of the time (good).
+ Recall 0.28: it only catches 28% of actual delays (bad). It misses 72% of delays (lots of false negatives).

Now I need to choose which metric I should favor. Ultimately, I have to strike a good balance between precision and recall so the f1 score is a decent metric.

### 3.2. Random Forest

Now I run a tree-based model: the random forest.

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    n_jobs=-1,
    class_weight="balanced_subsample"
)
rf.fit(X_train, y_train)
print(classification_report(y_test, rf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98    490381
           1       0.77      0.42      0.54     32911

    accuracy                           0.96    523292
   macro avg       0.86      0.71      0.76    523292
weighted avg       0.95      0.96      0.95    523292



The random forest is a clear improvement over the logistic regression. I must note however that the features have not been scaled yet and rf is usually scaling insensitive while logistic reg is not.

### 3.3. Gradient Boosting

In [22]:
from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(
    learning_rate=0.1,
    max_depth=6,
    max_iter=300,
    random_state=42
)
hgb.fit(X_train, y_train)
print(classification_report(y_test, hgb.predict(X_test)))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98    490381
           1       0.75      0.46      0.57     32911

    accuracy                           0.96    523292
   macro avg       0.86      0.73      0.77    523292
weighted avg       0.95      0.96      0.95    523292



This model roughly has the same performance than the rf. A slightly better f1 score for class 1.

## 4. Error Driven EDA

In this sub-section, I conduct an error driven EDA and aim to understand where my models fail. 