# Forecasting Delays in the Swiss Transportation System

## Tests on the Data Pipeline for Transportation Data: `IstData`

Copyrights Â© 2025, 2026 Yvan Richard.  
All rights reserved.

## Foreword

In this notebook, I rapidly check that the data I prepared from [opentransportdata](https://opentransportdata.swiss/en/) are correctly loaded in my memory.

## Load Data

First, I will load some useful libraries and the data of the first month of September 2025. The data as stored as a `parquet` format.

In [3]:
# load libraries
from pathlib import Path
import pandas as pd
import numpy as np


# load data with parquet
df_1 = pd.read_parquet("../../data/interim/2025_09/2025-09-01_IstDaten.parquet")

print(df_1.shape)


(1853483, 16)


In [5]:
df_2 = df_1[df_1["operator_code"] == "SBB"].copy()
print(df_2.shape)
df_2.head()

(57766, 16)


Unnamed: 0,op_date,trip_id,stop_id,stop_name,operator_id,operator_code,operator_name,transport_type,line_id,line_name,vehicle_type,additional_trip,arrival_scheduled_dt,arrival_observed_dt,arrival_delay_minutes,is_delayed
1,01.09.2025,ch:1:sjyid:100001:14391-002,8504181,Givisiez,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,14391,S30,S,False,2025-09-01 00:06:00,2025-09-01 00:06:19,0.316667,0
3,01.09.2025,ch:1:sjyid:100001:14394-002,8504130,Yvonand,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,14394,S30,S,False,2025-09-01 00:10:00,2025-09-01 00:11:38,1.633333,0
5,01.09.2025,ch:1:sjyid:100001:14391-002,8504138,Grolley,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,14391,S30,S,False,2025-09-01 00:14:00,2025-09-01 00:13:18,-0.7,0
7,01.09.2025,ch:1:sjyid:100001:14394-002,8504131,Cheyres,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,14394,S30,S,False,2025-09-01 00:14:00,2025-09-01 00:15:19,1.316667,0
10,01.09.2025,ch:1:sjyid:100001:14394-002,8504132,Estavayer-le-Lac,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,14394,S30,S,False,2025-09-01 00:19:00,2025-09-01 00:20:13,1.216667,0


In [7]:
df_2['vehicle_type'].value_counts()

vehicle_type
S      37397
R       9039
IR      3804
RE      3768
IC      2626
TER      590
EC       289
EXT       93
ICE       82
TGV       32
RJX       30
NJ        15
RB         1
Name: count, dtype: int64

In [11]:
df_2['arrival_delay_minutes'].describe()

# percentage of above 3 minutes delay
pct_above_3 = (df_2['arrival_delay_minutes'] > 3).mean() * 100
print(f"Percentage of arrival delays above 3 minutes: {pct_above_3:.2f}%")

# mean of the delay when above 3 minutes
mean_delay_above_3 = df_2.loc[df_2['arrival_delay_minutes'] > 3, 'arrival_delay_minutes'].mean()
print(f"Mean arrival delay when above 3 minutes: {mean_delay_above_3:.2f} minutes")

Percentage of arrival delays above 3 minutes: 9.29%
Mean arrival delay when above 3 minutes: 5.54 minutes


In [13]:
# inspect data
df_2.groupby("vehicle_type")["arrival_delay_minutes"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
vehicle_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EC,289.0,5.672722,10.596819,-4.433333,-0.133333,1.5,6.25,56.933333
EXT,93.0,2.115412,3.957777,-12.85,0.116667,1.366667,2.866667,15.4
IC,2626.0,1.210491,2.671305,-4.25,-0.166667,0.616667,1.816667,30.95
ICE,82.0,2.839837,6.221477,-2.133333,-0.304167,1.025,2.75,35.45
IR,3804.0,0.875162,2.004719,-3.666667,-0.266667,0.433333,1.470833,20.766667
NJ,15.0,3.744444,7.587643,-2.75,0.908333,1.783333,3.4,29.55
R,9039.0,1.106887,2.012064,-3.05,-0.083333,0.616667,1.666667,26.35
RB,1.0,-0.4,,-0.4,-0.4,-0.4,-0.4,-0.4
RE,3768.0,1.07205,1.967302,-2.416667,-0.016667,0.616667,1.566667,15.95
RJX,30.0,3.945,9.620772,-1.75,-0.208333,1.025,1.679167,34.716667
