# Matching trains with traffic lines

The main goal here is to develop a function that makes it possible to identify the line of a specific delayed train. The reason why we need such a function is because the passenger ridership estimation is given per line where the delay data is per specific train.

## Importing datasets

First, let us import the dataset for all the traffic lines (used in the ridership estimation data).

In [21]:
# import excel file static_pass_all_2024.xlsx

import pandas as pd
import os


# read by default 1st sheet of an excel file
df_line = pd.read_excel('static_pass_all_2024.xlsx')
# drop all the columns except the first 3 (no need for ridership data, only the line number, name and stopping patterns are of interest)
df_line = df_line.iloc[:, :5]

Let us now import the train data, more specifically the trains that are affected by delays. Of interest here are particularly Tågnr	and Tåguppdrag.
The goal is to match all of them to a specific line number in df_line.

In [22]:
# read by default 1st sheet of an excel file
df_train = pd.read_excel('metatraindata_2023.xlsx')

Let us also read the Lupp data where we have more attributes for each train, more particularly the stopping pattern. There are four different files in the data folder named as follows Rapport_T23_vX.csv where X is 11, 19, 28 and 37, we will read all of these and combine them in one dataframe, note the first row of each file is the header.

In [23]:
import pandas as pd
import glob

# Define the folder path and file pattern
folder_path = 'data/'  # Adjust the folder path if needed
file_pattern = 'Rapport_T23_v*.csv'

# Use glob to find all matching files
file_paths = glob.glob(folder_path + file_pattern)

# Read all files into a list of DataFrames
dfs = [pd.read_csv(file, header=0) for file in file_paths]

# Combine all DataFrames into one
df_lupp = pd.concat(dfs, ignore_index=True)

We need to clean up (make this df a bit smaller), e.g., by removing unnecessary data.

In [24]:
# from combined_df remove the following columns
# År (PAU)
# Veckonr (PAU)
# Datum (PAU)
# Tågslag, but before remove all raws where Tågslag is not RST

df_lupp_rst = df_lupp[df_lupp['Tågslag'] == 'RST']
df_lupp_rst_clean = df_lupp_rst.drop(columns=['År (PAU)', 'Veckonr (PAU)', 'Datum (PAU)', 'Tågslag'])

In [25]:
# remove all rows where both Uppehållstypavgång is Passage and Uppehållstypankomst is Passage
df_lupp_rst_clean = df_lupp_rst_clean[(df_lupp_rst_clean['Uppehållstypavgång'] != 'Passage') | (df_lupp_rst_clean['Uppehållstypankomst'] != 'Passage')]

In [26]:
# check how many trains from df_train that are in df_lupp_rst_clean
# for that search using the column Tågnr and Tåguppdrag from df_train
# and use similar columns Tåguppdrag and Tågnr from df_lupp_rst_clean
# to find the matching trains

# make sure these are int in both dataframes
df_train['Tågnr'] = df_train['Tågnr'].astype('Int64')
df_train['Tåguppdrag'] = df_train['Tåguppdrag'].astype('Int64')
df_lupp_rst_clean['Tåguppdrag'] = df_lupp_rst_clean['Tåguppdrag'].astype('Int64')

# in df_lupp_rst_clean, remove spaces between numbers first in Tågnr
# Remove spaces between numbers in the Tågnr column
df_lupp_rst_clean['Tågnr'] = df_lupp_rst_clean['Tågnr'].astype(str).str.replace(r'\s+', '', regex=True)
df_lupp_rst_clean['Tågnr'] = df_lupp_rst_clean['Tågnr'].astype('Int64')

In [27]:
# for each Tågnr, print how many possible Tåguppdrag there are
# this is to see if there are any duplicates in the data
x = df_lupp_rst_clean.groupby('Tågnr')['Tåguppdrag'].nunique()
y = df_train.groupby('Tågnr')['Tåguppdrag'].nunique()
# print the max and min for each dataframe
print(x.max(), x.min())
print(y.max(), y.min())

1 1
3 1


## Matching train delay and Lupp data

Before trying to find the closest line (line number/name) to a certain train (tågnr/uppdrag). Let us first look att how many delayed trains can we identify in the sample of Lupp data that we have.

In [28]:
# Remove duplicates from df_train and combined_df based on ('Tågnr', 'Tåguppdrag')
df_train_test = df_train.drop_duplicates(subset=['Tågnr', 'Tåguppdrag'])
combined_df_test = df_lupp_rst_clean.drop_duplicates(subset=['Tågnr', 'Tåguppdrag'])

# Perform an inner merge to find matching trains
matching_trains = pd.merge(
    df_train_test, 
    combined_df_test, 
    how='inner', 
    left_on=['Tågnr', 'Tåguppdrag'], 
    right_on=['Tågnr', 'Tåguppdrag']
)

# Count the number of matching trains
num_matching_trains = matching_trains.shape[0]
print(f"Number of matching trains: {num_matching_trains}")

# Count the number of unique trains in df_train
num_unique_trains = len(df_train_test[['Tåguppdrag']])
print(f"Out of {num_unique_trains} unique trains")

# Calculate the percentage of matching trains
matching_percentage = num_matching_trains / num_unique_trains * 100
print(f"Percentage of matching trains: {matching_percentage:.2f}%")

Number of matching trains: 5833
Out of 14474 unique trains
Percentage of matching trains: 40.30%


We now know that we have stopping pattern information (from Lupp data T23) for around 40% of the delayed trains (in metatraindata_2023). From now on, we focus on matching these 40% delayed trains to their line numbers.

First, we append the stopping pattern information to our delayed trains.

In [31]:
filtered_stops = df_lupp_rst_clean[
    ((df_lupp_rst_clean['Uppehållstypavgång'].isin(['Uppehåll', 'Första']))) |
    ((df_lupp_rst_clean['Uppehållstypankomst'].isin(['Sista'])))
]

first_dates = filtered_stops.groupby(['Tågnr', 'Tåguppdrag'])['Datum'].min().reset_index()
filtered_stops = pd.merge(filtered_stops, first_dates, on=['Tågnr', 'Tåguppdrag', 'Datum'])

stops_per_train = (
    filtered_stops.groupby(['Tågnr', 'Tåguppdrag'], as_index=False)
    .agg({'Delsträckanummer': list, 'Avgångplatssignatur': list, 'Uppehållstypankomst': list, 'AnkomstplatsPlatssignatur': list})
    .apply(lambda x: pd.Series({
        'Tågnr': x['Tågnr'],
        'Tåguppdrag': x['Tåguppdrag'],
        'Stopps': (
            [stop for i, stop in zip(x['Delsträckanummer'], x['Avgångplatssignatur']) 
             if pd.notna(stop)] +
            [x['AnkomstplatsPlatssignatur'][i] for i, type_a in enumerate(x['Uppehållstypankomst']) 
             if type_a == 'Sista' and pd.notna(x['AnkomstplatsPlatssignatur'][i])]
        )
    }), axis=1)
)

train_stops = pd.merge(
    matching_trains, 
    stops_per_train, 
    how='inner', 
    on=['Tågnr', 'Tåguppdrag']
)[['Tågnr', 'Tåguppdrag', 'Stopps']]

# make sure all the stops are uppercase
train_stops['Stopps'] = train_stops['Stopps'].apply(lambda x: [stop.upper() for stop in x])

In [10]:
# Create line_stops by grouping stations ('från' and 'till') for each line ('Linje')
line_stops = df_line.groupby('Linje').apply(lambda x: list(x['från']) + [x.iloc[-1]['till']]).reset_index()
# Rename columns for clarity
line_stops.columns = ['Linje', 'Stopps']

  line_stops = df_line.groupby('Linje').apply(lambda x: list(x['från']) + [x.iloc[-1]['till']]).reset_index()


## Matching delayed trains to traffic lines

Once we have the stopping patter for each train (nr+uppdrag), we look at df_line for the closest line with similar stopping patterns.

To do that, we can use one of the classification algorithms such as K-Nearest Neighbors (KNN) which we will use here.

In [54]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.neighbors import KNeighborsClassifier

# Binarize stopping patterns for both lines and trains
mlb = MultiLabelBinarizer()

# Fit and transform stopping patterns for lines
line_patterns = mlb.fit_transform(line_stops['Stopps'])

# Transform stopping patterns for trains using the same binarizer
train_patterns = mlb.transform(train_stops['Stopps'])

# Assign line IDs as labels
line_labels = line_stops['Linje']

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(line_patterns, line_labels)

# Predict closest line for each train
predicted_lines = knn.predict(train_patterns)

# Add predictions and corresponding line stops to the result
result = train_stops.copy()
result['Predicted_Line'] = predicted_lines

# Map the stops from line_stops to the predicted lines
line_stops_dict = line_stops.set_index('Linje')['Stopps'].to_dict()
result['Line_Stopps'] = result['Predicted_Line'].map(line_stops_dict)

# Display the result
print(result[['Tågnr', 'Tåguppdrag', 'Predicted_Line', 'Line_Stopps']])




      Tågnr  Tåguppdrag Predicted_Line  \
0       822        5407           5101   
1       838        5335           5101   
2       862        5317           5101   
3       810        5309           5101   
4       806        5289           5101   
...     ...         ...            ...   
5826  11295        1080          10701   
5827  13582        1085           5101   
5828  32580       15816           5101   
5829  29741       29741         10501R   
5830  23609        2635           5101   

                                            Line_Stopps  
0                                  [Stockholms Central]  
1                                  [Stockholms Central]  
2                                  [Stockholms Central]  
3                                  [Stockholms Central]  
4                                  [Stockholms Central]  
...                                                 ...  
5826                   [Malmö central, Svedala, Skurup]  
5827                           