# Matching trains with traffic lines

The main goal here is to develop a function that makes it possible to identify the line of a specific delayed train. The reason why we need such a function is because the passenger ridership estimation is given per line where the delay data is per specific train.

## Importing datasets

First, let us import the dataset for all the traffic lines (used in the ridership estimation data).

In [69]:
# import excel file static_pass_all_2024.xlsx
import pandas as pd

# read by default 1st sheet of an excel file
df_line = pd.read_excel('../data/output_data/static_pass_all_2024.xlsx')

In [70]:
# drop all the columns except the first 3 (no need for ridership data, only the line number, name and stopping patterns are of interest)
df_line = df_line.iloc[:, :9]

Let us now import the train data, more specifically the trains that are affected by delays. Of interest here are particularly Tågnr	and Tåguppdrag.
The goal is to match all of them to a specific line number in df_line.

In [71]:
# Ny datafil med alla RST tåg som har fått registrering av Infrastruktur händelser på södrastambanan
df_train = pd.read_csv('../data/train_data_2023/traindata_2023_passenger_SSBevents_012025.csv') 

We need to clean up (make this df a bit smaller), e.g., by removing unnecessary data.

In [72]:
df_train_rst = df_train[df_train['Tågslag'] == 'RST']
# upper case for columns 'Plats' 'StartStation_resa', 'SlutStation_resa','StartStation_uppdrag', 'SlutStation_uppdrag
df_train_rst.loc[:, 'Plats'] = df_train_rst['Plats'].str.upper()
df_train_rst.loc[:,'StartStation_resa'] = df_train_rst['StartStation_resa'].str.upper()
df_train_rst.loc[:,'SlutStation_resa'] = df_train_rst['SlutStation_resa'].str.upper()
df_train_rst.loc[:,'StartStation_uppdrag'] = df_train_rst['StartStation_uppdrag'].str.upper()
df_train_rst.loc[:,'SlutStation_uppdrag'] = df_train_rst['SlutStation_uppdrag'].str.upper()

In [73]:
# keep only the following columns
# 'resa', 'Tågnr', 'Tåguppdrag', 'Plats', 'Tågslag', 'Tågsort', 'Aktivitetskod', 'Aktivitetskodbeskrivning'
# 'PlanDatum', 'PlanTidpunkt', 'Datum',  'Tågläge', 'StartStation_resa', 'SlutStation_resa','StartStation_uppdrag', 'SlutStation_uppdrag
df_train_rst_clean = df_train_rst[['resa', 'Tågnr', 'Tåguppdrag', 'Plats', 'Tågslag', 'Tågsort', 'Aktivitetskod', 'Aktivitetskodbeskrivning', 'PlanDatum', 'PlanTidpunkt', 'Datum',  'Tågläge', 'StartStation_resa', 'SlutStation_resa','StartStation_uppdrag', 'SlutStation_uppdrag']]
df_train_rst_clean = df_train_rst_clean.reset_index(drop=True)

In [74]:
# remove all rows where Aktivitetskodbeskrivning is not 'Påstigande av resande', or 'Av- och påstigande av resande' or 'Avstigande av resande'
df_train_rst_clean = df_train_rst_clean[df_train_rst_clean['Aktivitetskodbeskrivning'].isin([
    'Påstigande av resande', 
    'Av- och påstigande av resande', 
    'Avstigande av resande'
])]

In [75]:
# read ../data/useful_data/Plats_sign.csv
df_plats_sign = pd.read_csv('../data/useful_data/Plats_sign.csv')

# # First, clean up df_plats_sign
# df_plats_sign = pd.read_excel('../data/useful_data/Förbindelselinje_2023_alla.xlsx')
# df_plats_sign = df_plats_sign.dropna(subset=['Plats'])
# df_plats_sign = df_plats_sign[['Plats', 'Plats_sign']].drop_duplicates()

# # Read the additional station mapping file
# df_plats_sign_pos = pd.read_excel('../data/useful_data/Plats_sign_pos.xlsx')
# # keep only columns Signatur and Plats_sign
# df_plats_sign_pos = df_plats_sign_pos[['Plats', 'Signatur']]
# # rename column Signatur to Plats_sign
# df_plats_sign_pos.rename(columns={'Signatur': 'Plats_sign'}, inplace=True)

# # augment df_plats_sign with df_plats_sign_pos
# df_plats_sign = pd.concat([df_plats_sign, df_plats_sign_pos], ignore_index=True)

# # Read the additional station mapping file
# df_trafikplats = pd.read_csv('../data/raw_data/Trafikplats_jvg_förenklad.csv')
# # keep only columns trafikplatsnamn and signatur
# df_trafikplats = df_trafikplats[['trafikplatsnamn', 'signatur']]
# # rename columns
# df_trafikplats.rename(columns={'trafikplatsnamn': 'Plats', 'signatur': 'Plats_sign'}, inplace=True)

# # augment df_plats_sign with df_trafikplats
# df_plats_sign = pd.concat([df_plats_sign, df_trafikplats], ignore_index=True)

# # Find the Plats_sign for 'Morastrand'
# mora_strand_sign = df_plats_sign[df_plats_sign['Plats'] == 'Morastrand']['Plats_sign'].values[0]

# # Create a new row for 'Mora Strand' mapping
# new_row = pd.DataFrame({'Plats': ['Mora Strand'], 'Plats_sign': [mora_strand_sign]})

# # Concatenate the new row with df_plats_sign
# df_plats_sign = pd.concat([df_plats_sign, new_row], ignore_index=True)

# df_plats_sign['Plats_sign'] = df_plats_sign['Plats_sign'].str.upper()

# # remove duplicates
# df_plats_sign = df_plats_sign.drop_duplicates(subset=['Plats_sign', 'Plats'])

In [76]:
# Create initial mapping for station signs
df_train_rst_clean = df_train_rst_clean.merge(
    df_plats_sign[['Plats', 'Plats_sign']], 
    left_on='Plats', 
    right_on='Plats', 
    how='left'
)

In [77]:
import re 

# Handle unmatched stations
unmatched_stations = df_train_rst_clean[df_train_rst_clean['Plats_sign'].isna()]['Plats'].unique()

# Define the regex matching function with df_plats_sign as parameter
def match_station(station, df_plats_sign):
    """
    Match station names using regex and handle special cases.
    Args:
        station: station name to match
        df_plats_sign: dataframe containing station mappings
    Returns:
        matched Plats_sign or None if no match found
    """
    station_variants = [
        station,
        re.sub(r'central', 'c', station, flags=re.IGNORECASE),
        re.sub(r'(\w+)( central)', r'\1s central', station, flags=re.IGNORECASE),
        re.sub(r'(\w+)( c)', r'\1s c', station, flags=re.IGNORECASE),
        re.sub(r'(\w+)s central', r'\1 central', station, flags=re.IGNORECASE),
        re.sub(r'(\w+)s central', r'\1 c', station, flags=re.IGNORECASE),
    ]
        # upper case for all elements in station_variants
    station_variants = [x.upper() for x in station_variants]

    # Special cases
    special_cases = {
        'marieholm': 'Göteborg Marieholm',
        'helsingborg godsbangård': 'Helsingborgs godsbangård',
        'hallsbergs pbg': 'Hallsbergs personbangård',
        'stockholm södra': 'Stockholms Södra',
        'falkenbergs personstation': 'Falkenberg personstation',
        'köpingebro': 'f.d. Köpingebro',
        'Mora strand': 'Morastrand'
    }

    # upper case for all keys in special_cases
    special_cases = {k.upper(): v for k, v in special_cases.items()}

    
    if station.upper() in special_cases:
        station_variants.append(special_cases[station.upper()])

    # keep only unique elements in station_variants
    station_variants = pd.Series(station_variants)

    for variant in station_variants:
        matches = df_plats_sign[df_plats_sign['Plats'].str.match(re.escape(variant), case=False, na=False)]
        if not matches.empty:
            # Add the original station name to df_plats_sign if it's not already there
            if not df_plats_sign[df_plats_sign['Plats'] == station].shape[0]:
                plats_sign = matches['Plats_sign'].iloc[0]
                new_row = pd.DataFrame({'Plats': [station], 'Plats_sign': [plats_sign]})
                df_plats_sign = pd.concat([df_plats_sign, new_row], ignore_index=True)
            return matches['Plats_sign'].iloc[0], df_plats_sign
    return None, df_plats_sign

# Apply regex matching for unmatched stations
for station in unmatched_stations:
    match, df_plats_sign = match_station(station, df_plats_sign)
    if match:
        df_train_rst_clean.loc[df_train_rst_clean['Plats'] == station, 'Plats_sign'] = match

In [78]:
df_plats_sign = df_plats_sign[['Plats', 'Plats_sign']].drop_duplicates()
df_plats_sign.to_csv('../data/useful_data/Plats_sign_augmented_v2.csv', index=False)

In [79]:
# Third attempt: Regex matching using additional file for remaining unmatched stations
#unmatched_stations = df_train_rst_clean[df_train_rst_clean['Plats_sign'].isna()]['Plats'].unique()

In [80]:
# for station in unmatched_stations:
#     match = match_station(station, df_plats_sign_pos)
#     if match:
#         df_train_rst_clean.loc[df_train_rst_clean['Plats'] == station, 'Plats_sign'] = match

In [81]:
# keep the full name for unmatched stations
df_train_rst_clean.loc[df_train_rst_clean['Plats_sign'].isna(), 'Plats_sign'] = df_train_rst_clean.loc[df_train_rst_clean['Plats_sign'].isna(), 'Plats']

In [82]:
df_train_rst_clean = df_train_rst_clean.merge(
    df_plats_sign[['Plats', 'Plats_sign']].rename(columns={'Plats': 'StartStation_uppdrag', 'Plats_sign': 'Start_uppdrag_sign'}),
    on='StartStation_uppdrag',
    how='left'
)
df_train_rst_clean.loc[df_train_rst_clean['Plats_sign'].isna(), 'Plats_sign'] = df_train_rst_clean.loc[df_train_rst_clean['Plats_sign'].isna(), 'Plats']

# Merge to get Slut_uppdrag_sign
df_train_rst_clean = df_train_rst_clean.merge(
    df_plats_sign[['Plats', 'Plats_sign']].rename(columns={'Plats': 'SlutStation_uppdrag', 'Plats_sign': 'Slut_uppdrag_sign'}),
    on='SlutStation_uppdrag',
    how='left'
)

# Merge to get Start_resa_sign
df_train_rst_clean = df_train_rst_clean.merge(
    df_plats_sign[['Plats', 'Plats_sign']].rename(columns={'Plats': 'StartStation_resa', 'Plats_sign': 'Start_resa_sign'}),
    on='StartStation_resa',
    how='left'
)

# Merge to get Slut_resa_sign
df_train_rst_clean = df_train_rst_clean.merge(
    df_plats_sign[['Plats', 'Plats_sign']].rename(columns={'Plats': 'SlutStation_resa', 'Plats_sign': 'Slut_resa_sign'}),
    on='SlutStation_resa',
    how='left'
)

In [83]:
# First part is same as your code
x_tgnr = df_train_rst_clean.groupby('Tågnr')['Tåguppdrag'].nunique()
print("Number of Tåguppdrag per Tågnr:")
print(f"df_train_rst_clean - max: {x_tgnr.max()}, min: {x_tgnr.min()}")

# Now checking how many Tågnr per Tåguppdrag
x_tgupp = df_train_rst_clean.groupby('Tåguppdrag')['Tågnr'].nunique()
print("\nNumber of Tågnr per Tåguppdrag:")
print(f"df_train_rst_clean - max: {x_tgupp.max()}, min: {x_tgupp.min()}")

# Now checking how many Tågnr per Tåguppdrag
x_tgnr_resa = df_train_rst_clean.groupby(['Tågnr','resa'])['Tåguppdrag'].nunique()
print("\nNumber of Tågnr per Tåguppdrag:")
print(f"df_train_rst_clean - max: {x_tgnr_resa.max()}, min: {x_tgnr_resa.min()}")

# Now checking how many Tågnr per Tåguppdrag
x_tgupp_resa = df_train_rst_clean.groupby(['Tåguppdrag','resa'])['Tågnr'].nunique()
print("\nNumber of Tågnr per Tåguppdrag:")
print(f"df_train_rst_clean - max: {x_tgupp_resa.max()}, min: {x_tgupp_resa.min()}")

Number of Tåguppdrag per Tågnr:
df_train_rst_clean - max: 2, min: 1

Number of Tågnr per Tåguppdrag:
df_train_rst_clean - max: 19, min: 1

Number of Tågnr per Tåguppdrag:
df_train_rst_clean - max: 1, min: 1

Number of Tågnr per Tåguppdrag:
df_train_rst_clean - max: 5, min: 1


This means that the unique identification is only possible using a combination of Tågnr and Tåguppdrag.

## Extracting stops from train data

Before trying to find the closest line (line number/name) to a certain train (resa, i.e, tågnr-uppdrag-datum). Let us extract the stops.
First, we append the stopping pattern information to our delayed trains.

In [84]:
# Create a new dataframe with stops from 'Plats' column
train_resa_stops = df_train_rst_clean.groupby(
    [
    'resa', 'Tågnr', 'Tåguppdrag', 'Tågsort','Start_uppdrag_sign', 'Slut_uppdrag_sign','Start_resa_sign', 'Slut_resa_sign'
    ]
    )['Plats_sign'].agg(list).reset_index()

In [85]:
# If there are duplicates in the 'Plats' lists, we can remove them while preserving order
train_resa_stops['Plats_sign'] = train_resa_stops['Plats_sign'].apply(lambda x: list(dict.fromkeys(x)))
train_resa_stops['Plats_len'] = train_resa_stops['Plats_sign'].apply(lambda x: len(x))

# rename column Plats_sign to Stopps
train_resa_stops.rename(columns={'Plats_sign': 'Stopps'}, inplace=True)

# Convert lists to tuples to make them hashable
train_resa_stops['Stopps'] = train_resa_stops['Stopps'].apply(tuple)

In [86]:
# create a reduced version of train_resa_stops with only Tåguppdrag and Plats while keeping the row with the longest list of stops, call it train_resa_stops_taguppdrag
train_resa_stops = train_resa_stops.sort_values(by='Plats_len', ascending=False)

# when there are duplicates in 'Tåguppdrag', keep the row with the highest 'Plats_len'
train_stops_no_duplicates = train_resa_stops.drop_duplicates(subset=['Tågnr','Tåguppdrag','Tågsort','Start_uppdrag_sign','Slut_uppdrag_sign'], keep='first')



#train_stops_no_duplicates = train_stops_no_duplicates[['resa','Tåguppdrag', 'Stopps']]

In [87]:
# Create a mapping DataFrame with unique combinations
mapping_df = train_stops_no_duplicates[['Tågnr','Tåguppdrag', 'Tågsort', 'Start_uppdrag_sign', 'Slut_uppdrag_sign', 'Stopps']].drop_duplicates()

# Merge the original DataFrame with the mapping DataFrame
train_resa_stops_same = train_resa_stops.merge(
    mapping_df[['Tågnr','Tåguppdrag', 'Tågsort', 'Start_uppdrag_sign', 'Slut_uppdrag_sign', 'Stopps']],
    on=['Tågnr','Tåguppdrag', 'Tågsort', 'Start_uppdrag_sign', 'Slut_uppdrag_sign'],
    how='left'
)

In [88]:
# Use the merged Stopps column to replace the original one
train_resa_stops_same = train_resa_stops_same.drop(columns=['Stopps_x']).rename(columns={'Stopps_y': 'Stopps'})

In [89]:
# # there are rows in train_resa_stops with same Tåguppdrag but have different Plats_len
# # how many rows are there?

# # Count how many times each Tåguppdrag appears
# duplicates = train_resa_stops.groupby('Tåguppdrag').size().reset_index(name='count')

# # Filter to only show Tåguppdrag that appear more than once
# multiple_entries = duplicates[duplicates['count'] > 1]

# # Get the number of Tåguppdrag with multiple entries
# num_duplicates = len(multiple_entries)

# # To see the actual duplicate rows with their Plats_len:
# duplicate_details = train_resa_stops[train_resa_stops['Tåguppdrag'].isin(multiple_entries['Tåguppdrag'])]
# duplicate_details = duplicate_details[['Tåguppdrag', 'Plats_len']].sort_values('Tåguppdrag')

# # Get the number of unique Plats_len for each Tåguppdrag
# unique_plats_len = duplicate_details.groupby('Tåguppdrag')['Plats_len'].nunique()

In [90]:
train_stops = train_stops_no_duplicates.drop_duplicates(subset=['Stopps'], keep='first')

In [91]:
# Group by 'Linje' and combine the 'från_sign' and 'till_sign' for each line
line_stops = df_line.groupby('Linje').apply(
    lambda x: list(x['från_sign']) + [x['till_sign'].iloc[-1]]
).reset_index()

# Rename columns for clarity
line_stops.columns = ['Linje', 'Stopps']

  line_stops = df_line.groupby('Linje').apply(


## Matching delayed trains to traffic lines

We now match delayed trains (subset with unique stop patterns) to the most likely traffic line. The most likely line is chosen as the one with the highest similarity score.

In [92]:
import pandas as pd # type: ignore
from difflib import SequenceMatcher

def calculate_score(train_stops, line_stops):
    """
    Calculate a similarity score between train stops and line stops.
    """
    # Match first and last stop
    score = 0

    if train_stops[0] == line_stops[0]:
        score += 2  # Higher weight for matching first stop
    if train_stops[-1] == line_stops[-1]:
        score += 2  # Higher weight for matching last stop
    
    # Calculate sequence similarity for intermediate stops
    sequence_similarity = SequenceMatcher(None, train_stops, line_stops).ratio()
    score += sequence_similarity * 10  # Adjust weight for sequence similarity
    
    return score

def get_inverted_line(line_id):
    """
    Get the inverted line ID.
    """
    return line_id[:-1] if line_id.endswith('R') else f"{line_id}R"

def match_trains_to_lines(train_stops_df, line_stops_df):
    """
    Match trains to lines based on similarity scores, including inverted stops.
    """
    matches = []
    for _, train_row in train_stops_df.iterrows():
        best_score = -1
        best_match = None
        best_direction = 'Normal'

        # if train_row has no stops, set best score to -1 and continue
        if len(train_row['Stopps']) > 0:
            for _, line_row in line_stops_df.iterrows():

                # Calculate score for normal stops
                normal_score = calculate_score(train_row['Stopps'], line_row['Stopps'])
                
                # Calculate score for inverted stops
                inverted_stops = line_row['Stopps'][::-1]
                inverted_score = calculate_score(train_row['Stopps'], inverted_stops)
                
                # Determine better match (normal or inverted)
                if inverted_score > normal_score:
                    current_score = inverted_score
                    current_match = get_inverted_line(line_row['Linje'])
                    current_direction = 'Inverted'
                else:
                    current_score = normal_score
                    current_match = line_row['Linje']
                    current_direction = 'Normal'
                
                # Update best match
                if current_score > best_score:
                    best_score = current_score
                    best_match = current_match
                    best_direction = current_direction
        
        matches.append({
            'Stopps': train_row['Stopps'],
            'Predicted_Line': best_match,
            'Score': best_score,
            'Direction': best_direction
        })
    
    return pd.DataFrame(matches)


matching_result = match_trains_to_lines(train_stops, line_stops)

# rename column Stopps to Stopps_train
matching_result.rename(columns={'Stopps': 'Stopps_train'}, inplace=True)

In [93]:
# Add a column to results corresponding stops of the predicted line
matching_result_stops = pd.merge(matching_result, line_stops, left_on='Predicted_Line', right_on='Linje', how='left').rename(columns={'Stopps': 'Stopps_line'})
matching_result_stops.drop(columns=['Linje'], inplace=True)

Now, we can construct the final table where all the trains are identified with a specific traffic line.

In [94]:
# export matching_result_stops to Excel, keep only the columns 'Predicted_Line', 'Score', 'Stopps_line', 'Stopps_train'
columns_to_keep = ['Predicted_Line', 'Score', 'Stopps_line', 'Stopps_train']
matching_result_stops = matching_result_stops[columns_to_keep]
matching_result_stops.to_excel('../data/output_data/matching_result_stops.xlsx', index=False)

## Match back with Lupp data

We have the matching between train stopps and predicted lines. We need to use that to include a column called predicted line in our original train data.

In [95]:
# Merge train_resa_stops (on Stopps) with matching_result_stops (on Stopps_train)
train_resa_stops_predicted_lines = train_resa_stops_same.merge(matching_result_stops, left_on='Stopps', right_on='Stopps_train', how='left')


In [None]:
# remove columns Stopps
train_resa_stops_predicted_lines.drop(columns=['Stopps'], inplace=True)

Index(['resa', 'Tågnr', 'Tåguppdrag', 'Tågsort', 'Start_uppdrag_sign',
       'Slut_uppdrag_sign', 'Start_resa_sign', 'Slut_resa_sign', 'Plats_len',
       'Predicted_Line', 'Score', 'Stopps_line', 'Stopps_train'],
      dtype='object')

In [97]:
# Merge train_resa_stops_same with train_resa_stops_predicted_lines to get predicted lines for all trains
df_merged = train_resa_stops.merge(
    train_resa_stops_predicted_lines[['resa','Tågnr','Tåguppdrag', 'Tågsort', 'Start_uppdrag_sign', 'Slut_uppdrag_sign', 'Predicted_Line', 'Score', 'Stopps_line']],
    on=['resa','Tågnr','Tåguppdrag', 'Tågsort', 'Start_uppdrag_sign', 'Slut_uppdrag_sign'],
    how='left'
)

In [98]:
df_merged.to_excel('../data/output_data/train_data_matched_lines.xlsx', index=False)