# Cleaning contract data (2011-2023)

Assumptions:
- No conctracts that are entirely after 2024 or before 2010.
- Only main tracks (nhsp)

In the following, we clean up the data from the raw data compilation of maintenance contracts (produced by running the R code on the excel file with BAS contracts).

## Import raw data

We first read the raw data from the folder in raw_data.

In [1]:
import pandas as pd # type: ignore

# Step 1: Load the Excel file containing service contracts for each bandel
excel_file_path = "../Python matching/raw_data/more_servicekontrakt_per_bandel_regression.xlsx"

#sheet_name = "uppdaterad"
sheet_name = "tid per bandel"

# Read the specific sheet 'T24' into a DataFrame
servicekontrakt_df = pd.read_excel(excel_file_path, sheet_name=sheet_name)

## Basic clean up

Remove rows with no information, these are normally simply indicating the name of the contract region.

In [2]:
# remove rows where the third column is missing 
servicekontrakt_df = servicekontrakt_df[servicekontrakt_df.iloc[:, 2].notna()]

We parse the column Tidsperiod (e.g., 2024-2030) to extract start_year (2024) and end_year (2030).

In [3]:
# parse the column Tidsperiod (20XX - 20YY) into two new columns 'Start_year' and 'End_year'
def parse_tidsperiod(tidsperiod):
    import re
    # first remove spaces in tidsperiod
    tidsperiod = tidsperiod.replace(' ', '')
    tidsperiod_match = re.match(r'^(\d{4})-(\d{4})$', tidsperiod)
    if tidsperiod_match:
        start_year = int(tidsperiod_match.group(1))
        end_year = int(tidsperiod_match.group(2))
    else:
        start_year = None
        end_year = None
    return pd.Series([start_year, end_year])

# Apply the parsing function to create two new columns
servicekontrakt_df[['Start_year', 'End_year']] = servicekontrakt_df['Tidsperiod'].apply(parse_tidsperiod)

Since we focus on contracts that finish at latest 2023, we remove all other contracts.

In [4]:
# remove rows where start_year is 2024 or later
servicekontrakt_df_cleaned = servicekontrakt_df[servicekontrakt_df['Start_year'] < 2024]
# remove all contracts where end_year is before 2010
servicekontrakt_df_cleaned = servicekontrakt_df_cleaned[servicekontrakt_df_cleaned['End_year'] >= 2010]

## Process cleaned data

We parse the name of the bandel (e.g., 306 (Borlänge)-Repbäcken) to extract the bandel number (e.g., 306) if any (sometimes not included), and the bandel name which must be included ((Borlänge)-Repbäcken).

In [5]:
# Parse the 'Bandel' column into two new columns 'Bandelnr' and 'Bandelnamn'
def parse_bandel(bandel):
    import re
    bandelnr_match = re.match(r'^(\d+(?:/\d+)*)', bandel)
    if bandelnr_match:
        bandelnr = bandelnr_match.group(0).replace('/', ', ')
        bandelnamn = bandel[len(bandelnr_match.group(0)):].strip()
    else:
        bandelnr = ''
        bandelnamn = bandel.strip()
    return pd.Series([bandelnr, bandelnamn])

# Apply the parsing function to create two new columns
servicekontrakt_df_processed = servicekontrakt_df_cleaned.copy()
# in bandel, replace '–' with '-'
servicekontrakt_df_processed['Bandel'] = servicekontrakt_df_processed['Bandel'].str.replace('–', '-')
servicekontrakt_df_processed[['Bandelnr', 'Bandelnamn']] = servicekontrakt_df_processed['Bandel'].apply(parse_bandel)

### Reading and cleaning BIS dictionary

We need a BIS dictionary to find missing Bandelnr for some contracts (and later to construct a graph and find the length of shortest path). We first read the BIS dictionary.

In [6]:
# File and sheet details
excel_file_path = "../Python matching/raw_data/BIS-data 2024-01-09 - Bandel, plats och förbindelselinje, alla spår.xlsx"

# Load the Excel file
bis_df = pd.read_excel(excel_file_path)

We then need to clean up the dictionary by focusing on main tracks (nhsp) and removing duplicates, to speed up the matchning/search.

In [7]:
# for finding the code of the stations
bis_df_no_duplicates = bis_df[['BdlNr', 'Bandel', 'Plats_sign', 'Plats']].drop_duplicates()
# remove all rows with mising values of 'Plats_sign'
bis_df_no_duplicates = bis_df_no_duplicates[bis_df_no_duplicates['Plats_sign'].notna()]

# Focus on the main tracks
# remove all rows where BdlNr is equal to 1 (Ingår ej i bandelsindelning) but not rows where Forbind is not empty (helpful for finding shortest path)
bis_df = bis_df[(bis_df['BdlNr'] != 1) | (bis_df['Forbind'].notna())]

# keep only rows where column Spår_huvud_sido is nhsp
bis_df_nhsp = bis_df[bis_df["Spår_huvud_sido"] == "nhsp"]

# Step 1: Remove duplicates from the mapping
bis_df_nhsp_no_duplicates = bis_df_nhsp[['BdlNr', 'Bandel', 'Plats_sign', 'Plats', 'Forbind']].drop_duplicates()

We also reorganize the same dataframe in order to easily get the length of the track section (Banlangd).

In [8]:
# Step 1: Group by 'BdlNr', 'Bandel', 'Plats_sign', 'Plats' and sum 'Banlangd'
grouped_by_plats = bis_df_nhsp.groupby(['BdlNr', 'Bandel', 'Plats_sign', 'Plats'])['Banlangd'].sum().reset_index()

# Step 2: Group by 'BdlNr', 'Bandel', 'Forbind' and sum 'Banlangd'
grouped_by_forbind = bis_df_nhsp.groupby(['BdlNr', 'Bandel', 'Forbind'])['Banlangd'].sum().reset_index()

# Step 3: Add 'Plats_sign' and 'Plats' columns with NaN to 'grouped_by_forbind' for consistency
grouped_by_forbind['Plats_sign'] = pd.NA
grouped_by_forbind['Plats'] = pd.NA

# Step 4: Add 'Forbind' column with NaN to 'grouped_by_plats' for consistency
grouped_by_plats['Forbind'] = pd.NA

# Step 5: Combine the two DataFrames using outer concatenation
combined_bis_df_nhsp_langd = pd.concat([grouped_by_plats, grouped_by_forbind], ignore_index=True, sort=False)

### Finding missing Bandelnr (using dictionary)

As mentioned earlier, some rows do not have bandelnr (e.g., (Kävlinge) - (Arlöv)), we use the dictionary (from BIS) that we just prepared to find the missing Bandelnr. Bandelnr is important after all to be able to match to a common contract region.

In [9]:
import difflib

# add column Bandelnamn_from_fuzzy_match to servicekontrakt_df_processed
servicekontrakt_df_processed['Bandelnamn_from_fuzzy_match'] = None

# Iterate over rows to find and update missing 'Bandelnr'
for index, row in servicekontrakt_df_processed.iterrows():
    if row['Bandelnr'] == "" and "-" in row['Bandelnamn']: # single stations are treated in the code just after
        # Direct match in dictionary
        bandel_nr = bis_df_no_duplicates[bis_df_no_duplicates['Bandel'] == row['Bandelnamn']]['BdlNr']
        if len(bandel_nr) > 0:
            servicekontrakt_df_processed.at[index, 'Bandelnr'] = bandel_nr.values[0]
            # put in a new column called Bandelnr_from_exact_match
            servicekontrakt_df_processed.at[index, 'Bandelnr_from_exact_match'] = bandel_nr.values[0]
        else:
            # Flexible matching using difflib
            all_bandels = bis_df_no_duplicates['Bandel'].tolist()
            closest_match = difflib.get_close_matches(row['Bandelnamn'], all_bandels, n=1, cutoff=0.8)
            
            if closest_match:
                # Find the closest match's Bandelnr
                bandel_nr = bis_df_no_duplicates[bis_df_no_duplicates['Bandel'] == closest_match[0]]['BdlNr']
                if len(bandel_nr) > 0:
                    servicekontrakt_df_processed.at[index, 'Bandelnr'] = bandel_nr.values[0]
                    # put in a new column called Bandelnr_from_fuzzy_match
                    servicekontrakt_df_processed.at[index, 'Bandelnr_from_fuzzy_match'] = bandel_nr.values[0]
                    # put in a new column called Bandelnamn_from_fuzzy_match
                    servicekontrakt_df_processed.at[index, 'Bandelnamn_from_fuzzy_match'] = closest_match[0]

Some bandelnamn have a specific name of a place. For these, we can identify the bandelnr using the dictionary directly.

In [10]:
# for the rows where Bandelnr is missing and Bandelnamn has no "-" in it, we find the bandelnr using bis_df ["Plats"]
for index, row in servicekontrakt_df_processed.iterrows():
    if row['Bandelnr'] == "" and "-" not in row['Bandelnamn']:
        # Direct match in dictionary
        bandel_nr = bis_df_nhsp_no_duplicates[bis_df_nhsp_no_duplicates['Plats'] == row['Bandelnamn']]['BdlNr']
        if len(bandel_nr) == 1:
            servicekontrakt_df_processed.at[index, 'Bandelnr'] = bandel_nr.values[0]
            # put in a new column called Bandelnr_from_exact_Plats_match
            servicekontrakt_df_processed.at[index, 'Bandelnr_from_exact_Plats_match'] = bandel_nr.values[0]
        elif len(bandel_nr) > 1:
            # save the first and print that there are multiple matches
            servicekontrakt_df_processed.at[index, 'Bandelnr'] = bandel_nr.values[0]
            # put in a new column called Bandelnr_from_exact_Plats_match
            servicekontrakt_df_processed.at[index, 'Bandelnr_from_exact_Plats_match'] = bandel_nr.values[0]
            print("Multiple matches for Bandelnamn: ", row['Bandelnamn'])
        else:
            # print that there is no match
            print("No match for Bandelnamn: ", row['Bandelnamn'])

No match for Bandelnamn:  Helsingborg c
No match for Bandelnamn:  Helsingborg c/ Helsingborg gbg
No match for Bandelnamn:  Helsingborg gbg
No match for Bandelnamn:  Landskrona Ö


### Parsing Bandelnamn to Plats_sign

In [11]:
#Create a mapping from Plats (full name) to Plats_sign (short code)
name_to_code_mapping = bis_df_no_duplicates.set_index('Plats_sign')['Plats'].to_dict()

def convert_bandelnamn_to_codes(bandelnamn):
    # Split by dash and preserve parentheses
    stations = bandelnamn.split('-')
    
    # Create a case-insensitive mapping of full names to codes
    name_to_code_mapping_lower = {
        str(v).lower(): str(k) for k, v in name_to_code_mapping.items()
    }
    
    # Detailed conversion with original names preserved
    station_details = []
    station_codes = []

    for name in stations:
        stripped_name = str(name).strip()
        has_parentheses = stripped_name.startswith('(') and stripped_name.endswith(')')
        
        # Remove parentheses temporarily for lookup
        name_without_parentheses = stripped_name[1:-1] if has_parentheses else stripped_name
        
        # Try case-insensitive matching with original name
        code = name_to_code_mapping_lower.get(name_without_parentheses.lower())
        

        # If no code found, try appending " central"
        central_name = ""
        if code is None:
            central_name = f"{name_without_parentheses} central"
            code = name_to_code_mapping_lower.get(central_name.lower(), None)
        
        if code is None:
            central_name = f"{name_without_parentheses}s central"
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # trying appending " c"
        if code is None:
            central_name = f"{name_without_parentheses} c"
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # if word has "gbg" replace with godsbangård
        if code is None and " c" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" c", "s central").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # if word has / then remove it the part after /
        if code is None and "/" in name_without_parentheses:
            central_name = name_without_parentheses.split('/')[0].strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)
            if code is None:
                central_name = central_name.replace(" c", "s central").strip()
                code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # if word has "gbg" replace with godsbangård
        if code is None and " Ö" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" Ö", " Östra").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # if word has "gbg" replace with godsbangård
        if code is None and "gbg" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" gbg", "s godsbangård").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        if code is None and "Åby" in name_without_parentheses:
            central_name = name_without_parentheses.replace("Åby", "Åby södra").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        if code is None and " (Mssb)" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" (Mssb)", "").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        if code is None and " V" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" V", "s västra").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)


        # if word has "rbg" replace with rangerbangård
        if code is None and "rbg" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" rbg", "s rangerbangård").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # if word has "rbg" replace with s central
        if code is None and "rbg" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" rbg", "s central").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # if word has "N" replace with norra
        if code is None and " N" in name_without_parentheses:
            central_name = name_without_parentheses.replace(" N", " norra").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)
        

        # try removing "driftsplats" from the name
        if code is None and "driftsplats" in name_without_parentheses:
            central_name = name_without_parentheses.replace("driftsplats", "").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)

        # try removing "driftsplats" from the name
        if code is None and "C driftsplats" in name_without_parentheses:
            central_name = name_without_parentheses.replace("C driftsplats", "central").strip()
            code = name_to_code_mapping_lower.get(central_name.lower(), None)


        # if no code found, check if name_without_parentheses is composed of two names separated by a space and replace the space with a hyphen
        if code is None:
            names = name_without_parentheses.split(' ')
            if len(names) == 2:
                name_without_parentheses = names[0] + '-' + names[1]
                code = name_to_code_mapping_lower.get(name_without_parentheses.lower(), None)

        # if not found, try fuzzy matching
        if code is None:
            all_plats = bis_df_no_duplicates['Plats'].tolist()
            closest_match = difflib.get_close_matches(name_without_parentheses, all_plats, n=1, cutoff=0.9)
            if closest_match:
                code = name_to_code_mapping_lower.get(closest_match[0].lower())
        
        station_details.append({
            'original_name': stripped_name,
            'tried_name': central_name if code and code != name_without_parentheses else None,
            'station_code': code
        })
        
        if code:
            # Add parentheses back if they were present
            formatted_code = f"({code})" if has_parentheses else code
            station_codes.append(formatted_code)
    
    return {
        'station_details': station_details,
        'station_codes': station_codes,
        'short_path': '-'.join(station_codes) if station_codes else None
    }

# Step 2: Prepare the dataframe with detailed conversion results
def prepare_bandelnamn_conversion(servicekontrakt_df):
    # Apply the conversion function
    conversion_results = servicekontrakt_df['Bandelnamn'].apply(convert_bandelnamn_to_codes)
    
    # Extract details into separate columns
    servicekontrakt_df = servicekontrakt_df.copy()
    servicekontrakt_df['station_details'] = conversion_results.apply(lambda x: x['station_details'])
    servicekontrakt_df['original_station_names'] = servicekontrakt_df['station_details'].apply(
        lambda x: [detail['original_name'] for detail in x]
    )
    servicekontrakt_df['station_codes'] = conversion_results.apply(lambda x: x['station_codes'])
    servicekontrakt_df['short_path'] = conversion_results.apply(lambda x: x['short_path'])
    
    return servicekontrakt_df

In [12]:
# Apply the steps
# 1. First, prepare the conversion
servicekontrakt_df_processed = prepare_bandelnamn_conversion(servicekontrakt_df_processed)

# # add a column named diff_len where you put the difference between the length of the list in original_station_names and station_codes
# servicekontrakt_df_processed['diff_len'] = servicekontrakt_df_processed.apply(lambda x: len(x['original_station_names']) - len(x['station_codes']), axis=1)

# # add another column with length of original_station_names
# servicekontrakt_df_processed['len_original_station_names'] = servicekontrakt_df_processed.apply(lambda x: len(x['original_station_names']), axis=1)

Some rows have have the same values of all the columns because they are belonging to different track. We remove these duplicates since we focus only on main track.

In [13]:
# drop columns station_details, original_station_names and station_codes
servicekontrakt_df_processed = servicekontrakt_df_processed.drop(columns=['station_details', 'original_station_names', 'station_codes'])

# make sure bandelnr is a number integer
servicekontrakt_df_processed['Bandelnr'] = pd.to_numeric(servicekontrakt_df_processed['Bandelnr'], errors='coerce')

# remove duplicate rows from servicekontrakt_df_processed
servicekontrakt_df_processed_no_duplicates = servicekontrakt_df_processed.drop_duplicates()

## Calculate Banlangd

The first and easiest rows are the ones where we already have found exact/fuzzy matches (with bandelnamn) from the dictionary.

In [14]:
servicekontrakt_df_langd = servicekontrakt_df_processed_no_duplicates.copy()

# add a new column named 'Banlangd' to servicekontrakt_df_langd
servicekontrakt_df_langd['Banlangd'] = pd.NA

# iterate over rows in servicekontrakt_df_langd and find the corresponding Banlangd in combined_bis_df_nhsp_langd
# use the columns 'Bandelnr' and 'Bandelnamn_from_fuzzy_match' to find the corresponding Banlangd by summing up the Banlangd values in combined_bis_df_nhsp_langd
for index, row in servicekontrakt_df_langd.iterrows():
    # if bandelnamn_from_fuzzy_match column is in row 
    if 'Bandelnamn_from_fuzzy_match' in row and not pd.isna(row['Bandelnamn_from_fuzzy_match']):    
        bandel_namn_exakt = row['Bandelnamn_from_fuzzy_match']
        bandel_nr = row['Bandelnr']
        # find the corresponding Banlangd in combined_bis_df_nhsp_langd
        banlangd = combined_bis_df_nhsp_langd[(combined_bis_df_nhsp_langd['BdlNr'] == bandel_nr) & (combined_bis_df_nhsp_langd['Bandel'] == bandel_namn_exakt)]['Banlangd']
        if len(banlangd) > 0:
            # sum up the Banlangd values
            servicekontrakt_df_langd.at[index, 'Banlangd'] = banlangd.sum()/1000
        else:
            # put zero if no Banlangd found
            servicekontrakt_df_langd.at[index, 'Banlangd'] = 0
    # do the same for Bandelnr_from_exact_match
    elif 'Bandelnr_from_exact_match' in row and not pd.isna(row['Bandelnr_from_exact_match']):
        bandel_namn_exakt = row['Bandelnamn']
        bandel_nr = row['Bandelnr']
        # find the corresponding Banlangd in combined_bis_df_nhsp_langd
        banlangd = combined_bis_df_nhsp_langd[(combined_bis_df_nhsp_langd['BdlNr'] == bandel_nr) & (combined_bis_df_nhsp_langd['Bandel'] == bandel_namn_exakt)]['Banlangd']
        if len(banlangd) > 0:
            # sum up the Banlangd values
            servicekontrakt_df_langd.at[index, 'Banlangd'] = banlangd.sum()/1000
        else:
            # put zero if no Banlangd found
            servicekontrakt_df_langd.at[index, 'Banlangd'] = 0
    # do the same for Bandelnr_from_exact_Plats_match
    elif 'Bandelnr_from_exact_Plats_match' in row and not pd.isna(row['Bandelnr_from_exact_Plats_match']):
        bandel_namn_exakt = row['Bandelnamn']
        bandel_nr = row['Bandelnr']
        # find the corresponding Banlangd in combined_bis_df_nhsp_langd
        banlangd = combined_bis_df_nhsp_langd[(combined_bis_df_nhsp_langd['BdlNr'] == bandel_nr) & (combined_bis_df_nhsp_langd['Plats'] == bandel_namn_exakt)]['Banlangd']
        if len(banlangd) > 0:
            # sum up the Banlangd values
            servicekontrakt_df_langd.at[index, 'Banlangd'] = banlangd.sum()/1000
        else:
            # put zero if no Banlangd found
            servicekontrakt_df_langd.at[index, 'Banlangd'] = 0

In [15]:
# get rid of the column Bandelnamn_from_fuzzy_match, Bandelnr_from_exact_match and Bandelnr_from_exact_Plats_match if they exist
if 'Bandelnamn_from_fuzzy_match' in servicekontrakt_df_langd:
    servicekontrakt_df_langd = servicekontrakt_df_langd.drop(columns=['Bandelnamn_from_fuzzy_match'])
if 'Bandelnr_from_exact_match' in servicekontrakt_df_langd:
    servicekontrakt_df_langd = servicekontrakt_df_langd.drop(columns=['Bandelnr_from_exact_match'])
if 'Bandelnr_from_exact_Plats_match' in servicekontrakt_df_langd:
    servicekontrakt_df_langd = servicekontrakt_df_langd.drop(columns=['Bandelnr_from_exact_Plats_match'])

The next easiest are the rows which correspond to a single station.

In [16]:
# iterate over rows in servicekontrakt_df_langd (with missing values of langd and Bandelnamn without "-")
# and find the corresponding Banlangd using short_path = combined_bis_df_nhsp_langd['Plats']
for index, row in servicekontrakt_df_langd.iterrows():
    if pd.isna(row['Banlangd']):
        # if Bandelnamn has "-" just continue
        if "-" in row['Bandelnamn']:
            continue
        # if not find the corresponding Banlangd using short_path
        bandel_nr = row['Bandelnr']
        driftplats_sign = row['short_path']
        # find the corresponding Banlangd in combined_bis_df_nhsp_langd
        banlangd = combined_bis_df_nhsp_langd[(combined_bis_df_nhsp_langd['BdlNr'] == bandel_nr) & (combined_bis_df_nhsp_langd['Plats_sign'] == driftplats_sign)]['Banlangd']
        if len(banlangd) > 0:
            # sum up the Banlangd values
            servicekontrakt_df_langd.at[index, 'Banlangd'] = banlangd.sum()/1000 
        else:
            # put zero if no Banlangd found
            servicekontrakt_df_langd.at[index, 'Banlangd'] = 0

### Construction of the Graph

To find the length for the rows, we will need to construct a graph of the network and find the shortest path and then accumulate the length on the main tracks if any.

In [17]:
# Step 1: Group by 'BdlNr', 'Bandel', 'Plats_sign', 'Plats' and sum 'Banlangd' where 'Spår_huvud_sido' is 'nhsp'
grouped_by_plats_nhsp = bis_df[bis_df['Spår_huvud_sido'] == 'nhsp'].groupby(['BdlNr', 'Bandel', 'Plats_sign', 'Plats'])['Banlangd'].sum().reset_index()

# Step 2: Group by 'BdlNr', 'Bandel', 'Forbind' and sum 'Banlangd' where 'Spår_huvud_sido' is 'nhsp'
grouped_by_forbind_nhsp = bis_df[bis_df['Spår_huvud_sido'] == 'nhsp'].groupby(['BdlNr', 'Bandel', 'Forbind'])['Banlangd'].sum().reset_index()

# Step 3: Group by 'BdlNr', 'Bandel', 'Plats_sign', 'Plats' where 'Spår_huvud_sido' is not 'nhsp' and set 'Banlangd' to 0
grouped_by_plats_non_nhsp = bis_df[bis_df['Spår_huvud_sido'] != 'nhsp'].groupby(['BdlNr', 'Bandel', 'Plats_sign', 'Plats']).size().reset_index(name='Banlangd')
grouped_by_plats_non_nhsp['Banlangd'] = 0

# Step 4: Group by 'BdlNr', 'Bandel', 'Forbind' where 'Spår_huvud_sido' is not 'nhsp' and set 'Banlangd' to 0
grouped_by_forbind_non_nhsp = bis_df[bis_df['Spår_huvud_sido'] != 'nhsp'].groupby(['BdlNr', 'Bandel', 'Forbind']).size().reset_index(name='Banlangd')
grouped_by_forbind_non_nhsp['Banlangd'] = 0

# Combine nhsp and non-nhsp dataframes
grouped_by_plats = pd.concat([grouped_by_plats_nhsp, grouped_by_plats_non_nhsp], ignore_index=True)
grouped_by_forbind = pd.concat([grouped_by_forbind_nhsp, grouped_by_forbind_non_nhsp], ignore_index=True)

# when when duplicate rows are present, keep the none with the highest Banlangd
grouped_by_plats = grouped_by_plats.sort_values('Banlangd', ascending=False).drop_duplicates(['BdlNr', 'Bandel', 'Plats_sign', 'Plats']).sort_index()
grouped_by_forbind = grouped_by_forbind.sort_values('Banlangd', ascending=False).drop_duplicates(['BdlNr', 'Bandel', 'Forbind']).sort_index()

# Step 3: Add 'Plats_sign' and 'Plats' columns with NaN to 'grouped_by_forbind' for consistency
grouped_by_forbind['Plats_sign'] = pd.NA
grouped_by_forbind['Plats'] = pd.NA

# Step 4: Add 'Forbind' column with NaN to 'grouped_by_plats' for consistency
grouped_by_plats['Forbind'] = pd.NA

# Step 5: Combine the two DataFrames using outer concatenation
combined_bis_df_langd = pd.concat([grouped_by_plats, grouped_by_forbind], ignore_index=True, sort=False)

In [18]:
import networkx as nx
import pandas as pd

# Global cache for lengths and the graph
langd_cache = {}
GLOBAL_GRAPH = None

# Step 1: Create a mapping from Plats_sign (full name) to Banlangd
station_length_lookup = combined_bis_df_langd.set_index('Plats_sign')['Banlangd'].to_dict()

### Utility Functions ###

def initialize_global_graph(dictionary_df):
    """Initialize the global graph once"""
    global GLOBAL_GRAPH
    if GLOBAL_GRAPH is None:
        #bdl_df = dictionary_df[(dictionary_df['BdlNr'] >= 2) & (dictionary_df['BdlNr'] <= 990)]
        bdl_df = dictionary_df
        GLOBAL_GRAPH = nx.Graph()  # Undirected graph to simulate bidirectional connections
        for _, row in bdl_df.iterrows():
            if pd.notna(row['Forbind']):
                start, end = row['Forbind'].split('-')
                length = row['Banlangd']
                GLOBAL_GRAPH.add_edge(start.strip(), end.strip(), length=length)

def calculate_sum_langd(forbind_list, dictionary_df):
    if not forbind_list or forbind_list == '':
        # print that the forbind_list is empty
        print("forbind_list is empty")
        return None
    
    # Check cache
    cache_key = (forbind_list)
    if cache_key in langd_cache:
        return langd_cache[cache_key]
    
    # Initialize global graph if not already done
    if GLOBAL_GRAPH is None:
        initialize_global_graph(dictionary_df)
    
    # Split and clean the forbind_list
    forbinds = [f.strip() for f in forbind_list.split(',')]
    stations = [station for forbind in forbinds for station in forbind.split('-')]
    first_station = stations[0]
    last_station = stations[-1]
    
    # Check if first and last stations are enclosed in parentheses
    include_first_station = not (first_station.startswith('(') and first_station.endswith(')'))
    include_last_station = not (last_station.startswith('(') and last_station.endswith(')'))
    
    # Remove parentheses for lookup in the graph
    first_station_cleaned = first_station.strip('()')
    last_station_cleaned = last_station.strip('()')
    
    # Check if stations exist in graph
    if first_station_cleaned not in GLOBAL_GRAPH or last_station_cleaned not in GLOBAL_GRAPH:
        langd_cache[cache_key] = None
        print("Stations not found in graph")
        return None
    
    try:
        path_length = nx.shortest_path_length(
            GLOBAL_GRAPH, 
            source=first_station_cleaned, 
            target=last_station_cleaned, 
            weight='length'
        )
        
        # Calculate length of intermediate stations
        shortest_path_stations = nx.shortest_path(
            GLOBAL_GRAPH, 
            source=first_station_cleaned, 
            target=last_station_cleaned
        )
        intermediate_stations = shortest_path_stations[1:-1]  # Exclude first and last station
        station_length_sum = sum(station_length_lookup.get(station, 0) for station in intermediate_stations)
        
        # Add lengths of first and last stations based on inclusion rules
        if include_first_station:
            station_length_sum += station_length_lookup.get(first_station_cleaned, 0)
        if include_last_station:
            station_length_sum += station_length_lookup.get(last_station_cleaned, 0)

        total_length = path_length + station_length_sum
        langd_cache[cache_key] = total_length
        return total_length
        
    except nx.NetworkXNoPath:
        langd_cache[cache_key] = None
        return None  # No path found

### Banlanged (from the shortest path)

In [19]:
def calculate_sum_langd_for_bandelnamn(row, dictionary_df):

    row_bandel = None
    row_forbind = None
    if pd.notna(row['Bandelnr']):
        row_bandel = int(row['Bandelnr'])
    row_forbind = row['short_path']

    # Case 1: Single station
    if '-' not in row_forbind:
        single_station = row_forbind
        
        # Find the row in dictionary_df where Plats_sign matches the single station
        matching_station = dictionary_df[dictionary_df['Plats_sign'] == single_station]
        
    
        # keep only rows where BdlNr is same as row['Bandelnr']
        matching_station = matching_station[matching_station['BdlNr'] == row_bandel]

        # If no matching station is found, return None
        if matching_station.empty:
            print(f"No matching station found for {single_station}")
            return None
        
        # Get the station's length as the sum of all values in matching_station['Banlangd']
        # station_length = matching_station['Banlangd'].iloc[0]
        station_length = matching_station['Banlangd'].sum()
        
        return station_length/1000  # Convert to kilometers

    # Case 2: Multiple stations (existing logic)
    return calculate_sum_langd(row_forbind, dictionary_df)/1000  # Convert to kilometers

# remove rows from dictionary_df where BdlNr is 1
#dictionary_df = dictionary_df[dictionary_df['BdlNr'] != 1]

# 2. Then calculate sum_langd
# call calculate_sum_langd_for_bandelnamn to calculate the langd for rows where Langd is missing
# servicekontrakt_df_langd['sum_langd'] = servicekontrakt_df_langd.apply(
#     lambda row: calculate_sum_langd_for_bandelnamn(row, combined_bis_df_langd),
#     axis=1
# )
servicekontrakt_df_langd.loc[servicekontrakt_df_langd['Banlangd'].isna(), 'Banlangd'] = servicekontrakt_df_langd[servicekontrakt_df_langd['Banlangd'].isna()].apply(
    lambda row: calculate_sum_langd_for_bandelnamn(row, combined_bis_df_langd),
    axis=1
)

# rename the column 'Banlangd' to 'sum_langd'
servicekontrakt_df_langd = servicekontrakt_df_langd.rename(columns={'Banlangd': 'sum_langd'})

# make sure column sum_langd is a real number
servicekontrakt_df_langd['sum_langd'] = pd.to_numeric(servicekontrakt_df_langd['sum_langd'], errors='coerce')

## Add track times in km-hours

After we have caclulated the lengths of each track segment, we can now add columns for track access times in km-hours for the following columns:
- 'TPA timmar per år',
- 'TPA timmar natt per år'
- 'TPA timmar helg per år', 
- 'EJ TPA timmar per år'
- 'EJ TPA timmar natt per år'
- 'EJ TPA timmar helg per år'

In [20]:
# Define the columns to calculate track access times (km-hours)
track_time_columns = [
    'TPA timmar per år', 'TPA timmar natt per år', 'TPA timmar helg per år', 
    'EJ TPA timmar per år', 'EJ TPA timmar natt per år', 'EJ TPA timmar helg per år'
]

# Create new columns for km-hours by multiplying each track time column by 'sum_langd'
for col in track_time_columns:
    km_hour_col = col.replace('timmar', 'km-timmar') # Naming the new column
    servicekontrakt_df_langd[km_hour_col] = servicekontrakt_df_langd[col] * servicekontrakt_df_langd['sum_langd']

## Matching to contracts

### Reading BIS-contract file

Load the excel file containing BIS information for mapping the bandel number with contract name. The file name is BIS_24_kontrakt_bandel_plats.xlsx and has sheet BIS 2024-01-09 with columns such as Bandel_nummer, UH_kontraktsområde.

In [21]:
# File and sheet details
excel_file_path = "../Python matching/raw_data/BIS_24_kontrakt_bandel_plats.xlsx"
sheet_name = "BIS 2024-01-09"

# Load the Excel file
bis_kontrakt_df = pd.read_excel(excel_file_path, sheet_name=sheet_name)

### Preparing mapping function

Preparing the mapping (bandel_nummer, Plats <-> UH kontraktområde).

In [22]:
# Step 1: Remove duplicates from the mapping
bandel_plats_to_contract_map = bis_kontrakt_df[['Bandel_nummer', 'Plats_sign','UH_kontraktsområde']].drop_duplicates()

# Step 2: Filter out rows where UH_kontraktsområde is NaN or 'Ingår inte i något kontrakt'
bandel_plats_to_contract_map = bandel_plats_to_contract_map[
    bandel_plats_to_contract_map['UH_kontraktsområde'].notna() & 
    (bandel_plats_to_contract_map['UH_kontraktsområde'] != 'Ingår inte i något kontrakt')
]

### Map to contracts

Initially we create a new column and fill it progressively with matchings. We start with the case where both Bandelnr and single station is present.

In [23]:
# create an empty column kontrakt_från_bandel in servicekontrakt_df_langd
servicekontrakt_df_langd['kontrakt_från_bandel'] = pd.NA

# for rows with non-empty Bandelnr, and short_path not containing "-" find the corresponding kontrakt_från_bandel
for index, row in servicekontrakt_df_langd.iterrows():
    if pd.notna(row['Bandelnr']) and '-' not in row['short_path']:
        bandel_nr = int(row['Bandelnr'])
        driftplats_sign = row['short_path']
        # remove parentheses
        driftplats_sign = driftplats_sign.strip('()')
        # find the corresponding kontrakt_från_bandel in bandel_plats_to_contract_map
        kontrakt = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Bandel_nummer'] == bandel_nr) & (bandel_plats_to_contract_map['Plats_sign'] == driftplats_sign)]['UH_kontraktsområde']
        if len(kontrakt) == 1:
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt.values[0]
        elif len(kontrakt) > 1:
            # save the first and print that there are multiple matches
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt.values[0]
            print("Multiple matches for Bandelnr: ", bandel_nr)
        else:
            # put zero if no kontrakt_från_bandel found
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = 'Inget kontrakt'
            print("No match for Bandelnr: ", bandel_nr)

We move on to map rows where we have two stations and exactly one of them does not have parentheses.

In [24]:
# now we will map rows with missing kontrakt_från_bandel that have Bandelnamn with "-" in it and no two occurrence of "("  in short_path
# we split the short_path by "-" and then take element wihout parentheses and find the corresponding kontrakt_från_bandel
for index, row in servicekontrakt_df_langd.iterrows():
    if '-' in row['short_path'] and row['short_path'].count('(') == 1:
        # split the short_path by "-"
        stations = row['short_path'].split('-')
        if stations[0].startswith('('):
            station_sign = stations[1]
            kontrakt = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == station_sign)]['UH_kontraktsområde'].unique()
        else:
            station_sign = stations[0]
            kontrakt = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == station_sign)]['UH_kontraktsområde'].unique()
        if len(kontrakt) == 1:
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
        elif len(kontrakt) > 1:
            if pd.notna(row['Bandelnr']):
                bandel_nr = int(row['Bandelnr'])
                kontrakt = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Bandel_nummer'] == bandel_nr) & (bandel_plats_to_contract_map['Plats_sign'] == station_sign)]['UH_kontraktsområde'].unique()
                if len(kontrakt) == 1:
                    servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
                elif len(kontrakt) > 1:
                    # save the first and print that there are multiple matches
                    servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
                    print("Multiple matches  (even after fixing bandelnr) for Bandelnr: ", bandel_nr)
            else:
                # save the first and print that there are multiple matches
                servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
                print("Multiple matches for Bandelnamn: ", station_sign)
        else:
            # put zero if no kontrakt_från_bandel found
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = 'Inget kontrakt'
            print("No match for Bandelnamn: ", station_sign)

Next, are the rows where no parentheses exist in the track section (i.e., both ends are included).

In [25]:
for index, row in servicekontrakt_df_langd.iterrows():
    if '-' in row['short_path'] and row['short_path'].count('(') == 0:
        # split the short_path by "-"
        stations = row['short_path'].split('-')
        kontrakt_1 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == stations[0])]['UH_kontraktsområde'].unique()
        kontrakt_2 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == stations[1])]['UH_kontraktsområde'].unique()
        if pd.notna(row['Bandelnr']):
            bandel_nr = int(row['Bandelnr'])
            if len(kontrakt_1) > 1:
                kontrakt_1 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Bandel_nummer'] == bandel_nr) & (bandel_plats_to_contract_map['Plats_sign'] == stations[0])]['UH_kontraktsområde'].unique()
            if len(kontrakt_2) > 1:
                kontrakt_2 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Bandel_nummer'] == bandel_nr) & (bandel_plats_to_contract_map['Plats_sign'] == stations[1])]['UH_kontraktsområde'].unique()
        # kontrakt will be the union of kontrakt_1 and kontrakt_2
        kontrakt = list(set(kontrakt_1) | set(kontrakt_2))
        # keep only unique values
        if len(kontrakt) == 1:
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
        elif len(kontrakt) > 1:
            # of the elements in kontrakt find the closest match using fuzzy matching to row['Kontraktsområdesnamn']   
            closest_match = difflib.get_close_matches(row['Kontraktsområdesnamn'], kontrakt, n=1, cutoff=0.6)     
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = closest_match[0]
            # print the value of the closest match as well as row['Kontraktsområdesnamn']
            print(f"Multiple matches for Bandelnamn: {stations[0]} and {stations[1]} with Kontraktsområdesnamn: {row['Kontraktsområdesnamn']}. Closest match: {closest_match}")
        else:
            # put zero if no kontrakt_från_bandel found
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = 'Inget kontrakt'
            print("No match for Bandelnamn: ", station_sign)

Multiple matches for Bandelnamn: Vns and Thö with Kontraktsområdesnamn: Långsele-Vännäs, Botniabanan. Closest match: ['Långsele-Vännäs inkl Botniabanan o Forsmo-Hoting']
Multiple matches for Bandelnamn: Ap and Lsl with Kontraktsområdesnamn: Långsele-Vännäs, Botniabanan. Closest match: ['Långsele-Vännäs inkl Botniabanan o Forsmo-Hoting']


Lastly, we match track sections where the ends are not included.

In [26]:
for index, row in servicekontrakt_df_langd.iterrows():
    if '-' in row['short_path'] and row['short_path'].count('(') == 2:
        kontrakt = []
        if pd.notna(row['Bandelnr']):
            bandel_nr = int(row['Bandelnr'])
            kontrakt = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Bandel_nummer'] == bandel_nr)]['UH_kontraktsområde'].unique()
            if len(kontrakt) == 1:
                servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
                continue
            elif len(kontrakt) > 1:
                # of the elements in kontrakt find the closest match using fuzzy matching to row['Kontraktsområdesnamn'] 
                # check row['Kontraktsområdesnamn'] is a substring of an element in kontrakt, if yes then choose that element
                kontrakt_substring = [x for x in kontrakt if row['Kontraktsområdesnamn'] in x]
                if len(kontrakt_substring) == 1:
                    servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt_substring[0]
                    # print the value of the closest match as well as row['Kontraktsområdesnamn']
                    print(f"Multiple matches for Bandelnamn: {kontrakt} with Kontraktsområdesnamn: {row['Kontraktsområdesnamn']}. Closest match: {kontrakt_substring}")
                    continue
                # if not found, try fuzzy matching
                closest_match = difflib.get_close_matches(row['Kontraktsområdesnamn'], kontrakt, n=1, cutoff=0.7)  
                servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = closest_match[0]
                # print the value of the closest match as well as row['Kontraktsområdesnamn']
                print(f"Multiple matches for Bandelnamn: {kontrakt} with Kontraktsområdesnamn: {row['Kontraktsområdesnamn']}. Closest match: {closest_match}")
                continue
        # split the short_path by "-"
        stations = row['short_path'].split('-')
        # remove parentheses
        stations = [station.strip('()') for station in stations]
        kontrakt_1 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == stations[0])]['UH_kontraktsområde'].unique()
        kontrakt_2 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == stations[1])]['UH_kontraktsområde'].unique()
        if len(kontrakt_1) > 1:
            kontrakt_1 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == stations[0])]['UH_kontraktsområde'].unique()
        if len(kontrakt_2) > 1:
            kontrakt_2 = bandel_plats_to_contract_map[(bandel_plats_to_contract_map['Plats_sign'] == stations[1])]['UH_kontraktsområde'].unique()
        # kontrakt will be the union of kontrakt_1 and kontrakt_2
        kontrakt = list(set(kontrakt_1) | set(kontrakt_2))
        # keep only unique values
        if len(kontrakt) == 1:
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt[0]
        elif len(kontrakt) > 1:
            # check row['Kontraktsområdesnamn'] is a substring of an element in kontrakt, if yes then choose that element
            kontrakt_substring = [x for x in kontrakt if row['Kontraktsområdesnamn'] in x]
            if len(kontrakt_substring) == 1:
                servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = kontrakt_substring[0]
                # print the value of the closest match as well as row['Kontraktsområdesnamn']
                print(f"Multiple matches for Bandelnamn: {kontrakt} with Kontraktsområdesnamn: {row['Kontraktsområdesnamn']}. Closest match: {kontrakt_substring}")
                continue
            # of the elements in kontrakt find the closest match using fuzzy matching to row['Kontraktsområdesnamn']   
            closest_match = difflib.get_close_matches(row['Kontraktsområdesnamn'], kontrakt, n=1, cutoff=0.7)     
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = closest_match[0]
            # print the value of the closest match as well as row['Kontraktsområdesnamn']
            print(f"Multiple matches for Bandelnamn: {kontrakt} with Kontraktsområdesnamn: {row['Kontraktsområdesnamn']}. Closest match: {closest_match}")
        else:
            # put zero if no kontrakt_från_bandel found
            servicekontrakt_df_langd.at[index, 'kontrakt_från_bandel'] = 'Inget kontrakt'
            print("No match for Bandelnamn: ", station_sign)

Multiple matches for Bandelnamn: ['Stockholm Nord' 'Mälarbanan'] with Kontraktsområdesnamn: Mälarbanan. Closest match: ['Mälarbanan']
Multiple matches for Bandelnamn: ['Norra stambanan', 'Ostkustbanan'] with Kontraktsområdesnamn: Norra Stambanan. Closest match: ['Norra stambanan']
Multiple matches for Bandelnamn: ['Norra stambanan', 'Ostkustbanan'] with Kontraktsområdesnamn: Norra Stambanan. Closest match: ['Norra stambanan']
Multiple matches for Bandelnamn: ['Ostkustbanan' 'Stockholm Nord'] with Kontraktsområdesnamn: Ostkustbanan. Closest match: ['Ostkustbanan']
Multiple matches for Bandelnamn: ['Stockholm Nord', 'Stockholm Mitt Drift o avhjälpande/Förebyggand'] with Kontraktsområdesnamn: Stockholm Mitt. Closest match: ['Stockholm Mitt Drift o avhjälpande/Förebyggand']
Multiple matches for Bandelnamn: ['Ostkustbanan' 'Stockholm Nord'] with Kontraktsområdesnamn: Stockholm Nord. Closest match: ['Stockholm Nord']
Multiple matches for Bandelnamn: ['Västra Södra Stambanan' 'Stockholm Syd']

## Export processed cleaned data

Export the relevant columns for visualisation in Power BI.

In [27]:
# add column Total timmar per år by adding TPA timmar per år and EJ TPA timmar per år
servicekontrakt_df_langd['Total timmar per år'] = servicekontrakt_df_langd['TPA timmar per år'] + servicekontrakt_df_langd['EJ TPA timmar per år']
servicekontrakt_df_langd['Total km-timmar per år'] = servicekontrakt_df_langd['TPA km-timmar per år'] + servicekontrakt_df_langd['EJ TPA km-timmar per år']

#  for excel
excel_file_path = "./exported_data_regression/Servicekontrakt_per_bandel_matched_all_2011_2023.xlsx"
# servicekontrakt_df_to_export = servicekontrakt_df_langd[['Kontraktsområdesnamn', 'kontrakt_från_bandel', 'Tidsperiod', 'Bandel', 'TPA timmar per år',
#        'TPA dagar per år', 'TPA veckor per år', 'TPA timmar natt per år',
#        'TPA timmar helg per år', 'EJ TPA timmar per år', 'EJ TPA dagar per år',
#        'EJ TPA veckor per år', 'EJ TPA timmar natt per år',
#        'EJ TPA timmar helg per år', 'Total timmar per år', 'Bandelnr',
#        'Bandelnamn', 'sum_langd']]
servicekontrakt_df_langd.to_excel(excel_file_path, index=False)