# Exporting track times for regressions (2011-2023)

Assumptions:
- Only conctracts between 2011 and 2023
- Only main tracks (nhsp)
- There are no track times for Långsele-Vännäs, Botniabanan (in 2023, after contract ends 2022). We therefore assume the same track times as in 2022.

For the regression data, we need to structure it as follows:
- Years from 2011 until 2023
- Contract regions (all contracts regions)
- Wether servicefönster was applied during that year
- If yes, what track times (as aggregated as possible) are provided


## Import data

We start with reading exported data after cleaning and matching of the track access times from contracts.

In [9]:
import pandas as pd # type: ignore

# Step 1: Load the Excel file containing service contracts for each bandel
excel_file_path = "./exported_data_regression/Servicekontrakt_per_bandel_matched_all_2011_2023.xlsx"


# Read the specific sheet 'T24' into a DataFrame
servicekontrakt_df = pd.read_excel(excel_file_path)

## Construct regression data

We first prepare some the years and contract regions. Then initialize the regression dataframe.

In [10]:
# Step 3: Create a base DataFrame with all combinations of years and contract regions
years = list(range(2011, 2024))  # 2011 to 2023
contract_regions = servicekontrakt_df['Kontraktsområdesnamn'].unique()
regression_data = pd.MultiIndex.from_product([years, contract_regions], names=['Year', 'Kontraktsområdesnamn']).to_frame(index=False)

Fill in with wether maintenance windows were applied or not.

In [11]:
# Step 2: Identify the first year a servicefönster was implemented for each contract region
servicefönster_start_year = servicekontrakt_df.groupby('Kontraktsområdesnamn')['Start_year'].min().reset_index()
servicefönster_start_year.rename(columns={'Start_year': 'First_Servicefönster_Year'}, inplace=True)

# Step 4: Merge with the servicefönster start year data
regression_data = regression_data.merge(servicefönster_start_year, on='Kontraktsområdesnamn', how='left')

# Step 5: Determine if servicefönster was applied (True if Year >= First_Servicefönster_Year)
regression_data['Servicefönster_applied'] = regression_data['Year'] >= regression_data['First_Servicefönster_Year']

Expand the dataframe by duplicating the track times for each year.

In [12]:
# Step 6: Expand data to include all years between Start_year and End_year
expanded_data = []

for _, row in servicekontrakt_df.iterrows():
    for year in range(row['Start_year'], row['End_year'] + 1):  # Include End_year
        new_row = row.copy()
        new_row['Year'] = year
        expanded_data.append(new_row)

expanded_servicekontrakt_df = pd.DataFrame(expanded_data)

Fill in with information about the promised track times.

In [13]:
# Step 7: Define columns for aggregation
track_time_columns = [
    'TPA timmar per år', 'TPA timmar natt per år', 'TPA timmar helg per år',
    'EJ TPA timmar per år', 'EJ TPA timmar natt per år', 'EJ TPA timmar helg per år', 'Total timmar per år'
]

track_distance_time_columns = [
    'TPA km-timmar per år', 'TPA km-timmar natt per år', 'TPA km-timmar helg per år',
    'EJ TPA km-timmar per år', 'EJ TPA km-timmar natt per år', 'EJ TPA km-timmar helg per år', 'Total km-timmar per år'
]

all_columns_to_aggregate = track_time_columns + track_distance_time_columns

# Step 8: Aggregate service contract track times per year and contract region
aggregated_times = expanded_servicekontrakt_df.groupby(['Year', 'Kontraktsområdesnamn'])[all_columns_to_aggregate].sum().reset_index()

# Step 9: Ensure all contract regions are represented for all years (2011-2023)
full_regression_data = pd.MultiIndex.from_product([years, contract_regions], names=['Year', 'Kontraktsområdesnamn']).to_frame(index=False)

Merge the different dataframes.

In [14]:
# Step 10: Merge with aggregated data
final_regression_data = full_regression_data.merge(aggregated_times, on=['Year', 'Kontraktsområdesnamn'], how='left')
# also merge with servicefönster applied data
final_regression_data = final_regression_data.merge(regression_data[['Year', 'Kontraktsområdesnamn', 'Servicefönster_applied']], on=['Year', 'Kontraktsområdesnamn'], how='left')

## Export regression data

In [15]:
# There are no track times for Långsele-Vännäs, Botniabanan (in 2023, after contract ends 2022). We therefore assume the same track times as in 2022.
final_regression_data.loc[final_regression_data['Kontraktsområdesnamn'] == 'Långsele-Vännäs, Botniabanan', all_columns_to_aggregate] = final_regression_data.loc[final_regression_data['Kontraktsområdesnamn'] == 'Långsele-Vännäs, Botniabanan', all_columns_to_aggregate].ffill()

# Step 11: Fill missing values with 0 (Assume no track time assigned)
for col in all_columns_to_aggregate:
    final_regression_data[col] = final_regression_data[col].fillna(0)

In [16]:
# Step 12: Export the structured regression data
output_csv = "./exported_data_regression/regression_data_utlovade_tider_2011_2023.csv"
output_excel = "./exported_data_regression/regression_data_utlovade_tider_2011_2023.xlsx"

final_regression_data.to_csv(output_csv, index=False)
final_regression_data.to_excel(output_excel, index=False)

print(f"✅ Regression data successfully exported:\n- CSV: {output_csv}\n- Excel: {output_excel}")


✅ Regression data successfully exported:
- CSV: ./exported_data_regression/regression_data_utlovade_tider_2011_2023.csv
- Excel: ./exported_data_regression/regression_data_utlovade_tider_2011_2023.xlsx
