# Equivalence trips to paths
TOC:

[Part 1: Trips](#part-1-trips)
- [1.1 Preformat trips](#1-1-preformat-trips)

[Part 2: Paths](#part-2-paths)
- [2.1 Preformat paths](#2-1-preformat-paths)

[Part 3: Remove un-localisable trips](#part-3-remove-un-localisable-trips)

[Part 4: Construct an equivalence between stations in MobA and stations in MMX](#part-4-construct-an-equivalence-between-stations-in-moba-and-stations-in-mmx)
- [4.1 Build the station_to_NUTS dictionary](#4-1-build-the-station_to_nuts-dictionary)
- [4.2 Build the NUTS_to_station_MMX dictionary](#4-2-build-the-nuts_to_station_mmx-dictionary)
- [4.3 Join the two dictionaries](#4-3-join-the-two-dictionaries)

[Part 5: To how many trips can I assign a path?](#part-5-to-how-many-trips-can-i-assign-a-path)

[Part 6: Assign paths to trips](#part-6-assign-paths-to-trips)
- [6.1 Read and format the necessary files](#6-1-read-and-format-the-necessary-files)
- [6.2 Separate itineraries in unique and repeated ones](#6-2-separate-itineraries-in-unique-and-repeated-ones)
- [6.3 Merge trips with itineraries](#6-3-merge-trips-with-itineraries)

[Part 7: A bit of analysis of the assignation](#part-7-a-bit-of-analysis-of-the-assignation)

[Part 8: Logit Model calibration](#part-8-logit-model-calibration)

[Part 9: Analysis of the calibration](#part-9-analysis-of-the-calibration)


In [332]:
# libraries to import
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from ast import literal_eval
from typing import Dict
import os
print(os.getcwd())
os.chdir(r"C:\Users\LMENENDEZ\GitHub\MultiModX")
print(os.getcwd())
pd.set_option('display.max_columns', None)

C:\Users\LMENENDEZ\GitHub\MultiModX
C:\Users\LMENENDEZ\GitHub\MultiModX


In [333]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [334]:
%autoreload
from strategic_evaluator.logit_model import *

## Part 1: Trips

In [335]:
# Trips during the week 22/09/2022 28/09/2022 (thursday to thursday)
# the day of study selected was Friday to put the air layer under pressure
all_trips = pd.read_csv(
    r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP4 Performance Assessment Solution\Demand data\Matrices MITMA\with_archetypes\20220922_28_trip_matrix_arc_pt_processed.csv.gz",
    compression="gzip",
    sep="|"
)

In [336]:
#Here there is no trip id but later trip id appears all the time
trips = all_trips[all_trips["date"] == 20220923].reset_index(drop=True).rename(columns={"origin_nut": "origin", "destination_nut": "destination"})
trips.head()

Unnamed: 0,date,trip_period,origin_zone,origin,origin_name,destination_zone,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,distance,route_distance,duration,mode,service,legs,trip_vehicle_type,nationality,home_census,home_zone,overnight_census,income,age,sex,vehicle_type,short_professional_driver,trips,trips_km,sample_trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced
0,20220923,P00,01002,ES211,Álava,4802006,ES213,Vizcaya,,,NF,O,D04_[10000-50000),D04_[10000-50000),01-02,train,conv_unknown,P00*01002*01002*None*train_13121*00-01*road*No...,other,ES,2_48,4801303,2_48,I01_[10000-15000),A01_[25-45),male,passenger,False,4.135,139.516,1.0,0.0,0.0,2.0675,2.0675,0.0,0.0,3,road-train-road,train_13121-train_13200,train_13121,train_13200,national,2,1,0,train_13121-train_13200,train_13106,train_13200
1,20220923,P00,01009_AM,ES211,Álava,31010_AM,ES220,Navarra,,,O,H,D04_[10000-50000),D04_[10000-50000),00-01,train,conv_unknown,P00*01009_AM*01009_AM*None*train_11213*00-01*r...,other,ES,2_31,31010_AM,2_31,I01_[10000-15000),A01_[25-45),male,passenger,False,5.191,69.312,1.0,3.707857,0.0,0.0,0.494381,0.0,0.988762,3,road-train-road,train_11213-train_11300,train_11213,train_11300,national,2,1,0,train_11213-train_11300,train_13106,train_11300
2,20220923,P00,01009_AM,ES211,Álava,abroad_208,abroad,abroad,,ground_Fra_08,NF,NF,abroad,D05_[50000-inf),01-02,train,conv_unknown,P00*01009_AM*01009_AM*None*train_11213*00-01*r...,other,FR,,,,,,,passenger,False,1.599,309.588,1.0,0.888333,0.39975,0.142133,0.071067,0.062183,0.035533,3,road-train-road,train_11213-train_11600,train_11213,train_11600,international_D,2,1,0,train_11213-train_11600,train_11208,train_11511
3,20220923,P00,01036,ES211,Álava,4802006,ES213,Vizcaya,,,H,O,D04_[10000-50000),D04_[10000-50000),00-01,train,conv_unknown,P00*01036*01036*None*train_13106*00-01*road*No...,other,ES,2_01,01036,2_48,I02_[15000-inf),A02_[45-65),female,passenger,False,6.236,139.477,1.0,3.118,0.0,0.7795,1.559,0.7795,0.0,3,road-train-road,train_13106-train_13200,train_13106,train_13200,national,2,1,0,train_13106-train_13200,train_13106,train_13200
4,20220923,P00,0105902,ES211,Álava,09219,ES412,Burgos,,,O,NF,D04_[10000-50000),D04_[10000-50000),01-02,train,conv_unknown,P00*0105902*0105901*None*train_11208*00-01*roa...,other,ES,2_01,0105904,2_01,I02_[15000-inf),A02_[45-65),male,passenger,False,4.215,155.813,1.0,3.417568,0.113919,0.227838,0.227838,0.227838,0.0,3,road-train-road,train_11208-train_11200,train_11208,train_11200,national,2,1,0,train_11208-train_11200,train_11208,train_11200


In [337]:
#associates each airport to the corresponding new NUTS for the deprecated NUTS
airports_to_NUTS={"airport_LPA":("ES705","Gran Canaria"),
                 "airport_FUE":("ES704","Fuerteventura"),
                 "airport_ACE":("ES708","Lanzarote"),
                 "airport_TFS":("ES709","Tenerife"),
                 "airport_TFN":("ES709","Tenerife"),
                 "airport_SPC":("ES707","La Palma"),
                 "airport_VDE":("ES703","El Hierro"),
                 "airport_PMI":("ES532","Mallorca"),
                 "airport_IBZ":("ES531","Eivissa i Formentera"),
                 "airport_MAH":("ES533","Menorca")
                 }

In [338]:
def format_trips(trips: pd.DataFrame, airports_to_NUTS: dict):
    # Filter trips (only those without "abroad" as origin or destination)
    trips = trips[~((trips["origin"] == "abroad") | (trips["destination"] == "abroad"))].copy()

    # Modify 'mode_tp' column: replacing modes with specific terminology
    trips.loc[:, "mode_tp"] = (
        trips["mode_sequence"]
        .str.replace("bus", "road")  # replace bus to road (some people can reach infrastructure by bus)
        .str.replace("plane", "air")  # use nomenclature of the offer data
        .str.replace("train", "rail")
    )

    # Remove "road" from the 'mode_tp' column
    trips.loc[:, "mode_tp"] = trips["mode_tp"].apply(
        lambda row: [mode for mode in row.split("-") if mode != "road"]
    )  # remove "road" (it will be considered like access time)

    # Only consider trips that do not contain "ship"
    trips = trips[~trips["mode_tp"].apply(lambda x: "ship" in x)]

    # Change aggregated island NUTS to dis-aggregated NUTS
    for key in airports_to_NUTS.keys():
        trips.loc[trips["start_node"] == key, ["origin", "origin_name"]] = [
            airports_to_NUTS[key][0],
            airports_to_NUTS[key][1]
        ]  # change start node
        trips.loc[trips["end_node"] == key, ["destination", "destination_name"]] = [
            airports_to_NUTS[key][0],
            airports_to_NUTS[key][1]
        ]  # change destination node

    return trips

### 1.1. Preformat trips 
Trips now account for the new NUTS and have an extra column named mode_tp that accounts for the combination of trains and planes taken during the day

In [339]:
trips=format_trips(trips,airports_to_NUTS)

In [340]:
trips

Unnamed: 0,date,trip_period,origin_zone,origin,origin_name,destination_zone,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,distance,route_distance,duration,mode,service,legs,trip_vehicle_type,nationality,home_census,home_zone,overnight_census,income,age,sex,vehicle_type,short_professional_driver,trips,trips_km,sample_trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp
0,20220923,P00,01002,ES211,Álava,4802006,ES213,Vizcaya,,,NF,O,D04_[10000-50000),D04_[10000-50000),01-02,train,conv_unknown,P00*01002*01002*None*train_13121*00-01*road*No...,other,ES,2_48,4801303,2_48,I01_[10000-15000),A01_[25-45),male,passenger,False,4.135,139.516,1.0,0.000000,0.000000,2.067500,2.067500,0.000000,0.000000,3,road-train-road,train_13121-train_13200,train_13121,train_13200,national,2,1,0,train_13121-train_13200,train_13106,train_13200,[rail]
1,20220923,P00,01009_AM,ES211,Álava,31010_AM,ES220,Navarra,,,O,H,D04_[10000-50000),D04_[10000-50000),00-01,train,conv_unknown,P00*01009_AM*01009_AM*None*train_11213*00-01*r...,other,ES,2_31,31010_AM,2_31,I01_[10000-15000),A01_[25-45),male,passenger,False,5.191,69.312,1.0,3.707857,0.000000,0.000000,0.494381,0.000000,0.988762,3,road-train-road,train_11213-train_11300,train_11213,train_11300,national,2,1,0,train_11213-train_11300,train_13106,train_11300,[rail]
3,20220923,P00,01036,ES211,Álava,4802006,ES213,Vizcaya,,,H,O,D04_[10000-50000),D04_[10000-50000),00-01,train,conv_unknown,P00*01036*01036*None*train_13106*00-01*road*No...,other,ES,2_01,01036,2_48,I02_[15000-inf),A02_[45-65),female,passenger,False,6.236,139.477,1.0,3.118000,0.000000,0.779500,1.559000,0.779500,0.000000,3,road-train-road,train_13106-train_13200,train_13106,train_13200,national,2,1,0,train_13106-train_13200,train_13106,train_13200,[rail]
4,20220923,P00,0105902,ES211,Álava,09219,ES412,Burgos,,,O,NF,D04_[10000-50000),D04_[10000-50000),01-02,train,conv_unknown,P00*0105902*0105901*None*train_11208*00-01*roa...,other,ES,2_01,0105904,2_01,I02_[15000-inf),A02_[45-65),male,passenger,False,4.215,155.813,1.0,3.417568,0.113919,0.227838,0.227838,0.227838,0.000000,3,road-train-road,train_11208-train_11200,train_11208,train_11200,national,2,1,0,train_11208-train_11200,train_11208,train_11200,[rail]
5,20220923,P00,0105902,ES211,Álava,09219,ES412,Burgos,,,W,H,D04_[10000-50000),D04_[10000-50000),00-01,train,conv_unknown,P00*0105902*0105901*None*train_11208*00-01*roa...,other,ES,2_09,09219,2_09,I01_[10000-15000),A02_[45-65),male,passenger,False,3.574,133.590,1.0,2.757086,0.000000,0.204229,0.102114,0.000000,0.510571,3,road-train-road,train_11208-train_11200,train_11208,train_11200,national,2,1,0,train_11208-train_11200,train_11208,train_11200,[rail]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227596,20220923,P23,5029709,ES243,Zaragoza,44056_AM,ES242,Teruel,,,NF,NF,D05_[50000-inf),D05_[50000-inf),01-02,train,conv_unknown,P23*5029709*5029703*None*train_70806*00-01*roa...,other,ES,2_22,22048,2_22,I01_[10000-15000),A02_[45-65),female,passenger,False,3.468,630.518,1.0,2.813660,0.007270,0.029082,0.152679,0.247195,0.218113,3,road-train-road,train_70806-train_67103,train_70806,train_67103,national,2,1,0,train_70806-train_67103,train_4040,train_71201,[rail]
227597,20220923,P23,5029709,ES243,Zaragoza,44056_AM,ES242,Teruel,,,NF,NF,D05_[50000-inf),D05_[50000-inf),01-02,train,conv_unknown,P23*5029709*5029703*None*train_70806*00-01*roa...,other,ES,2_22,22048,2_22,I01_[10000-15000),A03_[65-100),female,passenger,False,4.735,839.483,1.0,3.809162,0.026453,0.052905,0.238073,0.317430,0.290978,3,road-train-road,train_70806-train_67103,train_70806,train_67103,national,2,1,0,train_70806-train_67103,train_70600,train_67105,[rail]
227600,20220923,P23,5029712,ES243,Zaragoza,26059_AM,ES230,La Rioja,,,NF,H,D05_[50000-inf),D05_[50000-inf),01-02,train,conv_unknown,P23*5029712*5029707*None*train_04040*00-01*roa...,other,ES,2_26,26059_AM,2_26,I01_[10000-15000),A01_[25-45),male,passenger,False,4.190,767.304,1.0,3.412670,0.056878,0.000000,0.170633,0.189593,0.360226,3,road-train-road,train_04040-train_81100,train_04040,train_81100,national,2,1,0,train_04040-train_81100,train_4040,train_81108,[rail]
227601,20220923,P23,5029712,ES243,Zaragoza,26059_AM,ES230,La Rioja,,,NF,H,D05_[50000-inf),D05_[50000-inf),01-02,train,conv_unknown,P23*5029712*5029707*None*train_04040*00-01*roa...,other,ES,2_26,26059_AM,2_26,I02_[15000-inf),A03_[65-100),male,passenger,False,2.333,428.481,1.0,2.008972,0.064806,0.000000,0.064806,0.129611,0.064806,3,road-train-road,train_04040-train_81100,train_04040,train_81100,national,2,1,0,train_04040-train_81100,train_4040,train_81110,[rail]


In [341]:
trips["mode_tp"].value_counts()

mode_tp
[rail]                102692
[air]                  25613
[rail, rail]            2284
[air, air]               711
[air, rail]              188
[rail, air]              114
[rail, rail, rail]        19
[air, air, air]            8
[rail, air, rail]          2
[air, air, rail]           1
[air, rail, rail]          1
Name: count, dtype: int64

In order to assign trips with the correct path we need to know the duration of the trip better. The following function adds two column that account for the minimum duration of the trip and the maximum duration of the trip

In [342]:
# Function to parse the trip_duration and calculate min and max durations
def parse_duration(row):
    # Remove the square brackets and split the string by '-'
    duration_str = row['duration'][:]  # Removing '[' and ']'
    min_val, max_val = duration_str.split('-')

    # Convert the values to minutes, handling 'inf'
    if min_val == 'inf':
        min_duration = float('inf')
    else:
        min_duration = int(min_val) * 60  # Convert to minutes

    if max_val == 'inf':
        max_duration = float('inf')
    else:
        max_duration = int(max_val) * 60  # Convert to minutes

    return pd.Series([min_duration, max_duration])


In [343]:
# create two new columns named duration_min and duration_max
trips[['duration_min', 'duration_max']] = trips.apply(parse_duration, axis=1)

## PART 2: Paths

**This part could be skipped now since I did not use paths in the end**

There are two types of paths, those limited to one connection and those limited to two connections. Two connections might be more realistic but in reality there are only few people that do this kind of trip

In [344]:
potential_paths_1_connection = pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline\paths_itineraries\potential_paths_0.csv")
potential_paths_1_connection.head()

Unnamed: 0,origin,destination,option,nservices,path,total_travel_time,total_cost,total_emissions,total_waiting_time,nmodes,journey_type,access_time,egress_time,origin_0,destination_0,provider_0,alliance_0,mode_0,travel_time_0,cost_0,emissions_0,origin_1,destination_1,provider_1,alliance_1,mode_1,travel_time_1,mct_time_0_1,cost_1,emissions_1
0,ES111,ES112,0,1,"['007131412', '007131400']",170.0,5.07,1.59,,1,rail,43.0,101.0,007131412,007131400,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,26.0,5.07,1.59,,,,,,,,,
1,ES111,ES112,1,1,"['007131412', '007120300']",285.0,10.89,3.41,,1,rail,43.0,74.0,007131412,007120300,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,168.0,10.89,3.41,,,,,,,,,
2,ES111,ES112,2,0,['007131400'],162.0,,,,0,none,61.0,101.0,007131400,007131400,,,rail,,,,,,,,,,,,
3,ES111,ES112,3,2,"['007131412', '007122100', '007120300']",231.0,14.45,4.52,,1,rail,43.0,74.0,007131412,007122100,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,64.0,11.09,3.47,7122100.0,7120300.0,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,40.0,10.0,3.36,1.05
4,ES111,ES112,4,0,['LEST'],265.0,,,,0,none,144.0,121.0,LEST,LEST,,,air,,,,,,,,,,,,


In [345]:
potential_paths_2_connections = pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline_2_connections\paths_itineraries\potential_paths_0.csv")
potential_paths_2_connections.head()

Unnamed: 0,origin,destination,option,nservices,path,total_travel_time,total_cost,total_emissions,total_waiting_time,nmodes,journey_type,access_time,egress_time,origin_0,destination_0,provider_0,alliance_0,mode_0,travel_time_0,cost_0,emissions_0,origin_1,destination_1,provider_1,alliance_1,mode_1,travel_time_1,mct_time_0_1,cost_1,emissions_1,origin_2,destination_2,provider_2,alliance_2,mode_2,travel_time_2,mct_time_1_2,cost_2,emissions_2
0,ES111,ES112,0,1,"['007131412', '007131400']",170.0,5.07,1.59,,1,rail,43.0,101.0,007131412,007131400,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,26.0,5.07,1.59,,,,,,,,,,,,,,,,,,
1,ES111,ES112,1,1,"['007131412', '007120300']",285.0,10.89,3.41,,1,rail,43.0,74.0,007131412,007120300,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,168.0,10.89,3.41,,,,,,,,,,,,,,,,,,
2,ES111,ES112,2,0,['007131400'],162.0,,,,0,none,61.0,101.0,007131400,007131400,,,rail,,,,,,,,,,,,,,,,,,,,,
3,ES111,ES112,3,2,"['007131412', '007122100', '007120300']",231.0,14.45,4.52,,1,rail,43.0,74.0,007131412,007122100,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,64.0,11.09,3.47,7122100.0,7120300.0,"RENFE VIAJEROS, S.A","RENFE VIAJEROS, S.A",rail,40.0,10.0,3.36,1.05,,,,,,,,,
4,ES111,ES112,4,0,['LEST'],265.0,,,,0,none,144.0,121.0,LEST,LEST,,,air,,,,,,,,,,,,,,,,,,,,,


### 2.1 Preformat paths
main tasks: remove the paths that have 0 modes (are only egress or access) and add the corresponding mode_tp column

In [346]:
def format_paths(paths: pd.DataFrame):
    # Create a copy of the filtered DataFrame to avoid potential views
    paths = paths[paths["nmodes"] != 0].copy() # removes trips that are only access and egress
    
    # Select the columns to iterate through
    mode_columns = [col for col in paths.columns if col.startswith("mode_")]
    
    # Safely assign the new 'mode_tp' column using .loc
    paths.loc[:, "mode_tp"] = paths.apply(
        lambda row: [row[col] for col in mode_columns if str(row[col]) != "nan"],
        axis=1
    )
    return paths

In [347]:
potential_paths_1_connection=format_paths(potential_paths_1_connection)
potential_paths_2_connections=format_paths(potential_paths_2_connections)

In [348]:
potential_paths_2_connections["mode_tp"].value_counts()

mode_tp
[rail, rail, rail]    85958
[rail, rail, air]     28608
[air, rail, rail]     23781
[air, air, air]       20351
[rail, rail]          18264
[air, air]            16343
[rail, air]           11034
[rail, air, air]       9061
[air, rail, air]       8602
[air, rail]            6542
[air, air, rail]       4937
[air]                  4671
[rail]                 2392
[rail, air, rail]      1549
Name: count, dtype: int64

## Part 3: Remove un-localisable trips

The trips start, end and go through many more stations (and perhaps airports) than the stations considered in MultiModX. Hence we have to assign a MMX station to all the stations that appear in the trips. But to do so we have to change the format of the stations first. 

In [349]:
# location of "ALL" train stops given by UiC
# However this list is still incomplete
stops_loc=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\infrastructure\rail_info\stops.txt").astype(str) # everything is a string here to match other formatting
stops_loc["stop_id"] = stops_loc["stop_id"].apply(lambda x: "00" + x) #to make they start with 00

In [350]:
#finds weird stations
def find_weird_stations(node_sequence,stops_loc):
    weird_stations=[]
    nodes = node_sequence.split("-")
    for node in nodes:
        if node.startswith("train_"):
            station_id=node.split("_")[1]
            if not station_id.isdigit():
                weird_stations.append(station_id)
            else:
                station_id_modified_1 = f"0071{int(station_id):05d}"
                station_id_modified_2 = f"0087{int(station_id):05d}"
                station_id_modified_3 = f"0094{int(station_id):05d}"
                if any(station_id in stops_loc["stop_id"].values for station_id in [station_id_modified_1, station_id_modified_2, station_id_modified_3]):
                    pass
                else:
                    weird_stations.append(station_id)
    return weird_stations

In [351]:
trips["weird_stations"] = trips["node_sequence_reduced"].apply(
    lambda x: find_weird_stations(x, stops_loc))

In [352]:
# Assuming the "weird_stations" column contains lists
# Flatten the lists into one combined list
all_weird_stations = trips["weird_stations"].explode().dropna()

# Extract unique values
unique_weird_stations = all_weird_stations.unique()

# Convert back to a list if needed
unique_weird_stations = list(unique_weird_stations)
# converts this list into Moba format
unique_weird_stations_modified=[ "train_"+ station for station in unique_weird_stations] 

In [353]:
len(unique_weird_stations)

385

In [354]:
# This is the total number of trips that contain at least one weird station
weird_trips_num= trips[trips["weird_stations"].apply(lambda x: len(x) > 0)]["trips"].sum()
weird_trips_num

np.float64(41639.716)

In [355]:
# This is the number of trips that do not contain a weird station
normal_trips_num= trips[trips["weird_stations"].apply(lambda x: len(x) == 0)]["trips"].sum()
normal_trips_num

np.float64(479942.63099999994)

In [356]:
percent_normal_trips=normal_trips_num/trips["trips"].sum()*100
percent_normal_trips

np.float64(92.01665542564842)

Only 8 percent of all trips contain one of these non-localisable stations but they are a non-negligible part of the total

In [357]:
# read files with information about ALL stations considered by MobA
MobA_stations_coord=gpd.read_file(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\oferta_transporte\train_stations\train_stations.shp")

In [358]:
# identifies all the un-localisable stations
nowhere_stations=set(unique_weird_stations_modified)-set(MobA_stations_coord["ID"])
print(f"there are {len(nowhere_stations)} stations that are not in the data provided by MobA but appear in the trips dataframe")

there are 197 stations that are not in the data provided by MobA but appear in the trips dataframe


In [359]:
weird_trips_num= trips[trips["node_sequence_reduced"].apply(lambda x: any(station in x for station in nowhere_stations))]["trips"].sum()
print(f"there are in total {weird_trips_num} unlocalisable trips, i.e., {weird_trips_num/trips["trips"].sum()*100:.2f}% of the total")

there are in total 8681.27 unlocalisable trips, i.e., 1.66% of the total


We remove these weird trips from the total number of trips. Most of the removed trips come from Vizcaya (ES213) or arrive to Burgos (ES412). Keep this in mind for the calibration of the model.

In [360]:
trips = trips[~trips["node_sequence_reduced"].apply(lambda x: any(station in x for station in nowhere_stations))]

## Part 4: Construct an equivalence between stations in MobA and stations in MMX 

In [361]:
# coordinates, geometry and other properties of all NUTS (in Europe?)
NUTS_coord=gpd.read_file(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP4 Performance Assessment Solution\Demand data\nuts3_2003_geom_10.gpkg")

### 4.1 Build station_to_NUTS dictionary
contains all stations from MobA and their origin NUTS

In [362]:
# align coordinate reference system (crs)
NUTS_coord = NUTS_coord.to_crs(MobA_stations_coord.crs)

# Perform a spatial join to find which NUTS region each station belongs to
spatial_join = gpd.sjoin(MobA_stations_coord, NUTS_coord, how="left", predicate="within")

# Construct the dictionary
station_to_NUTS = dict(zip(spatial_join["ID"], spatial_join["geocode"]))

In [363]:
# list of stations considered in MMX
train_stations_considered=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\infrastructure\rail_info\rail_stations_considered_GTFS_2022.csv").astype(str)
train_stations_considered["stop_id"] = train_stations_considered["stop_id"].apply(lambda x: "00" + x) #to make they start with 00

In [364]:
#format train_station considered
def format_train_stations_considered(train_stations_considered: pd.DataFrame,station_to_nuts: Dict):
    train_stations_considered["modified_id"]=train_stations_considered["stop_id"].apply(lambda x : "train_" + str(x)[4:])
    train_stations_considered['NUTS'] = train_stations_considered['modified_id'].map(station_to_nuts)
    return train_stations_considered

In [365]:
train_stations_considered=format_train_stations_considered(train_stations_considered,station_to_NUTS)

In [366]:
train_stations_considered

Unnamed: 0,stop_id,modified_id,NUTS
0,007102002,train_02002,ES613
1,007102003,train_02003,ES617
2,007102030,train_02030,ES617
3,007103100,train_03100,ES616
4,007103208,train_03208,ES423
...,...,...,...
83,007181110,train_81110,ES230
84,007181200,train_81200,ES230
85,007181202,train_81202,ES220
86,007182100,train_82100,ES417


### 4.2 Build the NUTS_to_MMX_train_station dictionary
contains the origin NUTS and its associated MMX station

In [367]:
# build second dictionary
NUTS_to_MMX_train_station={}
for nuts, station in zip(train_stations_considered["NUTS"], train_stations_considered["stop_id"]):
    if nuts in NUTS_to_MMX_train_station:
        NUTS_to_MMX_train_station[nuts].append(station)
    else:
        NUTS_to_MMX_train_station[nuts]=[station]

### 4.3 Join the two dictionaries
to generate an equivalence of MobA stations and MMX stations

In [368]:
station_to_station_MMX = {}

for station, nuts_code in station_to_NUTS.items():
    # Get the value from nuts_to_train_stations_considered using the nuts_code
    if nuts_code in NUTS_to_MMX_train_station:
        station_to_station_MMX[station] = NUTS_to_MMX_train_station[nuts_code]

## 5 To how many trips can I assign a path?

Info: I have checked how many trips have a node sequence reduced that corresponds to a path in the potential_path documents. The goal is to assess how many trips are directly usable to calibrate the logit model.

In the future I will look onto how to maximise this number.

In [369]:
# construct a dictionary with the equivalence between IATA and ICAO codes
airports=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\infrastructure\airports_info\IATA_ICAO_Airport_codes_v1.3.csv")
airports["ICAO"] = airports["ICAO"].fillna(airports["IATA"])
IATA_to_ICAO = airports.set_index("IATA")["ICAO"].to_dict()

In [370]:
# list of all MMX stations
train_station_MMX=train_stations_considered["stop_id"].tolist()

In [371]:
# to build a node sequence that consisting in MMX nodes
def process_node_sequence_MMX(trips: pd.DataFrame, train_stations_MMX: list, IATA_to_ICAO: dict):
   # finds node sequence in MMX (first attempt)

    def process_row(row):
        # checks whether the corresponding column is a string
        node_sequence = row['node_sequence_reduced']
        if not isinstance(node_sequence, str):
            return np.nan

        result = []
        for split in node_sequence.split("-"):
            if split.startswith("train"):
                split=split.replace("train_", "")
                if split.isalpha():
                    return np.nan
                elif f"0071{int(split):05d}" in train_stations_MMX:
                    result.append(f"0071{int(split):05d}")
                else:
                    return np.nan
            elif split.startswith("airport_"):
                iata_code = split.replace("airport_", "")
                icao_code = IATA_to_ICAO.get(iata_code)
                if icao_code:
                    result.append(icao_code)
                else:
                    return np.nan
        return result

    trips = trips.copy()
    # Apply the processing function to each row in the dataframe
    trips.loc[:,'node_sequence_MMX'] = trips.apply(process_row, axis=1)

    return trips


In [372]:
trips=process_node_sequence_MMX(trips,train_station_MMX,IATA_to_ICAO)

In [373]:
trips_with_no_path = trips[trips["node_sequence_MMX"].isna()]["trips"].sum()
print(f"There are {trips_with_no_path:.2f} trips with no path, i.e, {trips_with_no_path/trips["trips"].sum()*100:.2f}% of the total")

There are 228810.92 trips with no path, i.e, 44.61% of the total


In [374]:
trips = trips.copy()
trips.loc[:,'is_in_paths'] = trips['node_sequence_MMX'].apply(lambda x: 1 if str(x) in potential_paths_2_connections['path'].tolist() else 0)

In [375]:
total_trips=trips["trips"].sum()

In [376]:
total_trips

np.float64(512901.077)

In [377]:
(trips['trips'] * trips['is_in_paths']).sum()

np.float64(261689.19099999996)

In [378]:
print(f"There are {(trips['trips'] * trips['is_in_paths']).sum():.2f} trips that already have a reduced node sequence in paths, i.e. {(trips['trips'] * trips['is_in_paths']).sum()/trips["trips"].sum()*100:.2f}% of the total numbers of trips")

There are 261689.19 trips that already have a reduced node sequence in paths, i.e. 51.02% of the total numbers of trips


In conclusion, approximately 55% of the trips have a node sequence composed of station contemplated in MMX and 51% of the trips have a node sequence considered in the potential paths in MMX

## Part 6: Assign paths to trips

### 6.1 Read and format the necessary files

I will use the ***_2_connections files for now but this can be changed 

**THERE WAS A PROBLEM WITH THIS ASSIGNATION SO I WILL RE-DO IT**

In [379]:
possible_itineraries_clustered_pareto_1_connection=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline\paths_itineraries\possible_itineraries_clustered_pareto_0.csv")
possible_itineraries_clustered_pareto_2_connections=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline_2_connections\paths_itineraries\possible_itineraries_clustered_pareto_0.csv")

In [380]:
possible_itineraries_1_connection=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline\paths_itineraries\possible_itineraries_0.csv")
possible_itineraries_2_connections=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline_2_connections\paths_itineraries\possible_itineraries_0.csv")

There are a few itineraries that should be removed. In particular, those that are only access/egress.

In [381]:
def format_itineraries(itineraries: pd.DataFrame):
    # Create a copy of the filtered DataFrame to avoid potential views
    itineraries = itineraries[itineraries["nservices"] != 0].copy() # removes trips that are only access and egress
    
    # Select the columns to iterate through
    mode_columns = [col for col in itineraries.columns if col.startswith("mode_")]
    
    # Safely assign the new 'mode_tp' column using .loc
    if mode_columns!=[]:
        itineraries.loc[:, "mode_tp"] = itineraries.apply(
            lambda row: [row[col] for col in mode_columns if str(row[col]) != "nan"],
            axis=1
        )
    return itineraries

In [382]:
possible_itineraries_clustered_pareto_2_connections=format_itineraries(possible_itineraries_clustered_pareto_2_connections)
possible_itineraries_clustered_pareto_1_connection=format_itineraries(possible_itineraries_clustered_pareto_1_connection)
possible_itineraries_2_connections=format_itineraries(possible_itineraries_2_connections)
possible_itineraries_1_connection=format_itineraries(possible_itineraries_1_connection)

In [383]:
common_columns = set(possible_itineraries_2_connections.columns).intersection(set(possible_itineraries_clustered_pareto_2_connections.columns))

We will merge the itineraries and the itineraries_clustered_pareto files to obtain information about the costs of each leg of the trip to calibrate the Logit Model. To do so I removed the repeated columns  

In [384]:
# # Rename `cluster_id` to `option` in `itineraries_clustered` for merging
# possible_itineraries_clustered_pareto_2_connections = possible_itineraries_clustered_pareto_2_connections.rename(columns={'cluster_id': 'option'})

# #drop repeated columns
# possible_itineraries_2_connections=possible_itineraries_2_connections.drop(columns=["total_cost","total_emissions","total_travel_time","total_waiting_time","journey_type","provider_0","provider_1","provider_2","aliance_0","aliance_1","aliance_2","departure_time_0","departure_time_1","departure_time_2","arrival_time_0","arrival_time_1","arrival_time_2"],errors="ignore")

# # Merge the two DataFrames on "origin", "destination", and "option"
# merged = pd.merge(possible_itineraries_clustered_pareto_2_connections, possible_itineraries_2_connections, on=['origin', 'destination', "nservices", 'option'], how='left')

# # Rename the columns back if needed
# possible_itineraries_clustered_pareto_2_connections = merged.rename(columns={'option': 'cluster_id'})

The same shape as before, so the merger went smoothly

In [385]:
# possible_itineraries_clustered_pareto_2_connections["mode_tp"].value_counts()

In [386]:
trips["node_sequence_MMX"]=trips["node_sequence_MMX"].astype(str)
trips=trips.rename(columns={"node_sequence_MMX":"path"})

In [387]:
#we consider only the trips that have a MMX path, revise this affirmation?
trips_final=trips[trips["is_in_paths"]==1]
trips_final.loc[:,"mode_tp"]=trips_final["mode_tp"].astype(str)
trips_final_grouped=trips_final.groupby(["origin","destination","path","duration_min","duration_max"], as_index = False)[["trips","archetype_0","archetype_1","archetype_2","archetype_3","archetype_4","archetype_5"]].sum()

In [388]:
total_trips_final=trips_final["trips"].sum()
print(f"we have {total_trips_final/total_trips*100:.2f}% trips that are assignable")

we have 51.02% trips that are assignable


In [389]:
# possible_itineraries_clustered_pareto_2_connections.loc[:,"mode_tp"]=possible_itineraries_clustered_pareto_2_connections["mode_tp"].astype(str)

To follow, we will separate distinct itineraries that have the same paths. This may happen when there are several services that have the exact same path but different price depending on the schedule or that go through the same stations but whose time, price or CO2 are sufficiently different to not be grouped

In [390]:
# # count how many itineraries have the same origin, destination, mode_tp and path
# grouped_paths_count = possible_itineraries_clustered_pareto_2_connections.groupby(['path', 'origin', 'destination', 'mode_tp']).size().reset_index(name='count')

In [391]:
# grouped_paths_count.head()

In [392]:
# possible_itineraries_clustered_pareto_2_connections = possible_itineraries_clustered_pareto_2_connections.merge(grouped_paths_count, on=['path', 'origin', 'destination', 'mode_tp'], how='left')

### 6.2 Assign path with costs to trips (best assignation so far)

In [393]:
possible_itineraries_clustered_1_connection=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline\paths_itineraries\possible_itineraries_clustered_0.csv")
possible_itineraries_clustered_2_connections=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline_2_connections\paths_itineraries\possible_itineraries_clustered_0.csv")

In [394]:
possible_itineraries_2_connections["mode_tp"]=possible_itineraries_2_connections["mode_tp"].astype(str)

In [395]:
possible_itineraries_grouped=possible_itineraries_2_connections.groupby(["origin","destination","path","nmodes","mode_tp"],as_index=False)[["access_time","egress_time","travel_time_0","cost_0","emissions_0","mct_time_0_1","travel_time_1","cost_1","emissions_1","mct_time_1_2","travel_time_2","cost_2","emissions_2"]].mean()

In [396]:
possible_itineraries_grouped=possible_itineraries_grouped[possible_itineraries_grouped["nmodes"]!=0]

This should have costs per path 

In [397]:
possible_itineraries_grouped["total_travel_time"]=possible_itineraries_grouped["access_time"].fillna(0)+possible_itineraries_grouped["egress_time"].fillna(0)+possible_itineraries_grouped["travel_time_0"].fillna(0)+possible_itineraries_grouped["travel_time_1"].fillna(0)+possible_itineraries_grouped["travel_time_2"].fillna(0)+possible_itineraries_grouped["mct_time_0_1"].fillna(0)+possible_itineraries_grouped["mct_time_1_2"].fillna(0)
possible_itineraries_grouped["total_cost"]=possible_itineraries_grouped["cost_0"].fillna(0)+possible_itineraries_grouped["cost_1"].fillna(0)+possible_itineraries_grouped["cost_2"].fillna(0)
possible_itineraries_grouped["total_emissions"]=possible_itineraries_grouped["emissions_0"].fillna(0)+possible_itineraries_grouped["emissions_1"].fillna(0)+possible_itineraries_grouped["emissions_2"].fillna(0)

In [398]:
trips_final_grouped=trips_final_grouped.merge(possible_itineraries_grouped, on=["path","origin","destination"],how="left")

In [399]:
trips_final_grouped.head()

Unnamed: 0,origin,destination,path,duration_min,duration_max,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,mode_tp,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,emissions_2,total_travel_time,total_cost,total_emissions
0,ES111,ES112,"['007122100', '007120300']",240.0,360.0,6.551,5.193586,0.118036,0.059018,0.472144,0.59018,0.118036,,,,,,,,,,,,,,,,,,
1,ES111,ES112,"['007131400', '007122100']",180.0,240.0,1.975,1.342233,0.03835,0.03835,0.172573,0.095874,0.287621,,,,,,,,,,,,,,,,,,
2,ES111,ES112,"['007131400', '007122100']",240.0,360.0,11.573,9.086613,0.090414,0.316449,0.813727,0.632898,0.632898,,,,,,,,,,,,,,,,,,
3,ES111,ES112,"['007131412', '007120300']",60.0,120.0,3.714,2.921553,0.03819,0.105023,0.315069,0.190951,0.143213,1.0,['rail'],43.0,74.0,179.5,10.89,3.41,,,,,,,,,296.5,10.89,3.41
4,ES111,ES112,"['007131412', '007120300']",180.0,240.0,34.983,27.184959,0.570618,0.614143,3.067649,1.330074,2.215557,1.0,['rail'],43.0,74.0,179.5,10.89,3.41,,,,,,,,,296.5,10.89,3.41


In [400]:
assigned=trips_final_grouped[trips_final_grouped["nmodes"].notna()]["trips"].sum()
print(f"{assigned/total_trips*100:.2f}% of the original trips and {assigned/total_trips_final*100:.2f}% of the trips with a potential path were assigned.")

46.59% of the original trips and 91.32% of the trips with a potential path were assigned.


Luis provided me with another file so I will repeat the assignation and see what changes

### 6.2 Bis re-assignation 

(un-comment for rerun but I find it slightly worse (??))

The only un-commented lines are useful for later

In [401]:
possible_itineraries_avg_2_connections=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\processed_baseline_2_connections\paths_itineraries\possible_paths_avg_from_filtered_it_0.csv")

In [402]:
# possible_itineraries_avg_unique=possible_itineraries_avg_2_connections[(possible_itineraries_avg_2_connections["n_alternative_id"]==1)&(possible_itineraries_avg_2_connections["nmodes"]!=0.0)]

First I tested what percent of trips are of the kind "one path -- one itinerary kind"

In [403]:
# trips_final_grouped=trips_final_grouped.merge(possible_itineraries_avg_unique, on=["path","origin","destination"],how="left")

In [404]:
#assigned=trips_final_grouped[trips_final_grouped["nmodes"].notna()]["trips"].sum()
# print(f"{assigned/total_trips*100:.2f}% of the original trips and {assigned/total_trips_final*100:.2f}% of the trips with a potential path were assigned.")

The I tested how many of the trips are of the kind "one path -- several itineraries" kind. For these trips, it is harder to assign a true cost because they might have a "slow train" and a "fast train" option running through the same stops. It is then very hard for me to distinguish which service they took. 

In [405]:
# possible_itineraries_avg_repeated=possible_itineraries_avg_2_connections[(possible_itineraries_avg_2_connections["n_alternative_id"]>1)&(possible_itineraries_avg_2_connections["nmodes"]!=0.0)]

In [406]:
# trips_final_grouped=trips_final_grouped.merge(possible_itineraries_avg_repeated, on=["path","origin","destination"],how="left")

In [407]:
# assigned=trips_final_grouped[trips_final_grouped["nmodes"].notna()]["trips"].sum()
# print(f"{assigned/total_trips*100:.2f}% of the original trips and {assigned/total_trips_final*100:.2f}% of the trips with a potential path were assigned.")

In [408]:
possible_itineraries_avg=possible_itineraries_avg_2_connections[(possible_itineraries_avg_2_connections["nmodes"]!=0.0)]

In [409]:
# trips_final_grouped=trips_final_grouped.merge(possible_itineraries_avg, on=["path","origin","destination"],how="left")

In [410]:
# assigned=trips_final_grouped[trips_final_grouped["nmodes"].notna()]["trips"].sum()
# print(f"{assigned/total_trips*100:.2f}% of the original trips and {assigned/total_trips_final*100:.2f}% of the trips with a potential path were assigned.")

In [411]:
# possible_itineraries_avg.columns

There is a problem with this assignation, I should obtain the same percentage of assigned trips than using possible_itineraries_clustered but I don't and I lose some information about multimodal trips. This is something I have to look into 


### 6.4 Separate the itineraries in unique and repeated ones (SKIP!)

In [412]:
# possible_itineraries_clustered_pareto_2_connections_unique=possible_itineraries_clustered_pareto_2_connections[possible_itineraries_clustered_pareto_2_connections["count"]==1]

In [413]:
# possible_itineraries_clustered_pareto_2_connections_unique.shape[0]

In [414]:
# possible_itineraries_clustered_pareto_2_connections_repeated=possible_itineraries_clustered_pareto_2_connections[possible_itineraries_clustered_pareto_2_connections["count"]>1]

In [415]:
# possible_itineraries_clustered_pareto_2_connections_repeated.shape[0]

### 6.5 Merge trips with itineraries (SKIP!)

first we merge the unequivocal itineraries

In [416]:
# trips_final_grouped=trips_final_grouped.merge(possible_itineraries_clustered_pareto_2_connections_unique, on=["path","origin","destination","mode_tp"],how="left")

In [417]:
# trips_final_grouped.head()

In [418]:
# trips_final_grouped.shape[0]

In [419]:
# total_trips_final

In [420]:
# first_assign=trips_final_grouped[trips_final_grouped["journey_type"].notna()]["trips"].sum()
# print(f"{first_assign/total_trips*100:.2f}% of the original trips and {first_assign/total_trips_final*100:.2f}% of the trips with a potential path were assigned.")

Then we select between the repeated itineraries (those who have the same origin, destination, path and mode_tp) which one we want to assign. If the total_travel_time matches with trips, we select that one. In the contrary we assign one at random

In [421]:
# def assign_itinerary(trips: pd.DataFrame,itineraries: pd.DataFrame):
#     # Assuming 'trips' and 'itineraries' are already defined DataFrames

# # Iterate over rows of trips where 'total_travel_time' is NaN
#     for idx, trip_row in trips[trips['total_travel_time'].isna()].iterrows():
#         # Extract values from the current row of 'trips'
#         origin = trip_row['origin']
#         destination = trip_row['destination']
#         path = trip_row['path']
#         mode_tp = trip_row['mode_tp']
        
#         # Find matching rows in 'itineraries' based on 'origin', 'destination', 'path', and 'mode_tp'
#         matching_rows = itineraries[
#             (itineraries['origin'] == origin) & 
#             (itineraries['destination'] == destination) & 
#             (itineraries['path'] == path) & 
#             (itineraries['mode_tp'] == mode_tp)
#         ]
        
#         # Check the number of matching rows
#         if len(matching_rows) == 0:
#             print(f"There was no matching itinerary for row {idx} in trips.")
#         elif len(matching_rows) == 1:
#             # There is exactly one matching row
#             # This line should never run if everything is done correctly
#             print(f"There was only one matching itinerary for row {idx} in trips.")
#             # Assign the values from the matching row directly to the corresponding columns in 'trips'
#             selected_row = matching_rows.iloc[0]
#             for column in selected_row.index:
#                 if column not in ['origin', 'destination', 'path', 'mode_tp', 'total_travel_time']:
#                     trips.at[idx, column] = selected_row[column]
#         elif len(matching_rows) >= 2:
#             # Check if total_travel_time in 'itineraries' is between duration_min and duration_max in 'trips'
#             total_travel_time_matches = matching_rows[
#                 (matching_rows['total_travel_time'] >= trip_row['duration_min']) &
#                 (matching_rows['total_travel_time'] <= trip_row['duration_max'])
#             ]
            
#             # If there are matching total_travel_time values, assign one of the matching rows
#             if len(total_travel_time_matches) > 0:
#                 selected_row = total_travel_time_matches.sample(1).iloc[0]
#                 # Assign values from selected_row to the corresponding columns in 'trips'
#                 for column in selected_row.index:
#                     if column not in ['origin', 'destination', 'path', 'mode_tp', 'total_travel_time']:
#                         trips.at[idx, column] = selected_row[column]
#             else:
#                 # If there is no match, assign values from a random row
#                 selected_row = matching_rows.sample(1).iloc[0]
#                 for column in selected_row.index:
#                     if column not in ['origin', 'destination', 'path', 'mode_tp', 'total_travel_time']:
#                         trips.at[idx, column] = selected_row[column]
#     return trips

In [422]:
# trips_final_grouped=assign_itinerary(trips_final_grouped,possible_itineraries_clustered_pareto_2_connections_repeated)

In [423]:
# second_assign=trips_final_grouped[trips_final_grouped["journey_type"].notna()]["trips"].sum()
# print(f"{second_assign/total_trips*100:.2f}% of the original trips and {second_assign/total_trips_final*100:.2f}% of the trips with a potential path were assigned.")

In [424]:
# # Find the unique paths in each DataFrame
# potential_paths_set = set(potential_paths_2_connections["path"].unique())
# itineraries_set = set(possible_itineraries_clustered_pareto_2_connections["path"].unique())

# # Find the paths in potential_paths but not in itineraries
# paths_in_potential_not_in_itineraries = potential_paths_set - itineraries_set

# # Count the number of such paths
# count_paths_not_in_itineraries = len(paths_in_potential_not_in_itineraries)

# print(f"There are {count_paths_not_in_itineraries} paths in potential_paths not in itineraries.")

In conclusion, I can assign an itinerary to close to 40% of my original trips as a first iteration but there are paths that appear in MobA, in potential_paths but not in itineraries! Does this call for a revision of how itineraries are calculated?

## 7 A bit of analysis of the assignation

In [425]:
percentage_of_assigned_od=pd.DataFrame(columns=["origin","destination","num_trips","num_assigned","percent_assigned"])
percentage_of_assigned_od[["origin","destination"]]=trips_final_grouped[["origin","destination"]].drop_duplicates().reset_index(drop=True)

In [426]:
for idx, row in percentage_of_assigned_od.iterrows():
    percentage_of_assigned_od.loc[idx,"num_trips"]=trips[(trips["origin"]==row["origin"])&(trips["destination"]==row["destination"])]["trips"].sum()
    percentage_of_assigned_od.loc[idx,"num_assigned"]=trips_final_grouped[(trips_final_grouped["origin"]==row["origin"])&(trips_final_grouped["destination"]==row["destination"])&(trips_final_grouped["nmodes"].notna())]["trips"].sum()
percentage_of_assigned_od["percent_assigned"]=percentage_of_assigned_od["num_assigned"]/percentage_of_assigned_od["num_trips"]*100
percentage_of_assigned_od[["num_trips","num_assigned","percent_assigned"]]=percentage_of_assigned_od[["num_trips","num_assigned","percent_assigned"]].apply(pd.to_numeric)

In [427]:
percentage_of_assigned_od

Unnamed: 0,origin,destination,num_trips,num_assigned,percent_assigned
0,ES111,ES112,662.214,58.504,8.834606
1,ES111,ES113,1453.070,1135.259,78.128308
2,ES111,ES114,6193.712,770.474,12.439616
3,ES111,ES130,22.537,22.537,100.000000
4,ES111,ES211,14.076,13.794,97.996590
...,...,...,...,...,...
1985,ES709,ES618,386.431,358.922,92.881265
1986,ES709,ES620,90.372,3.882,4.295578
1987,ES709,ES704,563.746,563.746,100.000000
1988,ES709,ES705,1026.464,1026.464,100.000000


In [428]:
percentage_of_assigned_o=pd.DataFrame(columns=["origin","num_trips","num_assigned","percent_assigned"])
percentage_of_assigned_o[["origin"]]=trips_final_grouped[["origin"]].drop_duplicates().reset_index(drop=True)

In [429]:
for idx, row in percentage_of_assigned_o.iterrows():
    percentage_of_assigned_o.loc[idx,"num_trips"]=trips[(trips["origin"]==row["origin"])]["trips"].sum()
    percentage_of_assigned_o.loc[idx,"num_assigned"]=trips_final_grouped[(trips_final_grouped["origin"]==row["origin"])&(trips_final_grouped["nmodes"].notna())]["trips"].sum()
percentage_of_assigned_o["percent_assigned"]=percentage_of_assigned_o["num_assigned"]/percentage_of_assigned_o["num_trips"]*100
percentage_of_assigned_o[["num_trips","num_assigned","percent_assigned"]]=percentage_of_assigned_o[["num_trips","num_assigned","percent_assigned"]].apply(pd.to_numeric)

In [430]:
percentage_of_assigned_o

Unnamed: 0,origin,num_trips,num_assigned,percent_assigned
0,ES111,13616.777,6058.638,44.493921
1,ES112,1771.322,331.534,18.716755
2,ES113,4156.491,1296.843,31.200428
3,ES114,10697.797,3262.525,30.497167
4,ES120,3423.856,2146.117,62.681287
5,ES130,2997.357,1139.172,38.005883
6,ES211,8453.605,4330.621,51.228097
7,ES212,5981.31,1879.069,31.415676
8,ES213,10500.535,3939.061,37.512955
9,ES220,7385.389,3597.019,48.70453


In [431]:
percentage_of_assigned_o["percent_assigned"].describe()

count     57.000000
mean      49.611364
std       23.765440
min        2.301625
25%       33.570071
50%       46.062403
75%       62.829669
max      100.000000
Name: percent_assigned, dtype: float64

In [432]:
percentage_of_assigned_o.sort_values(by="percent_assigned")

Unnamed: 0,origin,num_trips,num_assigned,percent_assigned
12,ES242,2410.427,55.479,2.301625
27,ES424,16229.361,1011.703,6.233782
11,ES241,2374.004,186.398,7.85163
34,ES514,22863.999,3506.193,15.334995
1,ES112,1771.322,331.534,18.716755
15,ES411,2554.468,503.938,19.727708
32,ES512,15355.262,3033.905,19.75808
10,ES230,3026.691,711.97,23.523049
49,ES620,8004.729,1884.251,23.539223
36,ES522,11897.013,3133.105,26.335224


The NUTS with the least assigned trips are:
| NUTS | Name | % |
|:-----|:-----|:---|
ES242 | Teruel | 2.3|
ES424 | Guadalajara| 6.2|
ES241 | Huesca | 7.9|
ES514 | Tarragona|14.5|
ES112 | Lugo | 18.7|

The NUTS with the most assigned trips are:
| NUTS | Name | % |
|:-----|:-----|:---|
ES703 | El Hierro| 100|
ES704 | Fuerteventura |93.4|
ES532| Mallorca |88.6|
ES640 | Mellilla |88.2|
ES531 |Eivissa i Formentera|86.8


Comparison with previous assignation

In [433]:
percentage_of_assigned_od[(percentage_of_assigned_od["origin"]=="ES300")&(percentage_of_assigned_od["destination"]=="ES521")]

Unnamed: 0,origin,destination,num_trips,num_assigned,percent_assigned
522,ES300,ES521,2767.701,1967.45,71.086075


In [434]:
percentage_of_assigned_od[(percentage_of_assigned_od["origin"]=="ES300")&(percentage_of_assigned_od["destination"]=="ES523")]

Unnamed: 0,origin,destination,num_trips,num_assigned,percent_assigned
524,ES300,ES523,3553.443,2974.975,83.720915


It seems that as a second assignation it is slightly better than the previous one

In [435]:
trips[trips["origin"]=="ES424"]["node_sequence_reduced"].value_counts()

node_sequence_reduced
train_70105-train_70103    402
train_70200-train_70103    392
train_70200-train_70111    183
train_70105-train_70107    157
train_70200-train_18000    147
                          ... 
train_70200-train_17000      1
train_70202-train_18000      1
train_70202-train_70108      1
train_22100-train_08223      1
train_70208-train_70103      1
Name: count, Length: 194, dtype: int64

The three first destination train stations are in Alcalá de Henares and connect to Guadalajara by Cercanías

In [436]:
NUTS_to_MMX_train_station["ES424"]

['007104007', '007170200']

In [437]:
trips[trips['node_sequence'].str.contains('train_04007', na=False)]

Unnamed: 0,date,trip_period,origin_zone,origin,origin_name,destination_zone,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,distance,route_distance,duration,mode,service,legs,trip_vehicle_type,nationality,home_census,home_zone,overnight_census,income,age,sex,vehicle_type,short_professional_driver,trips,trips_km,sample_trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp,duration_min,duration_max,weird_stations,path,is_in_paths
2101,20220923,P01,1707906,ES512,Girona,2816101,ES300,Madrid,,,NF,NF,D05_[50000-inf),D05_[50000-inf),08-10,train,conv_unknown,P01*1707906*0801903*None*train_71801*00-01*roa...,other,ES,2_17,1707902,2_17,I02_[15000-inf),A01_[25-45),male,passenger,False,3.752,2836.558,1.0,2.746309,0.096701,0.290103,0.096701,0.386804,0.135381,3,road-train-road,train_71801-train_04007,train_71801,train_04007,national,2,1,0,train_71801-train_04007,train_71801,train_4007,[rail],480.0,600.0,[],"['007171801', '007104007']",1
3044,20220923,P02,0809604,ES511,Barcelona,19046,ES424,Guadalajara,,,NF,NF,D05_[50000-inf),D05_[50000-inf),06-08,train,conv_unknown,P02*0809604*0801903*None*train_71801*00-01*roa...,other,IT,,,,,,,passenger,False,3.405,2110.840,1.0,2.403529,0.100147,0.200294,0.200294,0.200294,0.300441,3,road-train-road,train_71801-train_04007,train_71801,train_04007,national,2,1,0,train_71801-train_04007,train_71801,train_4007,[rail],360.0,480.0,[],"['007171801', '007104007']",1
3328,20220923,P02,2804904,ES300,Madrid,1715502,ES512,Girona,,,O,H,D05_[50000-inf),D05_[50000-inf),04-06,road,,P02*2804904*19326*None*train_04007*00-01*road*...,other,ES,2_17,1715502,2_28,I00_[0-10000),A01_[25-45),male,passenger,False,3.717,2447.950,1.0,1.645230,0.670279,0.182803,0.670279,0.182803,0.365607,3,road-train-road,train_04007-train_04040,train_04007,train_04040,national,2,1,0,train_04007-train_04040,train_70200,train_70600,[rail],240.0,360.0,[],"['007104007', '007104040']",1
3329,20220923,P02,2804904,ES300,Madrid,22061,ES241,Huesca,,,O,H,D05_[50000-inf),D05_[50000-inf),04-06,train,conv_unknown,P02*2804904*19326*None*train_04007*00-01*road*...,other,ES,2_22,22061,2_28,I01_[10000-15000),A01_[25-45),male,long,False,3.316,1599.490,1.0,2.302376,0.057921,0.086882,0.318568,0.246166,0.304087,3,road-train-road,train_04007-train_78400,train_04007,train_78400,national,2,1,0,train_04007-train_78400,train_70200,train_78400,[rail],240.0,360.0,[],"['007104007', '007178400']",1
3330,20220923,P02,2804904,ES300,Madrid,22117_AM,ES241,Huesca,,,NF,H,D05_[50000-inf),D05_[50000-inf),06-08,train,conv_unknown,P02*2804904*19326*None*train_04007*00-01*road*...,other,ES,2_22,22117_AM,2_28,I01_[10000-15000),A02_[45-65),female,long,False,5.316,2736.052,1.0,3.997465,0.125575,0.104646,0.418583,0.376724,0.293008,3,road-train-road,train_04007-train_78400,train_04007,train_78400,national,2,1,0,train_04007-train_78400,train_70200,train_78400,[rail],360.0,480.0,[],"['007104007', '007178400']",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209365,20220923,P20,31901,ES220,Navarra,2804904,ES300,Madrid,,,O,O,D05_[50000-inf),D05_[50000-inf),04-06,train,conv_unknown,P20*31901*3120107*None*train_80100*00-01*road*...,other,ES,2_31,31086,2_28,I01_[10000-15000),A02_[45-65),female,long,False,4.222,1991.492,1.0,3.069407,0.187923,0.087697,0.187923,0.400902,0.288148,4,road-train-train-road,train_80100-train_04040-train_04007,train_80100,train_04007,national,2,2,0,train_80100-train_04040-train_04007,train_80100,train_70200,"[rail, rail]",240.0,360.0,[],"['007180100', '007104040', '007104007']",1
215498,20220923,P21,08054,ES511,Barcelona,28106,ES300,Madrid,,,O,H,D05_[50000-inf),D05_[50000-inf),08-10,road,,P21*08054*5029707*None*train_04040*03-04*road*...,other,ES,2_28,28106,2_45,I00_[0-10000),A01_[25-45),female,long,False,5.237,3246.490,1.0,3.323481,0.226601,0.251779,0.478380,0.579091,0.377668,3,road-train-road,train_04040-train_04007,train_04040,train_04007,national,2,1,0,train_04040-train_04007,train_70600,train_70200,[rail],480.0,600.0,[],"['007104040', '007104007']",1
220014,20220923,P21,5029704,ES243,Zaragoza,19024,ES424,Guadalajara,,,NF,NF,D05_[50000-inf),D05_[50000-inf),01-02,train,conv_unknown,P21*5029704*5029707*None*train_04040*00-01*roa...,other,ES,2_50,5029702,2_50,I02_[15000-inf),A02_[45-65),female,passenger,False,3.541,917.459,1.0,3.186900,0.000000,0.000000,0.000000,0.354100,0.000000,3,road-train-road,train_04040-train_04007,train_04040,train_04007,national,2,1,0,train_04040-train_04007,train_4040,train_4007,[rail],60.0,120.0,[],"['007104040', '007104007']",1
220043,20220923,P21,5029707,ES243,Zaragoza,1913002,ES424,Guadalajara,,,O,NF,D05_[50000-inf),D05_[50000-inf),01-02,train,conv_unknown,P21*5029707*5029707*None*train_04040*None*road...,other,ES,2_50,50074,2_50,I01_[10000-15000),A02_[45-65),male,passenger,False,3.403,852.541,1.0,2.818794,0.043815,0.073026,0.146052,0.189867,0.131446,3,road-train-road,train_04040-train_04007,train_04040,train_04007,national,2,1,0,train_04040-train_04007,train_4040,train_4007,[rail],60.0,120.0,[],"['007104040', '007104007']",1


For example in Guadalajara, we are picking up trips done in "cercanías" (suburban railway) but we are unable to assign them a path. This might be behind the fact that we are able to assign only a fraction of the trips 

Let us check Madrid-BCN train

In [438]:
trips_final[trips_final["path"]=="['007104007', '007104040']"].head()

Unnamed: 0,date,trip_period,origin_zone,origin,origin_name,destination_zone,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,distance,route_distance,duration,mode,service,legs,trip_vehicle_type,nationality,home_census,home_zone,overnight_census,income,age,sex,vehicle_type,short_professional_driver,trips,trips_km,sample_trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp,duration_min,duration_max,weird_stations,path,is_in_paths
3328,20220923,P02,2804904,ES300,Madrid,1715502,ES512,Girona,,,O,H,D05_[50000-inf),D05_[50000-inf),04-06,road,,P02*2804904*19326*None*train_04007*00-01*road*...,other,ES,2_17,1715502,2_28,I00_[0-10000),A01_[25-45),male,passenger,False,3.717,2447.95,1.0,1.64523,0.670279,0.182803,0.670279,0.182803,0.365607,3,road-train-road,train_04007-train_04040,train_04007,train_04040,national,2,1,0,train_04007-train_04040,train_70200,train_70600,['rail'],240.0,360.0,[],"['007104007', '007104040']",1
31372,20220923,P07,16023_AM,ES423,Cuenca,816902,ES511,Barcelona,,,NF,H,D05_[50000-inf),D05_[50000-inf),08-10,road,,P07*16023_AM*19326*None*train_04007*00-01*road...,other,ES,2_08,816902,2_16,I01_[10000-15000),A03_[65-100),female,passenger,False,3.011,1871.702,1.0,1.372101,0.076228,0.0,0.076228,1.295873,0.19057,3,road-train-road,train_04007-train_04040,train_04007,train_04040,national,2,1,0,train_04007-train_04040,train_70200,train_70600,['rail'],480.0,600.0,[],"['007104007', '007104040']",1
34014,20220923,P07,28083,ES300,Madrid,5029712,ES243,Zaragoza,,,O,O,D05_[50000-inf),D05_[50000-inf),04-06,train,conv_unknown,P07*28083*19326*None*train_04007*00-01*road*No...,other,ES,2_11,1100604,2_28,I00_[0-10000),A02_[45-65),male,long,False,1.82,503.885,1.0,1.408066,0.022469,0.029959,0.059918,0.134815,0.164774,3,road-train-road,train_04007-train_04040,train_04007,train_04040,national,2,1,0,train_04007-train_04040,train_4007,train_4040,['rail'],240.0,360.0,[],"['007104007', '007104040']",1
45684,20220923,P08,28047,ES300,Madrid,1706602,ES512,Girona,,,NF,H,D05_[50000-inf),D05_[50000-inf),10-inf,road,,P08*28047*19326*None*train_04007*00-01*road*No...,other,ES,2_17,1706602,2_28,I01_[10000-15000),A02_[45-65),male,passenger,False,2.193,1640.877,1.0,1.254758,0.135649,0.19217,0.237387,0.158258,0.214778,3,road-train-road,train_04007-train_04040,train_04007,train_04040,national,2,1,0,train_04007-train_04040,train_70200,train_70600,['rail'],600.0,inf,[],"['007104007', '007104040']",1
58675,20220923,P09,2807401,ES300,Madrid,1707901,ES512,Girona,,,NF,H,D05_[50000-inf),D05_[50000-inf),06-08,road,,P09*2807401*19326*None*train_04007*00-01*road*...,other,ES,2_17,1707901,2_28,I01_[10000-15000),A02_[45-65),female,passenger,False,2.004,1423.627,1.0,1.233231,0.077077,0.077077,0.218385,0.167,0.231231,3,road-train-road,train_04007-train_04040,train_04007,train_04040,national,2,1,0,train_04007-train_04040,train_4007,train_70600,['rail'],360.0,480.0,[],"['007104007', '007104040']",1


In [439]:
trips_final_grouped[(trips_final_grouped["origin"]=="ES300")&(trips_final_grouped["destination"]=="ES511")&(trips_final_grouped["mode_tp"]=="['rail']")].head()

Unnamed: 0,origin,destination,path,duration_min,duration_max,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,mode_tp,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,emissions_2,total_travel_time,total_cost,total_emissions
2739,ES300,ES511,"['007117000', '007171801']",180.0,240.0,5.421,3.651587,0.262007,0.417448,0.333093,0.351197,0.405668,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2740,ES300,ES511,"['007117000', '007171801']",240.0,360.0,4.338,3.215205,0.225822,0.353736,0.135809,0.126334,0.281094,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2741,ES300,ES511,"['007117000', '007171801']",480.0,600.0,3.209,2.386403,0.104641,0.199109,0.101735,0.162775,0.254337,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2742,ES300,ES511,"['007117000', '007171801']",600.0,inf,4.24,2.638009,0.237421,0.249412,0.309367,0.323756,0.482036,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2752,ES300,ES511,"['007160000', '007171801']",120.0,180.0,1152.482,802.015893,51.063613,77.002567,66.37729,68.842899,87.179738,1.0,['rail'],23.0,36.0,165.172414,52.96,14.56,,,,,,,,,224.172414,52.96,14.56


There are trips that have a node sequence reduced "['007104007', '007104040']" which is Guadalajara-BCN that are classified as Madrid-BCN trips and hence cannot be assigned costs

In [440]:
possible_itineraries_grouped[(possible_itineraries_grouped["origin"]=="ES300")&(possible_itineraries_grouped["destination"]=="ES511")]

Unnamed: 0,origin,destination,path,nmodes,mode_tp,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,emissions_2,total_travel_time,total_cost,total_emissions
11359,ES300,ES511,"['007117000', '007171801']",1,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
11360,ES300,ES511,"['007160000', '007104040', '007171801']",1,"['rail', 'rail']",23.0,36.0,81.666667,25.15,7.88,15.0,109.333333,23.71,7.43,,,,,265.0,48.86,15.31
11361,ES300,ES511,"['007160000', '007104104', '007171801']",1,"['rail', 'rail']",23.0,36.0,134.0,45.04,12.38,20.0,34.0,6.99,2.19,,,,,247.0,52.03,14.57
11362,ES300,ES511,"['007160000', '007171801']",1,['rail'],23.0,36.0,165.172414,52.96,14.56,,,,,,,,,224.172414,52.96,14.56
11363,ES300,ES511,"['007160000', '007178400', '007171801']",1,"['rail', 'rail']",23.0,36.0,129.0,40.84,11.23,15.0,66.0,11.89,3.72,,,,,269.0,52.73,14.95
11364,ES300,ES511,"['LEMD', 'LEBL']",1,['air'],123.0,63.0,77.619048,136.23,43.547619,,,,,,,,,263.619048,136.23,43.547619


In [441]:
trips_final_grouped[trips_final_grouped["path"]=="['007117000', '007171801']"]

Unnamed: 0,origin,destination,path,duration_min,duration_max,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,mode_tp,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,emissions_2,total_travel_time,total_cost,total_emissions
2739,ES300,ES511,"['007117000', '007171801']",180.0,240.0,5.421,3.651587,0.262007,0.417448,0.333093,0.351197,0.405668,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2740,ES300,ES511,"['007117000', '007171801']",240.0,360.0,4.338,3.215205,0.225822,0.353736,0.135809,0.126334,0.281094,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2741,ES300,ES511,"['007117000', '007171801']",480.0,600.0,3.209,2.386403,0.104641,0.199109,0.101735,0.162775,0.254337,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2742,ES300,ES511,"['007117000', '007171801']",600.0,inf,4.24,2.638009,0.237421,0.249412,0.309367,0.323756,0.482036,1.0,['rail'],32.0,36.0,562.0,52.71,14.49,,,,,,,,,630.0,52.71,14.49
2815,ES300,ES514,"['007117000', '007171801']",240.0,360.0,2.558,1.377385,0.196769,0.02811,0.309209,0.281099,0.365429,,,,,,,,,,,,,,,,,,


The slow train is correctly assigned to the slo train in most cases 

In [442]:
trips_final_grouped["nmodes"].value_counts()

nmodes
1.0    6791
2.0      33
Name: count, dtype: int64

In [443]:
assigned=trips_final_grouped[trips_final_grouped["nmodes"].notna()]["trips"].sum()
print(assigned)

238963.84599999996


In [444]:
assigned_multimodal=trips_final_grouped[trips_final_grouped["nmodes"]>1]["trips"].sum()
print(assigned_multimodal)

185.26500000000001


In [445]:
trips_final_grouped.columns

Index(['origin', 'destination', 'path', 'duration_min', 'duration_max',
       'trips', 'archetype_0', 'archetype_1', 'archetype_2', 'archetype_3',
       'archetype_4', 'archetype_5', 'nmodes', 'mode_tp', 'access_time',
       'egress_time', 'travel_time_0', 'cost_0', 'emissions_0', 'mct_time_0_1',
       'travel_time_1', 'cost_1', 'emissions_1', 'mct_time_1_2',
       'travel_time_2', 'cost_2', 'emissions_2', 'total_travel_time',
       'total_cost', 'total_emissions'],
      dtype='object')

In [446]:
assigned_rail_only=trips_final_grouped[(trips_final_grouped["mode_tp"].astype(str).str.contains("rail"))&(~trips_final_grouped["mode_tp"].astype(str).str.contains("air"))&(trips_final_grouped["nmodes"].notna())]["trips"].sum()

In [447]:
assigned_air_only=trips_final_grouped[(trips_final_grouped["mode_tp"].astype(str).str.contains("air"))&(~trips_final_grouped["mode_tp"].astype(str).str.contains("rail"))&(trips_final_grouped["nmodes"].notna())]["trips"].sum()

In [448]:
assigned_air_only+assigned_rail_only+assigned_multimodal

np.float64(238963.84600000002)

In [449]:
print(f"of all the assigned trips {assigned_multimodal/assigned*100:.3f}% are multimodal")

of all the assigned trips 0.078% are multimodal


In [450]:
trips["mode_tp"]=trips["mode_tp"].astype(str)

In [451]:
total_multimodal=trips[(trips["mode_tp"].str.contains("air"))&(trips["mode_tp"].str.contains("rail"))]["trips"].sum()

In [452]:
total_multimodal

np.float64(1298.385)

In [453]:
print(f"of all the trips {total_multimodal/total_trips*100:.3f}% are multimodal. To maintain the same proportions we should have assigned {total_multimodal*assigned/(total_trips)**2*100:.3f}%")
print(f"the final proportion is {assigned_multimodal*total_trips**2/(assigned**2*total_multimodal)*100:.3f}% of the original proportion")

of all the trips 0.253% are multimodal. To maintain the same proportions we should have assigned 0.118%
the final proportion is 65.734% of the original proportion


In [454]:
trips[(trips["mode_tp"].str.contains("air"))&(trips["mode_tp"].str.contains("rail"))&(trips["path"]=="['LEAS', 'LEAL']")]

Unnamed: 0,date,trip_period,origin_zone,origin,origin_name,destination_zone,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,distance,route_distance,duration,mode,service,legs,trip_vehicle_type,nationality,home_census,home_zone,overnight_census,income,age,sex,vehicle_type,short_professional_driver,trips,trips_km,sample_trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp,duration_min,duration_max,weird_stations,path,is_in_paths
23152,20220923,P06,33042,ES120,Asturias,303102,ES521,Alicante / Alacant,,,NF,NF,D05_[50000-inf),D05_[50000-inf),04-06,plane,,P06*33042*33016*None*airport_OVD*00-01*road*No...,other,ES,2_27,27066,2_33,I01_[10000-15000),A02_[45-65),male,passenger,False,3.651,3501.145,1.0,1.445941,0.072297,0.0,1.771277,0.144594,0.216891,5,road-plane-road-train-road,airport_OVD-airport_ALC-train_60911-train_03309,airport_OVD,train_03309,national,3,1,1,airport_OVD-airport_ALC,airport_OVD,airport_ALC,"['air', 'rail']",240.0,360.0,[],"['LEAS', 'LEAL']",1
109044,20220923,P13,33004,ES120,Asturias,3043,ES521,Alicante / Alacant,,,O,NF,D05_[50000-inf),D05_[50000-inf),06-08,plane,,P13*33004*33016*None*airport_OVD*00-01*road*No...,other,ES,2_33,33032,2_33,I01_[10000-15000),A02_[45-65),female,passenger,False,3.871,3253.756,1.0,1.817687,0.067322,0.033661,1.514739,0.336609,0.100983,5,road-plane-road-train-road,airport_OVD-airport_ALC-train_60911-train_03309,airport_OVD,train_03309,national,3,1,1,airport_OVD-airport_ALC,airport_OVD,airport_ALC,"['air', 'rail']",360.0,480.0,[],"['LEAS', 'LEAL']",1


## 8 Logit Model Calibration

In [455]:
import biogeme

First I need to limit the number of alternatives per origin, destination, and path

In [456]:
possible_itineraries_avg.groupby(["origin","destination"]).size().sort_values()

origin  destination
ES708   ES705           1
        ES707           1
        ES709           1
ES415   ES411           1
        ES417           1
                       ..
ES617   ES411          18
ES616   ES521          18
ES617   ES220          19
ES616   ES511          22
ES220   ES614          24
Length: 2198, dtype: int64

In [457]:
trips_final_grouped_assigned=trips_final_grouped[trips_final_grouped["nmodes"].notna()]

In [458]:
# Count unique paths for each origin-destination pair
unique_paths_count = trips_final_grouped_assigned.groupby(['origin', 'destination'])['path'].nunique().reset_index()

# Rename columns for clarity
unique_paths_count.columns = ['origin', 'destination', 'unique_paths_count']

print(unique_paths_count)

     origin destination  unique_paths_count
0     ES111       ES112                   1
1     ES111       ES113                   3
2     ES111       ES114                   5
3     ES111       ES130                   2
4     ES111       ES211                   2
...     ...         ...                 ...
1497  ES709       ES618                   2
1498  ES709       ES620                   1
1499  ES709       ES704                   3
1500  ES709       ES705                   2
1501  ES709       ES708                   1

[1502 rows x 3 columns]


In [459]:
unique_paths_count["unique_paths_count"].describe()

count    1502.000000
mean        1.767643
std         1.136256
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         9.000000
Name: unique_paths_count, dtype: float64

75% of the origin and destination with assigned paths have 2 paths or less but there is one with up to 9

In [460]:
trips_final_grouped_assigned.drop("mode_tp",axis=1)

Unnamed: 0,origin,destination,path,duration_min,duration_max,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,emissions_2,total_travel_time,total_cost,total_emissions
3,ES111,ES112,"['007131412', '007120300']",60.0,120.0,3.714,2.921553,0.038190,0.105023,0.315069,0.190951,0.143213,1.0,43.0,74.0,179.500000,10.89,3.410000,,,,,,,,,296.500000,10.89,3.410000
4,ES111,ES112,"['007131412', '007120300']",180.0,240.0,34.983,27.184959,0.570618,0.614143,3.067649,1.330074,2.215557,1.0,43.0,74.0,179.500000,10.89,3.410000,,,,,,,,,296.500000,10.89,3.410000
5,ES111,ES112,"['007131412', '007120300']",240.0,360.0,14.030,11.266289,0.163634,0.283968,0.944361,0.726839,0.644909,1.0,43.0,74.0,179.500000,10.89,3.410000,,,,,,,,,296.500000,10.89,3.410000
6,ES111,ES112,"['007131412', '007120300']",360.0,480.0,5.777,4.594501,0.123177,0.061588,0.332578,0.394166,0.270989,1.0,43.0,74.0,179.500000,10.89,3.410000,,,,,,,,,296.500000,10.89,3.410000
7,ES111,ES113,"['007131400', '007122100']",0.0,60.0,227.581,177.312158,4.708263,4.007774,17.478449,9.441835,14.632521,1.0,61.0,28.0,42.733333,7.40,2.320000,,,,,,,,,131.733333,7.40,2.320000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10448,ES709,ES708,"['GCXO', 'GCRR']",120.0,180.0,142.247,80.287251,5.251482,4.024842,30.415937,8.901823,13.365665,1.0,120.0,43.0,50.000000,111.37,51.553077,,,,,,,,,213.000000,111.37,51.553077
10449,ES709,ES708,"['GCXO', 'GCRR']",180.0,240.0,140.999,82.928076,4.882438,3.632251,28.192395,10.386914,10.976926,1.0,120.0,43.0,50.000000,111.37,51.553077,,,,,,,,,213.000000,111.37,51.553077
10450,ES709,ES708,"['GCXO', 'GCRR']",240.0,360.0,211.289,114.230318,7.176302,5.533047,47.580615,16.312375,20.456343,1.0,120.0,43.0,50.000000,111.37,51.553077,,,,,,,,,213.000000,111.37,51.553077
10451,ES709,ES708,"['GCXO', 'GCRR']",360.0,480.0,54.408,29.870243,2.314012,0.772183,12.412165,3.985797,5.053602,1.0,120.0,43.0,50.000000,111.37,51.553077,,,,,,,,,213.000000,111.37,51.553077


In [461]:
inconsistent_groups = trips_final_grouped_assigned.groupby(['origin', 'destination', 'path'])['access_time'].nunique()
inconsistent_groups = inconsistent_groups[inconsistent_groups > 1]

In [462]:
inconsistent_groups

Series([], Name: access_time, dtype: int64)

In [463]:
trips_final_grouped_assigned.columns

Index(['origin', 'destination', 'path', 'duration_min', 'duration_max',
       'trips', 'archetype_0', 'archetype_1', 'archetype_2', 'archetype_3',
       'archetype_4', 'archetype_5', 'nmodes', 'mode_tp', 'access_time',
       'egress_time', 'travel_time_0', 'cost_0', 'emissions_0', 'mct_time_0_1',
       'travel_time_1', 'cost_1', 'emissions_1', 'mct_time_1_2',
       'travel_time_2', 'cost_2', 'emissions_2', 'total_travel_time',
       'total_cost', 'total_emissions'],
      dtype='object')

In [464]:
trips_logit =trips_final_grouped_assigned.groupby(['origin', 'destination', 'path']).agg({  
    'trips': 'sum',    # Sum num_of_trips
    'archetype_0': 'sum',     # Sum archetype_0
    'archetype_1': 'sum',     # Sum archetype_1
    'archetype_2': 'sum',     # Sum archetype_2
    'archetype_3': 'sum',     # Sum archetype_3
    'archetype_4': 'sum',     # Sum archetype_4
    'archetype_5': 'sum',     # Sum archetype_5
    'nmodes': 'first',        # Keep the first value for nmodes
    'access_time': 'first',   # Keep the first value for access_time
    'egress_time': 'first',    # Keep the first value for egress_time
    'travel_time_0': 'first',
    'cost_0': 'first',
    'emissions_0':'first',
    'mct_time_0_1': 'first',
    'travel_time_1': 'first',
    'cost_1': 'first',
    'emissions_1':'first',
    'mct_time_1_2': 'first',
    'travel_time_2': 'first',
    'cost_2': 'first',
    'emissions_1':'first',
    'total_travel_time': 'first',
    'total_cost':'first',
    'total_emissions':'first'
}).reset_index()

In [465]:
trips_logit.head()

Unnamed: 0,origin,destination,path,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,total_travel_time,total_cost,total_emissions
0,ES111,ES112,"['007131412', '007120300']",58.504,45.967302,0.895619,1.064722,4.659657,2.64203,3.27467,1.0,43.0,74.0,179.5,10.89,3.41,,,,,,,,296.5,10.89,3.41
1,ES111,ES113,"['007131400', '007122100']",831.82,638.068954,17.071497,16.35168,68.127097,37.168519,55.032253,1.0,61.0,28.0,42.733333,7.4,2.32,,,,,,,,131.733333,7.4,2.32
2,ES111,ES113,"['007131412', '007120300']",5.003,3.764162,0.087354,0.135002,0.524124,0.19059,0.301768,1.0,43.0,69.0,179.5,10.89,3.41,,,,,,,,291.5,10.89,3.41
3,ES111,ES113,"['007131412', '007122100']",298.436,228.508198,6.682714,5.53096,23.093871,14.035849,20.584408,1.0,43.0,28.0,78.571429,11.09,3.47,,,,,,,,149.571429,11.09,3.47
4,ES111,ES114,"['007131400', '007122100']",19.007,14.148943,0.596662,0.475672,1.545168,0.858479,1.382076,1.0,61.0,90.0,42.733333,7.4,2.32,,,,,,,,193.733333,7.4,2.32


In [466]:
# IMPORTANT TO RUN THIS LINE TO ENSURE THAT WE STAY WITH THE MOST USED ALTERNATIVES
trips_logit=trips_logit.sort_values(by=["origin","destination","trips"],ascending=[True,True,False])

In [467]:
trips_logit['noption'] = trips_logit.groupby(['origin', 'destination']).cumcount() + 1

In [468]:
trips_logit["noption"].value_counts()

noption
1    1502
2     660
3     283
4     125
5      53
6      21
7       8
8       2
9       1
Name: count, dtype: int64

In [469]:
less_than_5=trips_logit[trips_logit["noption"]<5]["trips"].sum()
total_trips=trips_logit["trips"].sum()
print(f"there are {less_than_5/total_trips*100:.2f}% of trips with 4 or less options")

there are 99.64% of trips with 4 or less options


In [470]:
less_than_4=trips_logit[trips_logit["noption"]<4]["trips"].sum()
total_trips=trips_logit["trips"].sum()
print(f"there are {less_than_5/total_trips*100:.2f}% of trips with 3 or less options")

there are 99.64% of trips with 3 or less options


In [471]:
trips_logit[trips_logit["noption"]<4]["nmodes"].value_counts()

nmodes
1.0    2413
2.0      32
Name: count, dtype: int64

In [472]:
trips_logit["nmodes"].value_counts()

nmodes
1.0    2622
2.0      33
Name: count, dtype: int64

In [473]:
trips_logit[(trips_logit["nmodes"]>1)&(trips_logit["noption"]<4)]["trips"].sum()

np.float64(181.704)

In [474]:
trips_logit[(trips_logit["nmodes"]>1)]["trips"].sum()

np.float64(185.26500000000001)

It looks like staying with 3 options max I still maintain most of my multimodal trips

In [475]:
# FOR A FIRST RUN I DECIDED TO ONLY CONSIDER THE 3 MOST USED ALTERNATIVES PER O-D PAIR
trips_logit=trips_logit[trips_logit["noption"]<=3]

In [476]:
trips_logit.head()

Unnamed: 0,origin,destination,path,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,total_travel_time,total_cost,total_emissions,noption
0,ES111,ES112,"['007131412', '007120300']",58.504,45.967302,0.895619,1.064722,4.659657,2.64203,3.27467,1.0,43.0,74.0,179.5,10.89,3.41,,,,,,,,296.5,10.89,3.41,1
1,ES111,ES113,"['007131400', '007122100']",831.82,638.068954,17.071497,16.35168,68.127097,37.168519,55.032253,1.0,61.0,28.0,42.733333,7.4,2.32,,,,,,,,131.733333,7.4,2.32,1
3,ES111,ES113,"['007131412', '007122100']",298.436,228.508198,6.682714,5.53096,23.093871,14.035849,20.584408,1.0,43.0,28.0,78.571429,11.09,3.47,,,,,,,,149.571429,11.09,3.47,2
2,ES111,ES113,"['007131412', '007120300']",5.003,3.764162,0.087354,0.135002,0.524124,0.19059,0.301768,1.0,43.0,69.0,179.5,10.89,3.41,,,,,,,,291.5,10.89,3.41,3
8,ES111,ES114,"['007131412', '007123004']",470.356,331.036817,15.922206,12.295786,51.267123,22.336955,37.497113,1.0,43.0,31.0,77.266667,9.74,3.05,,,,,,,,151.266667,9.74,3.05,1


In [477]:
cols = [f"{item}_{i}" for i in range(1, 4) for item in ["travel_time", "travel_cost", "co2", "train", "plane", "multimodal", "av"]]
print(cols)

['travel_time_1', 'travel_cost_1', 'co2_1', 'train_1', 'plane_1', 'multimodal_1', 'av_1', 'travel_time_2', 'travel_cost_2', 'co2_2', 'train_2', 'plane_2', 'multimodal_2', 'av_2', 'travel_time_3', 'travel_cost_3', 'co2_3', 'train_3', 'plane_3', 'multimodal_3', 'av_3']


In [478]:
# the following lines create the path_w_cost file necessary for the calibration of the logit model

cols = [f"{item}_{i}" for i in range(1, 4) for item in ["travel_time", "cost", "emissions", "train", "plane", "multimodal", "av"]]
paths_w_costs=pd.DataFrame(columns=cols)
# select the origin-destination pairs
unique_combinations = trips_logit[['origin', 'destination']].drop_duplicates()

# copy them
paths_w_costs.insert(0,"origin", unique_combinations["origin"])
paths_w_costs.insert(1,"destination", unique_combinations["destination"])

# assign default values for the rest of the columns
for col in cols:
    if col.startswith("av") or col.startswith("plane") or col.startswith("train") or col.startswith("multimodal"):
        paths_w_costs[col]= 0
    else:
        paths_w_costs[col]= float(-1)


# assign values from trips_logit
for idx, row in paths_w_costs.iterrows():
    for i in range(1,4):
        # retreaves travel_time, travel_cost, and co2
        travel_time = trips_logit[(trips_logit["origin"] == row["origin"]) & 
                                    (trips_logit["destination"] == row["destination"]) & 
                                    (trips_logit["noption"] == i)]["total_travel_time"]
        
        travel_cost=trips_logit[(trips_logit["origin"] == row["origin"]) & 
                                    (trips_logit["destination"] == row["destination"]) & 
                                    (trips_logit["noption"] == i)]["total_cost"]
        
        co2=trips_logit[(trips_logit["origin"] == row["origin"]) & 
                                    (trips_logit["destination"] == row["destination"]) & 
                                    (trips_logit["noption"] == i)]["total_emissions"]

        if not travel_time.empty:
            paths_w_costs.loc[idx, f"travel_time_{i}"] = travel_time.iloc[0]

        if not travel_cost.empty:
            paths_w_costs.loc[idx, f"cost_{i}"]=travel_cost.iloc[0]

        if not co2.empty:
            paths_w_costs.loc[idx, f"emissions_{i}"]=co2.iloc[0]
    # checks for train, plane and multimodal

            # Get the "path" column as a Series and check if it's empty
        path_series = trips_logit[(trips_logit["origin"] == row["origin"]) & 
                                (trips_logit["destination"] == row["destination"]) & 
                                (trips_logit["noption"] == i)]["path"]

        # Check if the Series is not empty before accessing its first element
        if not path_series.empty:
            path = path_series.iloc[0]  # Extract the first element of the Series (assuming only one match)
            
            # Now apply the regex checks on the string `path`
            if bool(re.search(r'(?=.*[A-Z])(?=.*\d)', path)):  # checks for numbers and capital letters in path
                paths_w_costs.loc[idx, f"multimodal_{i}"] = 1
                paths_w_costs.loc[idx, f"av_{i}"] = 1
            elif bool(re.search(r'^[^A-Z]*$', path)):  # checks for the absence of capital letters -> means no airports
                paths_w_costs.loc[idx, f"train_{i}"] = 1
                paths_w_costs.loc[idx, f"av_{i}"] = 1
            elif bool(re.search(r'^\D*$', path)):  # checks for the absence of numbers -> means no train stations
                paths_w_costs.loc[idx, f"plane_{i}"] = 1
                paths_w_costs.loc[idx, f"av_{i}"] = 1
        else:
            pass  # If the Series is empty, do nothing

In [479]:
paths_w_costs

Unnamed: 0,origin,destination,travel_time_1,cost_1,emissions_1,train_1,plane_1,multimodal_1,av_1,travel_time_2,cost_2,emissions_2,train_2,plane_2,multimodal_2,av_2,travel_time_3,cost_3,emissions_3,train_3,plane_3,multimodal_3,av_3
0,ES111,ES112,296.500000,10.89,3.410000,1,0,0,1,-1.000000,-1.00,-1.000000,0,0,0,0,-1.000,-1.00,-1.00,0,0,0,0
1,ES111,ES113,131.733333,7.40,2.320000,1,0,0,1,149.571429,11.09,3.470000,1,0,0,1,291.500,10.89,3.41,1,0,0,1
8,ES111,ES114,151.266667,9.74,3.050000,1,0,0,1,138.947368,4.67,1.460000,1,0,0,1,174.375,6.05,1.89,1,0,0,1
10,ES111,ES130,318.000000,132.79,69.960000,0,1,0,1,289.000000,132.17,42.060000,0,1,0,1,-1.000,-1.00,-1.00,0,0,0,0
12,ES111,ES211,294.000000,132.79,69.960000,0,1,0,1,265.000000,132.17,42.060000,0,1,0,1,-1.000,-1.00,-1.00,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2647,ES709,ES618,294.000000,197.30,88.156667,0,1,0,1,348.000000,199.85,87.470000,0,1,0,1,-1.000,-1.00,-1.00,0,0,0,0
2648,ES709,ES620,577.000000,359.27,170.445000,0,1,0,1,-1.000000,-1.00,-1.000000,0,0,0,0,-1.000,-1.00,-1.00,0,0,0,0
2650,ES709,ES704,253.000000,107.04,48.438182,0,1,0,1,298.000000,174.64,87.027097,0,1,0,1,347.000,175.91,87.56,0,1,0,1
2653,ES709,ES705,207.000000,81.66,43.637586,0,1,0,1,256.000000,82.93,44.370000,0,1,0,1,-1.000,-1.00,-1.00,0,0,0,0


Let us do some checks to ensure that paths_w_costs is correctly calculated

In [480]:
mask = ((paths_w_costs[f"travel_time_{i}"] == -1) & 
        (paths_w_costs[f"cost_{i}"] == -1) & 
        (paths_w_costs[f"emissions_{i}"] == -1) & 
        (paths_w_costs[f"av_{i}"] == 1))

# Use the mask to filter rows where the condition is True
matching_rows = paths_w_costs[mask]

# Perform logic on matching rows
for idx, row in matching_rows.iterrows():
    print(f"Condition met for row {idx}")

no available paths with -1 costs

In [481]:
paths_w_costs[(paths_w_costs["train_1"]+paths_w_costs["plane_1"]+paths_w_costs["multimodal_1"])>1]

Unnamed: 0,origin,destination,travel_time_1,cost_1,emissions_1,train_1,plane_1,multimodal_1,av_1,travel_time_2,cost_2,emissions_2,train_2,plane_2,multimodal_2,av_2,travel_time_3,cost_3,emissions_3,train_3,plane_3,multimodal_3,av_3


each path is correctly identified to either train, plane or multimodal

In [482]:
calibration_matrix=trips_logit.drop(columns=["path","nmodes","access_time","egress_time","travel_time_0","cost_0","emissions_0","mct_time_0_1","travel_time_1","cost_1","emissions_1","mct_time_1_2","travel_time_2","cost_2","total_cost","total_emissions","total_cost"])
calibration_matrix=calibration_matrix.merge(paths_w_costs, on=["origin","destination"],how="left")
calibration_matrix=calibration_matrix.rename(columns={"noption":"observed_choice"})
calibration_matrix=calibration_matrix.drop(columns=["origin","destination"])

In [483]:
calibration_matrix.head()

Unnamed: 0,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,total_travel_time,observed_choice,travel_time_1,cost_1,emissions_1,train_1,plane_1,multimodal_1,av_1,travel_time_2,cost_2,emissions_2,train_2,plane_2,multimodal_2,av_2,travel_time_3,cost_3,emissions_3,train_3,plane_3,multimodal_3,av_3
0,58.504,45.967302,0.895619,1.064722,4.659657,2.64203,3.27467,296.5,1,296.5,10.89,3.41,1,0,0,1,-1.0,-1.0,-1.0,0,0,0,0,-1.0,-1.0,-1.0,0,0,0,0
1,831.82,638.068954,17.071497,16.35168,68.127097,37.168519,55.032253,131.733333,1,131.733333,7.4,2.32,1,0,0,1,149.571429,11.09,3.47,1,0,0,1,291.5,10.89,3.41,1,0,0,1
2,298.436,228.508198,6.682714,5.53096,23.093871,14.035849,20.584408,149.571429,2,131.733333,7.4,2.32,1,0,0,1,149.571429,11.09,3.47,1,0,0,1,291.5,10.89,3.41,1,0,0,1
3,5.003,3.764162,0.087354,0.135002,0.524124,0.19059,0.301768,291.5,3,131.733333,7.4,2.32,1,0,0,1,149.571429,11.09,3.47,1,0,0,1,291.5,10.89,3.41,1,0,0,1
4,470.356,331.036817,15.922206,12.295786,51.267123,22.336955,37.497113,151.266667,1,151.266667,9.74,3.05,1,0,0,1,138.947368,4.67,1.46,1,0,0,1,174.375,6.05,1.89,1,0,0,1


In [484]:
pax_demand=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\demand\demand_ES_MD_intra_v0.2.csv")

In [485]:
for i in range(1,4):
    calibration_matrix=calibration_matrix.rename(columns={f"travel_cost_{i}":f"cost_{i}"})
    paths_w_costs=paths_w_costs.rename(columns={f"travel_cost_{i}":f"cost_{i}"})
    calibration_matrix=calibration_matrix.rename(columns={f"co2_{i}":f"emissions_{i}"})
    paths_w_costs=paths_w_costs.rename(columns={f"co2_{i}":f"emissions_{i}"})

In [486]:
pax_demand.to_csv("pax_demand.csv",index=False)
calibration_matrix.to_csv("calibration_matrix.csv",index=False)
paths_w_costs.to_csv("potential_paths_w_costs.csv",index=False)


In [487]:
database_path = r"C:\Users\LMENENDEZ\GitHub\MultiModX\calibration_matrix.csv"
n_archetypes = 6
n_alternatives = 3
calibrate_main(database_path, n_archetypes, n_alternatives)

Results for model archetype_0
Nbr of parameters:		5
Sample size:			1956
Excluded data:			0
Final log likelihood:		-1233.034
Akaike Information Criterion:	2476.068
Bayesian Information Criterion:	2503.961

              Value  Rob. Std err  Rob. t-test  Rob. p-value
ASC_PLANE  0.396035      1.246847     0.317629      0.750766
ASC_TRAIN -0.452849      1.265356    -0.357883      0.720431
B_CO2     -0.046671      0.010232    -4.561501      0.000005
B_COST     0.004104      0.004991     0.822335      0.410886
B_TIME    -0.014836      0.001035   -14.336538      0.000000
{'ASC_PLANE': np.float64(0.3960347967706549), 'ASC_TRAIN': np.float64(-0.45284940711309096), 'B_CO2': np.float64(-0.046671462331746766), 'B_COST': np.float64(0.004104118019709926), 'B_TIME': np.float64(-0.014836319897036139)}
Results for model archetype_1
Nbr of parameters:		5
Sample size:			1956
Excluded data:			0
Final log likelihood:		-1090.722
Akaike Information Criterion:	2191.445
Bayesian Information Criterion:	2219.338

In [488]:
# probability matrix
sensitivities = {"sensitivities": str(Path.cwd())} 
paths_prob = predict_main(paths_w_costs, n_archetypes, n_alternatives,sensitivities)
paths_prob.to_csv('potential_paths_w_probabilities.csv', index = False)

The chosen alternative [`2.0`] is not available for the following observations (rownumber[choice]): 0[2.0]-18[2.0]-20[2.0]-22[2.0]-26[2.0]-27[2.0]-30[2.0]-33[2.0]-34[2.0]-36[2.0]-38[2.0]-39[2.0]-41[2....
The chosen alternative [`3.0`] is not available for the following observations (rownumber[choice]): 0[3.0]-3[3.0]-4[3.0]-5[3.0]-8[3.0]-9[3.0]-10[3.0]-11[3.0]-14[3.0]-15[3.0]-16[3.0]-18[3.0]-20[3.0]-22...
The chosen alternative [`2.0`] is not available for the following observations (rownumber[choice]): 0[2.0]-18[2.0]-20[2.0]-22[2.0]-26[2.0]-27[2.0]-30[2.0]-33[2.0]-34[2.0]-36[2.0]-38[2.0]-39[2.0]-41[2....
The chosen alternative [`3.0`] is not available for the following observations (rownumber[choice]): 0[3.0]-3[3.0]-4[3.0]-5[3.0]-8[3.0]-9[3.0]-10[3.0]-11[3.0]-14[3.0]-15[3.0]-16[3.0]-18[3.0]-20[3.0]-22...
The chosen alternative [`2.0`] is not available for the following observations (rownumber[choice]): 0[2.0]-18[2.0]-20[2.0]-22[2.0]-26[2.0]-27[2.0]-30[2.0]-33[2.0]-34[2.0]-36[2.0]-38[2.

In [489]:
# paths with probabilities
pax_demand_path = r"C:\Users\LMENENDEZ\GitHub\MultiModX\pax_demand.csv"
potential_demand = assign_passengers_main(paths_prob, n_alternatives, pax_demand_path)
potential_demand.to_csv('potential_demand_flows.csv', index = False)

In [490]:
potential_demand

Unnamed: 0,date,origin,destination,archetype,trips,alternative_1,alternative_prob_1,alternative_2,alternative_prob_2,alternative_3,alternative_prob_3
0,20220923,ES111,ES112,archetype_0,874.728102,874.728102,1.0,0.0,0.0,0.0,0.0
1,20220923,ES111,ES112,archetype_1,15.985248,15.985248,1.0,0.0,0.0,0.0,0.0
2,20220923,ES111,ES112,archetype_2,23.717297,23.717297,1.0,0.0,0.0,0.0,0.0
3,20220923,ES111,ES112,archetype_3,84.776928,84.776928,1.0,0.0,0.0,0.0,0.0
4,20220923,ES111,ES112,archetype_4,52.702152,52.702152,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
11835,20220923,abroad,ES709,archetype_1,70.255272,,,,,,
11836,20220923,abroad,ES709,archetype_2,34.998342,,,,,,
11837,20220923,abroad,ES709,archetype_3,3.028744,,,,,,
11838,20220923,abroad,ES709,archetype_4,5.973891,,,,,,


# Part 9: Analysis of the calibration

In [491]:
import biogeme.results as res

In [492]:
paths_prob[(paths_prob["origin"]=="ES300")&(paths_prob["destination"]=="ES511")]

Unnamed: 0,origin,destination,archetype_0_prob_1,archetype_0_prob_2,archetype_0_prob_3,archetype_1_prob_1,archetype_1_prob_2,archetype_1_prob_3,archetype_2_prob_1,archetype_2_prob_2,archetype_2_prob_3,archetype_3_prob_1,archetype_3_prob_2,archetype_3_prob_3,archetype_4_prob_1,archetype_4_prob_2,archetype_4_prob_3,archetype_5_prob_1,archetype_5_prob_2,archetype_5_prob_3
691,ES300,ES511,0.502094,0.237765,0.260141,0.473344,0.244256,0.2824,0.472281,0.254433,0.273286,0.499386,0.205774,0.294841,0.491251,0.216547,0.292202,0.480492,0.242611,0.276897


In [493]:
paths_w_costs[(paths_w_costs["origin"]=="ES300")&(paths_w_costs["destination"]=="ES511")]

Unnamed: 0,origin,destination,travel_time_1,cost_1,emissions_1,train_1,plane_1,multimodal_1,av_1,travel_time_2,cost_2,emissions_2,train_2,plane_2,multimodal_2,av_2,travel_time_3,cost_3,emissions_3,train_3,plane_3,multimodal_3,av_3
691,ES300,ES511,224.172414,52.96,14.56,1,0,0,1,263.619048,136.23,43.547619,0,1,0,1,265.0,48.86,15.31,1,0,0,1


In [494]:
trips_logit[(trips_logit["origin"]=="ES300")&(trips_logit["destination"]=="ES511")]

Unnamed: 0,origin,destination,path,trips,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,nmodes,access_time,egress_time,travel_time_0,cost_0,emissions_0,mct_time_0_1,travel_time_1,cost_1,emissions_1,mct_time_1_2,travel_time_2,cost_2,total_travel_time,total_cost,total_emissions,noption
691,ES300,ES511,"['007160000', '007171801']",6361.68,4335.096848,281.565693,410.473284,411.870545,421.807648,500.865983,1.0,23.0,36.0,165.172414,52.96,14.56,,,,,,,,224.172414,52.96,14.56,1
692,ES300,ES511,"['LEMD', 'LEBL']",1586.1,1076.792895,68.339302,99.441904,105.405557,107.107534,129.012807,1.0,123.0,63.0,77.619048,136.23,43.547619,,,,,,,,263.619048,136.23,43.547619,2
690,ES300,ES511,"['007160000', '007104040', '007171801']",309.198,215.56292,13.016417,19.068899,19.523005,19.755275,22.271484,1.0,23.0,36.0,81.666667,25.15,7.88,15.0,109.333333,23.71,7.43,,,,265.0,48.86,15.31,3


The three alternatives that had the most number of travelers from Madrid to Barcelona were:
- the high speed train from Atocha to Sants. 
- the plane 
- the high speed train from Atocha to Delicias (Zaragoza) and then the high speed train from Delicias to Sants
Most people took the direct train. 

When calibrating the model, we decided to consider, per O-D pairs, only the 3 alternatives that had the most travelers. Hence, the slow train from Chamartín to Sants is not considered as a calibration alternatives (very few people did that)

In [495]:
read_results = res.bioResults(pickle_file=r"archetype_5.pickle")
print(read_results)


Results for model archetype_5
Nbr of parameters:		5
Sample size:			1956
Excluded data:			0
Init log likelihood:		-1581.744
Final log likelihood:		-1230.663
Likelihood ratio test (init):		702.163
Rho square (init):			0.222
Rho bar square (init):			0.219
Akaike Information Criterion:	2471.325
Bayesian Information Criterion:	2499.218
Final gradient norm:		2.369064e-06
ASC_PLANE      : 1.82[1.47 1.24 0.215][1.63 1.12 0.264]
ASC_TRAIN      : 0.769[1.47 0.521 0.602][1.63 0.472 0.637]
B_CO2          : -0.0387[0.00912 -4.24 2.23e-05][0.00934 -4.14 3.48e-05]
B_COST         : -0.00123[0.00472 -0.261 0.794][0.00468 -0.263 0.793]
B_TIME         : -0.0129[0.000867 -14.9 0][0.000964 -13.4 0]
('ASC_TRAIN', 'ASC_PLANE'):	2.13	0.986	-4.27	1.93e-05	2.62	0.991	-4.68	2.89e-06
('B_CO2', 'ASC_PLANE'):	-0.000249	-0.0186	-1.27	0.205	-0.00025	-0.0164	-1.14	0.254
('B_CO2', 'ASC_TRAIN'):	-0.000309	-0.023	-0.547	0.584	-0.000331	-0.0218	-0.496	0.62
('B_COST', 'ASC_PLANE'):	-7.8e-05	-0.0113	-1.24	0.214	-0.000127	-