# Adjustment of international trips from MND to real data

*Problem*: MND data does not reflect exactly how many people travel from Spain to international countries and vice-versa. 

*Solution*: Use data from aena to calculate the proportionality coefficient to be applied for each selected country and type of trip (incoming or outgoing)

What have I done in this notebook:
- See the internationl trips for the whole week disaggregated by country (France, UK, etc, etc,)
- Multiply that by 30/7 to infer the number of people that travel in the entire month
- Compare that with the data from aena
- get the coefficient predicted_data/real_data per country
- apply this coefficient per country to my data to infer a more accurate number of people that travel
- analyse the coefficients to see how we are detecting trips


-------------------------------------------------------------------

## Removing the weird station trips

In [1]:
import pandas as pd
import geopandas as gpd
import os
import re
import matplotlib.pyplot as plt
import unicodedata
print(os.getcwd())
os.chdir(r"C:\Users\LMENENDEZ\GitHub\MultiModX")
print(os.getcwd())
pd.set_option('display.max_columns', None)

c:\Users\LMENENDEZ\GitHub\MultiModX\notebooks\CS11
C:\Users\LMENENDEZ\GitHub\MultiModX


In [2]:
# Trips during the week 22/09/2022 28/09/2022 (thursday to thursday)
# the day of study selected was Friday to put the air layer under pressure
all_trips = pd.read_csv(
    r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP4 Performance Assessment Solution\Demand data\Matrices MITMA\with_archetypes\20220922_28_trip_matrix_arc_pt_processed.csv.gz",
    compression="gzip",
    sep="|"
)

In [3]:
all_trips=all_trips.rename(columns={"origin_nut": "origin", "destination_nut": "destination"})

In [4]:
%load_ext autoreload


In [5]:
%autoreload
from script.trips_format import *

In [6]:
#associates each airport to the corresponding new NUTS
airports_to_NUTS={"airport_LPA":("ES705","Gran Canaria"),
                 "airport_FUE":("ES704","Fuerteventura"),
                 "airport_ACE":("ES708","Lanzarote"),
                 "airport_TFS":("ES709","Tenerife"),
                 "airport_TFN":("ES709","Tenerife"),
                 "airport_GMZ":("ES709","Tenerife"),
                 "airport_SPC":("ES707","La Palma"),
                 "airport_VDE":("ES703","El Hierro"),
                 "airport_PMI":("ES532","Mallorca"),
                 "airport_IBZ":("ES531","Eivissa i Formentera"),
                 "airport_MAH":("ES533","Menorca")}

In [7]:
all_trips=format_trips(all_trips,airports_to_NUTS)

17 columns were removed


In [8]:
# remove cercanías
all_trips=all_trips[~(((all_trips["origin"]=="ES424")&(all_trips["destination"]=="ES300"))|((all_trips["origin"]=="ES300")&(all_trips["destination"]=="ES424")))]

In [9]:
# location of "ALL" train stops given by UiC
# However this list is still incomplete
stops_loc=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain+abroad\v=0.1\infrastructure\rail_info\stops.txt").astype(str) # everything is a string here to match other formatting
stops_loc["stop_id"] = stops_loc["stop_id"].apply(lambda x: "00" + x) #to make they start with 00

In [10]:
all_trips.loc[:,"weird_stations"] = all_trips["node_sequence_reduced"].apply(
    lambda x: find_weird_stations(x, stops_loc))

In [11]:
unique_weird_stations=get_weird_stations(all_trips["weird_stations"])

In [12]:
len(unique_weird_stations)

495

In [13]:
MobA_stations_coord=gpd.read_file(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain\v=0.7\oferta_transporte\train_stations\train_stations.shp")

In [14]:
# identifies all the un-localisable stations
nowhere_stations=set(unique_weird_stations)-set(MobA_stations_coord["ID"])
print(f"there are {len(nowhere_stations)} stations that are not in the data provided by MobA but appear in the trips dataframe")

there are 255 stations that are not in the data provided by MobA but appear in the trips dataframe


In [15]:
all_trips = all_trips[~all_trips["node_sequence_reduced"].apply(lambda x: any(station in x for station in nowhere_stations))]

---------------------------------------------------------------------

## Analysing the international trips

In [16]:
international_codes=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain+abroad\v=0.1\infrastructure\countries mcc\mcc_to_nationality.txt", sep="|")


In [17]:
mcc_to_country=international_codes.set_index("mcc")["country"].to_dict()

In [18]:
all_trips_abroad=all_trips[(all_trips["origin"]=="abroad")|(all_trips["destination"]=="abroad")]
all_trips_national=all_trips[~((all_trips["origin"]=="abroad")|(all_trips["destination"]=="abroad"))]

In [19]:
all_trips_abroad=format_trips_abroad(all_trips_abroad,mcc_to_country)

In [20]:
all_trips_abroad.head()

Unnamed: 0,date,origin,origin_name,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,legs,nationality,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp,trips
0,20220922,AD,AD,ES511,Barcelona,airport_BCN,,NF,NF,P21*abroad_213*0816904*None*airport_BCN*00-01*...,AD,0.605341,0.453429,0.430342,0.02124,0.046174,0.030475,4,plane-road-train-road,airport_BCN-train_71706-train_71802,airport_BCN,train_71802,international_O,2,1,1,airport_BCN,airport_BCN,airport_BCN,"['air', 'rail']",1.587
1,20220922,AD,AD,ES511,Barcelona,airport_BCN,,NF,NF,P22*abroad_213*0816904*None*airport_BCN*00-01*...,AD,0.605341,0.453429,0.430342,0.02124,0.046174,0.030475,2,plane-road,airport_BCN,airport_BCN,airport_BCN,international_O,1,0,1,airport_BCN,airport_BCN,airport_BCN,['air'],1.587
2,20220922,AE,AE,ES243,Zaragoza,airport_BCN,,NF,NF,P08*abroad_424*0816904*None*airport_BCN*00-01*...,AR,2.441427,0.378842,0.52617,0.0,0.210468,0.042094,4,plane-road-train-road,airport_BCN-train_51003-train_71801,airport_BCN,train_71801,international_O,2,1,1,airport_BCN-train_51003-train_71801,airport_BCN,train_71801,"['air', 'rail']",3.599
3,20220922,AE,AE,ES300,Madrid,airport_MAD,,NF,NF,P10*abroad_424*2807921*None*airport_MAD*00-01*...,AE,2.200508,0.555165,0.660959,0.027204,0.12393,0.031234,2,plane-road,airport_MAD,airport_MAD,airport_MAD,international_O,1,0,1,airport_MAD,airport_MAD,airport_MAD,['air'],3.599
4,20220922,AE,AE,ES300,Madrid,airport_MAD,,NF,NF,P13*abroad_424*2807921*None*airport_MAD*00-01*...,AE,2.200508,0.555165,0.660959,0.027204,0.12393,0.031234,2,plane-road,airport_MAD,airport_MAD,airport_MAD,international_O,1,0,1,airport_MAD,airport_MAD,airport_MAD,['air'],3.599


In [21]:
all_trips_national=format_trips_national(all_trips_national)

In [22]:
all_trips_abroad_incoming=all_trips_abroad[~(all_trips_abroad["origin"].str.startswith("ES"))]

In [23]:
all_trips_abroad_outgoing=all_trips_abroad[~(all_trips_abroad["destination"].str.startswith("ES")).fillna(False)]

  all_trips_abroad_outgoing=all_trips_abroad[~(all_trips_abroad["destination"].str.startswith("ES")).fillna(False)]


In [24]:
all_trips_abroad_incoming[all_trips_abroad_incoming["entry_point"].str.startswith("ground")]

Unnamed: 0,date,origin,origin_name,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,legs,nationality,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp,trips


In [25]:
all_trips_abroad_outgoing[all_trips_abroad_outgoing["exit_point"].str.startswith("ground")]

Unnamed: 0,date,origin,origin_name,destination,destination_name,entry_point,exit_point,origin_purpose,destination_purpose,legs,nationality,archetype_0,archetype_1,archetype_2,archetype_3,archetype_4,archetype_5,n_legs,mode_sequence,node_sequence,start_node,end_node,type,road_legs,train_legs,plane_legs,node_sequence_reduced,start_node_reduced,end_node_reduced,mode_tp,trips


The past lines were checking that we have correctly removed the ground trips

In [26]:
all_outgoing_grouped=all_trips_abroad_outgoing.groupby("destination")["trips"].sum().reset_index()

In [27]:
all_incoming_grouped=all_trips_abroad_incoming.groupby("origin")["trips"].sum().reset_index()

In [28]:
all_outgoing_grouped=all_outgoing_grouped.sort_values(by="trips",ascending=False)

In [29]:
all_incoming_grouped=all_incoming_grouped.sort_values(by="trips", ascending=False)

In [30]:
all_incoming_grouped

Unnamed: 0,origin,trips
46,FR,335276.906
106,PT,160897.501
47,GB,85563.751
129,US,50804.423
62,IT,44530.054
...,...,...
75,LK,3.599
125,TT,3.599
121,TD,3.599
128,UG,3.599


In [31]:
all_outgoing_grouped

Unnamed: 0,destination,trips
44,FR,326522.792
100,PT,152885.181
45,GB,96104.179
122,US,34603.196
59,IT,33681.919
...,...,...
22,CF,3.599
10,BD,3.599
65,KH,3.599
71,LK,3.599


In [32]:
all_incoming_grouped["total_trips_month"]=all_incoming_grouped["trips"].apply(lambda x: x*30/7)

In [33]:
all_incoming_grouped

Unnamed: 0,origin,trips,total_trips_month
46,FR,335276.906,1.436901e+06
106,PT,160897.501,6.895607e+05
47,GB,85563.751,3.667018e+05
129,US,50804.423,2.177332e+05
62,IT,44530.054,1.908431e+05
...,...,...,...
75,LK,3.599,1.542429e+01
125,TT,3.599,1.542429e+01
121,TD,3.599,1.542429e+01
128,UG,3.599,1.542429e+01


In [34]:
all_outgoing_grouped["total_trips_month"]=all_outgoing_grouped["trips"].apply(lambda x: x*30/7)

In [35]:
all_outgoing_grouped

Unnamed: 0,destination,trips,total_trips_month
44,FR,326522.792,1.399383e+06
100,PT,152885.181,6.552222e+05
45,GB,96104.179,4.118751e+05
122,US,34603.196,1.482994e+05
59,IT,33681.919,1.443511e+05
...,...,...,...
22,CF,3.599,1.542429e+01
10,BD,3.599,1.542429e+01
65,KH,3.599,1.542429e+01
71,LK,3.599,1.542429e+01


In [36]:
aena_incoming_data=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain+abroad\v=0.1\demand\aena_incoming_demand.csv", encoding="utf-16")

In [37]:
aena_outgoing_data=pd.read_csv(r"G:\Unidades compartidas\04_PROYECTOS I+D+i\2023 MultiModX\iii) Project\WP3 Scenario definition\Case study input data\Spain+abroad\v=0.1\demand\aena_outgoing_demand.csv", encoding="utf-16")

-------------------------------------------------------------------------------------------

## Get a list of all country mcc codes in Spanish

In [38]:
url= "https://es.wikipedia.org/wiki/ISO_3166-1_alfa-2"

In [39]:
tables = pd.read_html(url)

In [40]:
country_codes_spanish=tables[2]

In [41]:
country_codes_spanish.head()

Unnamed: 0,Código,Nombre del país,Año,ccTLD,ISO 3166-2,Notas
0,AD,Andorra,1974,.ad,ISO 3166-2:AD,
1,AE,Emiratos Árabes Unidos (los),1974,.ae,ISO 3166-2:AE,
2,AF,Afganistán,1974,.af,ISO 3166-2:AF,
3,AG,Antigua y Barbuda,1974,.ag,ISO 3166-2:AG,
4,AI,Anguila,1985,.ai,ISO 3166-2:AI,AI antes representaba al Territorio Francés de...


In [42]:
country_codes_spanish=country_codes_spanish.drop(["Año","ccTLD","ISO 3166-2","Notas"],axis=1)

In [43]:
def normalize_text(text):
    # Normalize the text to decompose accented characters
    text = unicodedata.normalize('NFKD', text)
    # Replace specific characters (e.g., ñ) before encoding
    text = text.replace('ñ', 'N').replace('Ñ', 'N')
    # Encode to ASCII, ignore non-ASCII characters, and decode back to string
    text = text.encode('ascii', 'ignore').decode('ascii')
    # Convert to uppercase
    return text.upper()

In [44]:
country_codes_spanish["Nombre del país"]=country_codes_spanish["Nombre del país"].apply(normalize_text)

In [45]:
country_codes_spanish

Unnamed: 0,Código,Nombre del país
0,AD,ANDORRA
1,AE,EMIRATOS ARABES UNIDOS (LOS)
2,AF,AFGANISTAN
3,AG,ANTIGUA Y BARBUDA
4,AI,ANGUILA
...,...,...
244,YE,YEMEN
245,YT,MAYOTTE
246,ZA,SUDAFRICA
247,ZM,ZAMBIA


In [46]:
def remove_parentheses(text):
    # Use regex to remove all text within parentheses
    return text.replace(r"\(.*?\)", "", regex=True).strip()

In [47]:
country_codes_spanish["Nombre del país"] = country_codes_spanish["Nombre del país"].str.replace(r"\(.*?\)", "", regex=True).str.strip()

In [48]:
country_codes_spanish

Unnamed: 0,Código,Nombre del país
0,AD,ANDORRA
1,AE,EMIRATOS ARABES UNIDOS
2,AF,AFGANISTAN
3,AG,ANTIGUA Y BARBUDA
4,AI,ANGUILA
...,...,...
244,YE,YEMEN
245,YT,MAYOTTE
246,ZA,SUDAFRICA
247,ZM,ZAMBIA


In [49]:
country_spanish_to_mcc=country_codes_spanish.set_index("Nombre del país")["Código"].to_dict()

In [50]:
# I added by hand a few necessary values
country_spanish_to_mcc["ESPAÑA"]="ES"
country_spanish_to_mcc["REINO UNIDO"]="GB"
country_spanish_to_mcc["ESTADOS UNIDOS"]="US"
country_spanish_to_mcc["HOLANDA"]="NL"

In [51]:
country_spanish_to_mcc["PAISES BAJOS"]

'NL'

In [52]:
aena_incoming_data["country_code"]=aena_incoming_data["País"].map(country_spanish_to_mcc)
aena_outgoing_data["country_code"]=aena_outgoing_data["País"].map(country_spanish_to_mcc)

In [53]:
aena_incoming_data[aena_incoming_data.isna().any(axis=1)]

Unnamed: 0,País,Pasajeros Totales,country_code
22,REPUBLICA CHECA,53.798,
27,REPUBLICA DOMINICANA,42.748,
31,QATAR,33.303,
47,REPUBLICA DE SERBIA,9.231,
54,REPUBLICA DE COREA,5.208,
72,REPUBLICA DE MONTENEGRO,1.606,
74,BAHRAIN,1.491,
76,ISLAS MAURICIO,1.259,
78,FAROE ISLANDS,697.0,
90,KAZAKSTAN,7.0,


I will manually add the missing country codes:

In [54]:
aena_incoming_data.loc[aena_incoming_data["País"]=="REPUBLICA CHECA","country_code"]="CZ"
aena_incoming_data.loc[aena_incoming_data["País"]=="REPUBLICA DOMINICANA","country_code"]="DO"
aena_incoming_data.loc[aena_incoming_data["País"]=="QATAR","country_code"]="QA"
aena_incoming_data.loc[aena_incoming_data["País"]=="REPUBLICA DE SERBIA","country_code"]="RS"
aena_incoming_data.loc[aena_incoming_data["País"]=="REPUBLICA DE COREA","country_code"]="KR"
aena_incoming_data.loc[aena_incoming_data["País"]=="REPUBLICA DE MONTENEGRO","country_code"]="ME"
aena_incoming_data.loc[aena_incoming_data["País"]=="BAHRAIN","country_code"]="BH"
aena_incoming_data.loc[aena_incoming_data["País"]=="ISLAS MAURICIO","country_code"]="MU"
aena_incoming_data.loc[aena_incoming_data["País"]=="FAROE ISLANDS","country_code"]="FO"
aena_incoming_data.loc[aena_incoming_data["País"]=="KAZAKSTAN","country_code"]="KZ"
aena_incoming_data.loc[aena_incoming_data["País"]=="IRAQ","country_code"]="IQ"
aena_incoming_data.loc[aena_incoming_data["País"]=="COSTA DE MARFIL","country_code"]="CI"


In [55]:
aena_incoming_data=aena_incoming_data.dropna()

In [56]:
aena_outgoing_data[aena_outgoing_data.isna().any(axis=1)]

Unnamed: 0,País,Pasajeros Totales,country_code
21,REPUBLICA CHECA,53.504,
29,QATAR,31.61,
32,REPUBLICA DOMINICANA,29.351,
47,REPUBLICA DE SERBIA,8.462,
58,REPUBLICA DE COREA,4.306,
71,REPUBLICA DE MONTENEGRO,1.468,
76,ISLAS MAURICIO,664.0,
77,BAHRAIN,563.0,
79,FAROE ISLANDS,434.0,
82,KAZAKSTAN,22.0,


In [57]:
aena_outgoing_data.loc[aena_outgoing_data["País"]=="REPUBLICA CHECA","country_code"]="CZ"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="REPUBLICA DOMINICANA","country_code"]="DO"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="QATAR","country_code"]="QA"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="REPUBLICA DE SERBIA","country_code"]="RS"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="REPUBLICA DE COREA","country_code"]="KR"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="REPUBLICA DE MONTENEGRO","country_code"]="ME"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="BAHRAIN","country_code"]="BH"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="ISLAS MAURICIO","country_code"]="MU"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="FAROE ISLANDS","country_code"]="FO"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="KAZAKSTAN","country_code"]="KZ"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="GUINEA BISSAU","country_code"]="GW"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="COSTA DE MARFIL","country_code"]="CI"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="BERMUDAS","country_code"]="BM"
aena_outgoing_data.loc[aena_outgoing_data["País"]=="BOTSWANA","country_code"]="BW"


In [58]:
aena_outgoing_data=aena_outgoing_data.dropna()

---------------------------------------------------------------------------------------------------------------

## Merge the aena data with the obtained data and calculate the coefficients

In [79]:
incoming_trips_comparison=pd.merge(all_incoming_grouped,aena_incoming_data,left_on="origin",right_on="country_code",how="left")

In [80]:
incoming_trips_comparison.head(15)

Unnamed: 0,origin,trips,total_trips_month,País,Pasajeros Totales,country_code
0,FR,335276.906,1436901.0,FRANCIA,638.919,FR
1,PT,160897.501,689560.7,PORTUGAL,282.016,PT
2,GB,85563.751,366701.8,REINO UNIDO,2.050.000,GB
3,US,50804.423,217733.2,ESTADOS UNIDOS,217.239,US
4,IT,44530.054,190843.1,ITALIA,740.767,IT
5,DE,36631.458,156992.0,ALEMANIA,1.311.725,DE
6,BE,24969.161,107010.7,BELGICA,287.256,BE
7,IE,22835.126,97864.83,IRLANDA,243.492,IE
8,NL,19954.94,85521.17,HOLANDA,418.007,NL
9,SE,14631.437,62706.16,SUECIA,122.735,SE


In [81]:
outgoing_trips_comparison=pd.merge(all_outgoing_grouped,aena_outgoing_data,left_on="destination",right_on="country_code",how="left")

In [82]:
outgoing_trips_comparison.head(15)

Unnamed: 0,destination,trips,total_trips_month,País,Pasajeros Totales,country_code
0,FR,326522.792,1399383.0,FRANCIA,652.108,FR
1,PT,152885.181,655222.2,PORTUGAL,276.524,PT
2,GB,96104.179,411875.1,REINO UNIDO,2.154.546,GB
3,US,34603.196,148299.4,ESTADOS UNIDOS,207.645,US
4,IT,33681.919,144351.1,ITALIA,763.137,IT
5,DE,27025.216,115822.4,ALEMANIA,1.302.409,DE
6,IE,24383.725,104501.7,IRLANDA,249.830,IE
7,BE,22226.452,95256.22,BELGICA,281.557,BE
8,NL,17716.143,75926.33,HOLANDA,410.734,NL
9,SE,14157.484,60674.93,SUECIA,109.544,SE


In [83]:
incoming_trips_comparison=incoming_trips_comparison.drop(["country_code"],axis=1)
outgoing_trips_comparison=outgoing_trips_comparison.drop(["country_code"],axis=1)

In [84]:
incoming_trips_comparison=incoming_trips_comparison.rename(columns={"trips":"trips_registered_weekly","total_trips_month":"trips_predicted_month","País":"country_name_es","Pasajeros Totales":"trips_real"})
outgoing_trips_comparison=outgoing_trips_comparison.rename(columns={"trips":"trips_registered_weekly","total_trips_month":"trips_predicted_month","País":"country_name_es","Pasajeros Totales":"trips_real"})

In [85]:
incoming_trips_comparison=incoming_trips_comparison[["origin","country_name_es","trips_registered_weekly","trips_predicted_month","trips_real"]]
outgoing_trips_comparison=outgoing_trips_comparison[["destination","country_name_es","trips_registered_weekly","trips_predicted_month","trips_real"]]

In [86]:
incoming_trips_comparison["trips_real"]=incoming_trips_comparison["trips_real"].str.replace(".","",regex=False)
incoming_trips_comparison["trips_real"]=pd.to_numeric(incoming_trips_comparison["trips_real"],errors="coerce")

In [87]:
incoming_lost_trips=incoming_trips_comparison[incoming_trips_comparison["trips_real"].isna()]["trips_predicted_month"].sum()
incoming_trips=incoming_trips_comparison["trips_predicted_month"].sum()
print(f"We are predicting {incoming_trips:.2f} incoming trips monthly and we lose {incoming_lost_trips:.2f} trips.")
print(f"This represents a {incoming_lost_trips/incoming_trips*100:.2f}% of the total incoming trips")

We are predicting 4020093.40 incoming trips monthly and we lose 92953.79 trips.
This represents a 2.31% of the total incoming trips


In [88]:
outgoing_lost_trips=outgoing_trips_comparison[outgoing_trips_comparison["trips_real"].isna()]["trips_predicted_month"].sum()
outgoing_trips=incoming_trips_comparison["trips_predicted_month"].sum()
print(f"We are predicting {outgoing_trips:.2f} incoming trips monthly and we lose {outgoing_lost_trips:.2f} trips.")
print(f"This represents a {outgoing_lost_trips/outgoing_trips*100:.2f}% of the total incoming trips")

We are predicting 4020093.40 incoming trips monthly and we lose 73065.31 trips.
This represents a 1.82% of the total incoming trips


In [89]:
incoming_trips_comparison=incoming_trips_comparison.dropna()

In [96]:
outgoing_trips_comparison["trips_real"]=outgoing_trips_comparison["trips_real"].str.replace(".","",regex=False)
outgoing_trips_comparison["trips_real"]=pd.to_numeric(outgoing_trips_comparison["trips_real"],errors="coerce")

In [97]:
outgoing_trips_comparison

Unnamed: 0,destination,country_name_es,trips_registered_weekly,trips_predicted_month,trips_real
0,FR,FRANCIA,326522.792,1.399383e+06,652108
1,PT,PORTUGAL,152885.181,6.552222e+05,276524
2,GB,REINO UNIDO,96104.179,4.118751e+05,2154546
3,US,ESTADOS UNIDOS,34603.196,1.482994e+05,207645
4,IT,ITALIA,33681.919,1.443511e+05,763137
...,...,...,...,...,...
117,BF,BURKINA FASO,7.198,3.084857e+01,0
118,BO,BOLIVIA,7.198,3.084857e+01,8178
119,AZ,AZERBAIYAN,7.198,3.084857e+01,0
122,GQ,GUINEA ECUATORIAL,3.599,1.542429e+01,1070


In [98]:
outgoing_trips_comparison=outgoing_trips_comparison.dropna()

In [99]:
incoming_trips_comparison["real_vs_predicted_coeff"]=incoming_trips_comparison["trips_real"]/incoming_trips_comparison["trips_predicted_month"]

In [100]:
incoming_trips_comparison.head(15)

Unnamed: 0,origin,country_name_es,trips_registered_weekly,trips_predicted_month,trips_real,real_vs_predicted_coeff
0,FR,FRANCIA,335276.906,1436901.0,638919.0,0.444651
1,PT,PORTUGAL,160897.501,689560.7,282016.0,0.408979
2,GB,REINO UNIDO,85563.751,366701.8,2050000.0,5.590374
3,US,ESTADOS UNIDOS,50804.423,217733.2,217239.0,0.99773
4,IT,ITALIA,44530.054,190843.1,740767.0,3.88155
5,DE,ALEMANIA,36631.458,156992.0,1311725.0,8.355364
6,BE,BELGICA,24969.161,107010.7,287256.0,2.684367
7,IE,IRLANDA,22835.126,97864.83,243492.0,2.488044
8,NL,HOLANDA,19954.94,85521.17,418007.0,4.88776
9,SE,SUECIA,14631.437,62706.16,122735.0,1.957304


In [101]:
outgoing_trips_comparison["real_vs_predicted_coeff"]=outgoing_trips_comparison["trips_real"]/outgoing_trips_comparison["trips_predicted_month"]

In [102]:
incoming_trips_comparison=incoming_trips_comparison.sort_values(["trips_real"],ascending=False)

In [103]:
outgoing_trips_comparison=outgoing_trips_comparison.sort_values(["trips_real"],ascending=False)

In [104]:
outgoing_trips_comparison[["country_name_es","real_vs_predicted_coeff"]].head(15)

Unnamed: 0,country_name_es,real_vs_predicted_coeff
2,REINO UNIDO,5.231067
5,ALEMANIA,11.244885
4,ITALIA,5.286673
0,FRANCIA,0.465997
8,HOLANDA,5.409639
13,SUIZA,9.739776
7,BELGICA,2.955786
1,PORTUGAL,0.422031
6,IRLANDA,2.390679
3,ESTADOS UNIDOS,1.400174


In [105]:
incoming_trips_comparison[["country_name_es","real_vs_predicted_coeff"]].head(15)

Unnamed: 0,country_name_es,real_vs_predicted_coeff
2,REINO UNIDO,5.590374
5,ALEMANIA,8.355364
4,ITALIA,3.88155
0,FRANCIA,0.444651
8,HOLANDA,4.88776
13,SUIZA,8.230474
6,BELGICA,2.684367
1,PORTUGAL,0.408979
7,IRLANDA,2.488044
3,ESTADOS UNIDOS,0.99773


In [106]:
incoming_trips_comparison_small=incoming_trips_comparison.head(15)

In [107]:
outgoing_trips_comparison_small=outgoing_trips_comparison.head(15)

In [78]:
outgoing_trips_comparison.columns

Index(['destination', 'country_name_es', 'trips_registered_weekly',
       'trips_predicted_month', 'trips_real', 'real_vs_predicted_coeff'],
      dtype='object')

Some conclusion of the analysis:

The coefficient is very different from country to country.
There is a clear difference between france and portugal and the rest of the countries. For france and portugal we are over-estimating the demand (two to one), whereas for the rest of the countries we are under-estimating the demand. This difference could be due to the fact that the data coming from France and Portugal is estimated in another way.

--------------------------------------------------------------------------------------------------

## Export results as a csv

In [77]:
#outgoing_trips_comparison_small[["destination","country_name_es","real_vs_predicted_coeff"]].to_csv("outgoing_trips_coefficients.csv",index=False)

In [78]:
# incoming_trips_comparison_small[["origin","country_name_es","real_vs_predicted_coeff"]].to_csv("incoming_trips_coefficients.csv",index=False)

In [108]:
outgoing_trips_comparison[["destination","country_name_es","real_vs_predicted_coeff"]].to_csv("outgoing_trips_coefficients_all.csv",index=False)

In [109]:
incoming_trips_comparison[["origin","country_name_es","real_vs_predicted_coeff"]].to_csv("incoming_trips_coefficients_all.csv",index=False)