# Preprocessing part 2: preparing the arrays
In this notebook we take 2 datasets prepared in spark: stop_times and transfers, and prepare them into the array format needed to run RAPTOR

## Outline
In this notebook the following actions are performed:
- Transform stop_ids with platform information into the parent station stop_id
- Keep only trips with a departure after 7 am and before 7 pm
- Delete trips which only have 1 stop
- Create integer IDs for routes, trips and stops, following the definition of the RAPTOR algorithm in Stop_times
- Add integer IDs to transfers and keep only stops that are inside the stop_times dataset

## Import packages

In [2]:
import pandas as pd
import numpy as np
import pickle
import itertools

## Read files
Before running make sure the .csv files are in /data . If not run notebook "transfer_to_local"

In [4]:
#stop_times
stop_times_curated = pd.read_csv("../data/stop_times_final_cyril.csv")
stop_times_curated.head(5)

Unnamed: 0.1,Unnamed: 0,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc
0,0,26-66-j19-1,8591205,17.TA.26-66-j19-1.1.H,8591205,17:00:00,17:00:00,3,"Zürich, Hürlimannplatz",47.365066,8.526539,"Zürich, Neubühl",3870,0,16:55:00,1225,12,1317,Bus
1,1,26-66-j19-1,8591415,17.TA.26-66-j19-1.1.H,8591415,17:02:00,17:02:00,4,"Zürich, Waffenplatzstrasse",47.361482,8.525749,"Zürich, Neubühl",3870,0,16:55:00,1225,12,1267,Bus
2,2,26-66-j19-1,8591204,17.TA.26-66-j19-1.1.H,8591204,17:03:00,17:03:00,5,"Zürich, Hügelstrasse",47.358543,8.526997,"Zürich, Neubühl",3870,0,16:55:00,1225,12,67,Bus
3,3,26-66-j19-1,8591098,17.TA.26-66-j19-1.1.H,8591098,17:04:00,17:04:00,6,"Zürich, Brunau/Mutschellenstr.",47.355147,8.527141,"Zürich, Neubühl",3870,0,16:55:00,1225,12,512,Bus
4,4,26-66-j19-1,8591392,17.TA.26-66-j19-1.1.H,8591392,17:05:00,17:05:00,7,"Zürich, Thujastrasse",47.350187,8.527806,"Zürich, Neubühl",3870,0,16:55:00,1225,12,403,Bus


In [59]:
#stop_times
stop_times_curated = pd.read_csv("../data/stop_times_curated.csv")
stop_times_curated.head(5)

Unnamed: 0.1,Unnamed: 0,route_id,stop_id,trip_id,arrival_time,departure_time,stop_sequence,direction_id,stop_name,route_desc
0,0,26-759-j19-1,8573205:0:K,1330.TA.26-759-j19-1.7.R,05:28:00,05:28:00,1,1,"Zürich Flughafen, Bahnhof",Bus
1,1,26-67-j19-1,8591341,46.TA.26-67-j19-1.1.R,05:33:00,05:33:00,1,1,"Zürich, Schmiede Wiedikon",Bus
2,2,26-325-j19-1,8587020:0:D,265.TA.26-325-j19-1.2.H,05:34:00,05:34:00,1,0,"Dietikon, Bahnhof",Bus
3,3,26-11-A-j19-1,8591382,1266.TA.26-11-A-j19-1.21.H,05:37:00,05:37:00,1,0,"Zürich, Sternen Oerlikon",Tram
4,4,26-302-j19-1,8590844,162.TA.26-302-j19-1.4.R,05:49:00,05:49:00,1,1,"Urdorf, Oberurdorf",Bus


We drop columns not useful to us

In [60]:
stop_times_curated = stop_times_curated.drop(columns=["Unnamed: 0"])

In [6]:
#transfers
transfers = pd.read_csv("../data/transfers.csv")
transfers.head(5)

Unnamed: 0.1,Unnamed: 0,stop_id,stop_id2,distance,Transfer_time_sec,stop_name,stop_name2
0,0,8500926,8590616,0.12243,146,"Oetwil a.d.L., Schweizäcker","Geroldswil, Schweizäcker"
1,1,8500926,8590737,0.300175,360,"Oetwil a.d.L., Schweizäcker","Oetwil an der Limmat, Halde"
2,2,8502186,8502186:0:1,0.006762,8,Dietikon Stoffelbach,Dietikon Stoffelbach
3,3,8502186,8502186:0:2,0.013524,16,Dietikon Stoffelbach,Dietikon Stoffelbach
4,4,8502186,8502186P,0.0,0,Dietikon Stoffelbach,Dietikon Stoffelbach


## Create stop_id same for all platforms
In the algorithm we make the simplifying assumptions that each time there is a change is the same station there is a 2 min change time. Due to this assumptions we can keep only the parent station name
The parent id is contained in the first 7 characters, so we can take the substring to create the parent stop_id

In [62]:
#copy information stop_id with platform in stop_id_raw
stop_times_curated["stop_id_raw"] = stop_times_curated["stop_id"]

In [63]:
#Use only first 7 characters for stop_id
stop_times_curated["stop_id"] = stop_times_curated["stop_id_raw"].str.slice(0, 7)
stop_times_curated["stop_id"] = pd.to_numeric(stop_times_curated["stop_id"])
stop_times_curated.head(5)

Unnamed: 0,route_id,stop_id,trip_id,arrival_time,departure_time,stop_sequence,direction_id,stop_name,route_desc,stop_id_raw
0,26-759-j19-1,8573205,1330.TA.26-759-j19-1.7.R,05:28:00,05:28:00,1,1,"Zürich Flughafen, Bahnhof",Bus,8573205:0:K
1,26-67-j19-1,8591341,46.TA.26-67-j19-1.1.R,05:33:00,05:33:00,1,1,"Zürich, Schmiede Wiedikon",Bus,8591341
2,26-325-j19-1,8587020,265.TA.26-325-j19-1.2.H,05:34:00,05:34:00,1,0,"Dietikon, Bahnhof",Bus,8587020:0:D
3,26-11-A-j19-1,8591382,1266.TA.26-11-A-j19-1.21.H,05:37:00,05:37:00,1,0,"Zürich, Sternen Oerlikon",Tram,8591382
4,26-302-j19-1,8590844,162.TA.26-302-j19-1.4.R,05:49:00,05:49:00,1,1,"Urdorf, Oberurdorf",Bus,8590844


In [64]:
#copy information stop_id with platform in stop_id_raw
transfers["stop_id_raw"] = transfers["stop_id"]
transfers["stop_id2_raw"] = transfers["stop_id2"]

We do the operation also on the transfers dataset

In [65]:
#Use only first 7 characters for stop_id
transfers["stop_id"] = transfers["stop_id_raw"].str.slice(0, 7)
transfers["stop_id2"] = transfers["stop_id2_raw"].str.slice(0, 7)
transfers["stop_id"] = pd.to_numeric(transfers["stop_id"])
transfers["stop_id2"] = pd.to_numeric(transfers["stop_id2"])
transfers.head(5)

Unnamed: 0.1,Unnamed: 0,stop_id,stop_id2,distance,Transfer_time_sec,stop_name,stop_name2,stop_id_raw,stop_id2_raw
0,0,8500926,8590616,0.12243,146,"Oetwil a.d.L., Schweizäcker","Geroldswil, Schweizäcker",8500926,8590616
1,1,8500926,8590737,0.300175,360,"Oetwil a.d.L., Schweizäcker","Oetwil an der Limmat, Halde",8500926,8590737
2,2,8502186,8502186,0.006762,8,Dietikon Stoffelbach,Dietikon Stoffelbach,8502186,8502186:0:1
3,3,8502186,8502186,0.013524,16,Dietikon Stoffelbach,Dietikon Stoffelbach,8502186,8502186:0:2
4,4,8502186,8502186,0.0,0,Dietikon Stoffelbach,Dietikon Stoffelbach,8502186,8502186P


## Keep only trips during the day
Our model will only consider trips during business days and normal hours, so we can delete all departures before 7 am and after 7 pm

We can get the hour of departure using str.slice , and explore the hours we have in the dataset. Then we convert these hours in integers in order to filter.

In [66]:
stop_times_curated.departure_time.str.slice(0,2).unique()

array(['05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15',
       '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '00',
       '04', '01'], dtype=object)

In [67]:
stop_times_curated["hour_departure"] = pd.to_numeric(stop_times_curated.departure_time.str.slice(0,2))

Check if well converted to int

In [68]:
stop_times_curated["hour_departure"].unique()

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25,  0,  4,  1])

We drop the trips before 7 am and after 7 pm using np.where function

In [69]:
stop_times_curated.trip_id.count()

331751

In [70]:
trip_id_drop = np.where(((stop_times_curated.hour_departure > 19) |\
                                                (stop_times_curated.hour_departure < 7)),\
                                               stop_times_curated["trip_id"] , None)

In [71]:
stop_times_curated = stop_times_curated[~stop_times_curated["trip_id"].isin(trip_id_drop)]

In [72]:
stop_times_curated.trip_id.count()

246576

With this operation we have decreased the size of stop_times by about 90k lines

## Delete trips with 1 stop
Trips with only 1 stop are useless in our dataset and will only pollute the algorithm. For this reason we dete these

We start by counting the stops of each trip

In [73]:
number_stop = stop_times_curated.groupby('trip_id').nunique()
number_stop.head(5)

Unnamed: 0_level_0,route_id,stop_id,trip_id,arrival_time,departure_time,stop_sequence,direction_id,stop_name,route_desc,stop_id_raw,hour_departure
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1.TA.1-231-j19-1.1.H,1,15,1,17,17,18,1,15,1,15,2
1.TA.1-44-j19-1.1.R,1,3,1,3,3,3,1,3,1,3,1
1.TA.1-444-j19-1.1.H,1,9,1,9,9,9,1,9,1,9,1
1.TA.12-E03-j19-1.1.H,1,2,1,2,2,2,1,2,1,2,2
1.TA.18-46-j19-1.1.H,1,1,1,1,1,1,1,1,1,1,1


In [74]:
#get trips with 1 stop
trip_with_1_stop = np.where((number_stop.stop_id == 1), number_stop.index, None)

Check number of trips before cleaning

In [75]:
stop_times_curated.trip_id.nunique()

20261

We drop the rows with a unique stop per trip

In [76]:
#drop trips with only 1 stop
stop_times_curated = stop_times_curated[~stop_times_curated["trip_id"].isin(trip_with_1_stop)]

And we check how many trips there still. About 900 trips with only 1 stop have been deleted

In [77]:
stop_times_curated.trip_id.nunique()

19390

### Create route_int, trip_int and stop_int as consecutive integer IDs
This operation is needed for sorting the routes, trips and stops in the right order. Additionally integers are lighter than strings so the algorithm will need fewer memory to work with the arrays.

#### Route_int
The route_int Id is given in an abitrary order

We start creating a tuple with all the stops in a trip

In [78]:
stop_times_curated = stop_times_curated.sort_values(["trip_id", "stop_sequence"])

In [79]:
#group stops into a sequence
tuple_stops = stop_times_curated.groupby('trip_id')['stop_id'].apply(tuple).to_frame()
tuple_stops.head(5)

Unnamed: 0_level_0,stop_id
trip_id,Unnamed: 1_level_1
1.TA.1-231-j19-1.1.H,"(8572747, 8582462, 8572600, 8572601, 8502553, ..."
1.TA.1-44-j19-1.1.R,"(8590275, 8591891, 8590279)"
1.TA.1-444-j19-1.1.H,"(8572747, 8580847, 8581346, 8502894, 8502979, ..."
1.TA.12-E03-j19-1.1.H,"(8573205, 8596126)"
1.TA.21-23-j19-1.1.R,"(8503000, 8503003)"


In [80]:
tuple_stops.index.nunique()

19390

And we can group all these sequences in unique groups

In [81]:
#group to get unique stop sequences
unique_stop_sequence = tuple_stops.groupby("stop_id").count()
unique_stop_sequence.head(5)

"(8502208, 8502209, 8503201, 8503010, 8503011, 8503000, 8503006, 8503016)"
"(8502208, 8502209, 8503201, 8503200, 8503010, 8503011, 8503016)"
"(8502208, 8502209, 8503202)"
"(8502208, 8502209, 8503202, 8503009, 8503010, 8503011, 8503000, 8503006, 8503016, 8503307)"
"(8502208, 8502209, 8503202, 8503200, 8503009, 8503000, 8503015, 8503016, 8503307, 8503305)"


In [82]:
unique_stop_sequence.index.nunique()

2555

These unique sequences of stops are our routes. We can create a unique ID, an integer, for each route

In [83]:
#create dataframe and route_int
df_unique_stop_sequence = unique_stop_sequence.reset_index()
df_unique_stop_sequence["route_int"] = df_unique_stop_sequence.index
df_unique_stop_sequence.head(5)

Unnamed: 0,stop_id,route_int
0,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0
1,"(8502208, 8502209, 8503201, 8503200, 8503010, ...",1
2,"(8502208, 8502209, 8503202)",2
3,"(8502208, 8502209, 8503202, 8503009, 8503010, ...",3
4,"(8502208, 8502209, 8503202, 8503200, 8503009, ...",4


We add the route information to the trip

In [84]:
#join with trip information
trip_with_routes = tuple_stops.join(df_unique_stop_sequence.set_index("stop_id"), on="stop_id", how="left").sort_values("route_int")
trip_with_routes.head(5)

Unnamed: 0_level_0,stop_id,route_int
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1
403.TA.26-24-j19-1.220.R,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0
425.TA.26-24-j19-1.220.R,"(8502208, 8502209, 8503201, 8503200, 8503010, ...",1
22.TA.30-57-Y-j19-1.1.H,"(8502208, 8502209, 8503202)",2
11.TA.30-57-Y-j19-1.1.H,"(8502208, 8502209, 8503202)",2
14.TA.30-57-Y-j19-1.1.H,"(8502208, 8502209, 8503202)",2


In [85]:
trip_with_routes = trip_with_routes.rename(columns={"stop_id" : "all_stops"})

Check if wrong manipulations cause to have the same, or higher, number of routes than trips. It is not the case

In [86]:
#check if routes and trips do not have the same number
trip_with_routes.index.nunique()

19390

In [87]:
trip_with_routes.route_int.nunique()

2555

We add the rout_int column to stop_times dataframe

In [88]:
stop_times_curated.trip_id.count()

245705

In [89]:
#join to get route_int in stop_times
stop_times_routes = stop_times_curated.join(trip_with_routes, how="left", on="trip_id" , lsuffix='_left', rsuffix='_right').drop_duplicates()


In [90]:
stop_times_routes.trip_id.count()

245705

In [91]:
stop_times_routes.head(5)

Unnamed: 0,route_id,stop_id,trip_id,arrival_time,departure_time,stop_sequence,direction_id,stop_name,route_desc,stop_id_raw,hour_departure,all_stops,route_int
81914,1-231-j19-1,8572747,1.TA.1-231-j19-1.1.H,09:37:00,09:37:00,1,0,"Bremgarten AG, Bahnhof",Bus,8572747,9,"(8572747, 8582462, 8572600, 8572601, 8502553, ...",618
181281,1-231-j19-1,8582462,1.TA.1-231-j19-1.1.H,09:38:00,09:38:00,3,0,"Bremgarten AG, Zelgli",Bus,8582462,9,"(8572747, 8582462, 8572600, 8572601, 8502553, ...",618
42460,1-231-j19-1,8572600,1.TA.1-231-j19-1.1.H,09:39:00,09:39:00,4,0,"Zufikon, Emaus",Bus,8572600,9,"(8572747, 8582462, 8572600, 8572601, 8502553, ...",618
224454,1-231-j19-1,8572601,1.TA.1-231-j19-1.1.H,09:39:00,09:39:00,5,0,"Zufikon, Algier",Bus,8572601,9,"(8572747, 8582462, 8572600, 8572601, 8502553, ...",618
11836,1-231-j19-1,8502553,1.TA.1-231-j19-1.1.H,09:43:00,09:43:00,6,0,"Unterlunkhofen, Breitenäcker",Bus,8502553,9,"(8572747, 8582462, 8572600, 8572601, 8502553, ...",618


In [92]:
#check if route_int is correct
stop_times_routes.route_int.max()

2554

#### Trip_int
The trip_int number needs to be ordered by route_int and time

In [93]:
#check number trips in stop_times
stop_times_routes.trip_id.nunique()

19390

In [94]:
stop_times_routes.sort_values(["route_int", "arrival_time"]).head(5)

Unnamed: 0,route_id,stop_id,trip_id,arrival_time,departure_time,stop_sequence,direction_id,stop_name,route_desc,stop_id_raw,hour_departure,all_stops,route_int
181290,26-24-j19-1,8502208,403.TA.26-24-j19-1.220.R,10:44:00,10:45:00,3,1,Horgen Oberdorf,S-Bahn,8502208:0:4,10,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0
261974,26-24-j19-1,8502209,403.TA.26-24-j19-1.220.R,10:47:00,10:47:00,4,1,Oberrieden Dorf,S-Bahn,8502209:0:1,10,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0
130162,26-24-j19-1,8503201,403.TA.26-24-j19-1.220.R,10:53:00,10:53:00,6,1,Rüschlikon,S-Bahn,8503201:0:2,10,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0
173670,26-24-j19-1,8503010,403.TA.26-24-j19-1.220.R,11:02:00,11:03:00,9,1,Zürich Enge,S-Bahn,8503010:0:2,11,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0
238129,26-24-j19-1,8503011,403.TA.26-24-j19-1.220.R,11:04:00,11:04:00,10,1,Zürich Wiedikon,S-Bahn,8503011:0:2,11,"(8502208, 8502209, 8503201, 8503010, 8503011, ...",0


Generate sequential trip_int, ordered by route and by time

In [95]:
trip_df = pd.DataFrame(stop_times_routes.sort_values(["route_int", "arrival_time"]).trip_id.unique())
trip_df["trip_int"] = trip_df.index
trip_df["trip_id"] = trip_df.iloc[:,0]
trip_df.head(5)

Unnamed: 0,0,trip_int,trip_id
0,403.TA.26-24-j19-1.220.R,0,403.TA.26-24-j19-1.220.R
1,425.TA.26-24-j19-1.220.R,1,425.TA.26-24-j19-1.220.R
2,4.TA.30-57-Y-j19-1.1.H,2,4.TA.30-57-Y-j19-1.1.H
3,5.TA.30-57-Y-j19-1.1.H,3,5.TA.30-57-Y-j19-1.1.H
4,6.TA.30-57-Y-j19-1.1.H,4,6.TA.30-57-Y-j19-1.1.H


In [96]:
#check number trip_id
trip_df.trip_id.nunique()

19390

We join trip_id to stop_times dataframe

In [None]:
#join to get trip_int in stop_times
stop_times_routes_trip = stop_times_routes.join(trip_df.set_index("trip_id"), how="inner", on="trip_id" , lsuffix='_left', rsuffix='_right').drop_duplicates()


In [None]:
#save ordered stop_times
stop_times_routes_trip = stop_times_routes_trip.sort_values(["route_int", "trip_int", "stop_sequence"])
stop_times_routes_trip.head(5)

In [None]:
#check if manipulations did not destroy trips
stop_times_routes_trip.trip_id.nunique()

#### Stop_int
Stop_int id needs to ordered by route, trip and stop sequence

In [None]:
#check number stops at entry
stop_times_routes_trip.stop_id.nunique()

stop_times_routes_trip is already in the right order. We create dataframe to create stop_int

In [None]:
stops_df = pd.DataFrame(stop_times_routes_trip.stop_id.unique())
stops_df["stop_int"] = stops_df.index
stops_df["stop_id"] = stops_df.iloc[:,0]
stops_df.head(5)

In [None]:
#check if number stop_int correct
stops_df.stop_int.nunique()

We add stop_int information to stop_times

In [None]:
#join to get stop_int
stop_times_routes_trip_stop = stop_times_routes_trip.join(stops_df.set_index("stop_id"), how="inner", on="stop_id",  lsuffix='_left', rsuffix='_right').drop_duplicates()


In [None]:
stop_times_routes_trip_stop.head(5)

In [None]:
#check if no stops deleted during manipulation
stop_times_routes_trip_stop.stop_id.nunique()

In [None]:
stop_times_routes_trip_stop.stop_int.max()

In [None]:
#keep only useful columns 
stop_times_int = stop_times_routes_trip_stop[["route_int", "trip_int", "stop_int", "stop_sequence", "arrival_time", "departure_time",\
                                          "route_id", "trip_id", "stop_id", \
                                             "route_desc", "stop_id_raw", "stop_name"]].sort_values(["route_int", "trip_int", "stop_sequence"])

In [None]:
stop_times_int = stop_times_int.reset_index(drop=True)

In [None]:
stop_times_int.loc[100:150].head(5)

An overview of number of routes, trips and stops

In [None]:
stop_times_int.route_int.nunique()

In [None]:
stop_times_int.trip_int.nunique()

In [None]:
stop_times_int.stop_int.nunique()

In [None]:
stop_times_int.stop_int.count()

### Transfer: delete transfer to same stop & get stop_int & stop_int2


In [7]:
#check number stops transfers
transfers.stop_id.count()

12564

In [8]:
transfers.head(5)

Unnamed: 0.1,Unnamed: 0,stop_id,stop_id2,distance,Transfer_time_sec,stop_name,stop_name2
0,0,8500926,8590616,0.12243,146,"Oetwil a.d.L., Schweizäcker","Geroldswil, Schweizäcker"
1,1,8500926,8590737,0.300175,360,"Oetwil a.d.L., Schweizäcker","Oetwil an der Limmat, Halde"
2,2,8502186,8502186:0:1,0.006762,8,Dietikon Stoffelbach,Dietikon Stoffelbach
3,3,8502186,8502186:0:2,0.013524,16,Dietikon Stoffelbach,Dietikon Stoffelbach
4,4,8502186,8502186P,0.0,0,Dietikon Stoffelbach,Dietikon Stoffelbach


We delete transfers to the same stop

In [9]:
transfers_df = transfers[transfers['stop_id'] != transfers['stop_id2']]

In [10]:
transfers_df.stop_id.count()

12564

We create the stop_int column in transfers. This action eliminates stops not in stop_times

In [13]:
stop_times_int = stop_times_curated

In [None]:
transfers_df = transfers_df.merge(stop_times_int[["stop_id", "stop_int"]].set_index("stop_id"), how="inner", on = "stop_id").drop_duplicates()

In [None]:
transfers_df.stop_id.count()

In [None]:
transfers_df.head(5)

In [None]:
#create dataframe with stops
df_stop_int2 = stop_times_int[["stop_id", "stop_int"]].rename(columns={"stop_id": "stop_id2", "stop_int" : "stop_int_2"})
df_stop_int2.head(5)

We add the the stop id for the arrival destination, stop_int2

In [None]:
transfers_df_int = transfers_df.merge(df_stop_int2.set_index("stop_id2"), how="inner", on = "stop_id2").drop_duplicates()

In [None]:
transfers_df_int.head(5)

In [None]:
transfers_df_int.stop_id.count()

In [None]:
transfers = transfers_df_int

In [None]:
#check number unique stops2 in transfers
transfers.stop_id2.nunique()

In [None]:
transfers.stop_id.nunique()

In [5]:
stop_times_ordered = stop_times_curated
stop_times_ordered.head(5)

Unnamed: 0.1,Unnamed: 0,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc
0,0,26-66-j19-1,8591205,17.TA.26-66-j19-1.1.H,8591205,17:00:00,17:00:00,3,"Zürich, Hürlimannplatz",47.365066,8.526539,"Zürich, Neubühl",3870,0,16:55:00,1225,12,1317,Bus
1,1,26-66-j19-1,8591415,17.TA.26-66-j19-1.1.H,8591415,17:02:00,17:02:00,4,"Zürich, Waffenplatzstrasse",47.361482,8.525749,"Zürich, Neubühl",3870,0,16:55:00,1225,12,1267,Bus
2,2,26-66-j19-1,8591204,17.TA.26-66-j19-1.1.H,8591204,17:03:00,17:03:00,5,"Zürich, Hügelstrasse",47.358543,8.526997,"Zürich, Neubühl",3870,0,16:55:00,1225,12,67,Bus
3,3,26-66-j19-1,8591098,17.TA.26-66-j19-1.1.H,8591098,17:04:00,17:04:00,6,"Zürich, Brunau/Mutschellenstr.",47.355147,8.527141,"Zürich, Neubühl",3870,0,16:55:00,1225,12,512,Bus
4,4,26-66-j19-1,8591392,17.TA.26-66-j19-1.1.H,8591392,17:05:00,17:05:00,7,"Zürich, Thujastrasse",47.350187,8.527806,"Zürich, Neubühl",3870,0,16:55:00,1225,12,403,Bus


We start by making sure the order is correct

In [None]:
stop_times_ordered = stop_times_int.sort_values(by=["route_int", "trip_int", "stop_sequence"])
stop_times_ordered.head(5)

In [None]:
stop_times_ordered[["arrival_time", "departure_time"]].head(5)

We add None to first arrival time and last departure time.

In [None]:
#adding a shift
stop_times_ordered["sequence_shift_1"] = stop_times_ordered["stop_sequence"].shift(-1, fill_value=0)
stop_times_ordered.head(5)

In [None]:
stop_times_ordered['departure_time'] = np.where((stop_times_ordered["stop_sequence"] > stop_times_ordered["sequence_shift_1"]), None, stop_times_ordered['departure_time'])

In [None]:
stop_times_ordered["arrival_time"] = np.where((stop_times_ordered["stop_sequence"] == 1), None, stop_times_ordered['arrival_time'])

In [None]:
stop_times_ordered[["arrival_time","departure_time", "stop_sequence", "sequence_shift_1"]].head(5)

## Array structure preparation

#### StopTimes: 
[[departure_route0_trip0_stop0, arrival_route0_trip0_stop_0], [departure_route0_trip0_stop1, arrival_route0_trip0_stop_1], …], [[departure_route0_trip1_stop0, arrival_route0_trip1_stop_0], …], ….], [[[departure_route1_trip0_stop0, arrival_route1_trip0_stop_0], …], [[departure_route1_trip1_stop0, arrival_route0_trip1_stop_0], …], ….], …]

We transform it in datetime as required by the raptor algorithm

In [None]:
stop_times_ordered['arrival_time'] = pd.to_datetime(stop_times_ordered['arrival_time'])
stop_times_ordered['departure_time'] = pd.to_datetime(stop_times_ordered['departure_time'])

In [None]:
stop_times_ordered[["arrival_time", "departure_time"]].head(5)

In [None]:
with open('../data/stop_times_df.pkl','wb') as f: pickle.dump(stop_times_ordered, f)

In [None]:
stop_times_ordered = stop_times_ordered.sort_values(by=["route_int", "trip_int", "stop_sequence"])
stop_times_ordered.head(5)

And we transform it to array, ready ti be used by raptor

In [None]:
stop_times_array = stop_times_ordered[["arrival_time", "departure_time"]].to_numpy()
stop_times_array

In [None]:
np.size(stop_times_array,0)

In [None]:
with open('../data/stop_times_array.pkl','wb') as f: pickle.dump(stop_times_array, f)

#### Routes: 
[[route0_nr.Trips, route0_nr. Stops, route0_pointerRoutes, route0_pointerStops_times],[route1_nr.Trips, route1_nr. Stops,, route1_pointerRoutes, route1_pointerStops_times],…]

We start by getting the number of trips and stops there is for each route

In [None]:
distinct_trips_stops = stop_times_ordered.groupby(["route_int"]).nunique()[["trip_int","stop_int"]].sort_index().rename(columns={"trip_int": "n_Trips", "stop_int": "n_stops"})
distinct_trips_stops.head(5)

In [None]:
distinct_trips_stops.shape

We create the pointer for the route stops, by adding the unique stops for each route

In [None]:
distinct_trips_stops['pointer_routes_stops'] = distinct_trips_stops.n_stops.cumsum().shift(1, fill_value=0)
distinct_trips_stops.head(5)

We create the pointer for stop_times by adding the number of stops in each route, counting duplicates (due to several trips)

In [None]:
distinct_trips_stops["pointer_stop_times"] = (stop_times_ordered.groupby(["route_int"]).count().stop_id).cumsum().shift(1, fill_value=0)

In [None]:
distinct_trips_stops["pointer_routes_stops_shift"] = distinct_trips_stops['pointer_routes_stops'].shift(-1, fill_value=0)
distinct_trips_stops["pointer_stop_times_shift"] = distinct_trips_stops['pointer_stop_times'].shift(-1, fill_value=0)
distinct_trips_stops.head(5)

In [None]:
distinct_trips_stops['pointer_routes_stops'] = np.where((distinct_trips_stops["pointer_routes_stops"] == distinct_trips_stops["pointer_routes_stops_shift"]), None, distinct_trips_stops['pointer_routes_stops'])
distinct_trips_stops['pointer_stop_times'] = np.where((distinct_trips_stops["pointer_stop_times"] == distinct_trips_stops["pointer_stop_times_shift"]), None, distinct_trips_stops['pointer_stop_times'])


In [None]:
distinct_trips_stops.isna().any()

In [None]:
with open('../data/routes_array_df.pkl','wb') as f: pickle.dump(distinct_trips_stops[['n_Trips', 'n_stops', 'pointer_routes_stops', 'pointer_stop_times']], f)

In [None]:
distinct_trips_stops.info()

In [None]:
routes_array = distinct_trips_stops[['n_Trips', 'n_stops', 'pointer_routes_stops', 'pointer_stop_times']].to_numpy()
routes_array

In [None]:
np.size(routes_array, 0)

In [None]:
with open('../data/routes_array.pkl','wb') as f: pickle.dump(routes_array, f)

RouteStops: [route0_stop0, route0_stop1,…, route1_stop0, route1_stop1,…, …]


In [None]:
route_stops = stop_times_ordered.sort_values(["route_int", "stop_sequence"])
route_stops = route_stops[['route_int', 'stop_int']].drop_duplicates().reset_index()
route_stops.head(5)

In [None]:
route_stops.info()

In [None]:
route_stops.route_int.nunique()

In [None]:
with open('../data/route_stops_df.pkl','wb') as f: pickle.dump(route_stops, f)

In [None]:
route_stops_array = route_stops.stop_int.to_numpy()
route_stops_array

In [None]:
np.size(np.unique(route_stops_array))

In [None]:
np.size(route_stops_array, 0)

In [None]:
route_stops_array.shape

In [None]:
with open('../data/route_stops_array.pkl','wb') as f: pickle.dump(route_stops_array, f)

Check if pointers are correct
It is fundamental that the indexes, that serve as pointers, in Routes are correct

We start by looking at where the indexes for stop_times and route_stops diverge. This will allow us to change. We can see that Route stops should have a new route at 3 while stop_times should have it at 78, so we try with that

In [None]:
distinct_trips_stops.head(5)

We can check if the pointer indicates the routes index number. At the pointer_routes should indicate the first stop of a new route. We try with 3 to see if route_stops has a new route at this index. It does so it works

In [None]:
route_stops.head(5)

We go and see if stop_times has a new route at 78. It does, so it works

In [None]:
stop_times_ordered.loc[75:80].head(5)

Stops: [[stop0_pointerRoutes, stop0_pointerTransfer], [stop1_pointerRoutes, stop1_pointerTransfer], …]

In [None]:
stops_join = route_stops.join(transfers.set_index("stop_int"), how="left", on="stop_int").drop_duplicates()
stops_join.head(5)

In [None]:
stops_join.stop_int.nunique()

In [None]:
distinct_route_transfers = stops_join.sort_values("stop_int").groupby(["stop_int"]).nunique().rename(columns={"route_int": "n_Routes", "stop_int_2": "n_Transfers"})
distinct_route_transfers = distinct_route_transfers[["n_Routes", "n_Transfers"]].sort_index()
distinct_route_transfers.head(5)

In [None]:
distinct_route_transfers['pointer_stop_routes'] = distinct_route_transfers.n_Routes.cumsum().shift(1, fill_value=0)
distinct_route_transfers['pointer_transfers'] = distinct_route_transfers.n_Transfers.cumsum().shift(1, fill_value=0)
distinct_route_transfers.head(5)

In [None]:
distinct_route_transfers["pointer_stop_routes_shift"] = distinct_route_transfers['pointer_stop_routes'].shift(-1, fill_value=0)
distinct_route_transfers["pointer_transfers_shift"] = distinct_route_transfers['pointer_transfers'].shift(-1, fill_value=0)
distinct_route_transfers.head(5)

In [None]:
distinct_route_transfers['pointer_stop_routes'] = np.where((distinct_route_transfers["pointer_stop_routes"] == distinct_route_transfers["pointer_stop_routes_shift"]), None, distinct_route_transfers['pointer_stop_routes'])
distinct_route_transfers['pointer_transfers'] = np.where((distinct_route_transfers["pointer_transfers"] == distinct_route_transfers["pointer_transfers_shift"]), None, distinct_route_transfers['pointer_transfers'])


In [None]:
distinct_route_transfers.isna().any()

In [None]:
stops_df = distinct_route_transfers[['pointer_stop_routes', 'pointer_transfers']]

In [None]:
with open('../data/stops_df.pkl','wb') as f: pickle.dump(stops_df, f)

In [None]:
stops_array = stops_df.to_numpy()
stops_array

In [None]:
np.size(stops_array, 0)

In [None]:
stops_array.shape

In [None]:
with open('../data/stops_array.pkl','wb') as f: pickle.dump(stops_array, f)

StopRoutes: [stop0_route1, stop0_route3, stop1_route1, stop2_route1, stop1_route4, …]

In [None]:
stop_routes = stop_times_ordered[["route_int", "stop_int", "stop_id"]].drop_duplicates().sort_values(["stop_int", "route_int"])
stop_routes = stop_routes.reset_index()
stop_routes.head(5)

In [None]:
stop_routes.shape

In [None]:
stop_times_curated.route_id.nunique()

In [None]:
stop_routes.route_int.nunique()

In [None]:
with open('../data/stop_routes_df.pkl','wb') as f: pickle.dump(stop_routes, f)

In [None]:
stop_routes_array = stop_routes["route_int"].to_numpy()
stop_routes_array

In [None]:
np.size(stop_routes_array, 0)

In [None]:
stop_routes_array.shape

In [None]:
with open('../data/stop_routes_array.pkl','wb') as f: pickle.dump(stop_routes_array, f)

Transfer: [[[stop0_nameTargetStop1, transferTime1], [stop0_nameTargetStop2, transferTime2],….], [stop1_nameTargetStop1, transferTime1], [stop1_nameTargetStop2, transferTime2],….],…]

In [None]:
transfers.stop_id.count()

In [None]:
transfer_pandas = transfers[["stop_int","stop_int_2", "Transfer_time_sec", "stop_id_raw"]].sort_values(["stop_int", "stop_int_2", "stop_id_raw"]).drop_duplicates(["stop_int", "stop_int_2"])
transfer_pandas = transfer_pandas.reset_index(drop=True)
transfer_pandas.head()

In [None]:
transfer_pandas.stop_int_2.nunique()

In [None]:
with open('../data/transfer_df.pkl','wb') as f: pickle.dump(transfers.sort_values("stop_id"), f)

In [None]:
transfer_array = transfer_pandas[["stop_int_2", "Transfer_time_sec"]].to_numpy()
transfer_array

In [None]:
with open('../data/transfer_array.pkl','wb') as f: pickle.dump(transfer_array, f)

In [None]:
np.size(transfer_array, 0)

#### Check if indexes in stops is correct

We see first the pointers

In [None]:
stops_df.head(5)

We see that at the index 8 there should be a new stop. we check and it is false

In [None]:
transfer_pandas.loc[5:10].head(5)

We see that at index 4 we should have a new stop. we check and it true

In [None]:
stop_routes.head(5)

In [None]:
stop_routes.loc[stop_routes['stop_int'] == 172]

In [None]:
route_stops.loc[route_stops['stop_int'] == 172]

read files as pickles

In [None]:
with open('../data/stop_times_array.pkl','rb') as f: arrayname1 = pickle.load(f)

In [None]:
with open('../data/routes_array.pkl','rb') as f: arrayname2 = pickle.load(f)

In [None]:
with open('../data/route_stops_array.pkl','rb') as f: arrayname3 = pickle.load(f)

In [None]:
arrayname1

In [None]:
arrayname2

In [None]:
arrayname3