# Transforming the GTFS-like table `stop_times` and the footpath table `transfers` to the pointer-array file structure of RAPTOR

*Authors: Cyril Pulver, Tomas Turner*

Most of the data cleaning was performed in `notebooks/data_wrangling_before_RAPTOR`. The two tables thereby obtained (`stop_times` and `transfers`) are already indexed exactly as RAPTOR's StopTimes and Transfers arrays require it. However, we still need to:
 - generate those arrays from the tables
 - generate the accompanying pointer arrays (Routes, Stops, RouteStops, StopRoutes) required for RAPTOR

**This notebook generates pickled python objects required to run RAPTOR from `notebooks/MC_RAPTOR`**

Below is an outline of the operations

- StopTimes
    - For each trip, change the arrival time at the first stop and the departure time from the last stop to None
    - Create arrays by subsetting on arrival and departure times.
- Routes
    - Compute the route length and number of trips for each route (defined by our in-house `route_int` identifier)
    - Create pointer for Route_stops by cumulative sum of the route lengths
    - Create pointer for Stop_times by cumulative sum of the route lengths multiplied by the number of trips per route
    - Transform everything as integer
    - Create array
- Route_stop
    - Get the sequence of stops visited for each route
    - Create array
- Stop_routes
    - Gather the unique `route_int` serving each stop
    - Create array and pickle
- Transfers
    - Create the array directly from the table `Transfers` by subsetting on `stop_int2` and `walking_time`
- Stops
    - Count unique transfers and unique routes per stop
    - Order by stop_int
    - create array and pickle
- Check
    - Indexes of stops correspond to stop_routes and transfers
    - Check coherence between stop_routes, routes_stop and stop_times
    - Check if files can be read correctly from pickle

## Import packages

In [1]:
import pandas as pd
import numpy as np
import pickle
import itertools

## Read files
Before running make sure the .csv files are in /data . If not run notebook "transfer_to_local"

In [2]:
#stop_times
stop_times_curated = pd.read_csv("../data/stop_times_final_cyril.csv")
stop_times_curated.head(5)

Unnamed: 0.1,Unnamed: 0,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
0,0,26-46-j19-1,8591371,742.TA.26-46-j19-1.8.R,8591371,18:06:00,18:06:00,13,"Zürich, Singlistrasse",47.405111,8.49349,"Zürich, Rütihof",2524,1,17:48:00,1363,18,879,Bus,1632087572480
1,1,26-46-j19-1,8591358,742.TA.26-46-j19-1.8.R,8591358,18:07:00,18:07:00,14,"Zürich, Segantinistrasse",47.407446,8.489969,"Zürich, Rütihof",2524,1,17:48:00,1363,18,1025,Bus,1632087572481
2,2,26-46-j19-1,8591158,742.TA.26-46-j19-1.8.R,8591158,18:08:00,18:08:00,15,"Zürich, Giblenstrasse",47.410728,8.485953,"Zürich, Rütihof",2524,1,17:48:00,1363,18,704,Bus,1632087572482
3,3,26-46-j19-1,8576241,742.TA.26-46-j19-1.8.R,8576241,18:09:00,18:09:00,16,"Zürich, Heizenholz",47.412297,8.483905,"Zürich, Rütihof",2524,1,17:48:00,1363,18,57,Bus,1632087572483
4,4,26-46-j19-1,8591155,742.TA.26-46-j19-1.8.R,8591155,18:10:00,18:10:00,17,"Zürich, Geeringstrasse",47.414443,8.480438,"Zürich, Rütihof",2524,1,17:48:00,1363,18,187,Bus,1632087572484


We drop columns not useful to us

In [3]:
stop_times_curated = stop_times_curated.sort_values("monotonically_increasing_id").reset_index().drop(columns=["Unnamed: 0"])
stop_times_curated.head(5)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
0,201860,26-13-j19-1,8576240,2064.TA.26-13-j19-1.24.H,8576240,07:00:00,07:00:00,5,"Zürich, Meierhofplatz",47.40201,8.499374,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1221,Tram,0
1,201861,26-13-j19-1,8591353,2064.TA.26-13-j19-1.24.H,8591353,07:01:00,07:01:00,6,"Zürich, Schwert",47.39973,8.504611,"Zürich, Albisgütli",1831,0,07:00:00,0,26,816,Tram,1
2,201862,26-13-j19-1,8591039,2064.TA.26-13-j19-1.24.H,8591039,07:02:00,07:02:00,7,"Zürich, Alte Trotte",47.397766,8.507252,"Zürich, Albisgütli",1831,0,07:00:00,0,26,776,Tram,2
3,201863,26-13-j19-1,8591121,2064.TA.26-13-j19-1.24.H,8591121,07:03:00,07:03:00,8,"Zürich, Eschergutweg",47.39627,8.51204,"Zürich, Albisgütli",1831,0,07:00:00,0,26,307,Tram,3
4,201864,26-13-j19-1,8591417,2064.TA.26-13-j19-1.24.H,8591417,07:05:00,07:05:00,9,"Zürich, Waidfussweg",47.395498,8.5184,"Zürich, Albisgütli",1831,0,07:00:00,0,26,347,Tram,4


In [4]:
#transfers
transfers = pd.read_csv("../data/transfers_cyril.csv")
transfers.head(5)

Unnamed: 0.1,Unnamed: 0,stop_id_general,stop_int,stop_lat_first,stop_lon_first,stop_name_first,stop_id_general_2,stop_int_2,stop_lat_first_2,stop_lon_first_2,stop_name_first_2,distance,walking_time,monotonically_increasing_id
0,0,8577912,248,47.259187,8.598099,"Horgen, untere Mühle",8503204,537,47.261607,8.596904,Horgen,0.283831,340,317827579904
1,1,8577912,248,47.259187,8.598099,"Horgen, untere Mühle",8573555,755,47.257662,8.593212,"Horgen, Bergli",0.405876,487,317827579905
2,2,8577912,248,47.259187,8.598099,"Horgen, untere Mühle",8503672,1323,47.262351,8.597596,Horgen ZSG,0.353874,424,317827579906
3,3,8577912,248,47.259187,8.598099,"Horgen, untere Mühle",8590663,1335,47.256547,8.602321,"Horgen, Wannenthal",0.433241,519,317827579907
4,4,8587970,249,47.307085,8.607855,"Erlenbach ZH, Föhrenstrasse",8587980,406,47.307182,8.613505,"Erlenbach ZH, Schützenhaus",0.426166,511,317827579908


In [5]:
transfers = transfers.sort_values("monotonically_increasing_id").reset_index().drop(columns=["Unnamed: 0"])
transfers.head(5)

Unnamed: 0,index,stop_id_general,stop_int,stop_lat_first,stop_lon_first,stop_name_first,stop_id_general_2,stop_int_2,stop_lat_first_2,stop_lon_first_2,stop_name_first_2,distance,walking_time,monotonically_increasing_id
0,6232,8502508,0,47.415446,8.377185,"Spreitenbach, Raiacker",8590268,815,47.414212,8.379521,"Spreitenbach, ASP",0.222963,267,0
1,6233,8502508,0,47.415446,8.377185,"Spreitenbach, Raiacker",8590270,1350,47.41795,8.372083,"Spreitenbach, Brüel",0.474276,569,1
2,6234,8503078,1,47.345476,8.593023,Waldburg,8591903,63,47.348349,8.596042,"Zollikerberg, Spital",0.392124,470,2
3,6235,8503078,1,47.345476,8.593023,Waldburg,8591023,242,47.346821,8.598153,"Zollikerb., Langägerten/Spital",0.414395,497,3
4,6236,8503078,1,47.345476,8.593023,Waldburg,8590879,551,47.345391,8.593302,"Waldburg, Station",0.023022,27,4


## Ordering

We make sure the order is correct

In [6]:
stop_times_ordered = stop_times_curated.sort_values(by=["route_int", "departure_first_stop", "trip_id", "stop_sequence"])
stop_times_ordered.head(5)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
0,201860,26-13-j19-1,8576240,2064.TA.26-13-j19-1.24.H,8576240,07:00:00,07:00:00,5,"Zürich, Meierhofplatz",47.40201,8.499374,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1221,Tram,0
1,201861,26-13-j19-1,8591353,2064.TA.26-13-j19-1.24.H,8591353,07:01:00,07:01:00,6,"Zürich, Schwert",47.39973,8.504611,"Zürich, Albisgütli",1831,0,07:00:00,0,26,816,Tram,1
2,201862,26-13-j19-1,8591039,2064.TA.26-13-j19-1.24.H,8591039,07:02:00,07:02:00,7,"Zürich, Alte Trotte",47.397766,8.507252,"Zürich, Albisgütli",1831,0,07:00:00,0,26,776,Tram,2
3,201863,26-13-j19-1,8591121,2064.TA.26-13-j19-1.24.H,8591121,07:03:00,07:03:00,8,"Zürich, Eschergutweg",47.39627,8.51204,"Zürich, Albisgütli",1831,0,07:00:00,0,26,307,Tram,3
4,201864,26-13-j19-1,8591417,2064.TA.26-13-j19-1.24.H,8591417,07:05:00,07:05:00,9,"Zürich, Waidfussweg",47.395498,8.5184,"Zürich, Albisgütli",1831,0,07:00:00,0,26,347,Tram,4


In [7]:
stop_times_curated.equals(stop_times_ordered)

True

We make sure that there are never stop_sequences that do not respect the departure time sequence within a same trip

In [8]:
stop_times_ordered_2 = stop_times_ordered.sort_values(by=['route_int', 'departure_first_stop', 'trip_id', 'departure_time'])

In [9]:
stop_times_curated.equals(stop_times_ordered_2)

True

## Fix arrival and departure time
We add None to first arrival time and last departure time.

It is possible that two departure times for two consecutive trips in stop_times are the same. Therefore, the solution below may break.

In [10]:
#adding a shift
stop_times_ordered["trip_id_1"] = stop_times_ordered["trip_id"].shift(-1, fill_value=0)
stop_times_ordered.head(50)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,...,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id,trip_id_1
0,201860,26-13-j19-1,8576240,2064.TA.26-13-j19-1.24.H,8576240,07:00:00,07:00:00,5,"Zürich, Meierhofplatz",47.40201,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1221,Tram,0,2064.TA.26-13-j19-1.24.H
1,201861,26-13-j19-1,8591353,2064.TA.26-13-j19-1.24.H,8591353,07:01:00,07:01:00,6,"Zürich, Schwert",47.39973,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,816,Tram,1,2064.TA.26-13-j19-1.24.H
2,201862,26-13-j19-1,8591039,2064.TA.26-13-j19-1.24.H,8591039,07:02:00,07:02:00,7,"Zürich, Alte Trotte",47.397766,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,776,Tram,2,2064.TA.26-13-j19-1.24.H
3,201863,26-13-j19-1,8591121,2064.TA.26-13-j19-1.24.H,8591121,07:03:00,07:03:00,8,"Zürich, Eschergutweg",47.39627,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,307,Tram,3,2064.TA.26-13-j19-1.24.H
4,201864,26-13-j19-1,8591417,2064.TA.26-13-j19-1.24.H,8591417,07:05:00,07:05:00,9,"Zürich, Waidfussweg",47.395498,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,347,Tram,4,2064.TA.26-13-j19-1.24.H
5,201865,26-13-j19-1,8591437,2064.TA.26-13-j19-1.24.H,8591437,07:06:00,07:06:00,10,"Zürich, Wipkingerplatz",47.392591,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1015,Tram,5,2064.TA.26-13-j19-1.24.H
6,201866,26-13-j19-1,8580522,2064.TA.26-13-j19-1.24.H,8580522,07:08:00,07:08:00,11,"Zürich, Escher-Wyss-Platz",47.390797,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,455,Tram,6,2064.TA.26-13-j19-1.24.H
7,201867,26-13-j19-1,8591110,2064.TA.26-13-j19-1.24.H,8591110,07:09:00,07:09:00,12,"Zürich, Dammweg",47.388492,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1102,Tram,7,2064.TA.26-13-j19-1.24.H
8,201868,26-13-j19-1,8591306,2064.TA.26-13-j19-1.24.H,8591306,07:10:00,07:10:00,13,"Zürich, Quellenstrasse",47.38674,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,786,Tram,8,2064.TA.26-13-j19-1.24.H
9,201869,26-13-j19-1,8591257,2064.TA.26-13-j19-1.24.H,8591257,07:11:00,07:11:00,14,"Zürich, Limmatplatz",47.384599,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,388,Tram,9,2064.TA.26-13-j19-1.24.H


In [11]:
stop_times_ordered['departure_time'] = np.where((stop_times_ordered["trip_id"] != stop_times_ordered["trip_id_1"]), None, stop_times_ordered['departure_time'])

In [12]:
stop_times_ordered.head(50)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,...,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id,trip_id_1
0,201860,26-13-j19-1,8576240,2064.TA.26-13-j19-1.24.H,8576240,07:00:00,07:00:00,5,"Zürich, Meierhofplatz",47.40201,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1221,Tram,0,2064.TA.26-13-j19-1.24.H
1,201861,26-13-j19-1,8591353,2064.TA.26-13-j19-1.24.H,8591353,07:01:00,07:01:00,6,"Zürich, Schwert",47.39973,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,816,Tram,1,2064.TA.26-13-j19-1.24.H
2,201862,26-13-j19-1,8591039,2064.TA.26-13-j19-1.24.H,8591039,07:02:00,07:02:00,7,"Zürich, Alte Trotte",47.397766,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,776,Tram,2,2064.TA.26-13-j19-1.24.H
3,201863,26-13-j19-1,8591121,2064.TA.26-13-j19-1.24.H,8591121,07:03:00,07:03:00,8,"Zürich, Eschergutweg",47.39627,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,307,Tram,3,2064.TA.26-13-j19-1.24.H
4,201864,26-13-j19-1,8591417,2064.TA.26-13-j19-1.24.H,8591417,07:05:00,07:05:00,9,"Zürich, Waidfussweg",47.395498,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,347,Tram,4,2064.TA.26-13-j19-1.24.H
5,201865,26-13-j19-1,8591437,2064.TA.26-13-j19-1.24.H,8591437,07:06:00,07:06:00,10,"Zürich, Wipkingerplatz",47.392591,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1015,Tram,5,2064.TA.26-13-j19-1.24.H
6,201866,26-13-j19-1,8580522,2064.TA.26-13-j19-1.24.H,8580522,07:08:00,07:08:00,11,"Zürich, Escher-Wyss-Platz",47.390797,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,455,Tram,6,2064.TA.26-13-j19-1.24.H
7,201867,26-13-j19-1,8591110,2064.TA.26-13-j19-1.24.H,8591110,07:09:00,07:09:00,12,"Zürich, Dammweg",47.388492,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,1102,Tram,7,2064.TA.26-13-j19-1.24.H
8,201868,26-13-j19-1,8591306,2064.TA.26-13-j19-1.24.H,8591306,07:10:00,07:10:00,13,"Zürich, Quellenstrasse",47.38674,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,786,Tram,8,2064.TA.26-13-j19-1.24.H
9,201869,26-13-j19-1,8591257,2064.TA.26-13-j19-1.24.H,8591257,07:11:00,07:11:00,14,"Zürich, Limmatplatz",47.384599,...,"Zürich, Albisgütli",1831,0,07:00:00,0,26,388,Tram,9,2064.TA.26-13-j19-1.24.H


In [13]:
# positive shift for setting arrival times to none:
stop_times_ordered["trip_id_2"] = stop_times_ordered["trip_id"].shift(1, fill_value=0)
stop_times_ordered.head(50)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,...,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id,trip_id_1,trip_id_2
0,201860,26-13-j19-1,8576240,2064.TA.26-13-j19-1.24.H,8576240,07:00:00,07:00:00,5,"Zürich, Meierhofplatz",47.40201,...,1831,0,07:00:00,0,26,1221,Tram,0,2064.TA.26-13-j19-1.24.H,0
1,201861,26-13-j19-1,8591353,2064.TA.26-13-j19-1.24.H,8591353,07:01:00,07:01:00,6,"Zürich, Schwert",47.39973,...,1831,0,07:00:00,0,26,816,Tram,1,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
2,201862,26-13-j19-1,8591039,2064.TA.26-13-j19-1.24.H,8591039,07:02:00,07:02:00,7,"Zürich, Alte Trotte",47.397766,...,1831,0,07:00:00,0,26,776,Tram,2,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
3,201863,26-13-j19-1,8591121,2064.TA.26-13-j19-1.24.H,8591121,07:03:00,07:03:00,8,"Zürich, Eschergutweg",47.39627,...,1831,0,07:00:00,0,26,307,Tram,3,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
4,201864,26-13-j19-1,8591417,2064.TA.26-13-j19-1.24.H,8591417,07:05:00,07:05:00,9,"Zürich, Waidfussweg",47.395498,...,1831,0,07:00:00,0,26,347,Tram,4,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
5,201865,26-13-j19-1,8591437,2064.TA.26-13-j19-1.24.H,8591437,07:06:00,07:06:00,10,"Zürich, Wipkingerplatz",47.392591,...,1831,0,07:00:00,0,26,1015,Tram,5,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
6,201866,26-13-j19-1,8580522,2064.TA.26-13-j19-1.24.H,8580522,07:08:00,07:08:00,11,"Zürich, Escher-Wyss-Platz",47.390797,...,1831,0,07:00:00,0,26,455,Tram,6,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
7,201867,26-13-j19-1,8591110,2064.TA.26-13-j19-1.24.H,8591110,07:09:00,07:09:00,12,"Zürich, Dammweg",47.388492,...,1831,0,07:00:00,0,26,1102,Tram,7,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
8,201868,26-13-j19-1,8591306,2064.TA.26-13-j19-1.24.H,8591306,07:10:00,07:10:00,13,"Zürich, Quellenstrasse",47.38674,...,1831,0,07:00:00,0,26,786,Tram,8,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H
9,201869,26-13-j19-1,8591257,2064.TA.26-13-j19-1.24.H,8591257,07:11:00,07:11:00,14,"Zürich, Limmatplatz",47.384599,...,1831,0,07:00:00,0,26,388,Tram,9,2064.TA.26-13-j19-1.24.H,2064.TA.26-13-j19-1.24.H


In [14]:
stop_times_ordered["arrival_time"] = np.where((stop_times_ordered["trip_id"] != stop_times_ordered["trip_id_2"]), None, stop_times_ordered['arrival_time'])

In [15]:
stop_times_ordered[["arrival_time","departure_time"]].head(50)

Unnamed: 0,arrival_time,departure_time
0,,07:00:00
1,07:01:00,07:01:00
2,07:02:00,07:02:00
3,07:03:00,07:03:00
4,07:05:00,07:05:00
5,07:06:00,07:06:00
6,07:08:00,07:08:00
7,07:09:00,07:09:00
8,07:10:00,07:10:00
9,07:11:00,07:11:00


In [16]:
stop_times_ordered.tail()

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,...,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id,trip_id_1,trip_id_2
260454,62764,26-7-B-j19-1,8591081,2674.TA.26-7-B-j19-1.17.H,8591081,07:33:00,07:33:00,27,"Zürich Wollishofen, Bhf (Tram)",47.347034,...,1961,0,07:00:00,1460,28,250,Tram,1709396985177,2674.TA.26-7-B-j19-1.17.H,2674.TA.26-7-B-j19-1.17.H
260455,62765,26-7-B-j19-1,8591304,2674.TA.26-7-B-j19-1.17.H,8591304,07:34:00,07:34:00,28,"Zürich, Post Wollishofen",47.344472,...,1961,0,07:00:00,1460,28,982,Tram,1709396985178,2674.TA.26-7-B-j19-1.17.H,2674.TA.26-7-B-j19-1.17.H
260456,62766,26-7-B-j19-1,8591279,2674.TA.26-7-B-j19-1.17.H,8591279,07:35:00,07:35:00,29,"Zürich, Morgental",47.343948,...,1961,0,07:00:00,1460,28,1349,Tram,1709396985179,2674.TA.26-7-B-j19-1.17.H,2674.TA.26-7-B-j19-1.17.H
260457,62767,26-7-B-j19-1,8591106,2674.TA.26-7-B-j19-1.17.H,8591106,07:36:00,07:36:00,30,"Zürich, Butzenstrasse",47.34141,...,1961,0,07:00:00,1460,28,1037,Tram,1709396985180,2674.TA.26-7-B-j19-1.17.H,2674.TA.26-7-B-j19-1.17.H
260458,62768,26-7-B-j19-1,8591439,2674.TA.26-7-B-j19-1.17.H,8591439,07:37:00,,31,"Zürich, Wollishofen",47.338439,...,1961,0,07:00:00,1460,28,552,Tram,1709396985181,0,2674.TA.26-7-B-j19-1.17.H


In [17]:
# removing the shifted columns which we won't need anymore
stop_times_ordered = stop_times_ordered.drop(columns=['trip_id_1', 'trip_id_2'])

## Array structure preparation

#### StopTimes: 
Structure: [[departure_route0_trip0_stop0, arrival_route0_trip0_stop_0], [departure_route0_trip0_stop1, arrival_route0_trip0_stop_1], …], [[departure_route0_trip1_stop0, arrival_route0_trip1_stop_0], …], ….], [[[departure_route1_trip0_stop0, arrival_route1_trip0_stop_0], …], [[departure_route1_trip1_stop0, arrival_route0_trip1_stop_0], …], ….], …]

We transform it in datetime as required by the raptor algorithm

In [18]:
stop_times_ordered['arrival_time'] = pd.to_datetime(stop_times_ordered['arrival_time'])
stop_times_ordered['departure_time'] = pd.to_datetime(stop_times_ordered['departure_time'])

In [19]:
stop_times_ordered[["arrival_time", "departure_time"]].head(50)

Unnamed: 0,arrival_time,departure_time
0,NaT,2020-05-24 07:00:00
1,2020-05-24 07:01:00,2020-05-24 07:01:00
2,2020-05-24 07:02:00,2020-05-24 07:02:00
3,2020-05-24 07:03:00,2020-05-24 07:03:00
4,2020-05-24 07:05:00,2020-05-24 07:05:00
5,2020-05-24 07:06:00,2020-05-24 07:06:00
6,2020-05-24 07:08:00,2020-05-24 07:08:00
7,2020-05-24 07:09:00,2020-05-24 07:09:00
8,2020-05-24 07:10:00,2020-05-24 07:10:00
9,2020-05-24 07:11:00,2020-05-24 07:11:00


In [20]:
with open('../data/stop_times_df_cyril.pkl','wb') as f: pickle.dump(stop_times_ordered, f)

And we transform it to array, ready to be used by RAPTOR

In [21]:
stop_times_array = stop_times_ordered[["arrival_time", "departure_time"]].to_numpy()
stop_times_array

array([[                          'NaT', '2020-05-24T07:00:00.000000000'],
       ['2020-05-24T07:01:00.000000000', '2020-05-24T07:01:00.000000000'],
       ['2020-05-24T07:02:00.000000000', '2020-05-24T07:02:00.000000000'],
       ...,
       ['2020-05-24T07:35:00.000000000', '2020-05-24T07:35:00.000000000'],
       ['2020-05-24T07:36:00.000000000', '2020-05-24T07:36:00.000000000'],
       ['2020-05-24T07:37:00.000000000',                           'NaT']],
      dtype='datetime64[ns]')

In [22]:
np.size(stop_times_array,0)

260459

In [23]:
with open('../data/stop_times_array_cyril.pkl','wb') as f: pickle.dump(stop_times_array, f)

#### Routes: 
structure: [[route0_nr.Trips, route0_nr. Stops, route0_pointerRoutes, route0_pointerStops_times],[route1_nr.Trips, route1_nr. Stops,, route1_pointerRoutes, route1_pointerStops_times],…]

We start by getting the number of trips and stops there is for each route

**IMPORTANT**: the number of stops stored in `routes[1]` is not the number of unique stops in the route, but rather the **length** of the route, i.e the number of times there is a stop for each trip of the route.

We examine this below. First, we try to count the number of stops per route by simply counting the unique stop_int in each route (or trip). But as compared with `stop_count` which was computed during pyspark processing, that does not work.

In [24]:
distinct_trips_stops = stop_times_ordered.groupby(["route_int"]).nunique()[["trip_id","stop_int"]].sort_index().rename(columns={"trip_id": "n_Trips", "stop_int": "n_stops"})
distinct_trips_stops.head(5)

Unnamed: 0_level_0,n_Trips,n_stops
route_int,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,26
1,1,8
2,1,17
3,1,2
4,55,9


In [25]:
#Sanity checking on n_stops per route from stop_times
np.where(distinct_trips_stops[['n_stops']].where(distinct_trips_stops[['n_stops']].values!=stop_times_ordered[['route_int', 'stop_count']].drop_duplicates().sort_values(by='route_int')[['stop_count']].values).notna())[0]

array([  36,   69,   74,  181,  228,  241,  254,  262,  274,  300,  302,
        331,  362,  383,  458,  471,  494,  535,  549,  585,  599,  740,
        785,  856,  866,  890,  898,  929,  943,  977, 1006, 1007, 1016,
       1058, 1103, 1157, 1223, 1245, 1253, 1266, 1314, 1331, 1368, 1416,
       1443])

What went wrong ? 

In [26]:
stop_times_ordered[stop_times_ordered['route_int']==36].head(50)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
3187,41421,26-302-j19-1,8587020,533.TA.26-302-j19-1.9.H,8587020:0:E,NaT,2020-05-24 07:03:00,1,"Dietikon, Bahnhof",47.406199,8.404521,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,130,Bus,17179869810
3188,41422,26-302-j19-1,8595511,533.TA.26-302-j19-1.9.H,8595511,2020-05-24 07:05:00,2020-05-24 07:05:00,2,"Dietikon, Heimstrasse",47.40854,8.405868,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,892,Bus,17179869811
3189,41423,26-302-j19-1,8590603,533.TA.26-302-j19-1.9.H,8590603,2020-05-24 07:07:00,2020-05-24 07:07:00,3,"Fahrweid, Limmatbrücke",47.407598,8.411887,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,1273,Bus,17179869812
3190,41424,26-302-j19-1,8591829,533.TA.26-302-j19-1.9.H,8591829,2020-05-24 07:08:00,2020-05-24 07:08:00,4,"Fahrweid, Au",47.411646,8.414295,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,109,Bus,17179869813
3191,41425,26-302-j19-1,8590602,533.TA.26-302-j19-1.9.H,8590602,2020-05-24 07:09:00,2020-05-24 07:09:00,5,"Fahrweid, Brunaustrasse",47.414418,8.414043,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,1159,Bus,17179869814
3192,41426,26-302-j19-1,8590615,533.TA.26-302-j19-1.9.H,8590615,2020-05-24 07:10:00,2020-05-24 07:10:00,6,"Geroldswil, Grindlen",47.416728,8.414394,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,1096,Bus,17179869815
3193,41427,26-302-j19-1,8590614,533.TA.26-302-j19-1.9.H,8590614,2020-05-24 07:11:00,2020-05-24 07:11:00,7,"Geroldswil, Dorfstrasse",47.41984,8.413136,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,724,Bus,17179869816
3194,41428,26-302-j19-1,8590618,533.TA.26-302-j19-1.9.H,8590618,2020-05-24 07:12:00,2020-05-24 07:12:00,8,"Geroldswil, Zentrum",47.421457,8.409444,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,589,Bus,17179869817
3195,41429,26-302-j19-1,8590614,533.TA.26-302-j19-1.9.H,8590614,2020-05-24 07:13:00,2020-05-24 07:13:00,9,"Geroldswil, Dorfstrasse",47.41984,8.413136,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,724,Bus,17179869818
3196,41430,26-302-j19-1,8590617,533.TA.26-302-j19-1.9.H,8590617,2020-05-24 07:15:00,2020-05-24 07:15:00,10,"Geroldswil, Welbrig",47.418072,8.419065,"Urdorf, Oberurdorf",4880,0,07:03:00,36,30,859,Bus,17179869819


This route passes twice through the same stop ('Geroldswil, Dorfstrasse') at different times !

The second approach would be to count the number of unique trip_id and stop_sequences for each route. But that also fails.

In [27]:
distinct_trips_stops_2 = stop_times_ordered.groupby(["route_int"]).nunique()[["trip_id","stop_sequence"]].sort_index().rename(columns={"trip_id": "n_Trips", "stop_sequence": "n_stops"})
distinct_trips_stops.head(5)

Unnamed: 0_level_0,n_Trips,n_stops
route_int,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,26
1,1,8
2,1,17
3,1,2
4,55,9


In [28]:
np.where(distinct_trips_stops_2[['n_stops']].where(distinct_trips_stops_2[['n_stops']].values!=stop_times_ordered[['route_int', 'stop_count']].drop_duplicates().sort_values(by='route_int')[['stop_count']].values).notna())[0]

array([  55,   60,   68,   82,   91,  106,  130,  156,  178,  201,  214,
        282,  310,  416,  435,  436,  446,  520,  602,  610,  611,  637,
        660,  685,  699,  744,  764,  854,  861,  914,  928,  938,  978,
       1028, 1041, 1071, 1087, 1144, 1149, 1151, 1283, 1285, 1287, 1297,
       1325, 1327, 1328, 1356, 1403, 1426, 1442])

What went wrong this time ?

In [29]:
stop_times_ordered[stop_times_ordered['route_int']==55].head(50)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
5151,51596,1-445-j19-1,8572561,6.TA.1-445-j19-1.2.H,8572561,NaT,2020-05-24 07:01:00,8,"Bellikon, Rehaklinik",47.388535,8.344037,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,867,Bus,25769804942
5152,51597,1-445-j19-1,8503879,6.TA.1-445-j19-1.2.H,8503879,2020-05-24 07:03:00,2020-05-24 07:03:00,9,"Widen, Imbismatt",47.37244,8.356524,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,102,Bus,25769804943
5153,51598,1-445-j19-1,8502575,6.TA.1-445-j19-1.2.H,8502575,2020-05-24 07:04:00,2020-05-24 07:04:00,10,"Widen, Dorf",47.367572,8.363594,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,868,Bus,25769804944
5154,51599,1-445-j19-1,8580729,6.TA.1-445-j19-1.2.H,8580729,2020-05-24 07:04:00,2020-05-24 07:04:00,11,"Berikon, Kesslernmatt",47.364579,8.366082,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,1111,Bus,25769804945
5155,51600,1-445-j19-1,8572560,6.TA.1-445-j19-1.2.H,8572560,2020-05-24 07:09:00,2020-05-24 07:09:00,12,"Berikon-Widen, Bahnhof",47.361798,8.366756,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,882,Bus,25769804946
5156,51601,1-445-j19-1,8572599,6.TA.1-445-j19-1.2.H,8572599,2020-05-24 07:10:00,2020-05-24 07:10:00,13,"Berikon, Kreisschule",47.356279,8.366298,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,884,Bus,25769804947
5157,51602,1-445-j19-1,8502560,6.TA.1-445-j19-1.2.H,8502560,2020-05-24 07:12:00,2020-05-24 07:12:00,14,"Berikon, Kirche",47.351051,8.371418,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,312,Bus,25769804948
5158,51603,1-445-j19-1,8580317,6.TA.1-445-j19-1.2.H,8580317,2020-05-24 07:13:00,2020-05-24 07:13:00,15,"Berikon, Stalden",47.347643,8.376503,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,1296,Bus,25769804949
5159,51604,1-445-j19-1,8572598,6.TA.1-445-j19-1.2.H,8572598,2020-05-24 07:14:00,2020-05-24 07:14:00,16,"Berikon, Mattenhof",47.344125,8.380069,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,499,Bus,25769804950
5160,51605,1-445-j19-1,8591365,6.TA.1-445-j19-1.2.H,8591365,2020-05-24 07:32:00,2020-05-24 07:32:00,17,"Zürich, Sihlcity",47.357971,8.522263,"Zürich Enge, Bahnhof/Bederstr.",44507,0,07:01:00,55,11,1017,Bus,25769804951


stop_sequence starts at 8 for the first trip, but then starts at 3 for the second trip.

The reason for that is unknown, as the two trips could be verified on sbb.ch.

The correct way to count the length of each route would be to divide the length of each part of `stop_times` belonging to a unique `route_int` by the number of unique `trip_id` found for that route.

In [30]:
n_route_int_df = stop_times_ordered.groupby('route_int').count()['route_id'].values

In [31]:
n_unique_trips_per_route_int = stop_times_ordered.groupby('route_int').agg('nunique')['trip_id'].values

In [32]:
temp1 = np.rint((n_route_int_df/n_unique_trips_per_route_int))

In [33]:
temp2 = stop_times_ordered.groupby('route_int')['stop_count'].agg('unique').values

In [34]:
all(temp1 == temp2)

True

In [35]:
temp3 = stop_times_ordered.groupby('route_int')['stop_count'].agg('first').values

all(temp1 == temp3)

True

We have confirmed dividing the length of `stop_times` belonging to each `route_int` by the number of unique `trip_id` in each route yields the length of each route, as computed during the pyspark processing. Therefore we may directly use the `stop_count` column per route to get the length of each route.

In [36]:
route_lengths = stop_times_ordered.groupby('route_int')\
.agg({'trip_id': 'nunique',
      'stop_count': 'first'})\
.reset_index()\
.rename(columns={'trip_id': 'n_trips',
                'stop_count': 'route_length'})
route_lengths.head()

Unnamed: 0,route_int,n_trips,route_length
0,0,1,26
1,1,1,8
2,2,1,17
3,3,1,2
4,4,55,9


In [37]:
route_lengths.shape

(1461, 3)

We create the pointer for routeStops, by adding the route length for each route

In [38]:
route_lengths['pointer_route_stops'] = route_lengths.route_length.cumsum().shift(1, fill_value=0)
route_lengths.head(5)

Unnamed: 0,route_int,n_trips,route_length,pointer_route_stops
0,0,1,26,0
1,1,1,8,26
2,2,1,17,34
3,3,1,2,51
4,4,55,9,53


We create the pointer for stop_times by adding the number of stops in each route, counting duplicates (due to several trips)

In [39]:
route_lengths["pointer_stop_times"] = (stop_times_ordered.groupby(["route_int"]).count().stop_id).cumsum().shift(1, fill_value=0)
route_lengths.head(20)

Unnamed: 0,route_int,n_trips,route_length,pointer_route_stops,pointer_stop_times
0,0,1,26,0,0
1,1,1,8,26,26
2,2,1,17,34,34
3,3,1,2,51,51
4,4,55,9,53,53
5,5,1,2,62,548
6,6,1,5,64,550
7,7,1,6,69,555
8,8,7,5,75,561
9,9,1,4,80,596


In [40]:
# sanity check: the pointer in stop_times should be the cumulative sum of the number of trips times the length of the route for each route
route_lengths['pointer_stop_times_2'] = route_lengths[['n_trips', 'route_length']]\
.apply(lambda x: x[0]*x[1], axis=1)\
.cumsum()\
.shift(1, fill_value=0)
route_lengths.head(20)

Unnamed: 0,route_int,n_trips,route_length,pointer_route_stops,pointer_stop_times,pointer_stop_times_2
0,0,1,26,0,0,0
1,1,1,8,26,26,26
2,2,1,17,34,34,34
3,3,1,2,51,51,51
4,4,55,9,53,53,53
5,5,1,2,62,548,548
6,6,1,5,64,550,550
7,7,1,6,69,555,555
8,8,7,5,75,561,561
9,9,1,4,80,596,596


In [41]:
# sanity check part 2
all(route_lengths['pointer_stop_times'].values == route_lengths['pointer_stop_times_2'].values)

True

In [42]:
route_lengths = route_lengths.drop(columns=['pointer_stop_times_2'])
route_lengths.head()

Unnamed: 0,route_int,n_trips,route_length,pointer_route_stops,pointer_stop_times
0,0,1,26,0,0
1,1,1,8,26,26
2,2,1,17,34,34
3,3,1,2,51,51
4,4,55,9,53,53


We make sure there are no route that have no trips or no stops, and in case we delete its pointer

In [43]:
route_lengths['pointer_route_stops'] = np.where((route_lengths["n_trips"] == 0), None, route_lengths['pointer_route_stops'])
route_lengths['pointer_stop_times'] = np.where((route_lengths["route_length"] == 0), None, route_lengths['pointer_stop_times'])

In [44]:
route_lengths.isnull().any()

route_int              False
n_trips                False
route_length           False
pointer_route_stops    False
pointer_stop_times     False
dtype: bool

In [45]:
# last check on the ordering by route_int
route_lengths2 = route_lengths.sort_values(by=['route_int', 'pointer_route_stops'])
route_lengths.equals(route_lengths2)

True

Make sure to convert all the information in integers

In [46]:
route_lengths["pointer_route_stops"] = pd.to_numeric(route_lengths["pointer_route_stops"])
route_lengths["pointer_stop_times"] = pd.to_numeric(route_lengths["pointer_stop_times"])
route_lengths['route_int'] = pd.to_numeric(route_lengths["route_int"])
route_lengths['n_trips'] = pd.to_numeric(route_lengths["n_trips"])
route_lengths['route_length'] = pd.to_numeric(route_lengths["route_length"])


In [47]:
route_lengths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   route_int            1461 non-null   int64
 1   n_trips              1461 non-null   int64
 2   route_length         1461 non-null   int64
 3   pointer_route_stops  1461 non-null   int64
 4   pointer_stop_times   1461 non-null   int64
dtypes: int64(5)
memory usage: 57.2 KB


In [48]:
with open('../data/routes_array_df_cyril.pkl','wb') as f: pickle.dump(route_lengths[['n_trips', 'route_length', 'pointer_route_stops', 'pointer_stop_times']], f)

In [49]:
routes_array = route_lengths[['n_trips', 'route_length', 'pointer_route_stops', 'pointer_stop_times']].to_numpy()
routes_array

array([[     1,     26,      0,      0],
       [     1,      8,     26,     26],
       [     1,     17,     34,     34],
       ...,
       [     1,      3,  15362, 260396],
       [     2,     16,  15365, 260399],
       [     1,     28,  15381, 260431]])

In [50]:
np.size(routes_array, 0)

1461

In [51]:
with open('../data/routes_array_cyril.pkl','wb') as f: pickle.dump(routes_array, f)

#### RouteStops: 
Structure:[route0_stop0, route0_stop1,…, route1_stop0, route1_stop1,…, …]

Note that `RouteStops` as defined in RAPTOR takes **the sequence of stops of the route (of length `route_length`)**, not the unique stops visited in during the route. 

We use the index we have generated above to pick the sequence of stop corresponding to the first trip of each route.

In [52]:
first_trip_indices_series = route_lengths[['pointer_stop_times', 'route_length']].apply(lambda x: list(range(x[0], x[0]+x[1], 1)), axis=1)
first_trip_indices_series

0       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1                        [26, 27, 28, 29, 30, 31, 32, 33]
2       [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 4...
3                                                [51, 52]
4                    [53, 54, 55, 56, 57, 58, 59, 60, 61]
                              ...                        
1456    [260240, 260241, 260242, 260243, 260244, 26024...
1457    [260384, 260385, 260386, 260387, 260388, 26038...
1458                             [260396, 260397, 260398]
1459    [260399, 260400, 260401, 260402, 260403, 26040...
1460    [260431, 260432, 260433, 260434, 260435, 26043...
Length: 1461, dtype: object

In [53]:
first_trip_indices = []
for x in first_trip_indices_series:
    first_trip_indices.extend(x)

In [54]:
route_stops = stop_times_ordered.iloc[first_trip_indices, ]
route_stops.iloc[50:, ].head(20)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
50,201910,26-304-j19-1,8590617,159.TA.26-304-j19-1.4.R,8590617,2020-05-24 19:59:00,NaT,17,"Geroldswil, Welbrig",47.418072,8.419065,"Dietikon, Bahnhof",5481,1,19:39:00,2,17,859,Bus,50
51,201911,26-61-j19-1,8591281,269.TA.26-61-j19-1.1.H,8591281,NaT,2020-05-24 19:57:00,1,"Zürich, Mühlacker",47.425633,8.498,"Zürich, Schwamendingerplatz",2076,0,19:57:00,3,2,212,Bus,51
52,201912,26-61-j19-1,8591046,269.TA.26-61-j19-1.1.H,8591046,2020-05-24 19:58:00,NaT,2,"Zürich, Aspholz",47.425086,8.500587,"Zürich, Schwamendingerplatz",2076,0,19:57:00,3,2,1003,Bus,52
53,201913,26-703-j19-1,8591825,179.TA.26-703-j19-1.2.R,8591825,NaT,2020-05-24 07:10:00,1,"Benglen, Bodenacher",47.361129,8.638613,"Zürich, Klusplatz",9385,1,07:10:00,4,9,580,Bus,53
54,201914,26-703-j19-1,8590504,179.TA.26-703-j19-1.2.R,8590504,2020-05-24 07:11:00,2020-05-24 07:11:00,2,"Benglen, Gerlisbrunnen",47.361086,8.633609,"Zürich, Klusplatz",9385,1,07:10:00,4,9,861,Bus,54
55,201915,26-703-j19-1,8596005,179.TA.26-703-j19-1.2.R,8596005,2020-05-24 07:14:00,2020-05-24 07:14:00,3,"Binz bei Maur, Twäracher",47.360892,8.623476,"Zürich, Klusplatz",9385,1,07:10:00,4,9,1366,Bus,55
56,201916,26-703-j19-1,8591832,179.TA.26-703-j19-1.2.R,8591832,2020-05-24 07:14:00,2020-05-24 07:14:00,4,"Pfaffhausen, Müseren",47.362699,8.617548,"Zürich, Klusplatz",9385,1,07:10:00,4,9,1023,Bus,56
57,201917,26-703-j19-1,8591147,179.TA.26-703-j19-1.2.R,8591147,2020-05-24 07:16:00,2020-05-24 07:16:00,5,"Zürich, Friedhof Witikon",47.361342,8.602824,"Zürich, Klusplatz",9385,1,07:10:00,4,9,1260,Bus,57
58,201918,26-703-j19-1,8591162,179.TA.26-703-j19-1.2.R,8591162,2020-05-24 07:17:00,2020-05-24 07:17:00,6,"Zürich, Glockenacker",47.360977,8.599303,"Zürich, Klusplatz",9385,1,07:10:00,4,9,146,Bus,58
59,201919,26-703-j19-1,8591261,179.TA.26-703-j19-1.2.R,8591261,2020-05-24 07:18:00,2020-05-24 07:18:00,7,"Zürich, Loorenstrasse",47.359863,8.594524,"Zürich, Klusplatz",9385,1,07:10:00,4,9,1197,Bus,59


In [55]:
route_stops.shape

(15409, 20)

Let's compare it to the "wrong way" of doing that operation: looking only at unique stops per route:

In [56]:
stop_times_ordered[['route_int', 'stop_int']].drop_duplicates().shape

(15344, 2)

We would have missed some entries for routes passing by the same stop more than once.

Let's filter for the information that is needed for us.

In [57]:
route_stops = route_stops[['route_int', 'stop_int']]
route_stops.head()

Unnamed: 0,route_int,stop_int
0,0,1221
1,0,816
2,0,776
3,0,307
4,0,347


Check if we have everything all information as integers

In [58]:
route_stops.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15409 entries, 0 to 260458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   route_int  15409 non-null  int64
 1   stop_int   15409 non-null  int64
dtypes: int64(2)
memory usage: 361.1 KB


Quick check if we have the same number of unique route as lines in the array routes. It is the case

In [59]:
route_stops.route_int.nunique()

1461

In [60]:
with open('../data/route_stops_df_cyril.pkl','wb') as f: pickle.dump(route_stops, f)

In [61]:
route_stops_array = route_stops.stop_int.to_numpy()
route_stops_array

array([1221,  816,  776, ..., 1349, 1037,  552])

Check if the number of unique stops corresponds to the stops we have in stoptimes. It is the case

In [62]:
np.size(np.unique(route_stops_array))

1407

In [63]:
stop_times_ordered.stop_id_general.nunique()

1407

In [64]:
with open('../data/route_stops_array_cyril.pkl','wb') as f: pickle.dump(route_stops_array, f)

#### Check if pointers are correct
It is fundamental that the indexes, that serve as pointers, in Routes are correct

We start by looking at where the indexes for stop_times and route_stops diverge. We observe that Route stops should have a new route at 62 while stop_times should have it at 548. Let's verify that.

In [65]:
route_lengths.head(6)

Unnamed: 0,route_int,n_trips,route_length,pointer_route_stops,pointer_stop_times
0,0,1,26,0,0
1,1,1,8,26,26
2,2,1,17,34,34
3,3,1,2,51,51
4,4,55,9,53,53
5,5,1,2,62,548


We verify that the pointer indicates the routes index number for a given example. At the pointer_routes should indicate the first stop of a new route. We try with 3 to see if route_stops has a new route at this index.

In [66]:
route_stops.iloc[60:65]

Unnamed: 0,route_int,stop_int
60,4,1311
61,4,1133
548,5,298
549,5,1294
550,6,1202


We go and see if stop_times has a new route at 548.

In [67]:
stop_times_ordered.loc[545:550].head(5)

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
545,202405,26-703-j19-1,8591261,32.TA.26-703-j19-1.2.R,8591261,2020-05-24 19:27:00,2020-05-24 19:27:00,7,"Zürich, Loorenstrasse",47.359863,8.594524,"Zürich, Klusplatz",6842,1,19:19:00,4,9,1197,Bus,545
546,202406,26-703-j19-1,8591107,32.TA.26-703-j19-1.2.R,8591107,2020-05-24 19:29:00,2020-05-24 19:29:00,8,"Zürich, Carl-Spitteler-Strasse",47.358324,8.586592,"Zürich, Klusplatz",6842,1,19:19:00,4,9,1311,Bus,546
547,202407,26-703-j19-1,8591233,32.TA.26-703-j19-1.2.R,8591233,2020-05-24 19:33:00,NaT,9,"Zürich, Klusplatz",47.364037,8.566496,"Zürich, Klusplatz",6842,1,19:19:00,4,9,1133,Bus,547
548,202408,26-10-j19-1,8573205,1672.TA.26-10-j19-1.11.R,8573205,NaT,2020-05-24 07:01:00,27,"Zürich Flughafen, Bahnhof",47.450441,8.563729,"Zürich Flughafen, Fracht",4096,1,07:01:00,5,2,298,Tram,548
549,202409,26-10-j19-1,8588553,1672.TA.26-10-j19-1.11.R,8588553,2020-05-24 07:02:00,NaT,28,"Zürich Flughafen, Fracht",47.452494,8.572057,"Zürich Flughafen, Fracht",4096,1,07:01:00,5,2,1294,Tram,549


### Stops: 
[[stop0_pointerRoutes, stop0_pointerTransfer], [stop1_pointerRoutes, stop1_pointerTransfer], …]

We count the number of unique routes pass by a stop

In [68]:
stops_join = route_stops.groupby(["stop_int"]).nunique().rename(columns={"route_int": "n_Routes"}).drop(columns=["stop_int"])
stops_join.head(5)

Unnamed: 0_level_0,n_Routes
stop_int,Unnamed: 1_level_1
0,11
1,9
2,18
3,6
4,23


In [69]:
stop_times_ordered[['stop_int', 'route_int']].sort_values(by='stop_int').groupby('stop_int').agg('nunique')['route_int'].head()

stop_int
0    11
1     9
2    18
3     6
4    23
Name: route_int, dtype: int64

Check if we have always the right number of stops

In [70]:
#sanity check
all(stops_join[['n_Routes']].values.flatten()== stop_times_ordered[['stop_int', 'route_int']].sort_values(by='stop_int').groupby('stop_int').agg('nunique')['route_int'].values)

True

In [71]:
stops_join.count()

n_Routes    1407
dtype: int64

We count the number of transfers for each stop

In [72]:
distinct_transfers = transfers[["stop_int", "stop_int_2"]]\
.groupby(["stop_int"])\
.nunique()\
.rename(columns={"stop_int_2": "n_Transfers"})\
.drop(columns=["stop_int"])
distinct_transfers.head(5)

Unnamed: 0_level_0,n_Transfers
stop_int,Unnamed: 1_level_1
0,2
1,5
2,15
3,6
4,4


In [73]:
# verifying on the first few rows of transfers
transfers.head(30)

Unnamed: 0,index,stop_id_general,stop_int,stop_lat_first,stop_lon_first,stop_name_first,stop_id_general_2,stop_int_2,stop_lat_first_2,stop_lon_first_2,stop_name_first_2,distance,walking_time,monotonically_increasing_id
0,6232,8502508,0,47.415446,8.377185,"Spreitenbach, Raiacker",8590268,815,47.414212,8.379521,"Spreitenbach, ASP",0.222963,267,0
1,6233,8502508,0,47.415446,8.377185,"Spreitenbach, Raiacker",8590270,1350,47.41795,8.372083,"Spreitenbach, Brüel",0.474276,569,1
2,6234,8503078,1,47.345476,8.593023,Waldburg,8591903,63,47.348349,8.596042,"Zollikerberg, Spital",0.392124,470,2
3,6235,8503078,1,47.345476,8.593023,Waldburg,8591023,242,47.346821,8.598153,"Zollikerb., Langägerten/Spital",0.414395,497,3
4,6236,8503078,1,47.345476,8.593023,Waldburg,8590879,551,47.345391,8.593302,"Waldburg, Station",0.023022,27,4
5,6237,8503078,1,47.345476,8.593023,Waldburg,8503077,705,47.34732,8.596796,Spital Zollikerberg,0.350511,420,5
6,6238,8503078,1,47.345476,8.593023,Waldburg,8576189,1001,47.345701,8.588487,"Zollikon, Rebwiesstrasse",0.342709,411,6
7,6239,8503088,2,47.377495,8.539169,Zürich HB SZU,8591327,48,47.37435,8.543239,"Zürich, Rudolf-Brun-Brücke",0.464967,557,7
8,6240,8503088,2,47.377495,8.539169,Zürich HB SZU,8588078,272,47.376844,8.54394,"Zürich, Central",0.366395,439,8
9,6241,8503088,2,47.377495,8.539169,Zürich HB SZU,8591316,373,47.373066,8.53846,"Zürich, Rennweg",0.495337,594,9


In [74]:
distinct_transfers.count()

n_Transfers    1337
dtype: int64

As stop_join contains all stops, while transfer does not we have to join on the side of stop_join

In [75]:
stops_info = distinct_transfers.join(stops_join, how="right")
stops_info.head(20)

Unnamed: 0_level_0,n_Transfers,n_Routes
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.0,11
1,5.0,9
2,15.0,18
3,6.0,6
4,4.0,23
5,12.0,3
6,4.0,2
7,6.0,1
8,,6
9,4.0,6


The `NaN` values are due to the fact that not all stops are within 500m of other stops.

In [76]:
stops_info.count()

n_Transfers    1337
n_Routes       1407
dtype: int64

In [77]:
stops_info['pointer_stop_routes'] = stops_info.n_Routes.cumsum().shift(1, fill_value=0)
stops_info['pointer_transfers'] = stops_info.n_Transfers.cumsum().shift(1, fill_value=0)
stops_info.head(20)

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2.0,11,0,0.0
1,5.0,9,11,2.0
2,15.0,18,20,7.0
3,6.0,6,38,22.0
4,4.0,23,44,28.0
5,12.0,3,67,32.0
6,4.0,2,70,44.0
7,6.0,1,72,48.0
8,,6,73,54.0
9,4.0,6,79,


In [78]:
stops_info.tail()

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1402,,2,15296,6237.0
1403,5.0,5,15298,
1404,8.0,31,15303,6242.0
1405,7.0,5,15334,6250.0
1406,7.0,5,15339,6257.0


In [79]:
## ?

stops_info['pointer_stop_routes'] = np.where((stops_info["n_Routes"] == 0), None, stops_info['pointer_stop_routes'])
stops_info['pointer_transfers'] = np.where((stops_info["n_Transfers"] == 0), None, stops_info['pointer_transfers'])

In [80]:
stops_info.head(20)

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2.0,11,0,0.0
1,5.0,9,11,2.0
2,15.0,18,20,7.0
3,6.0,6,38,22.0
4,4.0,23,44,28.0
5,12.0,3,67,32.0
6,4.0,2,70,44.0
7,6.0,1,72,48.0
8,,6,73,54.0
9,4.0,6,79,


There is an issue here: the stop_int with no transfers should get a None for the pointer to transfers. However, it is the next one that gets it.

In [81]:
stops_info['pointer_transfers_not_shifted']= stops_info.n_Transfers.cumsum()
stops_info['pointer_transfers_shifted_2'] = stops_info.pointer_transfers.shift(1, fill_value = 0)
stops_info.head(20)

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers,pointer_transfers_not_shifted,pointer_transfers_shifted_2
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2.0,11,0,0.0,2.0,0.0
1,5.0,9,11,2.0,7.0,0.0
2,15.0,18,20,7.0,22.0,2.0
3,6.0,6,38,22.0,28.0,7.0
4,4.0,23,44,28.0,32.0,22.0
5,12.0,3,67,32.0,44.0,28.0
6,4.0,2,70,44.0,48.0,32.0
7,6.0,1,72,48.0,54.0,44.0
8,,6,73,54.0,,48.0
9,4.0,6,79,,58.0,54.0


Now, we can apply the following logic to repair the pointer to transfer:
- When `pointer_transfers_not_shifted` is NaN, `pointer_transfers_corrected` takes the value NaN
- When `pointer_transfers` is NaN, `pointer_transfers_corrected` takes the value from`pointer_transfers_shifted_2`

In [82]:
def correct_transfer_pointer(x):
    if pd.isna(x[1]):
        return np.nan
    elif pd.isna(x[0]):
        return x[2]
    else:
        return x[0]
        
stops_info['pointer_transfers_corrected'] = stops_info[['pointer_transfers', 
                                                    'pointer_transfers_not_shifted', 
                                                    'pointer_transfers_shifted_2']]\
.apply(correct_transfer_pointer, axis=1)
stops_info.head(20)

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers,pointer_transfers_not_shifted,pointer_transfers_shifted_2,pointer_transfers_corrected
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,2.0,11,0,0.0,2.0,0.0,0.0
1,5.0,9,11,2.0,7.0,0.0,2.0
2,15.0,18,20,7.0,22.0,2.0,7.0
3,6.0,6,38,22.0,28.0,7.0,22.0
4,4.0,23,44,28.0,32.0,22.0,28.0
5,12.0,3,67,32.0,44.0,28.0,32.0
6,4.0,2,70,44.0,48.0,32.0,44.0
7,6.0,1,72,48.0,54.0,44.0,48.0
8,,6,73,54.0,,48.0,
9,4.0,6,79,,58.0,54.0,54.0


In [83]:
stops_info.tail()

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers,pointer_transfers_not_shifted,pointer_transfers_shifted_2,pointer_transfers_corrected
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1402,,2,15296,6237.0,,6234.0,
1403,5.0,5,15298,,6242.0,6237.0,6237.0
1404,8.0,31,15303,6242.0,6250.0,,6242.0
1405,7.0,5,15334,6250.0,6257.0,6242.0,6250.0
1406,7.0,5,15339,6257.0,6264.0,6250.0,6257.0


In [84]:
# dropping intermediate columns
stops_info = stops_info.drop(columns=['pointer_transfers', 'pointer_transfers_not_shifted', 'pointer_transfers_shifted_2'])
stops_info.head(20)

Unnamed: 0_level_0,n_Transfers,n_Routes,pointer_stop_routes,pointer_transfers_corrected
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2.0,11,0,0.0
1,5.0,9,11,2.0
2,15.0,18,20,7.0
3,6.0,6,38,22.0
4,4.0,23,44,28.0
5,12.0,3,67,32.0
6,4.0,2,70,44.0
7,6.0,1,72,48.0
8,,6,73,
9,4.0,6,79,54.0


In [85]:
stops_info.isna().any()

n_Transfers                     True
n_Routes                       False
pointer_stop_routes            False
pointer_transfers_corrected     True
dtype: bool

We check the order and get the relevant columns in the right order

In [86]:
stops_df = stops_info[['pointer_stop_routes', 'pointer_transfers_corrected']].sort_index()
stops_df.head(20)

Unnamed: 0_level_0,pointer_stop_routes,pointer_transfers_corrected
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0.0
1,11,2.0
2,20,7.0
3,38,22.0
4,44,28.0
5,67,32.0
6,70,44.0
7,72,48.0
8,73,
9,79,54.0


We transform to array and pickle

In [88]:
with open('../data/stops_df.pkl','wb') as f: pickle.dump(stops_df, f)

In [90]:
stops_array = stops_df.to_numpy()
stops_array

array([[0, 0.0],
       [11, 2.0],
       [20, 7.0],
       ...,
       [15303, 6242.0],
       [15334, 6250.0],
       [15339, 6257.0]], dtype=object)

In [91]:
np.size(stops_array, 0)

1407

In [92]:
stops_array.shape

(1407, 2)

In [93]:
with open('../data/stops_array_cyril.pkl','wb') as f: pickle.dump(stops_array, f)

### StopRoutes:
Structure: [stop0_route1, stop0_route3, stop1_route1, stop2_route1, stop1_route4, …]

This time we consider the unique routes passing by each stop. One route is not counted twice if it passes twice by the same stop.

In [94]:
stop_routes = stop_times_ordered[["route_int", "stop_int"]].drop_duplicates().sort_values(["stop_int", "route_int"])
stop_routes = stop_routes.reset_index(drop=True)
stop_routes.head(5)

Unnamed: 0,route_int,stop_int
0,17,0
1,116,0
2,126,0
3,144,0
4,169,0


We check we have the right number of routes and stops. It seems correct

In [95]:
stop_times_curated.route_int.nunique()

1461

In [96]:
stop_routes.route_int.nunique()

1461

In [97]:
stop_routes.stop_int.nunique()

1407

In [59]:
with open('../data/stop_routes_df_cyril.pkl','wb') as f: pickle.dump(stop_routes, f)

In [98]:
stop_routes_array = stop_routes["route_int"].to_numpy()
stop_routes_array

array([  17,  116,  126, ...,  861,  982, 1087])

In [99]:
np.size(stop_routes_array, 0)

15344

In [100]:
stop_routes_array.shape

(15344,)

In [101]:
with open('../data/stop_routes_array_cyril.pkl','wb') as f: pickle.dump(stop_routes_array, f)

### Transfer:
[[[stop0_nameTargetStop1, transferTime1], [stop0_nameTargetStop2, transferTime2],….], [stop1_nameTargetStop1, transferTime1], [stop1_nameTargetStop2, transferTime2],….],…]

We order by stop_int and make sure there are no duplicates

In [111]:
# sanity checking

transfer_test = transfers.sort_values(["stop_int", "stop_int_2"]).drop_duplicates(["stop_int", "stop_int_2"])
transfers.equals(transfer_test)

True

We can also see here that not all stops have transfers

In [112]:
transfers.stop_int.nunique()

1337

In [110]:
with open('../data/transfer_df_cyril.pkl','wb') as f: pickle.dump(transfers, f)

In [113]:
transfer_array = transfers[["stop_int_2", "walking_time"]].to_numpy()
transfer_array

array([[ 815,  267],
       [1350,  569],
       [  63,  470],
       ...,
       [1113,  382],
       [1122,  338],
       [1270,  553]])

In [114]:
with open('../data/transfer_array_cyril.pkl','wb') as f: pickle.dump(transfer_array, f)

In [115]:
transfer_array.shape

(6264, 2)

#### Check if indexes in stops is correct

In [121]:
stops_df.head(5)

Unnamed: 0_level_0,pointer_stop_routes,pointer_transfers_corrected
stop_int,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0.0
1,11,2.0
2,20,7.0
3,38,22.0
4,44,28.0


We see that at the index 2 there should be a new stop.

In [119]:
transfers.head(5)

Unnamed: 0,index,stop_id_general,stop_int,stop_lat_first,stop_lon_first,stop_name_first,stop_id_general_2,stop_int_2,stop_lat_first_2,stop_lon_first_2,stop_name_first_2,distance,walking_time,monotonically_increasing_id
0,6232,8502508,0,47.415446,8.377185,"Spreitenbach, Raiacker",8590268,815,47.414212,8.379521,"Spreitenbach, ASP",0.222963,267,0
1,6233,8502508,0,47.415446,8.377185,"Spreitenbach, Raiacker",8590270,1350,47.41795,8.372083,"Spreitenbach, Brüel",0.474276,569,1
2,6234,8503078,1,47.345476,8.593023,Waldburg,8591903,63,47.348349,8.596042,"Zollikerberg, Spital",0.392124,470,2
3,6235,8503078,1,47.345476,8.593023,Waldburg,8591023,242,47.346821,8.598153,"Zollikerb., Langägerten/Spital",0.414395,497,3
4,6236,8503078,1,47.345476,8.593023,Waldburg,8590879,551,47.345391,8.593302,"Waldburg, Station",0.023022,27,4


We see that at index 11 we should have a new stop.

In [125]:
stop_routes.iloc[8:15]

Unnamed: 0,route_int,stop_int
8,573,0
9,617,0
10,1054,0
11,41,1
12,150,1
13,432,1
14,872,1


We check a couple of times also if route_stops and stops_routes are coherent. Meaning that for a given route both dataframes must have the same stops

In [126]:
stop_routes.loc[stop_routes['route_int'] == 200]

Unnamed: 0,route_int,stop_int
1880,200,168
4982,200,434
5608,200,504
8038,200,715
15000,200,1365


In [127]:
route_stops.loc[route_stops['route_int'] == 200]

Unnamed: 0,route_int,stop_int
32867,200,168
32868,200,1365
32869,200,504
32870,200,434
32871,200,715


We also check if it is coherent with our original stop_times.

In [129]:
stop_times_curated.loc[stop_times_curated['route_int'] == 200]

Unnamed: 0,index,route_id,stop_id_general,trip_id,stop_id,arrival_time,departure_time,stop_sequence,stop_name,stop_lat,stop_lon,trip_headsign,trip_short_name,direction_id,departure_first_stop,route_int,stop_count,stop_int,route_desc,monotonically_increasing_id
32867,167649,26-303-j19-1,8590805,532.TA.26-303-j19-1.7.R,8590805,19:53:00,19:53:00,1,"Schlieren, Zentrum/Bahnhof",47.39824,8.447101,"Killwangen, Bahnhof",3555,1,19:53:00,200,5,168,Bus,206158431453
32868,167650,26-303-j19-1,8590794,532.TA.26-303-j19-1.7.R,8590794,19:55:00,19:55:00,2,"Schlieren, Kesslerstrasse",47.396379,8.438873,"Killwangen, Bahnhof",3555,1,19:53:00,200,5,1365,Bus,206158431454
32869,167651,26-303-j19-1,8590797,532.TA.26-303-j19-1.7.R,8590797,19:56:00,19:56:00,3,"Schlieren, Reitmen",47.397109,8.433384,"Killwangen, Bahnhof",3555,1,19:53:00,200,5,504,Bus,206158431455
32870,167652,26-303-j19-1,8590801,532.TA.26-303-j19-1.7.R,8590801,19:58:00,19:58:00,4,"Schlieren, Steinacker",47.396179,8.426377,"Killwangen, Bahnhof",3555,1,19:53:00,200,5,434,Bus,206158431456
32871,167653,26-303-j19-1,8590840,532.TA.26-303-j19-1.7.R,8590840,19:59:00,19:59:00,5,"Urdorf, Luberzen",47.393813,8.423377,"Killwangen, Bahnhof",3555,1,19:53:00,200,5,715,Bus,206158431457


#### Pickle Check
If files can be read from pickle

In [158]:
with open('../data/stop_times_array_cyril.pkl','rb') as f: arrayname1 = pickle.load(f)

In [159]:
with open('../data/routes_array_cyril.pkl','rb') as f: arrayname2 = pickle.load(f)

In [160]:
with open('../data/route_stops_array_cyril.pkl','rb') as f: arrayname3 = pickle.load(f)

In [161]:
arrayname1

array([[                          'NaT', '2020-05-23T07:00:00.000000000'],
       ['2020-05-23T07:01:00.000000000', '2020-05-23T07:01:00.000000000'],
       ['2020-05-23T07:02:00.000000000', '2020-05-23T07:02:00.000000000'],
       ...,
       ['2020-05-23T07:35:00.000000000', '2020-05-23T07:35:00.000000000'],
       ['2020-05-23T07:36:00.000000000', '2020-05-23T07:36:00.000000000'],
       ['2020-05-23T07:37:00.000000000',                           'NaT']],
      dtype='datetime64[ns]')

In [162]:
arrayname2

array([[     1,     26,      0,      0],
       [     1,      8,     26,     26],
       [     1,     17,     34,     34],
       ...,
       [     1,      3,  15297, 260396],
       [     2,     16,  15300, 260399],
       [     1,     28,  15316, 260431]])

In [163]:
arrayname3

array([1221,  816,  776, ..., 1349, 1037,  552])