## GTFS
In order to compute accessibility based on GTFS data (General Transit Feed Specification, can by downloaded from ftp://199.203.58.18/), we will first perform basic processing of the data.
We are using the pandas library.

The code is based on the following structure of GTFS tables:
![GTFS Tables](../../input_data/GTFS_tables.PNG)

In [97]:
import pandas as pd
import datetime as dt
import pickle
from tqdm.auto import tqdm
tqdm.pandas()

DATA_PATH = '../../../input_data/synthetic_examples/input_data/Test0_b_TwoLines/'

In [98]:
OUTPUT_PATH = '../../../output_data/validation/two_lines/'

In [99]:
DAY = dt.datetime(2015, 12, 14)

## Process Calendar - Get trips for a single day

In [100]:
# Load calendar
calendar_df = pd.read_csv(DATA_PATH + 'calendar.txt')
# Convert dates to python's datetime type
calendar_df['start_date'] = calendar_df['start_date'].apply(lambda x: dt.datetime.strptime(str(x), '%Y%m%d'))
calendar_df['end_date'] = calendar_df['end_date'].apply(lambda x: dt.datetime.strptime(str(x), '%Y%m%d'))

calendar_df.columns = ['service_id',
 'sunday',
 'monday',
 'tuesday',
 'wednesday',
 'thursday',
 'friday',
 'saturday',
 'start_date',
 'end_date']

calendar_df[:3]

Unnamed: 0,service_id,sunday,monday,tuesday,wednesday,thursday,friday,saturday,start_date,end_date
0,31360372,1,1,1,1,1,1,1,2015-01-01,2016-12-31
1,31349568,1,1,1,1,1,1,1,2015-01-01,2016-12-31


In [101]:
# Let's say we want all trips that occurred on the first Sunday after the feed was published 

# Filter so we only keep services that are active on Sunday.
sunday_services_df = calendar_df[calendar_df['sunday'] == 1][['service_id', 'start_date', 'end_date']]

# Keep only services that start during/before selected date
sunday_services_df = sunday_services_df[sunday_services_df['start_date'] <= DAY]

# Keep only services that end during/after selected date
sunday_services_df = sunday_services_df[sunday_services_df['end_date'] >= DAY]

In [102]:
sunday_services_df

Unnamed: 0,service_id,start_date,end_date
0,31360372,2015-01-01,2016-12-31
1,31349568,2015-01-01,2016-12-31


## Process Trips

In [103]:
# Load trips
trips_df = pd.read_csv(DATA_PATH + 'trips.txt')

In [104]:
trips_df.head(2)

Unnamed: 0,route_id,service_id,trip_id,direction_id,shape_id
0,9823,31360372,18854222_161215,0,64333
1,9823,31360372,18854225_161215,0,64333


In [105]:


# Fix column names (some columns have special 'hudden' characters that we want to remove)
trips_df.columns = ['route_id', 'service_id', 'trip_id', 'direction_is', 'shape_id']

trips_calendar_df = sunday_services_df.merge(trips_df, on='service_id', suffixes=('_calendar', '_trips'))
sunday_trips_df = trips_calendar_df.drop(['start_date', 'end_date'], axis=1)
sunday_trips_df[:3]

Unnamed: 0,service_id,route_id,trip_id,direction_is,shape_id
0,31360372,9823,18854222_161215,0,64333
1,31360372,9823,18854225_161215,0,64333
2,31360372,9823,18854226_161215,0,64333


In [106]:
sunday_trips_df.nunique()

service_id        2
route_id          2
trip_id         161
direction_is      1
shape_id          2
dtype: int64

### We now have all trips that occurred on the selected date

## Process Stop Times
Note: This is pretty heavy compared to the rest of the tables

In [107]:
# Load stop times
stop_times_df = pd.read_csv(DATA_PATH + 'stop_times.txt')

# Get all trips departures by getting the minimal departure time for each trip
trips_start_times_df = stop_times_df.groupby('trip_id').agg({'departure_time': 'min'})

# Let's join the last two tabled to get the departure times of all sunday trips
sunday_departures_df = sunday_trips_df.merge(trips_start_times_df, on='trip_id', suffixes=('_departures', '_trips'))

In [108]:
sunday_departures_df.head(2)

Unnamed: 0,service_id,route_id,trip_id,direction_is,shape_id,departure_time
0,31360372,9823,18854222_161215,0,64333,12:05:00
1,31360372,9823,18854225_161215,0,64333,13:35:00


## Process Stops

In [109]:
def convert_gtfs_time_to_datetime(gtfs_time):
#     date = dt.datetime(2019, 11, 3)
    h, m, s = [int(x) for x in gtfs_time.split(':')]
    if h < 24:
        # This is a 'normal' situation, we can simply create a datetime object using the date we defined before
        return DAY + dt.timedelta(hours=h, minutes=m, seconds=s)
    # Otherwise we have a 'strange' time: it's after midnight
    new_date = DAY + dt.timedelta(days=1)
    return new_date + dt.timedelta(hours=h-24, minutes=m, seconds=s)

In [110]:
# Load stops
stops_df = pd.read_csv(DATA_PATH + 'stops.txt')

stops_df.head(2)

Unnamed: 0,stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,location_type,parent_station
0,775,39739,ההסתדרות/שנקר,רחוב:ההסתדרות 54 עיר: חולון רציף: קומה:,32.017504,34.774177,0,
1,12882,21128,אבן גבירול/שד' נורדאו,רחוב:אבן גבירול 155 עיר: תל אביב יפו רציף: ...,32.091212,34.782786,0,


In [111]:
# Add stop code and zone id to stop times
stop_times_with_stop_codes_df = stop_times_df.merge(
    stops_df[['stop_id', 'stop_code']], on='stop_id')

stop_times_with_stop_codes_df['departure_time'] = stop_times_with_stop_codes_df[
    'departure_time'].apply(convert_gtfs_time_to_datetime)

## Construct Nodes

In [112]:
# We want to (right) join this table with stop_times in order to get the sunday stop times with trip departure time.
sunday_nodes_df = stop_times_df.merge(sunday_departures_df, how='right', on='trip_id', suffixes=('_stop', '_trip_departure'))

# Remove some columns to clear the data
sunday_nodes_df = sunday_nodes_df.drop(['pickup_type',
                                        'drop_off_type', 'service_id', 'direction_is', 'shape_id'], 
                                       axis=1)

# Add stops data to nodes
nodes_df = sunday_nodes_df.merge(stops_df, on='stop_id', suffixes=('_node', '_stop'))
nodes_df = nodes_df.drop(['stop_desc', 'stop_name', 'parent_station', 'location_type'],axis=1)


nodes_df[:3]

Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon
0,18854219_161215,11:02:48,11:02:48,13257,30,9823,10:20:00,22557,32.040172,34.775255
1,18854222_161215,12:47:48,12:47:48,13257,30,9823,12:05:00,22557,32.040172,34.775255
2,18854225_161215,14:17:48,14:17:48,13257,30,9823,13:35:00,22557,32.040172,34.775255


In [113]:
# Convert GTFS times to match "real-world time".
nodes_df['arrival'] = nodes_df['arrival_time'].apply(convert_gtfs_time_to_datetime)
nodes_df['departure'] = nodes_df['departure_time_stop'].apply(convert_gtfs_time_to_datetime)

In [114]:
nodes_df.head()

Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure
0,18854219_161215,11:02:48,11:02:48,13257,30,9823,10:20:00,22557,32.040172,34.775255,2015-12-14 11:02:48,2015-12-14 11:02:48
1,18854222_161215,12:47:48,12:47:48,13257,30,9823,12:05:00,22557,32.040172,34.775255,2015-12-14 12:47:48,2015-12-14 12:47:48
2,18854225_161215,14:17:48,14:17:48,13257,30,9823,13:35:00,22557,32.040172,34.775255,2015-12-14 14:17:48,2015-12-14 14:17:48
3,18854226_161215,14:44:48,14:44:48,13257,30,9823,14:02:00,22557,32.040172,34.775255,2015-12-14 14:44:48,2015-12-14 14:44:48
4,18854227_161215,14:56:48,14:56:48,13257,30,9823,14:14:00,22557,32.040172,34.775255,2015-12-14 14:56:48,2015-12-14 14:56:48


In [115]:
nodes_df.shape

(6207, 12)

## Some Stats on the Overall Nodes For the Day 

In [116]:
nodes_df.nunique()

trip_id                           161
arrival_time                     5975
departure_time_stop              5975
stop_id                            80
stop_sequence                      50
route_id                            2
departure_time_trip_departure     155
stop_code                          80
stop_lat                           80
stop_lon                           80
arrival                          5975
departure                        5975
dtype: int64

In [117]:
# TODO: add node_id according to index, and save in pkl

In [118]:
nodes_df.to_pickle(OUTPUT_PATH + 'all_nodes.pkl')

In [119]:
nodes_df[nodes_df['arrival'] > start_time].sort_values('arrival')

Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure
1494,18854212_161215,07:00:01,07:00:01,29568,7,9823,06:53:00,20482,32.117333,34.806280,2015-12-14 07:00:01,2015-12-14 07:00:01
2303,18854384_161215,07:00:13,07:00:13,29488,19,9823,06:35:00,20361,32.075101,34.781570,2015-12-14 07:00:13,2015-12-14 07:00:13
1558,18854212_161215,07:00:32,07:00:32,13229,8,9823,06:53:00,21734,32.115503,34.807217,2015-12-14 07:00:32,2015-12-14 07:00:32
5527,19283213_161215,07:00:34,07:00:34,13821,15,2517,06:40:00,26099,32.073809,34.800651,2015-12-14 07:00:34,2015-12-14 07:00:34
85,18854211_161215,07:00:37,07:00:37,13428,31,9823,06:15:00,25273,32.037780,34.777806,2015-12-14 07:00:37,2015-12-14 07:00:37
4840,19283214_161215,07:00:46,07:00:46,13584,7,2517,06:50:00,25612,32.072437,34.778119,2015-12-14 07:00:46,2015-12-14 07:00:46
1622,18854212_161215,07:01:00,07:01:00,29398,9,9823,06:53:00,20137,32.113547,34.807402,2015-12-14 07:01:00,2015-12-14 07:01:00
3442,19283212_161215,07:01:06,07:01:06,13896,24,2517,06:30:00,26207,32.072195,34.822999,2015-12-14 07:01:06,2015-12-14 07:01:06
2367,18854384_161215,07:01:20,07:01:20,14530,20,9823,06:35:00,28561,32.071828,34.781736,2015-12-14 07:01:20,2015-12-14 07:01:20
4937,19283214_161215,07:01:24,07:01:24,13575,8,2517,06:50:00,25592,32.071428,34.778882,2015-12-14 07:01:24,2015-12-14 07:01:24


In [120]:
start_time = DAY + dt.timedelta(hours=7)
end_time = start_time + dt.timedelta(hours=1, minutes=30)

morning_nodes_df = nodes_df[nodes_df['arrival'] > start_time][nodes_df['arrival'] < end_time]
morning_nodes_df.head(3)

  after removing the cwd from sys.path.


Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure
22,18854212_161215,07:35:48,07:35:48,13257,30,9823,06:53:00,22557,32.040172,34.775255,2015-12-14 07:35:48,2015-12-14 07:35:48
40,18854213_161215,07:52:48,07:52:48,13257,30,9823,07:10:00,22557,32.040172,34.775255,2015-12-14 07:52:48,2015-12-14 07:52:48
41,18854214_161215,08:22:48,08:22:48,13257,30,9823,07:40:00,22557,32.040172,34.775255,2015-12-14 08:22:48,2015-12-14 08:22:48


In [121]:
morning_nodes_df.nunique()

trip_id                           22
arrival_time                     533
departure_time_stop              533
stop_id                           80
stop_sequence                     50
route_id                           2
departure_time_trip_departure     21
stop_code                         80
stop_lat                          80
stop_lon                          80
arrival                          533
departure                        533
dtype: int64

In [122]:
morning_nodes_df = morning_nodes_df.reset_index()

In [123]:
morning_nodes_df['node_id'] = morning_nodes_df.index

In [124]:
morning_nodes_df.head(3)

Unnamed: 0,index,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure,node_id
0,22,18854212_161215,07:35:48,07:35:48,13257,30,9823,06:53:00,22557,32.040172,34.775255,2015-12-14 07:35:48,2015-12-14 07:35:48,0
1,40,18854213_161215,07:52:48,07:52:48,13257,30,9823,07:10:00,22557,32.040172,34.775255,2015-12-14 07:52:48,2015-12-14 07:52:48,1
2,41,18854214_161215,08:22:48,08:22:48,13257,30,9823,07:40:00,22557,32.040172,34.775255,2015-12-14 08:22:48,2015-12-14 08:22:48,2


In [125]:
morning_nodes_df.to_pickle(OUTPUT_PATH + 'morning_nodes.pkl')

## Filter only Tel Aviv Metropolitan stops

Let's get all Tel Aviv (TLV) stops 

In [None]:
for zone in stops_df['zone_id'].unique():
    print(zone)

In [None]:
test_stops = stops_df[stops_df['zone_id'] == 6900]
test_stops.to_csv(OUTPUT_PATH + 'test_stops.csv')

In [None]:
test_stops.shape

In [None]:
stops_df.shape

In [None]:
# Let's see how manu NaN zones we have:
stops_df['zone_id'].isna().sum()

In [None]:
# I think zone 210 is tel aviv metropolitan area (only the small surrounding part, we would need to extend for our real computations). 
# We need to filter only trips that contain stops (and then only nodes with those stops) that are in this zone.
tlv_stops_df = stops_df[stops_df['zone_id'] == 210]

In [None]:
tlv_stops_df.shape

Now we need to get the TLV stop times, in order to find all trips that include TLV stops

In [None]:
# stop_times_df
tlv_stop_times_df = stop_times_df.merge(tlv_stops_df[['stop_id']], on='stop_id', how='inner')

In [None]:
tlv_stop_times_df.shape

In [None]:
stop_times_df.shape

Next, the TLV stop times include all actual trip ids with TLV stops in them. Let's find relevant trip in the Sunday's trips we're examining

In [None]:
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#
# We want to get unique trips which pass through TLV, then we will use these trips to filter 
# only nodes that pass through TLV.
#
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

tlv_trips_df = tlv_stop_times_df[['trip_id']].drop_duplicates(subset ="trip_id", 
                     keep = False, inplace = False)

In [None]:
tlv_trips_df.head()

In [None]:
tlv_trips_df.shape

Finally, let's save only nodes that include TLV trips

In [None]:
tlv_nodes_df = nodes_df.merge(tlv_trips_df[['trip_id']], on='trip_id', how='inner')

In [None]:
nodes_df.shape

In [None]:
tlv_nodes_df.shape

In [None]:
tlv_nodes_df.to_csv('tlv_nodes.csv')