## GTFS
In order to compute accessibility based on GTFS data (General Transit Feed Specification, can by downloaded from ftp://199.203.58.18/), we will first perform basic processing of the data.
We are using the pandas library.

The code is based on the following structure of GTFS tables:
![GTFS Tables](../../input_data/GTFS_tables.PNG)

In [126]:
import pandas as pd
import datetime as dt
import pickle
from tqdm.auto import tqdm
tqdm.pandas()

DATA_PATH = '../../../input_data/synthetic_examples/input_data/Test1/'

In [127]:
OUTPUT_PATH = '../../../output_data/validation/two_lines/'

In [128]:
DAY = dt.datetime(2015, 12, 14)

## Process Calendar - Get trips for a single day

In [129]:
# Load calendar
calendar_df = pd.read_csv(DATA_PATH + 'calendar.txt')
# Convert dates to python's datetime type
calendar_df['start_date'] = calendar_df['start_date'].apply(lambda x: dt.datetime.strptime(str(x), '%Y%m%d'))
calendar_df['end_date'] = calendar_df['end_date'].apply(lambda x: dt.datetime.strptime(str(x), '%Y%m%d'))

calendar_df.columns = ['service_id',
 'sunday',
 'monday',
 'tuesday',
 'wednesday',
 'thursday',
 'friday',
 'saturday',
 'start_date',
 'end_date']

calendar_df[:3]

Unnamed: 0,service_id,sunday,monday,tuesday,wednesday,thursday,friday,saturday,start_date,end_date
0,31349658,1,1,1,1,1,0,0,2015-12-06,2015-12-07
1,31349659,1,1,1,1,1,0,0,2015-12-15,2016-02-04
2,31349660,0,0,0,0,0,1,0,2015-12-15,2016-02-04


In [130]:
# Let's say we want all trips that occurred on the first Sunday after the feed was published 

# Filter so we only keep services that are active on Sunday.
sunday_services_df = calendar_df[calendar_df['sunday'] == 1][['service_id', 'start_date', 'end_date']]

# Keep only services that start during/before selected date
sunday_services_df = sunday_services_df[sunday_services_df['start_date'] <= DAY]

# Keep only services that end during/after selected date
sunday_services_df = sunday_services_df[sunday_services_df['end_date'] >= DAY]

In [131]:
sunday_services_df

Unnamed: 0,service_id,start_date,end_date
4,31349669,2015-12-08,2015-12-14
8,31360267,2015-12-09,2015-12-14


## Process Trips

In [132]:
# Load trips
trips_df = pd.read_csv(DATA_PATH + 'trips.txt')

In [133]:
trips_df.head(2)

Unnamed: 0,route_id,service_id,trip_id,direction_id,shape_id
0,9807,31360272,18673606_161215,1,64069
1,9807,31360272,18673610_161215,1,64069


In [134]:


# Fix column names (some columns have special 'hudden' characters that we want to remove)
trips_df.columns = ['route_id', 'service_id', 'trip_id', 'direction_is', 'shape_id']

trips_calendar_df = sunday_services_df.merge(trips_df, on='service_id', suffixes=('_calendar', '_trips'))
sunday_trips_df = trips_calendar_df.drop(['start_date', 'end_date'], axis=1)
sunday_trips_df[:3]

Unnamed: 0,service_id,route_id,trip_id,direction_is,shape_id
0,31349669,2544,19407596_081215,1,62838
1,31349669,2544,19407597_081215,1,62838
2,31349669,2544,19407617_081215,1,62838


In [135]:
sunday_trips_df.nunique()

service_id        2
route_id          2
trip_id         143
direction_is      1
shape_id          2
dtype: int64

### We now have all trips that occurred on the selected date

## Process Stop Times
Note: This is pretty heavy compared to the rest of the tables

In [136]:
# Load stop times
stop_times_df = pd.read_csv(DATA_PATH + 'stop_times.txt')

# Get all trips departures by getting the minimal departure time for each trip
trips_start_times_df = stop_times_df.groupby('trip_id').agg({'departure_time': 'min'})

# Let's join the last two tabled to get the departure times of all sunday trips
sunday_departures_df = sunday_trips_df.merge(trips_start_times_df, on='trip_id', suffixes=('_departures', '_trips'))

In [137]:
sunday_departures_df.head(2)

Unnamed: 0,service_id,route_id,trip_id,direction_is,shape_id,departure_time
0,31349669,2544,19407596_081215,1,62838,06:59:00
1,31349669,2544,19407597_081215,1,62838,07:08:00


## Process Stops

In [138]:
def convert_gtfs_time_to_datetime(gtfs_time):
#     date = dt.datetime(2019, 11, 3)
    h, m, s = [int(x) for x in gtfs_time.split(':')]
    if h < 24:
        # This is a 'normal' situation, we can simply create a datetime object using the date we defined before
        return DAY + dt.timedelta(hours=h, minutes=m, seconds=s)
    # Otherwise we have a 'strange' time: it's after midnight
    new_date = DAY + dt.timedelta(days=1)
    return new_date + dt.timedelta(hours=h-24, minutes=m, seconds=s)

In [139]:
# Load stops
stops_df = pd.read_csv(DATA_PATH + 'stops.txt')

stops_df.head(2)

Unnamed: 0,stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,location_type,parent_station
0,12834,21038,הועד הפועל/ארלוזורוב,רחוב:ארלוזורוב 93 עיר: תל אביב יפו רציף: קו...,32.084755,34.784677,0,
1,12841,21068,ארלוזורוב/משה שרת,רחוב:ארלוזורוב 117 עיר: תל אביב יפו רציף: ק...,32.084074,34.787875,0,


In [140]:
# Add stop code and zone id to stop times
stop_times_with_stop_codes_df = stop_times_df.merge(
    stops_df[['stop_id', 'stop_code']], on='stop_id')

stop_times_with_stop_codes_df['departure_time'] = stop_times_with_stop_codes_df[
    'departure_time'].apply(convert_gtfs_time_to_datetime)

## Construct Nodes

In [141]:
# We want to (right) join this table with stop_times in order to get the sunday stop times with trip departure time.
sunday_nodes_df = stop_times_df.merge(sunday_departures_df, how='right', on='trip_id', suffixes=('_stop', '_trip_departure'))

# Remove some columns to clear the data
sunday_nodes_df = sunday_nodes_df.drop(['pickup_type',
                                        'drop_off_type', 'service_id', 'direction_is', 'shape_id'], 
                                       axis=1)

# Add stops data to nodes
nodes_df = sunday_nodes_df.merge(stops_df, on='stop_id', suffixes=('_node', '_stop'))
nodes_df = nodes_df.drop(['stop_desc', 'stop_name', 'parent_station', 'location_type'],axis=1)


nodes_df[:3]

Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon
0,19407595_081215,06:57:06,06:57:06,12962,6,2544,06:50:00,21257,32.076767,34.844303
1,19407596_081215,07:06:06,07:06:06,12962,6,2544,06:59:00,21257,32.076767,34.844303
2,19407597_081215,07:15:06,07:15:06,12962,6,2544,07:08:00,21257,32.076767,34.844303


In [142]:
# Convert GTFS times to match "real-world time".
nodes_df['arrival'] = nodes_df['arrival_time'].apply(convert_gtfs_time_to_datetime)
nodes_df['departure'] = nodes_df['departure_time_stop'].apply(convert_gtfs_time_to_datetime)

In [143]:
nodes_df.head()

Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure
0,19407595_081215,06:57:06,06:57:06,12962,6,2544,06:50:00,21257,32.076767,34.844303,2015-12-14 06:57:06,2015-12-14 06:57:06
1,19407596_081215,07:06:06,07:06:06,12962,6,2544,06:59:00,21257,32.076767,34.844303,2015-12-14 07:06:06,2015-12-14 07:06:06
2,19407597_081215,07:15:06,07:15:06,12962,6,2544,07:08:00,21257,32.076767,34.844303,2015-12-14 07:15:06,2015-12-14 07:15:06
3,19407598_081215,07:24:06,07:24:06,12962,6,2544,07:17:00,21257,32.076767,34.844303,2015-12-14 07:24:06,2015-12-14 07:24:06
4,19407599_081215,07:33:06,07:33:06,12962,6,2544,07:26:00,21257,32.076767,34.844303,2015-12-14 07:33:06,2015-12-14 07:33:06


In [144]:
nodes_df.shape

(4670, 12)

## Some Stats on the Overall Nodes For the Day 

In [145]:
nodes_df.nunique()

trip_id                           143
arrival_time                     4506
departure_time_stop              4506
stop_id                            64
stop_sequence                      37
route_id                            2
departure_time_trip_departure     137
stop_code                          64
stop_lat                           64
stop_lon                           64
arrival                          4506
departure                        4506
dtype: int64

In [146]:
# TODO: add node_id according to index, and save in pkl

In [147]:
nodes_df.to_pickle(OUTPUT_PATH + 'all_nodes.pkl')

In [148]:
nodes_df[nodes_df['arrival'] > start_time].sort_values('arrival')

Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure
3293,19404969_091215,07:00:01,07:00:01,29474,7,9807,06:52:00,20330,32.080507,34.834548,2015-12-14 07:00:01,2015-12-14 07:00:01
2033,19404967_091215,07:00:27,07:00:27,12894,27,9807,06:30:00,21155,32.086040,34.778794,2015-12-14 07:00:27,2015-12-14 07:00:27
1277,19407593_081215,07:00:30,07:00:30,13065,23,2544,06:29:00,21413,32.069303,34.780756,2015-12-14 07:00:30,2015-12-14 07:00:30
652,19407594_081215,07:00:35,07:00:35,13199,15,2544,06:40:00,21654,32.085293,34.810136,2015-12-14 07:00:35,2015-12-14 07:00:35
4106,19404968_091215,07:00:35,07:00:35,13842,18,9807,06:42:00,26138,32.079270,34.814187,2015-12-14 07:00:35,2015-12-14 07:00:35
3367,19404969_091215,07:00:37,07:00:37,29477,8,9807,06:52:00,20340,32.082834,34.835564,2015-12-14 07:00:37,2015-12-14 07:00:37
1731,19407596_081215,07:00:54,07:00:54,15909,2,2544,06:59:00,31902,32.054576,34.847617,2015-12-14 07:00:54,2015-12-14 07:00:54
721,19407594_081215,07:01:03,07:01:03,13123,16,2544,06:40:00,21503,32.084288,34.808631,2015-12-14 07:01:03,2015-12-14 07:01:03
3441,19404969_091215,07:01:05,07:01:05,14045,9,9807,06:52:00,26520,32.084589,34.836367,2015-12-14 07:01:05,2015-12-14 07:01:05
4180,19404968_091215,07:01:09,07:01:09,13927,19,9807,06:42:00,26272,32.080908,34.812014,2015-12-14 07:01:09,2015-12-14 07:01:09


In [149]:
start_time = DAY + dt.timedelta(hours=7)
end_time = start_time + dt.timedelta(hours=1, minutes=30)

morning_nodes_df = nodes_df[nodes_df['arrival'] > start_time][nodes_df['arrival'] < end_time]
morning_nodes_df.head(3)

  after removing the cwd from sys.path.


Unnamed: 0,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure
1,19407596_081215,07:06:06,07:06:06,12962,6,2544,06:59:00,21257,32.076767,34.844303,2015-12-14 07:06:06,2015-12-14 07:06:06
2,19407597_081215,07:15:06,07:15:06,12962,6,2544,07:08:00,21257,32.076767,34.844303,2015-12-14 07:15:06,2015-12-14 07:15:06
3,19407598_081215,07:24:06,07:24:06,12962,6,2544,07:17:00,21257,32.076767,34.844303,2015-12-14 07:24:06,2015-12-14 07:24:06


In [150]:
morning_nodes_df.nunique()

trip_id                           26
arrival_time                     573
departure_time_stop              573
stop_id                           64
stop_sequence                     37
route_id                           2
departure_time_trip_departure     23
stop_code                         64
stop_lat                          64
stop_lon                          64
arrival                          573
departure                        573
dtype: int64

In [151]:
morning_nodes_df = morning_nodes_df.reset_index()

In [152]:
morning_nodes_df['node_id'] = morning_nodes_df.index

In [153]:
morning_nodes_df.head(3)

Unnamed: 0,index,trip_id,arrival_time,departure_time_stop,stop_id,stop_sequence,route_id,departure_time_trip_departure,stop_code,stop_lat,stop_lon,arrival,departure,node_id
0,1,19407596_081215,07:06:06,07:06:06,12962,6,2544,06:59:00,21257,32.076767,34.844303,2015-12-14 07:06:06,2015-12-14 07:06:06,0
1,2,19407597_081215,07:15:06,07:15:06,12962,6,2544,07:08:00,21257,32.076767,34.844303,2015-12-14 07:15:06,2015-12-14 07:15:06,1
2,3,19407598_081215,07:24:06,07:24:06,12962,6,2544,07:17:00,21257,32.076767,34.844303,2015-12-14 07:24:06,2015-12-14 07:24:06,2


In [154]:
morning_nodes_df.to_pickle(OUTPUT_PATH + 'morning_nodes.pkl')

## Filter only Tel Aviv Metropolitan stops

Let's get all Tel Aviv (TLV) stops 

In [None]:
for zone in stops_df['zone_id'].unique():
    print(zone)

In [None]:
test_stops = stops_df[stops_df['zone_id'] == 6900]
test_stops.to_csv(OUTPUT_PATH + 'test_stops.csv')

In [None]:
test_stops.shape

In [None]:
stops_df.shape

In [None]:
# Let's see how manu NaN zones we have:
stops_df['zone_id'].isna().sum()

In [None]:
# I think zone 210 is tel aviv metropolitan area (only the small surrounding part, we would need to extend for our real computations). 
# We need to filter only trips that contain stops (and then only nodes with those stops) that are in this zone.
tlv_stops_df = stops_df[stops_df['zone_id'] == 210]

In [None]:
tlv_stops_df.shape

Now we need to get the TLV stop times, in order to find all trips that include TLV stops

In [None]:
# stop_times_df
tlv_stop_times_df = stop_times_df.merge(tlv_stops_df[['stop_id']], on='stop_id', how='inner')

In [None]:
tlv_stop_times_df.shape

In [None]:
stop_times_df.shape

Next, the TLV stop times include all actual trip ids with TLV stops in them. Let's find relevant trip in the Sunday's trips we're examining

In [None]:
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#
# We want to get unique trips which pass through TLV, then we will use these trips to filter 
# only nodes that pass through TLV.
#
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

tlv_trips_df = tlv_stop_times_df[['trip_id']].drop_duplicates(subset ="trip_id", 
                     keep = False, inplace = False)

In [None]:
tlv_trips_df.head()

In [None]:
tlv_trips_df.shape

Finally, let's save only nodes that include TLV trips

In [None]:
tlv_nodes_df = nodes_df.merge(tlv_trips_df[['trip_id']], on='trip_id', how='inner')

In [None]:
nodes_df.shape

In [None]:
tlv_nodes_df.shape

In [None]:
tlv_nodes_df.to_csv('tlv_nodes.csv')