## GTFS
In order to compute accessibility based on GTFS data (General Transit Feed Specification, can by downloaded from ftp://199.203.58.18/), we will first perform basic processing of the data.
We are using the pandas library.

The code is based on the following structure of GTFS tables:
![GTFS Tables](../../input_data/GTFS_tables.PNG)

In [4]:
# Imports
import pandas as pd
import datetime as dt
import numpy as np
import pickle
import warnings
from tqdm.auto import tqdm

# Set up notebook
warnings.filterwarnings('ignore')
tqdm.pandas()

# Code constants
DATA_PATH = '../../input_data/GTFS-28-Oct-19/'
OUTPUT_PATH = '../../output_data/'
DAY = dt.datetime(2019, 11, 3)


## Create Nodes
### Utility Functions
GTFS times are markes so that trips that last after midnight (00:00) actyally get counted from 24:00 and onwards. For example, an entry in stop_times file can have the time 26:50. 
In order to handle the times in python native datetime package, we must convert all "GTFS times" to regular times.

In [None]:
def convert_gtfs_time_to_datetime(gtfs_time):
    date = dt.datetime(2019, 11, 3)
    h, m, s = [int(x) for x in gtfs_time.split(':')]
    if h < 24:
        # This is a 'normal' situation, we can simply create a datetime object using the date we defined before
        return date + dt.timedelta(hours=h, minutes=m, seconds=s)
    # Otherwise we have a 'strange' time: it's after midnight
    new_date = date + dt.timedelta(days=1)
    return new_date + dt.timedelta(hours=h-24, minutes=m, seconds=s)

### Process Calendar - Get trips for a single day
Note the the chosed day, as marked by the above constant DAY, should fit the code below. If the date is a Sunday, we should filter by 'sunday' below. See code internal documentation for specifics. 

In [None]:
# Load calendar
calendar_df = pd.read_csv(DATA_PATH + 'calendar.txt')
# Convert dates to python's datetime type
calendar_df['start_date'] = calendar_df['start_date'].apply(lambda x: dt.datetime.strptime(str(x), '%Y%m%d'))
calendar_df['end_date'] = calendar_df['end_date'].apply(lambda x: dt.datetime.strptime(str(x), '%Y%m%d'))

calendar_df.columns = ['service_id',
 'sunday',
 'monday',
 'tuesday',
 'wednesday',
 'thursday',
 'friday',
 'saturday',
 'start_date',
 'end_date']

# Let's say we want all trips that occurred on the first Sunday after the feed was published 

# Filter so we only keep services that are active on Sunday.
sunday_services_df = calendar_df[calendar_df['sunday'] == 1][['service_id', 'start_date', 'end_date']]

# Keep only services that start during/before selected date
sunday_services_df = sunday_services_df[sunday_services_df['start_date'] <= DAY]

# Keep only services that end during/after selected date
sunday_services_df = sunday_services_df[sunday_services_df['end_date'] >= DAY]

### Process Trips

In [None]:
# Load trips
trips_df = pd.read_csv(DATA_PATH + 'trips.txt')

# Fix column names (some columns have special 'hudden' characters that we want to remove)
trips_df.columns = ['route_id', 'service_id', 'trip_id','trip_headsign', 'direction_is', 'shape_id']

trips_calendar_df = sunday_services_df.merge(trips_df, on='service_id', suffixes=('_calendar', '_trips'))
sunday_trips_df = trips_calendar_df.drop(['start_date', 'end_date'], axis=1)

We should now have all **trips** that occured on the selected date.
### Process Stop Times
**Note:** This is pretty heavy compared to the rest of the tables

In [None]:
# Load stop times
stop_times_df = pd.read_csv(DATA_PATH + 'stop_times.txt')

# Get all trips departures by getting the minimal departure time for each trip
trips_start_times_df = stop_times_df.groupby('trip_id').agg({'departure_time': 'min'})

# Let's join the last two tabled to get the departure times of all sunday trips
sunday_departures_df = sunday_trips_df.merge(trips_start_times_df, on='trip_id', suffixes=('_departures', '_trips'))

### Process Stops

In [None]:
# Load stops
stops_df = pd.read_csv(DATA_PATH + 'stops.txt')

# Add stop code and zone id to stop times
stop_times_with_stop_codes_df = stop_times_df.merge(
    stops_df[['stop_id', 'stop_code', 'zone_id']], on='stop_id')

# Add stop code to stop times - Yulia's example
# stop_times_with_stop_codes_df = stop_times_df.merge(
#     stops_df[['stop_id', 'stop_code']], on='stop_id')


stop_times_with_stop_codes_df['departure_time'] = stop_times_with_stop_codes_df[
    'departure_time'].apply(convert_gtfs_time_to_datetime)

### Construct the nodes

In [None]:
# We want to (right) join this table with stop_times in order to get the sunday stop times with trip departure time.
sunday_nodes_df = stop_times_df.merge(sunday_departures_df, how='right', on='trip_id', suffixes=('_stop', '_trip_departure'))

# Remove some columns to clear the data
sunday_nodes_df = sunday_nodes_df.drop(['pickup_type', 'shape_dist_traveled', 
                                        'drop_off_type', 'service_id', 'direction_is', 'shape_id'], 
                                       axis=1)

# Add stops data to nodes
nodes_df = sunday_nodes_df.merge(stops_df, on='stop_id', suffixes=('_node', '_stop'))
nodes_df = nodes_df.drop(['stop_desc', 'stop_name', 'zone_id', 'parent_station', 'location_type'],axis=1)

# Convert GTFS times to match "real-world time".
nodes_df['arrival'] = nodes_df['arrival_time'].apply(convert_gtfs_time_to_datetime)
nodes_df['departure'] = nodes_df['departure_time_stop'].apply(convert_gtfs_time_to_datetime)


## Create Edges