# Estimate car vs bus travel time

* Pull out parallel routes.
* Make car travel down same route as the bus.
* `osmx` snaps to nodes, but even for every 5th bus stop, it's snapping to same node.
* `osrm` wasn't able to be installed in Hub
* `valhalla`? Kuan Butt's blog?

#### Quick and Dirty Approach
* Pull parallel routes
* New query that grabs stop sequences for each trip
* **Pare down** trips, keep only 1 trip for the route (pick longest one), ignore short trips
* Do above step outside of query; query returning distinct is keeping it across trips and mixing up stop sequences, and weird results come out
* Add `service_hours` for that trip, in `view.gtfs_fact_daily_trips`, this is GTFS scheduled 
* Add `service_hours_rt` for that trip...this should come from Eric's RT work
* At least be able to pull that same trip, or if it's a different day, pull it for the same time?
* Based on distance traveled, estimate car travel time with some assumptions (35, 40 mph?)
* For now, estimate car travel with lower mph assumption, so that some viable routes can be pulled. Don't want bus to look worse than it is (mid-day, free-flowing), and compare it to car travel (which is probably estimated during free-flowing too)

Later, swap out car travel time estimation with other approaches. Maybe use Google API to do requests.

In [1]:
#https://stackoverflow.com/questions/55162077/how-to-get-the-driving-distance-between-two-geographical-coordinates-using-pytho
import geopandas as gpd
import os
import pandas as pd

os.environ["CALITP_BQ_MAX_BYTES"] = str(130_000_000_000)

from calitp.tables import tbl
from calitp import query_sql
from siuba import *

import shared_utils
import utils



In [2]:
'''
SELECTED_DATE = "2022-2-8"

tbl_stop_times = (
    tbl.views.gtfs_schedule_dim_stop_times()
    >> filter(_.calitp_extracted_at <= SELECTED_DATE, 
              _.calitp_deleted_at > SELECTED_DATE, 
             )
)


daily_stop_times = (
    tbl.views.gtfs_schedule_fact_daily_trips()
    >> filter(_.service_date == SELECTED_DATE, 
          _.is_in_service == True)
    >> filter(_.calitp_itp_id==182)
    >> left_join(_, tbl_stop_times,
              # also added url number to the join keys ----
             ["calitp_itp_id", "calitp_url_number", "trip_id"])
    >> select(_.calitp_itp_id,
           _.trip_id, _.route_id, _.stop_id, _.stop_sequence, 
           _.service_hours, _.trip_first_departure_ts, _.trip_last_arrival_ts
          )    
    >> inner_join(_, 
                  (tbl.views.gtfs_schedule_dim_stops()
                   >> select(_.calitp_itp_id,
                            _.stop_id, _.stop_lon, _.stop_lat,
                            )
                  ), on = ["calitp_itp_id", "stop_id"]
    )
    >> distinct()
    >> collect()
)
'''

'\nSELECTED_DATE = "2022-2-8"\n\ntbl_stop_times = (\n    tbl.views.gtfs_schedule_dim_stop_times()\n    >> filter(_.calitp_extracted_at <= SELECTED_DATE, \n              _.calitp_deleted_at > SELECTED_DATE, \n             )\n)\n\n\ndaily_stop_times = (\n    tbl.views.gtfs_schedule_fact_daily_trips()\n    >> filter(_.service_date == SELECTED_DATE, \n          _.is_in_service == True)\n    >> filter(_.calitp_itp_id==182)\n    >> left_join(_, tbl_stop_times,\n              # also added url number to the join keys ----\n             ["calitp_itp_id", "calitp_url_number", "trip_id"])\n    >> select(_.calitp_itp_id,\n           _.trip_id, _.route_id, _.stop_id, _.stop_sequence, \n           _.service_hours, _.trip_first_departure_ts, _.trip_last_arrival_ts\n          )    \n    >> inner_join(_, \n                  (tbl.views.gtfs_schedule_dim_stops()\n                   >> select(_.calitp_itp_id,\n                            _.stop_id, _.stop_lon, _.stop_lat,\n                            )\n 

In [3]:
#daily_stop_times.to_parquet("./data/metro_routes.parquet")

In [4]:
routes_with_stops = pd.read_parquet("./data/metro_routes.parquet")

In [5]:
gdf = shared_utils.utils.download_geoparquet(utils.GCS_FILE_PATH, 
                                             "parallel_or_intersecting")

gdf = gdf[gdf.parallel==1].reset_index(drop=True)

# Start with LA Metro
gdf = gdf[gdf.itp_id==182].reset_index(drop=True)

In [6]:
def select_parallel_routes(df, parallel_info):
    df = df.rename(columns = {"calitp_itp_id": "itp_id"})
    
    gdf = (df[df.route_id.isin(parallel_info.route_id)]
            .sort_values(["itp_id", "route_id", "stop_sequence"])
            .drop_duplicates(subset=["itp_id", "route_id", "stop_sequence"])
            .reset_index(drop=True)
           )
    
    gdf = shared_utils.geography_utils.create_point_geometry(
        gdf, longitude_col = "stop_lon", latitude_col = "stop_lat",
    )
    
    return gdf

parallel = select_parallel_routes(routes_with_stops, gdf)

In [7]:
#https://stackoverflow.com/questions/25055712/pandas-every-nth-row
# Maybe not use every bus stop, since bus stops are spaced fairly closely
# Maybe every other, every 3rd? want to mimic the bus route, do not want
# to stray too far
#df = df.iloc[::3]

Don't like how `osmx` is returning the same nodes for bus stops, even at every 5th bus stop.

`osrm` doesn't install bc of some `GDAL` dependencies.

Can Google API be used? But need to check terms and conditions if we can make requests to calculate travel time or even grab speed limits through the
[Python package](https://github.com/googlemaps/google-maps-services-python)

At minimum, can calculate distance between stops, sum it up, and for cars, set an assumption of 30 mph or 45 mph. If we can't use Google API to grab speed limit, then we will hard code it.

In [8]:
def calculate_distance_traveled(df):
    group_cols = ["itp_id", "route_id"]
    sort_cols = group_cols + ["stop_sequence"]
    
    df = df.to_crs(shared_utils.geography_utils.CA_StatePlane)
    
    # Distance traveled
    df = df.assign(
        # Previous geometry
        start = (df.sort_values(sort_cols)
                 .groupby(group_cols)["geometry"]
                 .apply(lambda x: x.shift(1))),
        end = (df.sort_values(sort_cols)
               .groupby(group_cols)["geometry"]
               .apply(lambda x: x.shift(0))
              )
    )
    
    df = df.assign(
        feet_traveled = df.end.distance(df.start) 
    ).drop(columns = ["start", "end"])
        
    return df
            

In [9]:
df = calculate_distance_traveled(parallel)

In [10]:
def calculate_time_traveled(df):
    # Use a set of assumptions
    
    AVG_SPEED = 40
    
    df = df.assign(
        max_stop = (df.groupby(["itp_id", "route_id", "trip_id"])
                    ["stop_sequence"].transform("max"))
    )
    
    df2 = shared_utils.geography_utils.aggregate_by_geography(
        df,
        group_cols = ["itp_id", "route_id", "trip_id", 
                     "trip_first_departure_ts", "trip_last_arrival_ts"],
        sum_cols = ["feet_traveled"], 
        mean_cols = ["service_hours", "max_stop"]
    )
    
    df2 = df2.assign(
        miles_traveled = df2.feet_traveled.divide(
            shared_utils.geography_utils.FEET_PER_MI)
    
    )
    
    # speed = distance / time
    # time = distance / speed
    df2 = df2.assign(
        car_trip_time_hr = df2.miles_traveled.divide(AVG_SPEED),
        departure_hr = pd.to_datetime(df2.trip_first_departure_ts, unit='s').dt.hour                                        
    ).drop(columns = "feet_traveled")
        
    return df2

In [11]:
df2 = calculate_time_traveled(df)

Which trip should be selected?

It does appear that `max_stop` differs even for the same route. Not so clear what short vs long trips are. Should a trip with of average `service_hours` be selected? or average `miles_traveled` to represent a typical trip? 

But, typical trip is probably combination of mid-day service with one that ran near the average of `service_hours` or `miles_traveled`? Don't want to pull short trips because those may take place during peak.

In [12]:
df2[df2.route_id=="10-13153"].sort_values("departure_hr")

Unnamed: 0,itp_id,route_id,trip_id,trip_first_departure_ts,trip_last_arrival_ts,max_stop,service_hours,miles_traveled,car_trip_time_hr,departure_hr
15,182,10-13153,10010007430446-DEC21,17160,21720,91,1.266667,11.036106,0.275903,4
16,182,10-13153,10010007330511-DEC21,18660,23460,92,1.333333,20.905529,0.522638,5
5,182,10-13153,10010007430513-DEC21,18780,23880,96,1.416667,34.311041,0.857776,5
9,182,10-13153,10010007350627-DEC21,23220,28140,20,1.366667,5.930319,0.148258,6
14,182,10-13153,10010007350657-DEC21,25020,30000,79,1.383333,15.537939,0.388448,6
7,182,10-13153,10010007340739-DEC21,27540,31260,51,1.033333,8.452163,0.211304,7
18,182,10-13153,10010007330741-DEC21,27660,33660,99,1.666667,25.151418,0.628785,7
1,182,10-13153,10010007430703-DEC21,25380,31080,100,1.583333,42.227576,1.055689,7
8,182,10-13153,10010007320807-DEC21,29220,34320,81,1.416667,18.838597,0.470965,8
3,182,10-13153,10010007371209-DEC21,43740,48000,50,1.183333,23.619934,0.590498,12


In [13]:
group_cols = ["itp_id", "route_id"]

# Should there be a check that there are mid-day trips for that route_id?
df2['midday'] = df2.apply(lambda x: 
                            1 if ((x.departure_hr >= 10) & 
                                  (x.departure_hr <= 14))
                            else 0, axis=1)

df2['has_midday'] = df2.groupby(group_cols)["midday"].transform("max")

df2['midday2'] = df2.apply(lambda x: 
                            1 if ((x.departure_hr >= 10) & 
                                  (x.departure_hr <= 16) #& (x.midday==0)
                                 )
                            else 0, axis=1)

df2['has_midday2'] = df2.groupby(group_cols)["midday2"].transform("max")


In [14]:
df2.has_midday2.value_counts()

1    983
0      6
Name: has_midday2, dtype: int64

In [15]:
df2[df2.has_midday==0].route_id.value_counts()

240-13153    14
150-13153    11
177-13153     9
803           6
Name: route_id, dtype: int64

In [16]:
df2[df2.route_id=="240-13153"]

Unnamed: 0,itp_id,route_id,trip_id,trip_first_departure_ts,trip_last_arrival_ts,max_stop,service_hours,miles_traveled,car_trip_time_hr,departure_hr,midday,has_midday,midday2,has_midday2
402,182,240-13153,10240000010504-DEC21,18240,21600,62,0.933333,13.504534,0.337613,5,0,0,0,1
403,182,240-13153,10240000032808-DEC21,101280,104340,2,0.85,12.575274,0.314382,4,0,0,0,1
404,182,240-13153,10240000021531-DEC21,55860,61500,78,1.566667,48.087853,1.202196,15,0,0,1,1
405,182,240-13153,10240000010837-DEC21,31020,35400,65,1.216667,19.477158,0.486929,8,0,0,0,1
406,182,240-13153,10240000042329-DEC21,84540,87720,72,0.883333,12.121433,0.303036,23,0,0,0,1
407,182,240-13153,10240000010454-DEC21,17640,20940,7,0.916667,11.301492,0.282537,4,0,0,0,1
408,182,240-13153,10240000010847-DEC21,31620,36000,71,1.216667,37.969144,0.949229,8,0,0,0,1
409,182,240-13153,10240000032708-DEC21,97680,100740,68,0.85,36.063828,0.901596,3,0,0,0,1
410,182,240-13153,10240000021830-DEC21,66600,71400,77,1.333333,19.652186,0.491305,18,0,0,0,1
411,182,240-13153,10240000020754-DEC21,28440,33060,73,1.283333,12.751569,0.318789,7,0,0,0,1


In [17]:
def select_one_trip(df):
    # Not sure why across trip_ids, 
    # for the same route_id, there are differing max_stop_sequence
    # Use longest route (max stop sequence)?
    # Use median or mean service hours or miles traveled?
    group_cols = ["itp_id", "route_id"]
    
    # Should there be a check that there are mid-day trips for that route_id?
    df['midday'] = df.apply(lambda x: 
                                1 if ((x.departure_hr >= 10) & 
                                      (x.departure_hr <= 14))
                                else 0, axis=1)
    
    
    df['has_midday'] = df.groupby(group_cols)["midday"].transform("max")
    
    # Subset to mid-day first
    df = df[(df.departure_hr >= 10) & (df.departure_hr <= 14)]
    
    # Maybe pick one where service hours and miles_traveled is somewhere in 40th-60th percentile 
    df = df[(df.service_hours == df.service_hours.quantile(0.4)) & 
            (df.service_hours <= df.service_hours.quantile(0.6)) & 
            (df.miles_traveled >= df.miles_traveled.quantile(0.4)) & 
            (df.miles_traveled <= df.miles_traveled.quantile(0.6))
           ].reset_index(drop=True)
    
    '''
    df = df.assign(
        max_stop = (df.groupby(group_cols + ["trip_id"])
                    ["stop_sequence"].transform("max")
                   ),
    )
    
    df = df.assign(
        longest_trip = (df.groupby(group_cols)
                        ["max_stop"].transform("max")
                       )
    )
    
    df2 = (df[df.max_stop == df.longest_trip]
           .reset_index(drop=True)
           .drop(columns = ["max_stop", "longest_trip"])
           .rename(columns = {"calitp_itp_id": "itp_id"})
          )
    '''
    
    return df

In [18]:
df3 = select_one_trip(df2)

In [19]:
df3

Unnamed: 0,itp_id,route_id,trip_id,trip_first_departure_ts,trip_last_arrival_ts,max_stop,service_hours,miles_traveled,car_trip_time_hr,departure_hr,midday,has_midday,midday2,has_midday2
0,182,10-13153,10010007371437-DEC21,52620,56940,23,1.2,10.575006,0.264375,14,1,1,1,1
1,182,211-13153,10211000571455-DEC21,53700,58020,58,1.2,7.81012,0.195253,14,1,1,1,1


Comparison should be against bus's travel time along that route.

Can we pick one that is midday, one of the faster trips? Should be probably around 75th or 80th percentile.

Then see how long it takes for the bus to make that trip.

Actually, that travel time is in the data warehouse. Do another query, grab all the travel times, see if one can be selected for 75th or 80th percentile and if it's still less than 2x car trip time, then it can be selected as "viable parallel" route.

`views.gtfs_schedule_fact_daily_trips` has the `service_hours` column...should grab that in original query because later I drop a bunch of trips to get down to unique route, and select longest trip.