# PMAC presentation

**PO-1: increase total amount of service on the SHN and reliability of that service by 2024**

Typical weekday: 2/8/22

1. Routes on SHN
a. parallel routes (1 mi corridor) - how many routes, agencies, share of all CA transit agency-routes?
b. intersecting routes (everything not parallel) - how many routes, agencies, share of all CA transit agency-routes?
c. intersecting routes (not parallel, but actually travel on SHN (50 ft buffer) for some portion of that route) - subset of above, how many routes, agencies, share of all CA agency-routes?

2. How many routes on SHN, breakdown by district
3. How many service hours are scheduled for a typical weekday for (1)?
4. How many of these agencies that have parallel routes on SHN also have GTFS RT?
Use `isin` and find `itp_id`, not route-specific, because most agencies that provide GTFS RT do it for the majority of their routes.

In [1]:
#https://stackoverflow.com/questions/55162077/how-to-get-the-driving-distance-between-two-geographical-coordinates-using-pytho
import geopandas as gpd
import os
import pandas as pd

os.environ["CALITP_BQ_MAX_BYTES"] = str(130_000_000_000)

from calitp.tables import tbl
from calitp import query_sql
from siuba import *

import shared_utils
import utils



In [None]:
'''
SELECTED_DATE = "2022-2-8"

trip_cols = ["calitp_itp_id", "calitp_url_number", "route_id", "shape_id"]


dim_trips = (tbl.views.gtfs_schedule_dim_trips()
             >> select(*trip_cols, _.trip_key)
             >> distinct()
            )

trips = (tbl.views.gtfs_schedule_fact_daily_trips()
         >> filter(_.service_date == SELECTED_DATE, 
                   _.is_in_service==True)
         >> select(_.trip_key, _.service_date, _.service_hours)
         >> inner_join(_, dim_trips, on = "trip_key")
         >> group_by(_.calitp_itp_id, _.calitp_url_number, _.shape_id, _.route_id)
         >> mutate(total_service_hours = _.service_hours.sum())
         >> select(*trip_cols, _.total_service_hours)
         >> distinct()
         >> rename(itp_id = "calitp_itp_id")
         >> collect()
        )
'''

In [2]:
#trips.to_parquet("./data/trips_service_hours.parquet")
trips = pd.read_parquet("./data/trips_service_hours.parquet")
trips.head(2)

Unnamed: 0,itp_id,calitp_url_number,route_id,shape_id,total_service_hours
0,4,1,650,shp-650-08,0.816667
1,16,0,50,5607_shp,7.4


In [3]:
trips2 = utils.include_exclude_multiple_feeds(trips, id_col = "itp_id",
                                     include_ids = [182],
                                     exclude_ids = [200]
                                    )

# obs in original df: 10211
# obs in new df: 7887
These operators have multiple calitp_url_number values: [56, 61, 106, 110, 127, 167, 182, 194, 246, 247, 264, 279, 280, 282, 290, 294, 314, 315, 350, 356, 368]


In [4]:
# From previous step, some itp_ids had multiple urls
# But, there was some difference in shape_id, route_id, so it's ok
# Keep all those obs
stats = shared_utils.geography_utils.aggregate_by_geography(
    trips2, 
    group_cols = ["itp_id", "route_id"],
    sum_cols = ["total_service_hours"],
)

stats.head()

Unnamed: 0,itp_id,route_id,total_service_hours
0,4,10,72.016667
1,4,12,71.233333
2,4,14,108.183333
3,4,18,87.516667
4,4,19,20.216667


In [5]:
stats.total_service_hours.sum()

98889.00416666665

Now pull out service hours for parallel routes using `shape_id`.

Caveat:
* There are probably some `shape_ids` we can't find in `routes_assembled`
* Can either exclude and they are not parallel, just count as intersecting for the all intersecting.
* Or, can go back to `gtfs_schedule.shapes` to see if it can be found, and determine whether it's parallel or not
* Finding it in `gtfs_schedule.shapes` can be iffy because it's always current, and we are setting a typical weekday in the past. We still may not find all of it. It may be a closer estimate, but excluding them from being parallel maybe is a cleaner approach, and allows us to treat edge cases the same.


### Rerun routes_assembled

Do it with `gtfs_schedule.shapes` because we're losing quite a bit of routes, even when it's parallel. 

1. Make new `routes_assembled`
1. Run the new `routes_assembled` through `create_parallel_corridors` to tag parallel
routes with 1 mile buffer
1. Run new `routes_assembled` through to tag parallel corridors with 50 ft buffer to grab intersecting subset
1. Grab service hours for parallel, intersecting subset, other intersecting (not parallel). Together, this should be 100%

In [6]:
gdf = gpd.read_parquet(
    "./data/parallel_or_intersecting_2022-03-02.parquet")

In [7]:
parallel = gdf[gdf.parallel==1].reset_index(drop=True)

In [8]:
parallel.head(2)

Unnamed: 0,itp_id,shape_id,route_id,geometry,route_length,total_routes,Route,County,District,RouteType,NB,SB,EB,WB,highway_length,pct_route,pct_highway,parallel
0,4,shp-10-09,10,"LINESTRING (5377530.006 3182846.533, 5377530.0...",40538.084415,129.0,185.0,ALA,4.0,State,1.0,1.0,0.0,0.0,38599.895707,0.661,0.694,1
1,4,shp-10-09,10,"LINESTRING (5377530.006 3182846.533, 5377530.0...",40538.084415,129.0,238.0,ALA,4.0,State,1.0,1.0,0.0,0.0,54249.258347,0.588,0.44,1


In [14]:
trips2.head()

Unnamed: 0,itp_id,calitp_url_number,route_id,shape_id,total_service_hours
0,4,0,10,shp-10-09,38.983333
1,4,0,10,shp-10-10,33.033333
2,4,0,12,shp-12-14,34.483333
3,4,0,12,shp-12-56,36.75
4,4,0,14,shp-14-14,0.266667


In [23]:
trips3 = pd.merge(
    parallel.drop(columns = "shape_id").drop_duplicates(),
    stats,
    on = ["itp_id", "route_id"],
    how = "left", 
    # m:1 because the transit route can intersect with multiple highways
    validate = "m:1",
)

In [30]:
keep_cols = ["itp_id", "route_id", "total_service_hours"]
trips4 = trips3[keep_cols].drop_duplicates()

shared_utils.geography_utils.aggregate_by_geography(
    trips4,
    group_cols = ["itp_id"],
    sum_cols = ["total_service_hours"],
    nunique_cols = ["route_id"]
)

Unnamed: 0,itp_id,total_service_hours,route_id
0,4,3607.150000,121
1,6,14.600000,1
2,11,36.933333,6
3,13,114.066667,1
4,15,0.000000,1
...,...,...,...
173,394,0.000000,1
174,473,26.133333,4
175,474,260.316667,19
176,482,34.066667,17


In [33]:
shared_utils.geography_utils.aggregate_by_geography(
    trips4.assign(group="All"),
    group_cols = ["group"],
    sum_cols = ["total_service_hours"],
    nunique_cols = ["route_id", "itp_id"]
)

Unnamed: 0,group,total_service_hours,itp_id,route_id
0,All,54480.443611,178,1997


In [49]:
# How many parallel routes for each district?
shared_utils.geography_utils.aggregate_by_geography(
    parallel,
    group_cols = ["District"],
    nunique_cols = ["itp_id", "route_id"]
).sort_values("District").astype(int)

Unnamed: 0,District,itp_id,route_id
6,1,7,33
9,2,9,41
2,3,18,155
0,4,48,1139
10,5,14,125
7,6,14,89
1,7,50,394
8,8,10,88
5,9,5,21
3,10,14,100


In [None]:
'''
SELECTED_DATE = "2022-2-8"

tbl_stop_times = (
    tbl.views.gtfs_schedule_dim_stop_times()
    >> filter(_.calitp_extracted_at <= SELECTED_DATE, 
              _.calitp_deleted_at > SELECTED_DATE, 
             )
)


daily_trips = (
    tbl.views.gtfs_schedule_fact_daily_trips()
    >> filter(_.service_date == SELECTED_DATE, 
          _.is_in_service == True)
    >> left_join(_, tbl_stop_times,
              # also added url number to the join keys ----
             ["calitp_itp_id", "calitp_url_number", "trip_id"])
    >> select(_.calitp_itp_id,
           _.trip_id, _.route_id,  
           _.service_hours
          )    
    >> distinct()
    >> collect()
)
'''