# Transit Service Density

## Conceptually, these all sound similar
1. Transit service density
1. Find and assess key transfers
1. Regional connections
1. Parallel routes

Parallel routes are where transit routes across operators are running over the same corridor. Spatial join to get a "service density" over a road corridor.

Within these corridors, if there are certain stops where they are co-located, we might want to focus there too (aka, near rail, etc). These stops are both key transfers regionally or key transfers to other local destinations.

## Stops (point geometry)

* Use GTFS `stops` and count trips per hour
* Consider whether `stops` (point geometry) would be the best for pairing this analysis with accessibility. 
   * Is accessibility also number of jobs reachable from a given stop? 
* This aggregation is the easiest to get, entirely derived from GTFS schedule tables. We'd only want to use this aggregation if there were other analyses using point data.

## Tracts (polygon geometry)

* Transit service increase analysis aggregated stops per hour to census tract. Census tract gave us CalEnviroScreen designations and categorizing of transit route into urban/suburban/rural.
* The aggregation is simple enough, but it's the visualization that comes after that's more challenging to interpret.
   * Are we saying something about tracts at the urban/suburban/rural or CalEnviroScreen high/moderate/low equity groups? 
   * If we're not, doing this aggregation is not ideal because we almost always having to figure out how to combine tract-level stats to points or lines or even larger polygons (Caltrans districts).

## Road segments (line geometry)
* We have roads now, but haven't really put it through heavy use yet, even road segments cut by 1 km segments.
* If we can aggregate arrivals to road segments, this gives us the strongest tie to the other issues:
   * Key transfer points - lots of operators and routes and stops served along the road segment
   * Parallel routes - multiple operators and multiple routes within same corridor
   * Regional connections - zero in on where these road segments come within certain buffer of freeway exit or rail station?
   * Transit service density - how many daily / peak trips travel along this corridor, how many arrivals are taking place along this corridor, how many operators / routes along this corridor


In [None]:
import geopandas as gpd
import intake
import pandas as pd

from shared_utils import rt_dates, rt_utils
from segment_speed_utils import helpers
from segment_speed_utils.project_vars import (RT_SCHED_GCS, 
                                              SHARED_GCS)
                                             

catalog = intake.open_catalog(
    "../_shared_utils/shared_utils/shared_data_catalog.yml")

analysis_date = rt_dates.DATES["sep2023"]

In [None]:
stop_times_with_dir = gpd.read_parquet(
    f"{RT_SCHED_GCS}stop_times_direction_{analysis_date}.parquet"
)

In [None]:
# need trip_instance_key to merge to stop_times
# grab arrival_hour from stop_times...categorize as peak/offpeak
trips = helpers.import_scheduled_trips(
    analysis_date,
    columns = ["trip_instance_key", "trip_id", "feed_key"]
)

stop_times = helpers.import_scheduled_stop_times(
    analysis_date,
    columns = ["feed_key", "trip_id", "stop_id", "stop_sequence", 
               "arrival_hour"]
).merge(
    trips,
    on = ["feed_key", "trip_id"],
    how = "inner"
)[["trip_instance_key", "stop_id", 
   "stop_sequence", "arrival_hour"]].query('arrival_hour.notnull()').compute()

In [None]:
gdf = pd.merge(
    stop_times_with_dir,
    stop_times,
    on = ["trip_instance_key", "stop_id", "stop_sequence"],
    how = "inner"
).astype({"arrival_hour": "int64"})

In [None]:
gdf = gdf.assign(
    time_of_day = gdf.apply(
        lambda x:
        rt_utils.categorize_time_of_day(x.arrival_hour), 
        axis=1)
)

gdf.time_of_day.value_counts()

In [None]:
gdf = gdf.assign(
    peak_category = gdf.apply(
        lambda x: "peak" if x.time_of_day in ["AM Peak", "PM Peak"]
        else "offpeak", axis=1)
)

In [None]:
gdf.peak_category.value_counts()

In [None]:
stop_cols = ["schedule_gtfs_dataset_key", "stop_id"]

peak_st = gdf[gdf.peak_category=="peak"]

arrivals_by_stop = (gdf.groupby(stop_cols, 
                                observed=True, group_keys=False)
                    .agg({"arrival_hour": "count"})
                    .reset_index()
                    .rename(columns = {"arrival_hour": "all_arrivals"})
                   )

peak_arrivals_by_stop = (peak_st.groupby(stop_cols, 
                                         observed=True, group_keys=False)
                    .agg({"arrival_hour": "count"})
                    .reset_index()
                    .rename(columns = {"arrival_hour": "peak_arrivals"})
                   )

In [None]:
stop_arrivals_gdf = pd.merge(
    stop_times_with_dir[stop_cols + ["geometry"]].drop_duplicates(),
    arrivals_by_stop,
    on = stop_cols,
    how = "inner"
).merge(
    peak_arrivals_by_stop,
    on = stop_cols,
    how = "left"
).astype({
    "all_arrivals": "int64",
    "peak_arrivals": "Int64"
})

In [None]:
# Disneyland shuttle in Toy Story lot has 6_000 arrivals a day
stop_arrivals_gdf.describe()

## Census Tracts

We would use the CalEnviroScreen 4.0 + LEHD stats by census tract dataset.

In [None]:
tracts = catalog.calenviroscreen_lehd_by_tract.read()
tracts.head(2)

## Road Segments

* We should focus on primary / secondary roads only. If we skip most of the local roads, then we don't even need date-specific road segments, since primary / secondary roads are cut into 1 km segments already.
* Maybe we should include local roads, see what comes up in the sjoin? Primary / secondary roads are looking a little sparse.

In [None]:
road_segments = gpd.read_parquet(
    f"{SHARED_GCS}segmented_roads_2020_primarysecondary.parquet"
)

In [None]:
road_segments.head(2)

In [None]:
road_segments_buff = road_segments.assign(
    geometry = road_segments.geometry.buffer(35)
)

In [None]:
stop_arrivals_by_segment_sjoin = gpd.sjoin(
    road_segments_buff,
    stop_arrivals_gdf,    
    how = "inner",
    predicate = "intersects"
).drop(columns = "index_right")

In [None]:
road_cols = ["linearid", "mtfcc", "fullname", 
             "segment_sequence", "primary_direction"]

# We might be overcounting here, since
# roads show up in both directions, and while a 35m buffer is intended
# to catch stops on the sidewalk, we might catch the other direction too
stop_arrivals_by_segment = (stop_arrivals_by_segment_sjoin
                            .groupby(road_cols, 
                                     observed=True, group_keys=False)
                            .agg({
                                "all_arrivals": "sum",
                                "peak_arrivals": "sum",
                                "stop_id": "count",
                                "schedule_gtfs_dataset_key": "nunique",
                            }).reset_index()
                            .rename(columns = {
                                "stop_id": "n_stops",
                                "schedule_gtfs_dataset_key": "n_operators",
                            })
                           )
                                       

In [None]:
# Attach road segment geometry
stop_arrivals_by_segment = (road_segments[road_cols + ["geometry"]]
                            .merge(
                                stop_arrivals_by_segment,
                                on = road_cols,
                                how = "inner"
                            )
                           )

In [None]:
stop_arrivals_by_segment.all_arrivals.describe()

In [None]:
stop_arrivals_by_segment = stop_arrivals_by_segment.assign(
    all_arrivals_quartile = pd.qcut(
        stop_arrivals_by_segment.all_arrivals,
        4, labels=False) + 1
)

In [None]:
stop_arrivals_by_segment.explore(
    "all_arrivals_quartile", 
    tiles = "CartoDB Positron",
    categorical = True
)