## Transit Bunching
* I tried turning `stop_times` to actual dates but it seems like seconds is easier to manipulate.
* 10_transit_bunching.ipynb contains timestamps attempts
* cd data-analyses/rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ../gtfs_digest
* [Issue](https://github.com/cal-itp/data-analyses/issues/1099)

In [1]:
import geopandas as gpd
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from shared_utils import catalog_utils, rt_dates, rt_utils
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS

# https://github.com/cal-itp/data-analyses/blob/main/_shared_utils/shared_utils/gtfs_analytics_data.yml
GTFS_DATA_DICT = catalog_utils.get_catalog("gtfs_analytics_data")

from segment_speed_utils.project_vars import (
    COMPILED_CACHED_VIEWS,
    GTFS_DATA_DICT,
    PROJECT_CRS,
    RT_SCHED_GCS,
    SCHED_GCS,
    SEGMENT_GCS,
)

In [2]:
import merge_data

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
may_date = "2024-05-22"

In [5]:
drop_for_preview = [
    "schedule_gtfs_dataset_key",
    "trip_instance_key",
    "shape_array_key",
    "feed_key",
    "trip_id",
]

### Get routes with short headways
* Katrina: <i>but want to understand how the original column is calculated (over what time period). I would also count the agencies/organizations represented in that subset to see if it fits our preconceptions about which agencies run frequent routes. Also check mix of buses/trains.</i>
* Eric: <i>Once you do the 60 / frequency calculation, it’s not really a frequency any more but rather a headway. headway_minutes might be a better way to label it than frequency_in_minutes.</i>
* Amanda: to-do is to merge this with an agency-organization early on...Forgot which dataset is the best for this.

In [6]:
subset = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "direction_id",
    "route_primary_direction",
    "service_date",
    "frequency",
]

In [7]:
GTFS_DATA_DICT.rt_vs_schedule_tables.sched_route_direction_metrics

'schedule_route_dir/schedule_route_direction_metrics'

In [8]:
route_dir = merge_data.concatenate_schedule_by_route_direction([may_date])[subset]

In [9]:
route_dir["headway_minutes"] = 60 / route_dir.frequency

In [10]:
route_dir.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,route_primary_direction,service_date,frequency,headway_minutes
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0.0,Northbound,2024-05-22,0.92,65.22


In [11]:
route_freq_groupby = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "direction_id",
    "route_primary_direction",
]

In [13]:
high_frequency_routes = (
    route_dir.groupby(route_freq_groupby)
    .agg({"headway_minutes": "mean"})
    .reset_index()
)

#### Grab routes in the 5th percentile of frequency for now.
* Eric: <i>Taking the 5%ile (17.65min headway) is reasonable, but I suspect the worst bunching issues might be on routes with headways at/below the 10min mark? Maybe try 15 and 10 as well?</i>

In [15]:
high_frequency_routes["headway_minutes"].describe(
    percentiles=[0.05, 0.1, 0.9, 0.95]
)

count   3417.00
mean     234.64
std      312.42
min        4.00
5%        17.65
10%       23.40
50%       97.71
90%      750.00
95%     1000.00
max     1250.00
Name: headway_minutes, dtype: float64

In [18]:
high_frequency_routes2 = high_frequency_routes.loc[
    high_frequency_routes.headway_minutes <= 15
]

In [19]:
high_frequency_routes2.route_id.nunique()

71

### Get trips of high frequency routes

In [20]:
TABLE = GTFS_DATA_DICT.schedule_downloads.trips

In [21]:
FILE = f"{COMPILED_CACHED_VIEWS}{TABLE}_{may_date}.parquet"

In [22]:
trips_subset = [
    "gtfs_dataset_key",
    "route_id",
    "trip_instance_key",
    "shape_array_key",
    "feed_key",
    "route_long_name",
    "direction_id",
]

In [23]:
trips = pd.read_parquet(FILE)[trips_subset].rename(
    columns={"gtfs_dataset_key": "schedule_gtfs_dataset_key"}
)

In [24]:
trips.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,trip_instance_key,shape_array_key,feed_key,route_long_name,direction_id
0,1770249a5a2e770ca90628434d4934b1,3408,c256553e28c4bba693e3136240b35419,8f644f847e987de68e0cb6fcd339cf41,926867fdee73d5fbfe4f011871bcd830,Route 21,0.0
1,1770249a5a2e770ca90628434d4934b1,3408,488e9e227288606249d0508961c0fa15,8f644f847e987de68e0cb6fcd339cf41,926867fdee73d5fbfe4f011871bcd830,Route 21,0.0


In [25]:
trips_freq_routes = pd.merge(
    trips,
    high_frequency_routes2,
    on=["schedule_gtfs_dataset_key", "route_id", "direction_id"],
    how="inner",
)

In [26]:
trips_freq_routes.shape

(16205, 9)

In [27]:
trips_freq_routes.trip_instance_key.nunique()

16205

In [30]:
trips.trip_instance_key.nunique()

96391

### `rt_stop_times2`: Get Stop Times of High Frequency Routes/Trips
* What's the difference btwn `trip_id` and `trip_instance_key`?
* Eric: <i>trip_instance_key is created by our warehouse (see Columns section), and is a composite including trip_id , service date, and feed URL in order to uniquely identify a specific trip while allowing for joins across schedule+RT. It’s probably the one to use here, but personally I sometimes like keeping trip_id around for context.</i>

In [31]:
rt_stop_times = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/schedule_rt_stop_times_2024-05-22.parquet"
)

In [32]:
# Find only stop times of trips that belong to high frequency trips
rt_stop_times2 = pd.merge(
    rt_stop_times,
    trips_freq_routes,
    on=[
        "schedule_gtfs_dataset_key",
        "trip_instance_key",
    ],
    how="inner",
)

In [36]:
pd.merge(
    rt_stop_times,
    trips_freq_routes,
    on=[
        "schedule_gtfs_dataset_key",
        "trip_instance_key",
    ],
    how="outer",
    indicator = True
)[["_merge"]].value_counts()

_merge    
left_only     2030399
both           570863
right_only       2274
dtype: int64

In [33]:
len(rt_stop_times) - len(rt_stop_times2)

2030399

In [34]:
rt_stop_times2.shape

(570863, 14)

In [35]:
rt_stop_times2.trip_id.nunique(), rt_stop_times2.trip_instance_key.nunique()

(13931, 13931)

###  `rt_stop_times3`: Some scheduled arrival seconds span longer than a day: filter them out
* Katrina: <i>I assume the scheduled arrival sec > 86400 are after midnight, don't need to throw these out. Does rt arrival sec behave the same way, or do you need to create a datetime?</i>
* Eric: <i>agree w/ Katrina’s comments on handling seconds around midnight, I don’t know the actual answer but if rt_arrival_sec does in fact always go to 0 at midnight instead of sometimes going >86400 when schedule does you could use the % operator on the scheduled value like scheduled_arrival_sec % 86400</i

In [37]:
rt_stop_times2.scheduled_arrival_sec.describe()

count   570863.00
mean     50485.85
std      19482.84
min       9420.00
25%      34140.00
50%      49680.00
75%      64260.00
max     108431.00
Name: scheduled_arrival_sec, dtype: float64

In [38]:
rt_stop_times3 = rt_stop_times2.loc[
    rt_stop_times2.scheduled_arrival_sec < 86400
].reset_index(drop=True)

### `rt_stop_times4`: Sort so `stop sequence` for the `operator-stop_id-route-id_direction_id` will be in order.
* Comparing bunching by STOP, so we have to look at the `stop sequence-stop_id.`
* Katrina: <i>Maybe you want to sort  by rt arrival seconds instead of scheduled?</i>
    * Amanda: Done.

In [39]:
rt_stop_times3.head(1)

Unnamed: 0,trip_id,stop_id,stop_sequence,scheduled_arrival_sec,schedule_gtfs_dataset_key,trip_instance_key,rt_arrival_sec,route_id,shape_array_key,feed_key,route_long_name,direction_id,route_primary_direction,headway_minutes
0,183-63mc4zr4r,4896831,1,21780.0,cc53a0dbf5df90e3009b9cb5d89d80ba,fab32bd349f15ec26794b00fba264631,21704,4443,8eb6571f567f3cc3b1e34a5118fe1587,2cfdf0e33e9229d6b0ad124d956f5856,DASH B,0.0,Northbound,12.33


In [40]:
# Rearrange: I want the stop sequence to be 1,2,3,4.
# stop ids can differ between trips of the same route and the same stop sequence is the same
rt_stop_times4 = rt_stop_times3.sort_values(
    by=[
        "schedule_gtfs_dataset_key",
        "route_id",
        "shape_array_key",
        "direction_id",
        "stop_sequence",
        "rt_arrival_sec",
    ]
).reset_index(drop=True)

In [41]:
# Make sure sorting is right
fillmore_stop_seq_13 = rt_stop_times4.loc[
    (rt_stop_times4.shape_array_key == "1b678a66d0009c55bc573cfc37aa1029")
    & (rt_stop_times4.stop_id == "13086")
    & (rt_stop_times4.direction_id == 0)
]

In [42]:
fillmore_stop_seq_13

Unnamed: 0,trip_id,stop_id,stop_sequence,scheduled_arrival_sec,schedule_gtfs_dataset_key,trip_instance_key,rt_arrival_sec,route_id,shape_array_key,feed_key,route_long_name,direction_id,route_primary_direction,headway_minutes
396954,11489969_M31,13086,13,67199.0,7cc0cb1871dfd558f11a2885c145d144,b73ff68241fdcb9ff5a3f3be424b2268,67051,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61
396955,11489975_M31,13086,13,69106.0,7cc0cb1871dfd558f11a2885c145d144,d30242b374225ed75a4aadd78fa8d7be,69048,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61
396956,11489815_M31,13086,13,69466.0,7cc0cb1871dfd558f11a2885c145d144,186fd89b59a49ddc1e84cb4b89c066d8,69723,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61
396957,11489816_M31,13086,13,70006.0,7cc0cb1871dfd558f11a2885c145d144,5cd2523ccd8c33e277aaae0ac9af35c8,70421,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61
396958,11489817_M31,13086,13,72992.0,7cc0cb1871dfd558f11a2885c145d144,45830206e5f4a07c06e520968f4b789f,73064,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61
396959,11489861_M31,13086,13,74432.0,7cc0cb1871dfd558f11a2885c145d144,a28a9fc884812bbb9e404de1dd970ccd,75434,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61
396960,11489818_M31,13086,13,75872.0,7cc0cb1871dfd558f11a2885c145d144,c20c26a42e6277dd327fe1280cead6a8,75943,22,1b678a66d0009c55bc573cfc37aa1029,7f69c2fdaa134642f14064a0b64d1495,FILLMORE,0.0,Southbound,7.61


### Calculate the difference btwn actual vs scheduled arrival.

In [43]:
def check_delay(df):
    df = df.assign(delay=df.rt_arrival_sec - df.scheduled_arrival_sec)

    print(df.delay.describe(percentiles=[0.05, 0.1, 0.9, 0.95]))

    max_delay_min = df.delay.max() / 60
    p95_delay_min = df.delay.quantile(q=0.95) / 60

    min_delay_min = df.delay.min() / 60
    p5_delay_min = df.delay.quantile(q=0.05) / 60

    print(f'min / max delay (minutes):{min_delay_min:.2f},{min_delay_min:.2f}')
    print(f"5th / 95th delay (minutes):{p5_delay_min:.2f}, {p95_delay_min:.2f}")

    return df

In [44]:
rt_stop_times4 = check_delay(rt_stop_times4)

count   546356.00
mean        66.05
std       2805.41
min     -86381.00
5%        -163.00
10%       -106.00
50%         86.00
90%        508.00
95%        710.00
max      35879.00
Name: delay, dtype: float64
min / max delay (minutes):-1439.68,-1439.68
5th / 95th delay (minutes):-2.72, 11.83


#### `rt_stop_times5`: Filter out values in `delay` that are in the 1 hour zone
* Actual times should not exceed more than an hour or less than hour.
* Katrina: <i>I am not sure if you need to throw out ">1 hour delay" trips, the customer experience we're interested in is actual wait times between stop arrivals</i>
    * Amanda: forgot why Tiffany does this but she generally throws out delays that are ~one hour.

In [45]:
# Filter to only delays that are an hour or less
rt_stop_times5 = rt_stop_times4[rt_stop_times4["delay"] <= 60 * 60].reset_index(
    drop=True
)

In [46]:
# Filter to only delays that are no less than
rt_stop_times5 = rt_stop_times5[rt_stop_times5["delay"] >= -3600].reset_index(drop=True)

In [47]:
len(rt_stop_times4) - len(rt_stop_times5)

835

In [48]:
len(rt_stop_times) - len(rt_stop_times5)

2055741

### Calculate the actual headway the `operator-route-direction_id-stop_sequence-stop_id-` grain
* Do I need to include feed key and shape array key? Amanda: still need help

In [49]:
groupby_cols = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "shape_array_key",
    "direction_id",
    "route_primary_direction",
    "stop_sequence",
    "stop_id",
]

In [50]:
# Subtract rt_arrival_sec from the previous row to the target row
# using groupby columns
rt_stop_times4["actual_headway"] = rt_stop_times4.groupby(groupby_cols)[
    "rt_arrival_sec"
].diff()

### Calculate scheduled headway
* Using the same grain.

In [51]:
rt_stop_times4["schd_headway"] = rt_stop_times4.groupby(groupby_cols)[
    "scheduled_arrival_sec"
].diff()

### Delete out rows that are `nan`??
* I am not sure if `nans` impact calculations of the mean scheduled headway and whatnot?
* These `nans` are becuase the first `operator-route-stop_id-stop_sequence` combo won't have anything to compare it to.
* Katrina: <i>I would fill in the actual/schedule headway columns with 0 rather than dropping the first row  in each grouping. I wonder if it makes sense to use a more descriptive column name than headway, such as "minutes since last vehicle"</i>

### `rt_stop_times6`: Delete out the rows in which `actual_headway` and `schd_headway` are `nan`: this is basically the first row of each grain
* Katrina: <i>I would fill in the actual/schedule headway columns with 0 rather than dropping the first row  in each grouping. I wonder if it makes sense to use a more descriptive column name than headway, such as "minutes since last vehicle"</i>

In [61]:
transit_matters_df1 = rt_stop_times4.copy()

In [62]:
transit_matters_df1["pct_actual_schd_headway"] = (
    transit_matters_df1.actual_headway / transit_matters_df1.schd_headway
)

In [63]:
import numpy as np

transit_matters_df1["bunched_y_n"] = np.where(
    transit_matters_df1["pct_actual_schd_headway"] < 0.25, "bunched", "not bunched"
)

#### There are some very extreme values: how to deal with this?


In [64]:
transit_matters_df1.pct_actual_schd_headway.describe()

count   528794.00
mean         0.95
std          0.52
min        -50.30
25%          0.77
50%          0.98
75%          1.16
max         10.62
Name: pct_actual_schd_headway, dtype: float64

In [65]:
transit_matters_df1.bunched_y_n.value_counts(dropna=True)

not bunched    513890
bunched         32466
Name: bunched_y_n, dtype: int64

#### Groupby grain and see how many trips for that grain are considered "bunched" or not.

In [66]:
transit_matters_df2 = (
    transit_matters_df1.groupby(
        [
            "schedule_gtfs_dataset_key",
            "route_long_name",
            "shape_array_key",
            "route_id",
            "stop_id",
            "direction_id",
            "route_primary_direction",
            "bunched_y_n",
        ]
    )
    .agg({"trip_instance_key": "nunique"})
    .reset_index()
)

In [67]:
#Filter out only rows that are bunched.
bunched_only = transit_matters_df2.loc[
    transit_matters_df2.bunched_y_n == "bunched"
].reset_index(drop=True)

In [68]:
transit_matters_agg = [
    "schedule_gtfs_dataset_key",
    "route_long_name",
    "shape_array_key",
    "route_id",
    "stop_id",
    "direction_id",
    "route_primary_direction",
]

In [69]:
# Aggregate all trips on the grain
transit_matters_all_trips = (
    transit_matters_df1.groupby(transit_matters_agg)
    .agg({"trip_instance_key": "nunique"})
    .reset_index()
    .rename(columns={"trip_instance_key": "all_trips"})
)

In [70]:
# Merge back, using left merge to keep bunching
bunched_only = pd.merge(
    bunched_only, transit_matters_all_trips, on=transit_matters_agg, how="left"
)

In [71]:
bunched_only["pct_trips_bunched"] = (
    bunched_only.trip_instance_key / bunched_only.all_trips * 100
)

In [72]:
bunched_only = bunched_only.drop(columns=["all_trips"])

In [73]:
# Merge back all rows that don't have bunching trips.
transit_matters_m1 = pd.merge(
    transit_matters_all_trips,
    bunched_only,
    on=transit_matters_agg,
    how="left",
)

In [75]:
transit_matters_m1 = transit_matters_m1.drop(
    columns=["bunched_y_n", "trip_instance_key"]
)

In [76]:
transit_matters_m1.pct_trips_bunched = transit_matters_m1.pct_trips_bunched.fillna(0)

In [77]:
transit_matters_m1.pct_trips_bunched.describe()

count   17305.00
mean        3.14
std         6.35
min         0.00
25%         0.00
50%         0.00
75%         4.00
max        50.00
Name: pct_trips_bunched, dtype: float64

In [78]:
transit_matters_m1.loc[transit_matters_m1.pct_trips_bunched >= 10].shape

(1892, 9)

### Use 2 minute benchmark
* [Source](https://static1.squarespace.com/static/533b9a24e4b01d79d0ae4376/t/645e82de1f570b31497c44dc/1683915486889/TransitMatters-Headwaymanagement.pdf)
* Justifying the use of
headway maintenance. For example, in April
2022 the 66 bus significantly bunched around
several stops. When bunching is defined as
buses that run within two minutes or less of
each other, inbound buses towards Nubian
Square bunched 10% of the time at Brigham
Circle, 9% at Brookline Village and Roxbury
Crossing, and 8% of the time at Coolidge
Corner. Bunching is even more dramatic
outbound towards Harvard Square where
buses bunched over 35% of the time at Winship
St, 13% at Coolidge Corner and Harvard Ave at
Commonwealth Ave, and 12% at North Harvard
St at Western Ave. View more data about bus
bunching through the TransitMatters Data
Dashboard here.

In [81]:
two_minutess_df = rt_stop_times5.copy()

In [82]:
two_minutess_df["actual_headway_min"] = two_minutess_df.actual_headway / 60

AttributeError: 'DataFrame' object has no attribute 'actual_headway'

In [None]:
two_minutess_df["bunched_y_n"] = np.where(
    two_minutess_df["actual_headway_min"] <= 2, "bunched", "not bunched"
)

In [None]:
two_minutess_df.info()

In [None]:
two_minutess_df.bunched_y_n.value_counts()

#### Same code as Transit Matters Approach

In [None]:
two_minutes_agg1 = (
    two_minutess_df.groupby(
        [
            "schedule_gtfs_dataset_key",
            "route_long_name",
            "shape_array_key",
            "route_id",
            "stop_id",
            "direction_id",
            "route_primary_direction",
            "bunched_y_n",
        ]
    )
    .agg({"trip_instance_key": "nunique"})
    .reset_index()
)

In [None]:
bunched_only_two_min = (
    two_minutes_agg1.loc[two_minutes_agg1.bunched_y_n == "bunched"]
    .reset_index(drop=True)
    .rename(columns={"trip_instance_key": "bunched_trips"})
)

In [None]:
# I want to do a left merge because I'm only interested in trips that bunched.
bunched_only_two_min = pd.merge(
    bunched_only_two_min,
    transit_matters_all_trips,
    on=[
        "schedule_gtfs_dataset_key",
        "route_long_name",
        "shape_array_key",
        "route_id",
        "stop_id",
        "direction_id",
        "route_primary_direction",
    ],
    how="left",
)

In [None]:
bunched_only_two_min["pct_trips_bunched"] = (
    bunched_only_two_min.bunched_trips / bunched_only_two_min.all_trips * 100
)

In [None]:
bunched_only_two_min = bunched_only_two_min.drop(columns=["all_trips"])

In [None]:
# Need to do a left merge on all trips for the stops that don't have bunching.
final_two_minute = pd.merge(
    transit_matters_all_trips,
    bunched_only_two_min,
    on=[
        "schedule_gtfs_dataset_key",
        "route_long_name",
        "shape_array_key",
        "route_id",
        "stop_id",
        "direction_id",
        "route_primary_direction",
    ],
    how="left",
)

In [None]:
final_two_minute.shape

In [None]:
final_two_minute = final_two_minute.drop(columns=["bunched_y_n", "bunched_trips"])

### Checkout all 3 using a stop_sequence/direction_id for Fillmore again
* Very different results between the 3 approaches. 
* The coefficient one says frequent bunching lol, but the other methods say there isn't any bunching...

In [None]:
transit_matters_m2.shape

In [None]:
bunching_by_stops.shape

In [None]:
fillmore.loc[
    (fillmore.shape_array_key == "1b678a66d0009c55bc573cfc37aa1029")
    & (fillmore.stop_id == "13086")
    & (fillmore.direction_id == 0)
]

In [None]:
transit_matters_m2.loc[
    (transit_matters_m2.shape_array_key == "1b678a66d0009c55bc573cfc37aa1029")
    & (transit_matters_m2.stop_id == "13086")
    & (transit_matters_m2.direction_id == 0)
]

In [None]:
final_two_minute.loc[
    (final_two_minute.shape_array_key == "1b678a66d0009c55bc573cfc37aa1029")
    & (final_two_minute.stop_id == "13086")
    & (final_two_minute.direction_id == 0)
]

In [None]:
# convert seconds to timestamp
transit_matters_fillmore_test["rt_arrival_time"] = pd.to_timedelta(
    transit_matters_fillmore_test["rt_arrival_sec"], unit="s"
)

In [None]:
transit_matters_fillmore_test