## Transit Bunching 
* `cd data-analyses/rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ../gtfs_digest`
* [Issue](https://github.com/cal-itp/data-analyses/issues/1099)


In [1]:
import datetime as dt

import altair as alt
import geopandas as gpd
import merge_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers, time_series_utils
from shared_utils import catalog_utils, rt_dates, rt_utils
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS

# https://github.com/cal-itp/data-analyses/blob/main/_shared_utils/shared_utils/gtfs_analytics_data.yml
GTFS_DATA_DICT = catalog_utils.get_catalog("gtfs_analytics_data")

from segment_speed_utils.project_vars import (
    COMPILED_CACHED_VIEWS,
    GTFS_DATA_DICT,
    PROJECT_CRS,
    RT_SCHED_GCS,
    SCHED_GCS,
    SEGMENT_GCS,
)

In [2]:
import yaml

with open("readable.yml") as f:
    readable_dict = yaml.safe_load(f)
with open("color_palettes.yml") as f:
    color_dict = yaml.safe_load(f)

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
may_date = "2024-05-22"

In [5]:
drop_for_preview = [
    "schedule_gtfs_dataset_key",
    "trip_instance_key",
    "shape_array_key",
    "feed_key",
    "trip_id",
]

## Step 1: Grab Routes from `GTFS_DATA_DICT.digest_tables.route_schedule_vp `

In [6]:
subset = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "direction_id",
    "route_primary_direction",
    "service_date",
    "frequency",
]

In [7]:
route_dir_columns = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "direction_id",
    "time_period",
    "route_primary_direction",
    "frequency",
    "service_date",
]

In [8]:
route_dir = merge_data.concatenate_schedule_by_route_direction([may_date])[
    route_dir_columns
]

### Calculate headways

In [9]:
route_dir["headway_minutes"] = 60 / route_dir.frequency

### Per Eric and Katrina's suggestion, retain only rows that hold `peak` hours in `time_period`

In [10]:
# Filter for only peak
route_dir = route_dir.loc[route_dir.time_period == "peak"].reset_index(drop=True)

In [11]:
len(route_dir)

3563

In [12]:
route_dir.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,frequency,service_date,headway_minutes
2060,baeeb157e85a901e47b828ef9fe75091,237,1.0,peak,Westbound,0.96,2024-05-22,62.5


### Merge in operators and districts (maybe this should go at the end of all the aggregating)

In [13]:
# Grab Crosswalk
CROSSWALK = GTFS_DATA_DICT.schedule_tables.gtfs_key_crosswalk

In [14]:
crosswalk_cols = [
    "schedule_gtfs_dataset_key",
    "organization_name",
    "name",
    "caltrans_district",
]

In [15]:
crosswalk_df = (
    time_series_utils.concatenate_datasets_across_dates(
        SCHED_GCS, CROSSWALK, [may_date], data_type="df", columns=crosswalk_cols
    )
    .sort_values(["service_date"])
    .reset_index(drop=True)
)

In [16]:
crosswalk_df.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,name,caltrans_district,service_date
0,1770249a5a2e770ca90628434d4934b1,Ventura County Transportation Commission,VCTC GMV Schedule,07 - Los Angeles,2024-05-22
1,d19b822237403aa9d44ab2e923afc7b9,City of Delano,Delano Schedule,06 - Fresno,2024-05-22


In [17]:
routes = pd.merge(
    route_dir,
    crosswalk_df,
    on=["schedule_gtfs_dataset_key", "service_date"],
    how="left",
)

In [18]:
len(routes)

4614

### Observation: Some headway minutes seem off which is skewing the "bunching" calculations
* How come route_id 30 for City of LA has a headway of 60 minutes for direction 1, but  157 minute headway for direction 0? Shouldn't the `frequency` and `headway_minutes` be similar?

In [19]:
routes.loc[
    (routes.schedule_gtfs_dataset_key == "cc53a0dbf5df90e3009b9cb5d89d80ba")
    & (routes.route_id == "30")
    & (routes.direction_id == 0)
]

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,frequency,service_date,headway_minutes,organization_name,name,caltrans_district
3521,cc53a0dbf5df90e3009b9cb5d89d80ba,30,0.0,peak,Westbound,0.12,2024-05-22,500.0,City of Los Angeles,LA DOT Schedule,07 - Los Angeles


In [20]:
routes.loc[
    (routes.schedule_gtfs_dataset_key == "cc53a0dbf5df90e3009b9cb5d89d80ba")
    & (routes.route_id == "30")
    & (routes.direction_id == 1)
]

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,frequency,service_date,headway_minutes,organization_name,name,caltrans_district
3522,cc53a0dbf5df90e3009b9cb5d89d80ba,30,1.0,peak,Eastbound,0.33,2024-05-22,181.82,City of Los Angeles,LA DOT Schedule,07 - Los Angeles


In [21]:
display(
    routes.loc[
        (routes.schedule_gtfs_dataset_key == "55a01ef72af21906934ae8ffb4786e86")
        & (routes.route_id == "390")
        & (routes.direction_id == 1)
    ]
)
display(
    routes.loc[
        (routes.schedule_gtfs_dataset_key == "55a01ef72af21906934ae8ffb4786e86")
        & (routes.route_id == "390")
        & (routes.direction_id == 0)
    ]
)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,frequency,service_date,headway_minutes,organization_name,name,caltrans_district
1547,55a01ef72af21906934ae8ffb4786e86,390,1.0,peak,Westbound,0.04,2024-05-22,1500.0,Eastern Contra Costa Transit Authority,Bay Area 511 Tri Delta Schedule,04 - Oakland


Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,frequency,service_date,headway_minutes,organization_name,name,caltrans_district
1546,55a01ef72af21906934ae8ffb4786e86,390,0.0,peak,Eastbound,0.33,2024-05-22,181.82,Eastern Contra Costa Transit Authority,Bay Area 511 Tri Delta Schedule,04 - Oakland


## Step 2: Load in Trips
* I feel like I am grabbing datasets from multiple areas of the pipeline, when I should just stick to one area.

In [22]:
TABLE = GTFS_DATA_DICT.schedule_downloads.trips

In [23]:
FILE = f"{COMPILED_CACHED_VIEWS}{TABLE}_{may_date}.parquet"

In [24]:
trips_subset = [
    "gtfs_dataset_key",
    "route_id",
    "trip_instance_key",
    "shape_array_key",
    "feed_key",
    "route_long_name",
    "direction_id",
    "route_type",
]

In [25]:
trips = pd.read_parquet(FILE)[trips_subset].rename(
    columns={"gtfs_dataset_key": "schedule_gtfs_dataset_key"}
)

In [26]:
trips_routes = pd.merge(
    trips,
    routes,
    on=["schedule_gtfs_dataset_key", "route_id", "direction_id"],
    how="inner",
)

In [27]:
trips_routes.route_id.nunique()

1374

### Is there a more sophisticated way to merge in the actual `route_type`? 

In [28]:
# https://gtfs.org/documentation/schedule/reference/#
route_type_crosswalk = {
    "route_type": ["0", "1", "2", "3", "4", "5", "6", "7", "11", "12"],
    "route_type_str": [
        "Tram, Streetcar, Light rail",
        "Subway, Metro",
        "Rail",
        "Bus",
        "Ferry.",
        "Cable tram.",
        "Aerial lift, suspended cable car (e.g., gondola lift, aerial tramway).",
        "Funicular.",
        "Trolleybus.",
        "Monorail.",
    ],
}

In [29]:
route_type_crosswalk_df = pd.DataFrame(route_type_crosswalk)

In [30]:
# Merge for route_type
trips_routes = pd.merge(
    trips_routes, route_type_crosswalk_df, on=["route_type"], how="left"
)

In [31]:
trips_routes = trips_routes.drop(columns=["route_type"]).rename(
    columns={"route_type_str": "route_type"}
)

## Step 3: Load Stop Times 

In [32]:
rt_stop_times = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/schedule_rt_stop_times_2024-05-22.parquet"
)

In [33]:
trips_routes.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,trip_instance_key,shape_array_key,feed_key,route_long_name,direction_id,time_period,route_primary_direction,frequency,service_date,headway_minutes,organization_name,name,caltrans_district,route_type
0,1770249a5a2e770ca90628434d4934b1,3408,c256553e28c4bba693e3136240b35419,8f644f847e987de68e0cb6fcd339cf41,926867fdee73d5fbfe4f011871bcd830,Route 21,0.0,peak,Westbound,0.62,2024-05-22,96.77,Ventura County Transportation Commission,VCTC GMV Schedule,07 - Los Angeles,Bus


In [34]:
rt_stop_times.shape

(2601262, 7)

In [35]:
trips_routes_times = pd.merge(
    rt_stop_times,
    trips_routes,
    on=[
        "schedule_gtfs_dataset_key",
        "trip_instance_key",
    ],
    how="inner",
)

In [36]:
(trips_routes_times.scheduled_arrival_sec.isna().sum())

15029

### Observation: about 126 rows are duplicated.

In [37]:
len(trips_routes_times)

3169136

In [38]:
trips_routes_times2 = trips_routes_times.drop_duplicates().reset_index(drop=True)

In [39]:
len(trips_routes_times) - len(trips_routes_times2)

126

## Step 4: Sorting & Subsetting

In [40]:
subset = [
    "service_date",
    "caltrans_district",
    "schedule_gtfs_dataset_key",
    "feed_key",
    "organization_name",
    "route_long_name",
    "route_type",
    "route_id",
    "direction_id",
    "stop_id",
    "stop_sequence",
    "trip_instance_key",
    "rt_arrival_sec",
    "scheduled_arrival_sec",
    "headway_minutes",
]

#### Review sorting order. 
* Exclude or include `stop_sequence`?

In [41]:
trips_routes_times3 = trips_routes_times2[subset]

In [42]:
len(trips_routes_times3)

3169010

In [43]:
trips_routes_times4 = trips_routes_times3.sort_values(
    by=[
        "schedule_gtfs_dataset_key",
        "route_id",
        "direction_id",
        "stop_id",
        "stop_sequence",
        "rt_arrival_sec",
    ],
).reset_index(drop=True)

## Step 5: Fixing Time Stamps (this portion is where I am having a lot of trouble) 
* The data is downloaded from the service_date May 22, 2024. 
* However,  the difference between `scheduled_arrival_sec` versus `rt_arrival_sec` times can look extreme.
    * Example 1: some trips that begin on May 21, 2024 late in the evening (around 11pm) run until early morning of May 22, 2024.

In [44]:
# Convert time to seconds
trips_routes_times4["converted_rt_arrival"] = pd.to_datetime(
    trips_routes_times4["service_date"]
) + pd.to_timedelta(trips_routes_times4["rt_arrival_sec"] % 86400, unit="s")

In [45]:
# Convert time to seconds
trips_routes_times4["converted_schd_arrival"] = pd.to_datetime(
    trips_routes_times4["service_date"]
) + pd.to_timedelta(trips_routes_times4["scheduled_arrival_sec"] % 86400, unit="s")

  base = data.astype(np.int64)
  data = (base * m + (frac * m).astype(np.int64)).view("timedelta64[ns]")


### Subtracting `converted_rt_arrival` from `converted_schd_arrival` then using `describe` to find the "extreme" rows.

In [46]:
percentiles = [0.01, 0.02, 0.05, 0.1, 0.9, 0.95, 0.98, 0.99]

In [47]:
trips_routes_times4["delay_min"] = (
    trips_routes_times4["converted_rt_arrival"]
    - trips_routes_times4["converted_schd_arrival"]
).dt.total_seconds() / 60

In [48]:
print(trips_routes_times4.delay_min.describe(percentiles))

count   3153981.00
mean          2.01
std          33.30
min       -1439.78
1%           -5.38
2%           -3.92
5%           -2.48
10%          -1.53
50%           1.45
90%           7.65
95%          10.83
98%          15.80
99%          20.27
max        1439.98
Name: delay_min, dtype: float64


### Manually fixing these rows but this wasn't "scientific" at all -> I want to do this a more "automated" way.
* These were based on my own observations.
* Using 600 (10 hours) as a benchmark, but not sure if this is the best.

In [49]:
preview_cols = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "scheduled_arrival_sec",
    "converted_schd_arrival",
    "rt_arrival_sec",
    "converted_rt_arrival",
    "delay_min",
]

### These rows need a day subtracted off of `converted_rt_arrival` because the trip actually began the night prior to the service_date.

In [50]:
trips_routes_times4.loc[trips_routes_times4.delay_min >= 600][preview_cols].head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min
23568,0666caf3ec1ecc96b74f4477ee4bc939,102-13172,86400.0,2024-05-22,86253,2024-05-22 23:57:33,1437.55
23705,0666caf3ec1ecc96b74f4477ee4bc939,102-13172,86400.0,2024-05-22,86275,2024-05-22 23:57:55,1437.92


In [51]:
trips_routes_times4["converted_rt_arrival"] = np.where(
    trips_routes_times4["delay_min"] >= 600,
    trips_routes_times4["converted_rt_arrival"] - pd.Timedelta(days=1),
    trips_routes_times4["converted_rt_arrival"],
)

In [52]:
trips_routes_times4.loc[trips_routes_times4.delay_min >= 600][preview_cols].head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min
23568,0666caf3ec1ecc96b74f4477ee4bc939,102-13172,86400.0,2024-05-22,86253,2024-05-21 23:57:33,1437.55
23705,0666caf3ec1ecc96b74f4477ee4bc939,102-13172,86400.0,2024-05-22,86275,2024-05-21 23:57:55,1437.92


### These trips need a day added because the trip began on the service_date late at night and continued onto the day after. 

In [53]:
trips_routes_times4.loc[trips_routes_times4.delay_min <= -600][preview_cols].head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min
12686,0666caf3ec1ecc96b74f4477ee4bc939,10-13172,86220.0,2024-05-22 23:57:00,266,2024-05-22 00:04:26,-1432.57
14512,0666caf3ec1ecc96b74f4477ee4bc939,10-13172,86340.0,2024-05-22 23:59:00,389,2024-05-22 00:06:29,-1432.52


In [54]:
trips_routes_times4["converted_rt_arrival"] = np.where(
    trips_routes_times4["delay_min"] <= -600,
    trips_routes_times4["converted_rt_arrival"] + pd.Timedelta(days=1),
    trips_routes_times4["converted_rt_arrival"],
)

In [55]:
trips_routes_times4.loc[trips_routes_times4.delay_min <= -600][preview_cols].head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min
12686,0666caf3ec1ecc96b74f4477ee4bc939,10-13172,86220.0,2024-05-22 23:57:00,266,2024-05-23 00:04:26,-1432.57
14512,0666caf3ec1ecc96b74f4477ee4bc939,10-13172,86340.0,2024-05-22 23:59:00,389,2024-05-23 00:06:29,-1432.52


### Recalculate `delay_min` and throw away any extreme values that couldn't be fixed

In [56]:
trips_routes_times4["delay_min2"] = (
    trips_routes_times4["converted_rt_arrival"]
    - trips_routes_times4["converted_schd_arrival"]
).dt.total_seconds() / 60

In [57]:
print(trips_routes_times4.delay_min.describe(percentiles))

count   3153981.00
mean          2.01
std          33.30
min       -1439.78
1%           -5.38
2%           -3.92
5%           -2.48
10%          -1.53
50%           1.45
90%           7.65
95%          10.83
98%          15.80
99%          20.27
max        1439.98
Name: delay_min, dtype: float64


In [58]:
print(trips_routes_times4.delay_min2.describe(percentiles))

count   3153981.00
mean          2.54
std          10.46
min        -839.98
1%           -5.28
2%           -3.88
5%           -2.47
10%          -1.52
50%           1.45
90%           7.67
95%          10.85
98%          15.82
99%          20.27
max         837.53
Name: delay_min2, dtype: float64


### About 2% of rows are thrown away if I use the 1% and 99% perentile cutoff, but this also seems overly harsh. That means if a bus arrives more than 5 minutes ahead of schedule, that row is thrown away. Likewise, if the bus is more than 20 minutes behind its scheduled time, the row is also cut.

In [59]:
(
    (
        len(trips_routes_times4.loc[trips_routes_times4.delay_min2 >= 20.27])
        + len(trips_routes_times4.loc[trips_routes_times4.delay_min2 <= -5.28])
    )
    / len(trips_routes_times4)
) * 100

1.9894541197408653

In [60]:
extreme_values = trips_routes_times4.loc[
    (trips_routes_times4.delay_min2 <= -5.28)
    | (trips_routes_times4.delay_min2 >= 20.27)
]

In [61]:
print(extreme_values.delay_min2.describe(percentiles))

count   63046.00
mean       12.87
std        68.12
min      -839.98
1%        -81.54
2%        -53.32
5%        -24.63
10%       -13.95
50%        -5.28
90%        39.64
95%        56.43
98%       133.15
99%       307.30
max       837.53
Name: delay_min2, dtype: float64


In [62]:
preview_cols.append("delay_min2")

In [63]:
extreme_values[preview_cols].sample(10)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min,delay_min2
262207,0666caf3ec1ecc96b74f4477ee4bc939,2-13172,62940.0,2024-05-22 17:29:00,64382,2024-05-22 17:53:02,24.03,24.03
1124056,5456c80d420043e15c8eb7368a8a4d89,292,24772.0,2024-05-22 06:52:52,26151,2024-05-22 07:15:51,22.98,22.98
2878322,f74424acf8c41e4c1e9fd42838c4875c,178,75910.0,2024-05-22 21:05:10,77281,2024-05-22 21:28:01,22.85,22.85
837369,1770249a5a2e770ca90628434d4934b1,3395,36000.0,2024-05-22 10:00:00,39078,2024-05-22 10:51:18,51.3,51.3
2776234,efbbd5293be71f7a5de0cf82b59febe1,3730,38427.0,2024-05-22 10:40:27,40010,2024-05-22 11:06:50,26.38,26.38
1552315,aea4108997c66a74fbdae27b34b69fde,130,28800.0,2024-05-22 08:00:00,28436,2024-05-22 07:53:56,-6.07,-6.07
1438927,7cc0cb1871dfd558f11a2885c145d144,PH,56386.0,2024-05-22 15:39:46,57927,2024-05-22 16:05:27,25.68,25.68
1068138,43d8d305ee692724a532f30ea63a1cbe,90X,63113.0,2024-05-22 17:31:53,62764,2024-05-22 17:26:04,-5.82,-5.82
1240580,7cc0cb1871dfd558f11a2885c145d144,2,46353.0,2024-05-22 12:52:33,47964,2024-05-22 13:19:24,26.85,26.85
953507,1770249a5a2e770ca90628434d4934b1,4778,49080.0,2024-05-22 13:38:00,57640,2024-05-22 16:00:40,142.67,142.67


In [64]:
trips_routes_times4.loc[trips_routes_times4.delay_min <= -600][preview_cols].sample(10)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min,delay_min2
570090,0666caf3ec1ecc96b74f4477ee4bc939,40-13172,86280.0,2024-05-22 23:58:00,208,2024-05-23 00:03:28,-1434.53,5.47
1372299,7cc0cb1871dfd558f11a2885c145d144,5,85482.0,2024-05-22 23:44:42,4,2024-05-23 00:00:04,-1424.63,15.37
1408544,7cc0cb1871dfd558f11a2885c145d144,9,86004.0,2024-05-22 23:53:24,61,2024-05-23 00:01:01,-1432.38,7.62
1431550,7cc0cb1871dfd558f11a2885c145d144,M,86315.0,2024-05-22 23:58:35,595,2024-05-23 00:09:55,-1428.67,11.33
681290,0666caf3ec1ecc96b74f4477ee4bc939,660-13172,86220.0,2024-05-22 23:57:00,94,2024-05-23 00:01:34,-1435.43,4.57
278186,0666caf3ec1ecc96b74f4477ee4bc939,20-13172,85980.0,2024-05-22 23:53:00,296,2024-05-23 00:04:56,-1428.07,11.93
438484,0666caf3ec1ecc96b74f4477ee4bc939,246-13172,86340.0,2024-05-22 23:59:00,23,2024-05-23 00:00:23,-1438.62,1.38
2036292,baeeb157e85a901e47b828ef9fe75091,929,86280.0,2024-05-22 23:58:00,16146,2024-05-23 04:29:06,-1168.9,271.1
1219219,7cc0cb1871dfd558f11a2885c145d144,14,85444.0,2024-05-22 23:44:04,207,2024-05-23 00:03:27,-1420.62,19.38
270994,0666caf3ec1ecc96b74f4477ee4bc939,2-13172,86100.0,2024-05-22 23:55:00,123,2024-05-23 00:02:03,-1432.95,7.05


In [65]:
1437.63 / 60

23.960500000000003

In [66]:
trips_routes_times4.loc[trips_routes_times4.delay_min >= 600][preview_cols].sample(10)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,scheduled_arrival_sec,converted_schd_arrival,rt_arrival_sec,converted_rt_arrival,delay_min,delay_min2
759728,0666caf3ec1ecc96b74f4477ee4bc939,90-13172,86400.0,2024-05-22 00:00:00,86397,2024-05-21 23:59:57,1439.95,-0.05
1863138,baeeb157e85a901e47b828ef9fe75091,7,86460.0,2024-05-22 00:01:00,49970,2024-05-21 13:52:50,831.83,-608.17
1203762,7cc0cb1871dfd558f11a2885c145d144,1,86471.0,2024-05-22 00:01:11,86289,2024-05-21 23:58:09,1436.97,-3.03
1665131,baeeb157e85a901e47b828ef9fe75091,2,86400.0,2024-05-22 00:00:00,44138,2024-05-21 12:15:38,735.63,-704.37
1917530,baeeb157e85a901e47b828ef9fe75091,8,86400.0,2024-05-22 00:00:00,86390,2024-05-21 23:59:50,1439.83,-0.17
2487075,d9d0325e50e50064e3cc8384b1751d67,1,86417.0,2024-05-22 00:00:17,86158,2024-05-21 23:55:58,1435.68,-4.32
1049771,43d8d305ee692724a532f30ea63a1cbe,1,86421.0,2024-05-22 00:00:21,86346,2024-05-21 23:59:06,1438.75,-1.25
1675356,baeeb157e85a901e47b828ef9fe75091,201,86460.0,2024-05-22 00:01:00,41698,2024-05-21 11:34:58,693.97,-746.03
997943,239f3baf3dd3b9e9464f66a777f9897d,23,37501.0,2024-05-22 10:25:01,79130,2024-05-21 21:58:50,693.82,-746.18
1685183,baeeb157e85a901e47b828ef9fe75091,215,86400.0,2024-05-22 00:00:00,86373,2024-05-21 23:59:33,1439.55,-0.45


### For now, I will just throw away trips that have a `converted_rt_arrival` value are an hour before or after the `converted_schd_arrival`

In [67]:
len(
    trips_routes_times4.loc[
        (trips_routes_times4.delay_min2 >= -60) | (trips_routes_times4.delay_min2 <= 60)
    ]
)

3153981

In [68]:
trips_routes_times5 = trips_routes_times4.loc[
    (trips_routes_times4.delay_min2 >= -60) | (trips_routes_times4.delay_min2 <= 60)
]

In [69]:
(len(trips_routes_times4) - len(trips_routes_times5)) / len(trips_routes_times4)

0.004742490556987829

## Step 7: Calculate the actual headway on the `operator-route-direction_id-stop_sequence-stop_id-` grain

In [70]:
groupby_cols = [
    "caltrans_district",
    "schedule_gtfs_dataset_key",
    "feed_key",
    "organization_name",
    "route_id",
    "route_long_name",
    "route_type",
    "direction_id",
    "stop_id",
    "stop_sequence",
]

In [71]:
trips_routes_times5["actual_arrival_lag_min"] = (
    trips_routes_times5.groupby(groupby_cols)["converted_rt_arrival"]
    .diff()
    .dt.total_seconds()
    / 60
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trips_routes_times5["actual_arrival_lag_min"] = (


### Check San Diego 

In [72]:
sd_test = trips_routes_times5.loc[
    (trips_routes_times5.organization_name == "San Diego Metropolitan Transit System")
    & (trips_routes_times5.route_id == "834")
]

In [73]:
cols_to_drop = [
    "service_date",
    "caltrans_district",
    "route_type",
    "route_id",
    "delay_min",
    "schedule_gtfs_dataset_key",
    "feed_key",
    "organization_name",
    "route_long_name",
    "trip_instance_key",
]

### Having trouble understanding headway vs. `actual_arrival_lag_min`: why is headway so much higher?

In [74]:
sd_test.drop(columns=cols_to_drop)

Unnamed: 0,direction_id,stop_id,stop_sequence,rt_arrival_sec,scheduled_arrival_sec,headway_minutes,converted_rt_arrival,converted_schd_arrival,delay_min2,actual_arrival_lag_min
1931084,0.0,40173,4,29344,28200.0,206.9,2024-05-22 08:09:04,2024-05-22 07:50:00,19.07,
1931087,0.0,40173,4,32750,31860.0,206.9,2024-05-22 09:05:50,2024-05-22 08:51:00,14.83,56.77
1931090,0.0,40173,4,49842,49920.0,206.9,2024-05-22 13:50:42,2024-05-22 13:52:00,-1.3,284.87
1931093,0.0,40173,4,53523,53520.0,206.9,2024-05-22 14:52:03,2024-05-22 14:52:00,0.05,61.35
1931096,0.0,40259,9,32164,32220.0,206.9,2024-05-22 08:56:04,2024-05-22 08:57:00,-0.93,
1931099,0.0,40259,9,35716,35820.0,206.9,2024-05-22 09:55:16,2024-05-22 09:57:00,-1.73,59.2
1931102,0.0,40382,8,32124,32160.0,206.9,2024-05-22 08:55:24,2024-05-22 08:56:00,-0.6,
1931105,0.0,40382,8,35668,35760.0,206.9,2024-05-22 09:54:28,2024-05-22 09:56:00,-1.53,59.07
1931108,0.0,40400,5,23956,24060.0,206.9,2024-05-22 06:39:16,2024-05-22 06:41:00,-1.73,
1931111,0.0,40400,5,28337,28260.0,206.9,2024-05-22 07:52:17,2024-05-22 07:51:00,1.28,73.02


## Step 8: Try MBTA - Massachusetts Bay Transportation Authority: 25% of scheduled headway 
* [Source](https://transitmatters.org/blog/reveal-mbtas-slowest-most-bunched-bus)
* [2024 Report](https://drive.google.com/file/d/1QFTVg0N3-uQeVoMqlOE6QLPqcoCtifzp/view?pli=1)
    * Taking a data-backed approach by relying on archival bus arrival and departure times from the MBTAʼs Open Data Portal and augmenting the data
      with route information from the MBTAʼs GTFS Feed,we adapted the methodology to reflect Bostonʼs
         unique transit characteristics as well as the post- COVID ridership dynamic to find bus speeds and bus bunching rates.
    * We limited this analysis to routes that had 500 or more daily riders, and only examined trips between 7am and 7pm on weekdays.
    * Adapted from NYC's analysis [here](https://www.nypirg.org/pubs/202311/Top_Ten_Best_Worst_in_NYC_Transit_2010-2019_FINAL.pdf)
    * To calculate the most bunched buses, we first
        defined a "bunch" as a bus that arrives within 25%
        of the scheduled headway of the bus in front of it.
        For example, if a bus is scheduled to arrive every
        10 minutes, a bus that arrives less than 2.5
        minutes a�er the bus in front of it is considered
        "bunched". We then looked at all time point events
        between 7am and 7pm on weekdays for each
        route. We matched each one to that dayʼs GTFS
        schedule to calculate the appropriate scheduled
        headway for that time of day and then calculated
        the total percent of departure events that met our
        bunching criteria. [here](https://static1.squarespace.com/static/533b9a24e4b01d79d0ae4376/t/6617ec40675223398aac12bf/1712843871514/TransitMatters-Bus-Bunching-Reports-Oct-2023)
    * They calculate it on the route level.
    * If a route has a bunching rate of 10% that means that every 1 out of 10 buses are
bunched. For a rider who does a round trip every day of the month, say 60 individual
trips, that means that the rider will experience bunching 6 times. (AH: how did they consider a trip to be bunched??)
    * Bunching typically worsens throughout a trip and
is most severe at the end of its route. However,
poor scheduling, dispatching, and operational
policy result in buses departing in a bunch, which
sets trips up for failure.
* [2023 Report](https://static1.squarespace.com/static/533b9a24e4b01d79d0ae4376/t/6617ec40675223398aac12bf/1712843871514/TransitMatters-Bus-Bunching-Reports-Oct-2023)
    * Analyzing bunching on a stop level: how many trips for a stop is bunched? 
    * Here, bunching is defined as headways < 25% of the scheduled_headway.
* 

In [75]:
transit_matters_df1 = trips_routes_times5.copy()

### Use the scheduled headway min instead of calculating it

In [76]:
transit_matters_df1["pct_actual_schd_headway"] = (
    transit_matters_df1.actual_arrival_lag_min / transit_matters_df1.headway_minutes
)

In [77]:
transit_matters_df1["pct_actual_schd_headway"].describe(percentiles)

count   2957621.00
mean          0.48
std           1.46
min         -83.03
1%            0.04
2%            0.07
5%            0.16
10%           0.22
50%           0.35
90%           0.75
95%           1.04
98%           1.66
99%           2.77
max          83.32
Name: pct_actual_schd_headway, dtype: float64

In [78]:
transit_matters_df1["bunched_y_n"] = np.where(
    transit_matters_df1["pct_actual_schd_headway"] < 0.25, "bunched", "not bunched"
)

In [79]:
transit_matters_df1.bunched_y_n.value_counts() / len(transit_matters_df1)

not bunched   0.86
bunched       0.14
Name: bunched_y_n, dtype: float64

In [80]:
transit_matters_df1.bunched_y_n.value_counts()

not bunched    2698128
bunched         455853
Name: bunched_y_n, dtype: int64

In [81]:
transit_matters_df1.loc[transit_matters_df1.pct_actual_schd_headway < 0.25].sample(3)

Unnamed: 0,service_date,caltrans_district,schedule_gtfs_dataset_key,feed_key,organization_name,route_long_name,route_type,route_id,direction_id,stop_id,stop_sequence,trip_instance_key,rt_arrival_sec,scheduled_arrival_sec,headway_minutes,converted_rt_arrival,converted_schd_arrival,delay_min,delay_min2,actual_arrival_lag_min,pct_actual_schd_headway,bunched_y_n
2929203,2024-05-22,07 - Los Angeles,f74424acf8c41e4c1e9fd42838c4875c,96358f776e5fcd8d2b6066507aed6645,Foothill Transit,Pomona- Industry- El Monte Station via V,Bus,194,1.0,3414,2496,14aad76fd70717b6875ffa8517b1f5cc,66975,66996.0,60.0,2024-05-22 18:36:15,2024-05-22 18:36:36,-0.35,-0.35,10.92,0.18,bunched
2450303,2024-05-22,07 - Los Angeles,cf0f7df88da36cd9ca4248eb1d6a0f39,da2b98d083fdde4f561bea811ad56b45,City of Culver City,Rapid Sepulveda Boulevard,Bus,6R,0.0,617,3,649747368edb9025f0c610f709448531,53996,53841.0,62.5,2024-05-22 14:59:56,2024-05-22 14:57:21,2.58,2.58,8.85,0.14,bunched
357068,2024-05-22,07 - Los Angeles,0666caf3ec1ecc96b74f4477ee4bc939,608992664173210532aa3e6cc573be2f,Los Angeles County Metropolitan Transportation Authority,Metro Local Line,Bus,212-13172,1.0,1564,54,0e0dba1a3d232b90fcb6a26672e27ec6,70789,70740.0,31.91,2024-05-22 19:39:49,2024-05-22 19:39:00,0.82,0.82,5.6,0.18,bunched


In [82]:
preview_cols.append("pct_actual_schd_headway")

### Observation: This bus is scheduled to arrive every 352 minutes (5 hours) but I can see based on the `rt_arrival_sec`, it comes much more frequently?

In [83]:
transit_matters_df1.loc[
    (
        transit_matters_df1.schedule_gtfs_dataset_key
        == "587e730fac4db21d54037e0f12b0dd5d"
    )
    & (transit_matters_df1.route_id == "606")
    & (transit_matters_df1.stop_id == "831414")
    & (transit_matters_df1.direction_id == 1)
    & (transit_matters_df1.stop_sequence == 20)
].drop(columns=cols_to_drop)

Unnamed: 0,direction_id,stop_id,stop_sequence,rt_arrival_sec,scheduled_arrival_sec,headway_minutes,converted_rt_arrival,converted_schd_arrival,delay_min2,actual_arrival_lag_min,pct_actual_schd_headway,bunched_y_n
1161083,1.0,831414,20,27276,27000.0,352.94,2024-05-22 07:34:36,2024-05-22 07:30:00,4.6,,,not bunched
1161084,1.0,831414,20,27298,27000.0,352.94,2024-05-22 07:34:58,2024-05-22 07:30:00,4.97,0.37,0.0,bunched
1161085,1.0,831414,20,28750,28740.0,352.94,2024-05-22 07:59:10,2024-05-22 07:59:00,0.17,24.2,0.07,bunched
1161086,1.0,831414,20,28785,28740.0,352.94,2024-05-22 07:59:45,2024-05-22 07:59:00,0.75,0.58,0.0,bunched


In [84]:
cols_to_drop

['service_date',
 'caltrans_district',
 'route_type',
 'route_id',
 'delay_min',
 'schedule_gtfs_dataset_key',
 'feed_key',
 'organization_name',
 'route_long_name',
 'trip_instance_key']

### Observation: `headway_minutes` is scheduled for every 29 minutes but the `converted_schd_arrival` seems to be a little more frequent?
* 15 minutes between rows 1442133 and 1442134. Likewise, 15 minutes between 1442135 and 1442136.

In [85]:
transit_matters_df1.loc[
    (
        transit_matters_df1.schedule_gtfs_dataset_key
        == "7cc0cb1871dfd558f11a2885c145d144"
    )
    & (transit_matters_df1.route_id == "T")
    & (transit_matters_df1.stop_id == "17346")
    & (transit_matters_df1.direction_id == 1)
    & (transit_matters_df1.stop_sequence == 17)
].drop(columns=cols_to_drop)

Unnamed: 0,direction_id,stop_id,stop_sequence,rt_arrival_sec,scheduled_arrival_sec,headway_minutes,converted_rt_arrival,converted_schd_arrival,delay_min2,actual_arrival_lag_min,pct_actual_schd_headway,bunched_y_n
1442132,1.0,17346,17,1075,86762.0,29.41,2024-05-22 00:17:55,2024-05-22 00:06:02,11.88,,,not bunched
1442133,1.0,17346,17,25262,24962.0,29.41,2024-05-22 07:01:02,2024-05-22 06:56:02,5.0,403.12,13.71,not bunched
1442134,1.0,17346,17,26210,25862.0,29.41,2024-05-22 07:16:50,2024-05-22 07:11:02,5.8,15.8,0.54,not bunched
1442135,1.0,17346,17,27037,26787.0,29.41,2024-05-22 07:30:37,2024-05-22 07:26:27,4.17,13.78,0.47,not bunched
1442136,1.0,17346,17,27434,27687.0,29.41,2024-05-22 07:37:14,2024-05-22 07:41:27,-4.22,6.62,0.22,bunched
1442137,1.0,17346,17,28885,28647.0,29.41,2024-05-22 08:01:25,2024-05-22 07:57:27,3.97,24.18,0.82,not bunched
1442138,1.0,17346,17,29341,29247.0,29.41,2024-05-22 08:09:01,2024-05-22 08:07:27,1.57,7.6,0.26,not bunched
1442139,1.0,17346,17,29850,29847.0,29.41,2024-05-22 08:17:30,2024-05-22 08:17:27,0.05,8.48,0.29,not bunched
1442140,1.0,17346,17,31195,31047.0,29.41,2024-05-22 08:39:55,2024-05-22 08:37:27,2.47,22.42,0.76,not bunched
1442141,1.0,17346,17,32364,31707.0,29.41,2024-05-22 08:59:24,2024-05-22 08:48:27,10.95,19.48,0.66,not bunched


## Step 9 Transit Matters: 2 minute benchmark
* [Source](https://static1.squarespace.com/static/533b9a24e4b01d79d0ae4376/t/645e82de1f570b31497c44dc/1683915486889/TransitMatters-Headwaymanagement.pdf)
* Justifying the use of
headway maintenance. For example, in April
2022 the 66 bus significantly bunched around
several stops. <b>When bunching is defined as
buses that run within two minutes or less of
each other</b>, inbound buses towards Nubian
Square bunched 10% of the time at Brigham
Circle, 9% at Brookline Village and Roxbury
Crossing, and 8% of the time at Coolidge
Corner. Bunching is even more dramatic
outbound towards Harvard Square where
buses bunched over 35% of the time at Winship
St, 13% at Coolidge Corner and Harvard Ave at
Commonwealth Ave, and 12% at North Harvard
St at Western Ave. View more data about bus
bunching through the TransitMatters Data
Dashboard here.


In [86]:
two_minutes_df = trips_routes_times5.copy()

In [89]:
len(two_minutes_df.loc[two_minutes_df["actual_arrival_lag_min"] < 0])

1709

### Added my own condition here to not tag any rows below 0, even though the literature didn't mention this.

In [90]:
two_minutes_df["bunched_y_n"] = np.where(
    (two_minutes_df["actual_arrival_lag_min"] > 0)
    & (two_minutes_df["actual_arrival_lag_min"] <= 2),
    "bunched",
    "not bunched",
)

In [91]:
two_minutes_df.bunched_y_n.value_counts() / len(two_minutes_df)

not bunched   0.99
bunched       0.01
Name: bunched_y_n, dtype: float64