## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/13
    * Figure out how to set up Config file
    * Tiffany:
        * add_metrics looks good, just remove the coercing of percents to 0-100 to a separate function. I want everything from 0-1, and then before charting, scaled up to 0-100 all at once. Can you write a general         * function for this....all the chart display / cleaning functions should live in 1 script in segment_speed_utils.
        * Another tweak for a step somewhere before add_metrics. Certain columns can be coerced to be integers, like total_vp and vp_in_shape, just like how total_min_w_gtfs is an integer. Coerce all the ones that can be integers to be integers for your trip table, and this will save on the rounding step later.
        * Column naming: think about how you want to change the column names. total_pings_for_trip is not going to make sense once you aggregate, so maybe go with something more generic. Otherwise, you're going to be aggregating and renaming columns constantly. I would just rely on the other columns in the row to tell us whether it's per trip or per route , and the metrics all use generic names that are suitable for passing through aggregation functions. (edited) 

In [78]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS
from shared_utils import portfolio_utils, rt_dates, rt_utils

In [79]:
# Times
import datetime

from loguru import logger

In [80]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [81]:
SCHED_GCS

'gs://calitp-analytics-data/data-analyses/gtfs_schedule/'

In [82]:
# analysis_date = rt_dates.DATES["dec2023"]

In [83]:
RT_SCHED_GCS

'gs://calitp-analytics-data/data-analyses/rt_vs_schedule/'

In [84]:
# rt_dates.DATES

### Routes - New Work

In [85]:
routes_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_route_dir/route_direction_metrics/trip_2023_09_13_to_2023_10_11.parquet"
)

In [86]:
routes_df.sample()

Unnamed: 0,time_period,schedule_gtfs_dataset_key,route_id,direction_id,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
2224,all_day,cb3074eb8b423dfc5acfeeb0de95eb82,51,1,6413,6609,12244,5399.0,12244,12217,110,1.85,1.0,0.97,0.22


### Trips - New Work

In [87]:
trips_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_trip/trip_metrics/trip_2023-09-13.parquet"
)

In [88]:
trips_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,time_of_day,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
19877,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,11865eb5cdbfe53529e6dd687a858816,129,128,385,129,385,385,108-13168,0,vp_sched,107.0,PM Peak,peak,2.98,1.0,1.0,0.21


### Check Merges

In [89]:
analysis_date = ["2023-09-13"]

In [90]:
route_time_cols = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "direction_id",
    "time_period",
]


def concatenate_schedule_by_route_direction(date_list: list) -> pd.DataFrame:
    """
    Concatenate schedule data that's been
    aggregated to route-direction-time_period.
    """
    df = pd.concat(
        [
            pd.read_parquet(
                f"{RT_SCHED_GCS}schedule_route_dir/"
                f"schedule_route_direction_metrics_{d}.parquet",
                columns=route_time_cols
                + [
                    "avg_sched_service_min",
                    "avg_stop_meters",
                    "n_trips",
                    "frequency",
                ],
            )
            .assign(service_date=pd.to_datetime(d))
            .astype({"direction_id": "Int64"})
            for d in date_list
        ],
        axis=0,
        ignore_index=True,
    )

    return df

In [91]:
import geopandas as gpd

In [92]:
def concatenate_speeds_by_route_direction(date_list: list) -> pd.DataFrame:
    df = pd.concat(
        [
            pd.read_parquet(
                f"{SEGMENT_GCS}rollup_singleday/" f"speeds_route_dir_{d}.parquet",
                columns=route_time_cols + ["speed_mph"],
            )
            .assign(service_date=pd.to_datetime(d))
            .astype({"direction_id": "Int64"})
            for d in date_list
        ],
        axis=0,
        ignore_index=True,
    )

    return df

#### concatenate_schedule_by_route_direction

In [93]:
df1 = concatenate_schedule_by_route_direction(analysis_date)

In [94]:
df1.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,avg_sched_service_min,avg_stop_meters,n_trips,frequency,service_date
809,e681c3a8dafa2c80e5b8e2cdd01f917a,3,0,peak,47.87,543.23,14,1.75,2023-09-13


In [95]:
routes_df.time_period.value_counts()

all_day    3174
peak       2981
offpeak    2729
Name: time_period, dtype: int64

In [96]:
m_cols = ["schedule_gtfs_dataset_key", "route_id", "direction_id", "time_period"]

In [97]:
m1 = pd.merge(df1, routes_df, on=m_cols, how="outer", indicator=True)

In [98]:
961 / len(m1)

0.08719716904092188

In [99]:
2137 / len(m1)

0.19390254967788767

In [100]:
7923 / len(m1)

0.7189002812811904

In [101]:
m1._merge.value_counts()

both          7923
left_only     2137
right_only     961
Name: _merge, dtype: int64

In [102]:
m1.loc[m1._merge == "right_only"].sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,avg_sched_service_min,avg_stop_meters,n_trips_x,frequency,service_date,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips_y,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct,_merge
10643,cc53a0dbf5df90e3009b9cb5d89d80ba,870,1,peak,,,,,NaT,771.0,789,2251.0,676.0,0,0,26.0,2.85,,0.98,0.17,right_only


In [103]:
right_only_m1 = m1.loc[m1._merge == "right_only"]

In [104]:
right_only_m1.time_period.value_counts()

all_day    348
offpeak    310
peak       303
Name: time_period, dtype: int64

#### concatenate_speeds_by_route_direction

In [105]:
df2 = concatenate_speeds_by_route_direction(analysis_date)

In [106]:
df2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,speed_mph,service_date
6406,95cb514215c61ca578b01d885f35ec0a,10933,0,offpeak,12.93,2023-09-13


In [107]:
routes_df.sample()

Unnamed: 0,time_period,schedule_gtfs_dataset_key,route_id,direction_id,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
5190,peak,e8d0fd2f1c4b13707a24909a0f206271,16611,0,390,659,909,218.0,909,905,26,1.38,1.0,0.59,2.02


In [108]:
m2 = pd.merge(df2, routes_df, on=m_cols, how="outer", indicator=True)

In [109]:
m2._merge.value_counts()

both          7637
right_only    1247
left_only        0
Name: _merge, dtype: int64

In [110]:
1247 / len(m2)

0.14036470058532194

In [111]:
right_only_m2 = m2.loc[m2._merge == "right_only"]

In [112]:
right_only_m2.time_period.value_counts()

all_day    426
offpeak    417
peak       404
Name: time_period, dtype: int64

In [113]:
right_only_m2.schedule_gtfs_dataset_key.value_counts().head()

7cc0cb1871dfd558f11a2885c145d144    388
d2b09fbd392b28d767c28ea26529b0cd     50
8fa3380c9291d3694494c34b014642d0     48
1770249a5a2e770ca90628434d4934b1     46
43d8d305ee692724a532f30ea63a1cbe     42
Name: schedule_gtfs_dataset_key, dtype: int64

In [114]:
analysis_date

['2023-09-13']

#### Where's the missing operator?

In [115]:
crosswalk = pd.read_parquet(
    f"gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_{analysis_date[0]}.parquet"
)[["organization_name", "schedule_gtfs_dataset_key"]]

In [121]:
missing_op = "7cc0cb1871dfd558f11a2885c145d144"

In [180]:
crosswalk.loc[crosswalk.schedule_gtfs_dataset_key == missing_op]

Unnamed: 0,organization_name,schedule_gtfs_dataset_key


In [181]:
crosswalk.loc[crosswalk.organization_name.str.contains('Muni')].drop_duplicates()

Unnamed: 0,organization_name,schedule_gtfs_dataset_key


In [182]:
crosswalk.sort_values(by = ['organization_name'])

Unnamed: 0,organization_name,schedule_gtfs_dataset_key
39,Alameda-Contra Costa Transit District,c499f905e33929a641f083dad55c521e
98,Amador Regional Transit System,36b8fbf12e4adc76b21651462b200860
24,Amtrak,b9473e19aebf7ee2ec18623eb35762a1
146,Anaheim Transportation Network,b7a6cd6a1a06406c35fa9abd16ad9754
40,Angel Island-Tiburon Ferry Company,04d1db905ac689e17a97ce414cf393a6
29,Antelope Valley Transit Authority,e681c3a8dafa2c80e5b8e2cdd01f917a
100,Banning Pass Transit,bc039937fdadd173bd3c3edc03b7a9c9
141,Basin Transit,b0760015c9fcd0500c4fddd5b9bb115b
149,Blue Lake Rancheria,6693efa56a541b6276da9b424f78a170
120,Butte County Association of Governments,f1cc580313b37ae0f853b2e469b27228


In [159]:
m2_ops = pd.merge(
    right_only_m2[["schedule_gtfs_dataset_key"]],
    crosswalk,
    on=["schedule_gtfs_dataset_key"],
    how="left",
    indicator=True,
).drop_duplicates()

In [157]:
m2_ops.loc[m2_ops.organization_name.str.contains("Muni")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,_merge


In [120]:
#m2_ops.sort_values(
#    by=["schedule_gtfs_dataset_key"]
#)

In [125]:
import siuba  # need this to do type hint in functions
from calitp_data_analysis import geography_utils
from calitp_data_analysis.tables import tbls
from shared_utils import schedule_rt_utils
from siuba import *

In [137]:
#Go to RELATION https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.dim_gtfs_service_data
#cal-itp-data-infra.mart_transit_database.dim_gtfs_service_data
# dim = tbls.mart_transit_database.dim_gtfs_service_data()

In [139]:
dim_op = (
        tbls.mart_transit_database.dim_gtfs_service_data()
        >> collect()
    )

In [147]:
dim_op._is_current.unique()

array([ True, False])

In [149]:
dim_op2 = dim_op.loc[dim_op._is_current == True].reset_index()

In [150]:
dim_op2.loc[dim_op2.gtfs_dataset_key == missing_op].name.unique()

array(['Muni Metro Rail – Bay Area 511 Muni Schedule',
       'Muni Bus – Bay Area 511 Muni Schedule'], dtype=object)

In [151]:
dim_op2.loc[dim_op2.name.str.contains('Muni')][['name', 'gtfs_dataset_key']].drop_duplicates()

Unnamed: 0,name,gtfs_dataset_key
22,Muni Bus – Bay Area 511 Muni Alerts,493982509e23e226eb4f05734ea5df96
58,Muni Metro Rail – Bay Area 511 Muni Schedule,7cc0cb1871dfd558f11a2885c145d144
74,Muni Metro Rail – Bay Area 511 Muni TripUpdates,82c194d8ddf7000de7ff3b5ef170eec8
105,Muni Metro Rail – Bay Area 511 Muni VehiclePositions,c0e3039da063db95ebabd3fe4ee611a4
277,Muni Bus – Bay Area 511 Regional Alerts,3a30fa95eb31aeb301ff0530c3291285
294,Muni Bus – Bay Area 511 Muni TripUpdates,82c194d8ddf7000de7ff3b5ef170eec8
332,Muni Bus – Bay Area 511 Regional TripUpdates,33c75a7d2149bcb9d5e6af2bde4d8d96
349,Muni Bus – Bay Area 511 Muni VehiclePositions,c0e3039da063db95ebabd3fe4ee611a4
359,Commerce Municipal Bus Lines – Commerce Alerts,f2e6877d0e3aa3cc7f0156b758171df0
364,Muni Metro Rail – Bay Area 511 Regional Alerts,3a30fa95eb31aeb301ff0530c3291285


#### Missing op is Muni but this is for merged `routes_df`, check for original

In [167]:
routes_df_subset = routes_df[['schedule_gtfs_dataset_key']].drop_duplicates()

In [168]:
crosswalk.sample()

Unnamed: 0,organization_name,schedule_gtfs_dataset_key
118,City of Tracy,8ef0af704a3d9932653cb7a39e74ea28


In [176]:
missing_op

'7cc0cb1871dfd558f11a2885c145d144'

In [174]:
routes_to_op_m1 = pd.merge(routes_df_subset, dim_op2, left_on = ['schedule_gtfs_dataset_key'], right_on = ['gtfs_dataset_key'], how = 'inner')

In [175]:
routes_to_op_m1.loc[routes_to_op_m1.name.str.contains('Muni')]

Unnamed: 0,schedule_gtfs_dataset_key,index,key,name,source_record_id,service_key,gtfs_dataset_key,customer_facing,category,fares_v2_status,manual_check__fixed_route_completeness,manual_check__demand_response_completeness,_is_current,_valid_from,_valid_to
49,7cc0cb1871dfd558f11a2885c145d144,2430,eea564c5cc438b604942c2f5fb38fb8d,Muni Metro Rail – Bay Area 511 Muni Schedule,recuts4YnSouQRYoG,487ba6aecb9c11d2b6c04756bd073147,7cc0cb1871dfd558f11a2885c145d144,False,,[],,,True,2023-10-26 00:00:00+00:00,2098-12-31 23:59:59.999999+00:00
50,7cc0cb1871dfd558f11a2885c145d144,15311,b2bc5ba67fe2697e2c0a4d1be361204e,Muni Bus – Bay Area 511 Muni Schedule,recuPAVqdOHkzMKNJ,9d79cc1f1915029d63c3a25924f673c2,7cc0cb1871dfd558f11a2885c145d144,False,precursor,[],Unknown,Unknown,True,2023-10-26 00:00:00+00:00,2098-12-31 23:59:59.999999+00:00
82,eaabdf2b0bb899b7953ea81047fdd00d,18664,e1a867a65bacc2f35ed7a30cdb964b05,Commerce Municipal Bus Lines – Commerce Schedule,reclEMvBL84Cd9h28,7df880687fa8aef0cddeefecbe9277bd,eaabdf2b0bb899b7953ea81047fdd00d,True,primary,[Blocked - Vendor],Complete,Unknown,True,2023-10-26 00:00:00+00:00,2098-12-31 23:59:59.999999+00:00


In [None]:
# routes_to_op_m1 = pd.merge(routes_df_subset, dim_op2, left_on = ['schedule_gtfs_dataset_key'], right_on = ['gtfs_dataset_key'], how = 'outer')

### Cleaning Function

In [None]:
pct_cols = [
    "rt_w_gtfs_pct",
    "rt_v_scheduled_time_pct",
    "spatial_accuracy_pct",
]

In [None]:
int_cols = [
    "rt_service_min",
    "service_minutes",
]

In [None]:
def clean_df(df: pd.DataFrame, pct_cols: list, int_cols: list) -> pd.DataFrame:
    for i in pct_cols:
        df[i] = df[i] * 100
    for i in int_cols:
        df[i] = df[i].fillna(0).round()

    df.columns = df.columns.str.replace("_", " ").str.strip().str.title()
    return df

In [None]:
routes_df2 = clean_df(routes_df, pct_cols, int_cols)

In [None]:
routes_df2["Time Period"].value_counts()

### Check time of day 
* https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py#L135-L163

In [None]:
analysis_date = "2023-10-11"

In [None]:
trips = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_trip/trip_metrics/trip_2023-10-11.parquet"
)

In [None]:
trips.head()

In [None]:
trips.peak_offpeak.unique()

In [None]:
trips.time_of_day.value_counts()

In [None]:
trips.peak_offpeak.value_counts()

In [None]:
roll_singleday_route_dir_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_segment_speeds/rollup_singleday/speeds_route_dir_2023-10-15.parquet"
)

In [None]:
roll_singleday_route_dir_df.time_period.value_counts()

In [None]:
roll_singleday_route_dir_df.sample()

In [None]:
roll_singleday_speeds_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_segment_speeds/rollup_singleday/speeds_trip_2023-10-15.parquet"
)

In [None]:
roll_singleday_speeds_df.sample()

 #### Fix Time of Day #1

In [None]:
trips.columns

In [None]:
trips2 = trips.drop(
    columns=[
        "peak_offpeak",
        "pings_per_min",
        "spatial_accuracy_pct",
        "rt_w_gtfs_pct",
        "rt_v_scheduled_time_pct",
    ]
)

In [None]:
trips2.head()

 #### Fix Time of Day #2

In [None]:
def add_metrics(df: pd.DataFrame) -> pd.DataFrame:

    df["pings_per_min"] = df.total_pings / df.rt_service_min
    df["spatial_accuracy_pct"] = df.vp_in_shape / df.total_vp
    df["rt_w_gtfs_pct"] = df.total_min_w_gtfs / df.rt_service_min
    df["rt_v_scheduled_time_pct"] = df.rt_service_min / df.service_minutes - 1

    # Mask rt_triptime_w_gtfs_pct for any values above 100%
    df.rt_w_gtfs_pct = df.rt_w_gtfs_pct.mask(df.rt_w_gtfs_pct > 1, 1)

    return df

* yeah, so actually, i use the column peak_offpeak and i do not filter, but pass peak_offpeak in the grouping cols (route-direction-peak_offpeak). then i do it again without peak_offpeak in the group cols and rename peak_offpeak = all_day. then i concatenate.

In [None]:
route_months = ["sep", "oct"]

route_analysis_date_list = [rt_dates.DATES[f"{m}2023"] for m in route_months]

In [None]:
def concatenate_trip_segment_speeds(analysis_date_list: list) -> pd.DataFrame:
    """
    Concatenate the trip parquets together,
    whether it's for single day or multi-day averages.
    """
    TRIP_EXPORT = "vp_trip/trip_metrics"
    df = pd.concat(
        [
            pd.read_parquet(
                f"{RT_SCHED_GCS}{TRIP_EXPORT}/trip_{analysis_date}.parquet"
            ).assign(service_date=pd.to_datetime(analysis_date))
            for analysis_date in analysis_date_list
        ],
        axis=0,
        ignore_index=True,
    )
    return df

In [None]:
def route_metrics(analysis_date_list: list) -> pd.DataFrame:

    df = concatenate_trip_segment_speeds(analysis_date_list)

    # Delete out trip generated metrics
    del_cols = [
        "pings_per_min",
        "spatial_accuracy_pct",
        "rt_w_gtfs_pct",
        "rt_v_scheduled_time_pct",
    ]

    df = df.drop(columns=del_cols)

    # Add weighted metrics
    sum_cols = [
        "total_min_w_gtfs",
        "rt_service_min",
        "total_pings",
        "service_minutes",
        "total_vp",
        "vp_in_shape",
    ]

    count_cols = ["trip_instance_key"]

    all_day_groups = [
        "schedule_gtfs_dataset_key",
        "route_id",
        "direction_id",
    ]

    all_day_df = (
        df.groupby(all_day_groups)
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )

    all_day_df = all_day_df.rename(columns={"trip_instance_key": "n_trips"})
    all_day_df = add_metrics(all_day_df)
    all_day_df["time_period"] = "all_day"

    peak_groups = ["peak_offpeak"] + all_day_groups
    peak_df = (
        df.groupby(peak_groups)
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )

    peak_df = peak_df.rename(
        columns={"trip_instance_key": "n_trips", "peak_offpeak": "time_period"}
    )
    peak_df = add_metrics(peak_df)

    final_df = pd.concat([peak_df, all_day_df])

    # Save
    # analysis_date_file = generate_date(analysis_date_list)
    # ROUTE_EXPORT = CONFIG_DICT["route_direction_metrics"]
    # df2.to_parquet(f"{RT_SCHED_GCS}{ROUTE_EXPORT}/trip_{analysis_date_file}.parquet")

    return final_df

In [None]:
all_routes2 = route_metrics(route_analysis_date_list)

In [None]:
all_routes2.head()

In [None]:
all_routes2.time_period.value_counts()