## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/13
    * Figure out how to set up Config file
    * Tiffany:
        * add_metrics looks good, just remove the coercing of percents to 0-100 to a separate function. I want everything from 0-1, and then before charting, scaled up to 0-100 all at once. Can you write a general         * function for this....all the chart display / cleaning functions should live in 1 script in segment_speed_utils.
        * Another tweak for a step somewhere before add_metrics. Certain columns can be coerced to be integers, like total_vp and vp_in_shape, just like how total_min_w_gtfs is an integer. Coerce all the ones that can be integers to be integers for your trip table, and this will save on the rounding step later.
        * Column naming: think about how you want to change the column names. total_pings_for_trip is not going to make sense once you aggregate, so maybe go with something more generic. Otherwise, you're going to be aggregating and renaming columns constantly. I would just rely on the other columns in the row to tell us whether it's per trip or per route , and the metrics all use generic names that are suitable for passing through aggregation functions. (edited) 

In [25]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS
from shared_utils import portfolio_utils, rt_dates, rt_utils

In [2]:
# Times
import datetime

from loguru import logger

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
# analysis_date = rt_dates.DATES["dec2023"]

In [5]:
RT_SCHED_GCS

'gs://calitp-analytics-data/data-analyses/rt_vs_schedule/'

In [6]:
rt_dates.DATES

{'feb2022': '2022-02-08',
 'mar2022': '2022-03-30',
 'may2022': '2022-05-04',
 'jun2022': '2022-06-15',
 'jul2022': '2022-07-13',
 'aug2022': '2022-08-17',
 'sep2022': '2022-09-14',
 'sep2022a': '2022-09-21',
 'oct2022': '2022-10-12',
 'nov2022a': '2022-11-07',
 'nov2022b': '2022-11-08',
 'nov2022c': '2022-11-09',
 'nov2022d': '2022-11-10',
 'nov2022': '2022-11-16',
 'dec2022': '2022-12-14',
 'jan2023': '2023-01-18',
 'feb2023': '2023-02-15',
 'mar2023': '2023-03-15',
 'apr2023a': '2023-04-10',
 'apr2023b': '2023-04-11',
 'apr2023': '2023-04-12',
 'apr2023c': '2023-04-13',
 'apr2023d': '2023-04-14',
 'apr2023e': '2023-04-15',
 'apr2023f': '2023-04-16',
 'may2023': '2023-05-17',
 'jun2023': '2023-06-14',
 'jul2023': '2023-07-12',
 'aug2023': '2023-08-15',
 'aug2023a': '2023-08-23',
 'sep2023': '2023-09-13',
 'oct2023a': '2023-10-09',
 'oct2023b': '2023-10-10',
 'oct2023': '2023-10-11',
 'oct2023c': '2023-10-12',
 'oct2023d': '2023-10-13',
 'oct2023e': '2023-10-14',
 'oct2023f': '2023-10

### Routes add multiple days

In [7]:
months = ["sep", "oct"]

analysis_date_list = [rt_dates.DATES[f"{m}2023"] for m in months]

In [8]:
analysis_date_list

['2023-09-13', '2023-10-11']

In [9]:
def concatenate_trip_segment_speeds(analysis_date_list: list) -> pd.DataFrame:
    """
    Concatenate the trip parquets together,
    whether it's for single day or multi-day averages.
    """
    # TRIP_EXPORT = CONFIG_DICT["trip_metrics"]
    TRIP_EXPORT = "vp_trip/trip_metrics"
    df = pd.concat(
        [
            pd.read_parquet(
                f"{RT_SCHED_GCS}{TRIP_EXPORT}/trip_{analysis_date}.parquet"
            ).assign(service_date=pd.to_datetime(analysis_date))
            for analysis_date in analysis_date_list
        ],
        axis=0,
        ignore_index=True,
    )
    return df

In [10]:
all_routes = concatenate_trip_segment_speeds(analysis_date_list)

In [11]:
all_routes.sched_rt_category.value_counts()

vp_sched    144989
vp_only      27632
Name: sched_rt_category, dtype: int64

In [12]:
all_routes.service_date.value_counts()

2023-10-11    86486
2023-09-13    86135
Name: service_date, dtype: int64

In [13]:
analysis_date_list[1]

'2023-10-11'

In [14]:
routes_just_one = concatenate_trip_segment_speeds(["2023-10-11"])

In [15]:
routes_just_one.service_date.value_counts()

2023-10-11    86486
Name: service_date, dtype: int64

In [16]:
all_routes.sample(3)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,time_of_day,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct,service_date
153683,7cc0cb1871dfd558f11a2885c145d144,6d55479f1eb5f89f63ea0541ce86a34a,64,63,190,64,190,190,5R,0,vp_sched,50.0,PM Peak,peak,2.97,1.0,1.0,0.28,2023-10-11
109452,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,22b0fb02683a3b1250be9943033c3953,135,135,403,135,403,400,210-13168,0,vp_sched,116.0,Midday,offpeak,2.99,0.99,1.0,0.16,2023-10-11
92198,efbbd5293be71f7a5de0cf82b59febe1,f9671ba41ffde9fecc0cb7ecefe991df,30,4,28,24,28,28,3639,1,vp_sched,24.0,AM Peak,peak,0.93,1.0,0.8,0.25,2023-10-11


#### Add back metrics

In [27]:
def add_metrics(df: pd.DataFrame) -> pd.DataFrame:

    df["pings_per_min"] = df.total_pings / df.rt_service_min
    df["spatial_accuracy_pct"] = df.vp_in_shape / df.total_vp
    df["rt_w_gtfs_pct"] = df.total_min_w_gtfs / df.rt_service_min
    df["rt_v_scheduled_time_pct"] = df.rt_service_min / df.service_minutes - 1

    # Mask rt_triptime_w_gtfs_pct for any values above 100%
    df.rt_w_gtfs_pct = df.rt_w_gtfs_pct.mask(df.rt_w_gtfs_pct > 1, 1)

    return df

In [28]:
def route_metrics(analysis_date_list:list)->pd.DataFrame:
    
    df = concatenate_trip_segment_speeds(analysis_date_list)
    # Delete out trip generated metrics
    del_cols = [
        "pings_per_min",
        "spatial_accuracy_pct",
        "rt_w_gtfs_pct",
        "rt_v_scheduled_time_pct",
    ]

    df = df.drop(columns=del_cols)
    
    # Add weighted metrics
    sum_cols = [
        "total_min_w_gtfs",
        "rt_service_min",
        "total_pings",
        "service_minutes",
        "total_vp",
        "vp_in_shape",
    ]

    count_cols = ["trip_instance_key"]
    
    group_cols = ["schedule_gtfs_dataset_key", "route_id", "direction_id", "sched_rt_category",  "peak_offpeak", "time_of_day"]
    df2 = (
        df.groupby(group_cols)
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )

    df2 = df2.rename(columns={"trip_instance_key": "n_trips"})

    df2 = add_metrics(df2)

    return df2

In [29]:
all_routes2 = route_metrics(
    analysis_date_list,
    
)

In [30]:
all_routes2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,time_of_day,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
1314,2f4b452801393f177e9dbca20cac1a07,50,1,vp_sched,peak,PM Peak,233,235,729,220.0,729,595,5,3.1,0.82,0.99,0.07


In [32]:
all_routes_test = route_metrics(
    ['2023-10-11'],
    
)

#### How come there are missing rt with GTFS Pct even if  it's vp_sched?

In [33]:
all_routes2.loc[
    (all_routes2["route_id"] == "49")
    & (all_routes2["schedule_gtfs_dataset_key"] == "015d67d5b75b5cf2b710bbadadfb75f5")
]

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,time_of_day,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
85,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,offpeak,Early AM,208,208,620,122.0,620,497,2,2.98,0.8,1.0,0.7
86,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,offpeak,Evening,143,142,425,120.0,425,425,2,2.99,1.0,1.0,0.18
87,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,offpeak,Midday,1295,1292,3845,1080.0,3845,3827,18,2.98,1.0,1.0,0.2
88,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,peak,AM Peak,718,719,2134,625.0,2134,2127,10,2.97,1.0,1.0,0.15
89,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,peak,PM Peak,1192,1195,3537,1016.0,3537,3451,16,2.96,0.98,1.0,0.18
90,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,offpeak,Early AM,503,502,1496,356.0,1496,1088,5,2.98,0.73,1.0,0.41
91,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,offpeak,Evening,196,229,582,138.0,582,435,2,2.54,0.75,0.86,0.66
92,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,offpeak,Midday,1257,1258,3738,1242.0,3738,3700,18,2.97,0.99,1.0,0.01
93,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,peak,AM Peak,795,796,2364,722.0,2364,2294,10,2.97,0.97,1.0,0.1
94,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,peak,PM Peak,1288,1286,3828,1063.0,3828,3463,15,2.98,0.9,1.0,0.21


In [34]:
all_routes.loc[
    (all_routes.route_id == "49")
    & (all_routes.schedule_gtfs_dataset_key == "015d67d5b75b5cf2b710bbadadfb75f5")
][["service_date"]].drop_duplicates()

Unnamed: 0,service_date
85724,2023-09-13
172356,2023-10-11


#### Cleaning Function

In [None]:
pct_cols = [
    "rt_w_gtfs_pct",
    "rt_v_scheduled_time_pct",
    "spatial_accuracy_pct",
]

In [None]:
int_cols = [
    "rt_service_min",
    "service_minutes",
]

In [None]:
def clean_df(df: pd.DataFrame, pct_cols: list, int_cols: list) -> pd.DataFrame:
    for i in pct_cols:
        df[i] = df[i] * 100
    for i in int_cols:
        df[i] = df[i].fillna(0).round()

    df.columns = df.columns.str.replace("_", " ").str.strip().str.title()
    return df

In [None]:
all_routes3 = clean_df(all_routes2, pct_cols, int_cols)

In [None]:
all_routes3.sample(3)