## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/13
    * Figure out how to set up Config file
    * Tiffany:
        * add_metrics looks good, just remove the coercing of percents to 0-100 to a separate function. I want everything from 0-1, and then before charting, scaled up to 0-100 all at once. Can you write a general         * function for this....all the chart display / cleaning functions should live in 1 script in segment_speed_utils.
        * Another tweak for a step somewhere before add_metrics. Certain columns can be coerced to be integers, like total_vp and vp_in_shape, just like how total_min_w_gtfs is an integer. Coerce all the ones that can be integers to be integers for your trip table, and this will save on the rounding step later.
        * Column naming: think about how you want to change the column names. total_pings_for_trip is not going to make sense once you aggregate, so maybe go with something more generic. Otherwise, you're going to be aggregating and renaming columns constantly. I would just rely on the other columns in the row to tell us whether it's per trip or per route , and the metrics all use generic names that are suitable for passing through aggregation functions. (edited) 

In [46]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS
from shared_utils import portfolio_utils, rt_dates, rt_utils

In [2]:
# Times
import datetime

from loguru import logger

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [47]:
SCHED_GCS

'gs://calitp-analytics-data/data-analyses/gtfs_schedule/'

In [4]:
# analysis_date = rt_dates.DATES["dec2023"]

In [5]:
RT_SCHED_GCS

'gs://calitp-analytics-data/data-analyses/rt_vs_schedule/'

In [6]:
# rt_dates.DATES

### Routes - New Work

In [7]:
routes_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_route_dir/route_direction_metrics/trip_2023_09_13_to_2023_10_11.parquet"
)

In [8]:
routes_df.sample()

Unnamed: 0,time_period,schedule_gtfs_dataset_key,route_id,direction_id,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
1805,offpeak,c499f905e33929a641f083dad55c521e,650,1,104,107,290,90.0,290,290,2,2.71,1.0,0.97,0.19


### Trips - New Work

In [9]:
trips_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_trip/trip_metrics/trip_2023-09-13.parquet"
)

In [10]:
trips_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,time_of_day,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
30410,a068c9b5e54692d4f729dde66c36cdb8,6053820932ebae7359871a2dc0212f62,69,68,201,69,0,0,Unknown,,vp_only,,AM Peak,peak,2.91,,1.0,


### Check Merges

In [11]:
analysis_date = ["2023-09-13"]

In [12]:
route_time_cols = [
    "schedule_gtfs_dataset_key",
    "route_id",
    "direction_id",
    "time_period",
]


def concatenate_schedule_by_route_direction(date_list: list) -> pd.DataFrame:
    """
    Concatenate schedule data that's been
    aggregated to route-direction-time_period.
    """
    df = pd.concat(
        [
            pd.read_parquet(
                f"{RT_SCHED_GCS}schedule_route_dir/"
                f"schedule_route_direction_metrics_{d}.parquet",
                columns=route_time_cols
                + [
                    "avg_sched_service_min",
                    "avg_stop_meters",
                    "n_trips",
                    "frequency",
                ],
            )
            .assign(service_date=pd.to_datetime(d))
            .astype({"direction_id": "Int64"})
            for d in date_list
        ],
        axis=0,
        ignore_index=True,
    )

    return df

In [13]:
import geopandas as gpd

In [14]:
def concatenate_speeds_by_route_direction(date_list: list) -> pd.DataFrame:
    df = pd.concat(
        [
            pd.read_parquet(
                f"{SEGMENT_GCS}rollup_singleday/" f"speeds_route_dir_{d}.parquet",
                columns=route_time_cols + ["speed_mph"],
            )
            .assign(service_date=pd.to_datetime(d))
            .astype({"direction_id": "Int64"})
            for d in date_list
        ],
        axis=0,
        ignore_index=True,
    )

    return df

#### concatenate_schedule_by_route_direction

In [15]:
df1 = concatenate_schedule_by_route_direction(analysis_date)

In [17]:
df1.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,avg_sched_service_min,avg_stop_meters,n_trips,frequency,service_date
5916,c499f905e33929a641f083dad55c521e,36,1,peak,44.81,210.82,16,2.0,2023-09-13


In [20]:
routes_df.time_period.value_counts()

all_day    3174
peak       2981
offpeak    2729
Name: time_period, dtype: int64

In [38]:
m_cols = ["schedule_gtfs_dataset_key", "route_id", "direction_id", "time_period"]

In [27]:
m1 = pd.merge(df1, routes_df, on=m_cols, how="outer", indicator=True)

In [33]:
961 / len(m1)

0.08719716904092188

In [34]:
2137 / len(m1)

0.19390254967788767

In [35]:
7923 / len(m1)

0.7189002812811904

In [28]:
m1._merge.value_counts()

both          7923
left_only     2137
right_only     961
Name: _merge, dtype: int64

In [29]:
m1.loc[m1._merge == "right_only"].sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,avg_sched_service_min,avg_stop_meters,n_trips_x,frequency,service_date,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips_y,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct,_merge
10720,1fd2f07342d966919b15d5d37fda8cc8,e24126d6-fbad-46b1-a498-75026e763636,0,all_day,,,,,NaT,1061.0,1061,3171.0,331.0,3171,1108,5.0,2.99,0.35,1.0,2.21,right_only


In [41]:
right_only_m1 = m1.loc[m1._merge == "right_only"]

In [43]:
right_only_m1.time_period.value_counts()

all_day    348
offpeak    310
peak       303
Name: time_period, dtype: int64

#### concatenate_speeds_by_route_direction

In [16]:
df2 = concatenate_speeds_by_route_direction(analysis_date)

In [18]:
df2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,speed_mph,service_date
3023,c499f905e33929a641f083dad55c521e,638,0,offpeak,12.67,2023-09-13


In [19]:
routes_df.sample()

Unnamed: 0,time_period,schedule_gtfs_dataset_key,route_id,direction_id,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
1567,offpeak,baeeb157e85a901e47b828ef9fe75091,290,1,1013,1036,1579,356.0,1579,1317,8,1.52,0.83,0.98,1.91


In [30]:
m2 = pd.merge(df2, routes_df, on=m_cols, how="outer", indicator=True)

In [31]:
m2._merge.value_counts()

both          7637
right_only    1247
left_only        0
Name: _merge, dtype: int64

In [36]:
1247 / len(m2)

0.14036470058532194

In [37]:
right_only_m2 = m2.loc[m2._merge == "right_only"]

In [39]:
right_only_m2.time_period.value_counts()

all_day    426
offpeak    417
peak       404
Name: time_period, dtype: int64

In [45]:
right_only_m2.schedule_gtfs_dataset_key.value_counts().head()

7cc0cb1871dfd558f11a2885c145d144    388
d2b09fbd392b28d767c28ea26529b0cd     50
8fa3380c9291d3694494c34b014642d0     48
1770249a5a2e770ca90628434d4934b1     46
43d8d305ee692724a532f30ea63a1cbe     42
Name: schedule_gtfs_dataset_key, dtype: int64

In [None]:
analysis_date

['2023-09-13']

In [69]:
crosswalk = pd.read_parquet(
    f"gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_{analysis_date[0]}.parquet"
)[["organization_name", "schedule_gtfs_dataset_key"]]

In [70]:
crosswalk.loc[crosswalk.schedule_gtfs_dataset_key == "7cc0cb1871dfd558f11a2885c145d144"]

Unnamed: 0,organization_name,schedule_gtfs_dataset_key


In [74]:
m2_ops = pd.merge(
    right_only_m2[["schedule_gtfs_dataset_key"]],
    crosswalk,
    on=["schedule_gtfs_dataset_key"],
    how="outer",
    indicator=True,
).drop_duplicates()

In [75]:
m2_ops.organization_name.value_counts().head()

City of Laguna Beach                2
Palo Verde Valley Transit Agency    2
Placer County                       2
City of Lawndale                    2
Lassen Transit Service Agency       2
Name: organization_name, dtype: int64

In [77]:
m2_ops.sort_values(
    by=["schedule_gtfs_dataset_key"]
)

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,_merge
1334,0139b1253130b33adcd4b3a4490530d2,Tulare County Regional Transit Agency,right_only
0,015d67d5b75b5cf2b710bbadadfb75f5,Marin County Transit District,both
1267,04d1db905ac689e17a97ce414cf393a6,Angel Island-Tiburon Ferry Company,right_only
4,07d3b79f14cec8099119e1eb649f065b,Tahoe Transportation District,both
1278,0881af3822466784992a49f1cc57d38f,Sonoma-Marin Area Rail Transit District,right_only
5,09a703757d1ed14ca9580b1385e39315,,left_only
10,09e16227fc42c4fe90204a9d11581034,Cloverdale Transit,both
26,0a3c0b21c85fb09f8db91599e14dd7f7,Lake Transit Authority,both
1314,0bcba4ddc5c10546f2e957a74f58b8ac,Yuba-Sutter Transit Authority,right_only
1324,0d04ec340550e5a62b031a8e125e6658,POINT,right_only


### Cleaning Function

In [None]:
pct_cols = [
    "rt_w_gtfs_pct",
    "rt_v_scheduled_time_pct",
    "spatial_accuracy_pct",
]

In [None]:
int_cols = [
    "rt_service_min",
    "service_minutes",
]

In [None]:
def clean_df(df: pd.DataFrame, pct_cols: list, int_cols: list) -> pd.DataFrame:
    for i in pct_cols:
        df[i] = df[i] * 100
    for i in int_cols:
        df[i] = df[i].fillna(0).round()

    df.columns = df.columns.str.replace("_", " ").str.strip().str.title()
    return df

In [None]:
routes_df2 = clean_df(routes_df, pct_cols, int_cols)

In [None]:
routes_df2["Time Period"].value_counts()

### Check time of day 
* https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py#L135-L163

In [None]:
analysis_date = "2023-10-11"

In [None]:
trips = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_trip/trip_metrics/trip_2023-10-11.parquet"
)

In [None]:
trips.head()

In [None]:
trips.peak_offpeak.unique()

In [None]:
trips.time_of_day.value_counts()

In [None]:
trips.peak_offpeak.value_counts()

In [None]:
roll_singleday_route_dir_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_segment_speeds/rollup_singleday/speeds_route_dir_2023-10-15.parquet"
)

In [None]:
roll_singleday_route_dir_df.time_period.value_counts()

In [None]:
roll_singleday_route_dir_df.sample()

In [None]:
roll_singleday_speeds_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/rt_segment_speeds/rollup_singleday/speeds_trip_2023-10-15.parquet"
)

In [None]:
roll_singleday_speeds_df.sample()

 #### Fix Time of Day #1

In [None]:
trips.columns

In [None]:
trips2 = trips.drop(
    columns=[
        "peak_offpeak",
        "pings_per_min",
        "spatial_accuracy_pct",
        "rt_w_gtfs_pct",
        "rt_v_scheduled_time_pct",
    ]
)

In [None]:
trips2.head()

 #### Fix Time of Day #2

In [None]:
def add_metrics(df: pd.DataFrame) -> pd.DataFrame:

    df["pings_per_min"] = df.total_pings / df.rt_service_min
    df["spatial_accuracy_pct"] = df.vp_in_shape / df.total_vp
    df["rt_w_gtfs_pct"] = df.total_min_w_gtfs / df.rt_service_min
    df["rt_v_scheduled_time_pct"] = df.rt_service_min / df.service_minutes - 1

    # Mask rt_triptime_w_gtfs_pct for any values above 100%
    df.rt_w_gtfs_pct = df.rt_w_gtfs_pct.mask(df.rt_w_gtfs_pct > 1, 1)

    return df

* yeah, so actually, i use the column peak_offpeak and i do not filter, but pass peak_offpeak in the grouping cols (route-direction-peak_offpeak). then i do it again without peak_offpeak in the group cols and rename peak_offpeak = all_day. then i concatenate.

In [None]:
route_months = ["sep", "oct"]

route_analysis_date_list = [rt_dates.DATES[f"{m}2023"] for m in route_months]

In [None]:
def concatenate_trip_segment_speeds(analysis_date_list: list) -> pd.DataFrame:
    """
    Concatenate the trip parquets together,
    whether it's for single day or multi-day averages.
    """
    TRIP_EXPORT = "vp_trip/trip_metrics"
    df = pd.concat(
        [
            pd.read_parquet(
                f"{RT_SCHED_GCS}{TRIP_EXPORT}/trip_{analysis_date}.parquet"
            ).assign(service_date=pd.to_datetime(analysis_date))
            for analysis_date in analysis_date_list
        ],
        axis=0,
        ignore_index=True,
    )
    return df

In [None]:
def route_metrics(analysis_date_list: list) -> pd.DataFrame:

    df = concatenate_trip_segment_speeds(analysis_date_list)

    # Delete out trip generated metrics
    del_cols = [
        "pings_per_min",
        "spatial_accuracy_pct",
        "rt_w_gtfs_pct",
        "rt_v_scheduled_time_pct",
    ]

    df = df.drop(columns=del_cols)

    # Add weighted metrics
    sum_cols = [
        "total_min_w_gtfs",
        "rt_service_min",
        "total_pings",
        "service_minutes",
        "total_vp",
        "vp_in_shape",
    ]

    count_cols = ["trip_instance_key"]

    all_day_groups = [
        "schedule_gtfs_dataset_key",
        "route_id",
        "direction_id",
    ]

    all_day_df = (
        df.groupby(all_day_groups)
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )

    all_day_df = all_day_df.rename(columns={"trip_instance_key": "n_trips"})
    all_day_df = add_metrics(all_day_df)
    all_day_df["time_period"] = "all_day"

    peak_groups = ["peak_offpeak"] + all_day_groups
    peak_df = (
        df.groupby(peak_groups)
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )

    peak_df = peak_df.rename(
        columns={"trip_instance_key": "n_trips", "peak_offpeak": "time_period"}
    )
    peak_df = add_metrics(peak_df)

    final_df = pd.concat([peak_df, all_day_df])

    # Save
    # analysis_date_file = generate_date(analysis_date_list)
    # ROUTE_EXPORT = CONFIG_DICT["route_direction_metrics"]
    # df2.to_parquet(f"{RT_SCHED_GCS}{ROUTE_EXPORT}/trip_{analysis_date_file}.parquet")

    return final_df

In [None]:
all_routes2 = route_metrics(route_analysis_date_list)

In [None]:
all_routes2.head()

In [None]:
all_routes2.time_period.value_counts()