## Something is wrong with GTFS Digest
* Makefile in `gtfs_digest` won't run since the function changed. 
    * Go to `rt_segment_speeds` -> `segment_speed_utils` -> `time_series_utils` and temporarily change back to the old function.


In [None]:
import _section2_utils as section2
import geopandas as gpd
import merge_operator_data
import merge_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers, time_series_utils
from shared_utils import catalog_utils, rt_dates, rt_utils
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [None]:
analysis_date_list = rt_dates.y2024_dates + rt_dates.y2023_dates

### Metrics for All Routes
* March 2023 has two values for some operators.
* Some operators have many rows that are repeating, causing their charts to go above 100. 

#### Look at the metrics dataframes first.
* I think `op_rt_sched_metrics` is the reason why there are duplicative values.
* Temp fix: in `section2_utils.load_operator_metrics()` drop duplicates based on `service_date`.

In [None]:
op_sched_metrics = merge_operator_data.concatenate_schedule_operator_metrics(analysis_date_list)

In [None]:
op_sched_metrics_dec = op_sched_metrics.loc[op_sched_metrics.service_date ==
                                                     '2024-12-11T00:00:00.000000000']

In [None]:
op_sched_metrics_dec.schedule_gtfs_dataset_key.value_counts().head(10)

In [None]:
op_rt_sched_metrics = merge_operator_data.concatenate_rt_vs_schedule_operator_metrics(analysis_date_list)

In [None]:
op_rt_sched_metrics_dec = op_rt_sched_metrics.loc[op_rt_sched_metrics.service_date ==
                                                     '2024-12-11T00:00:00.000000000']

In [None]:
op_rt_sched_metrics_dec.organization_name.value_counts().head(15)

* There is the rail versus the bus schedule.

In [None]:
op_rt_sched_metrics_dec.loc[
    op_rt_sched_metrics_dec.organization_name
    == "Los Angeles County Metropolitan Transportation Authority"
].T

#### How do you know which one is correct?

In [None]:
op_rt_sched_metrics_dec.loc[
    op_rt_sched_metrics_dec.organization_name
    == "Transit Joint Powers Authority for Merced County"
].T

In [None]:
op_rt_sched_metrics_dec.loc[
    op_rt_sched_metrics_dec.organization_name
    == "City of Santa Monica"
].T

In [None]:
op_rt_sched_metrics_dec.loc[
    op_rt_sched_metrics_dec.organization_name
    == "Tahoe Transportation District"
].T

In [None]:
op_rt_sched_metrics_dec.loc[
    op_rt_sched_metrics_dec.organization_name
    == "City of Lawndale"
].T

#### Dataframe from `merge_operator_data.concatenate_rt_vs_schedule_operator_metrics` is created [here at `gtfs_funnel/operator_scheduled_stats.py`](https://github.com/cal-itp/data-analyses/blob/1ba0f544a01f99966a6e210dd11666b4fe4a146e/gtfs_funnel/operator_scheduled_stats.py#L147)
* The data is grouped by `gtfs_schedule_dataset_key` and an `organization_name` can have multiple, which is why some organizations have multiple entries.

#### Other attempts to look at Operator Profiles

In [None]:
url = "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/digest/operator_profiles.parquet"
operator_profile_df = pd.read_parquet(url)

In [None]:
operator_profile_df.service_date.unique()

In [None]:
march_2023 = operator_profile_df.loc[
    operator_profile_df.service_date == "2023-03-15T00:00:00.000000000"
]

In [None]:
dec_2024 = operator_profile_df.loc[
    operator_profile_df.service_date == "2024-12-11T00:00:00.000000000"
]

In [None]:
march_2023.organization_name.value_counts().head(12)

In [None]:
dec_2024.organization_name.value_counts().head(12)

#### How does Los Angeles County Metropolitan Transportation Authority have two different values?

In [None]:
dec_2024.loc[
    dec_2024.organization_name
    == "Basin Transit"
].T

In [None]:
dec_2024.loc[
    dec_2024.organization_name
    == "Los Angeles County Metropolitan Transportation Authority"
]

In [None]:
dec_2024.loc[
    dec_2024.organization_name == "Transit Joint Powers Authority for Merced County"
]

In [None]:
dec_2024.loc[dec_2024.organization_name == "City of Lawndale"]

In [None]:
dec_2024.loc[dec_2024.organization_name == "Palo Verde Valley Transit Agency"]

In [None]:
dec_2024.loc[dec_2024.organization_name == "City of San Luis Obispo"]

In [None]:
crosswalk_df = merge_operator_data.concatenate_crosswalks(analysis_date_list)

In [None]:
crosswalk_df.head(1)

In [None]:
march_crosswalk_df = crosswalk_df.loc[]

### Op Profiles
* The code for `gtfs_digest/merge_operator.py` stopped working because one of the column names changed. I went into `crosswalk_gtfs_dataset_key_to_organization` to fix that. 
* <s>Operator Profiles: are from September 2024 when it's Dec 2024.</s>
    * Fixed: was still referencing one of my old testing profiles.

In [None]:
SCHED_GCS

In [None]:
f"{GTFS_DATA_DICT.schedule_tables.gtfs_key_crosswalk}"

In [None]:
dec_crosswalk_url = "gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_2024-12-11.parquet"

In [None]:
nov_crosswalk_url = "gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_2024-11-13.parquet"

In [None]:
dec_crosswalk_df = pd.read_parquet(dec_crosswalk_url)

In [None]:
dec_crosswalk_df.organization_name.value_counts().head(25)

In [None]:
dec_crosswalk_df.loc[
    dec_crosswalk_df.organization_name == "City of South San Francisco"
]

In [None]:
dec_crosswalk_df.loc[
    dec_crosswalk_df.organization_name == "City and County of San Francisco"
]

In [None]:
nov_crosswalk_df = pd.read_parquet(nov_crosswalk_url)

In [None]:
sept_crosswalk_df = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_2024-09-18.parquet"
)

In [None]:
sept_cols = set(sept_crosswalk_df.columns.tolist())
dec_cols = set(dec_crosswalk_df.columns.tolist())
nov_cols = set(nov_crosswalk_df.columns.tolist())

In [None]:
nov_cols - sept_cols

In [None]:
sept_cols - dec_cols

In [None]:
dec_cols - sept_cols

In [None]:
ventura_dec = dec_crosswalk_df.loc[
    dec_crosswalk_df.organization_name == "Ventura County Transportation Commission"
]

In [None]:
ventura_dec[["primary_uza_code", "primary_uza_name"]].drop_duplicates()

In [None]:
ventura_sept = sept_crosswalk_df.loc[
    sept_crosswalk_df.organization_name == "Ventura County Transportation Commission"
]

In [None]:
ventura_sept[["primary_uza_code", "primary_uza_name"]].drop_duplicates()

In [None]:
crosswalk_df = merge_operator_data.concatenate_crosswalks(analysis_date_list)

In [None]:
crosswalk_df.service_date.unique()

In [None]:
import _section1_utils

In [None]:
organization_name = "Monterey-Salinas Transit"

In [None]:
ntd_profile = _section1_utils.load_operator_ntd_profile(organization_name)

In [None]:
ntd_profile

### Timeliness for Dir 0 and 1 are missing since October.

In [None]:
schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

In [None]:
# Keep only rows that are found in both schedule and real time data
schd_vp_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            ("organization_name", "==", organization_name),
            ("sched_rt_category", "==", "schedule_and_vp"),
        ]
    ],
)

In [None]:
schd_vp_df_gtfskeys = schd_vp_df[
    ["schedule_gtfs_dataset_key", "service_date"]
].drop_duplicates()

In [None]:
schd_vp_df.head(2)

In [None]:
schedule_by_route = merge_data.concatenate_schedule_by_route_direction(
    analysis_date_list
)

In [None]:
schedule_by_route_gtfskeys = schedule_by_route[
    ["schedule_gtfs_dataset_key", "service_date"]
].drop_duplicates()

In [None]:
pd.merge(
    df_avg_speeds_gtfskeys,
    schedule_by_route_gtfskeys,
    on=["schedule_gtfs_dataset_key", "service_date"],
    how="outer",
    indicator=True,
)[["_merge"]].value_counts()

In [None]:
import merge_data

In [None]:
from shared_utils import gtfs_utils_v2, publish_utils

### Average Speed Missing for Offpeak and Peak since October
* All Day available 
* GTFS Keys missing? 

In [None]:
df_avg_speeds = merge_data.concatenate_speeds_by_route_direction(analysis_date_list)

In [None]:
df_avg_speeds.service_date.unique()

In [None]:
df_avg_speeds.head()

In [None]:
df_avg_speeds_gtfskeys = df_avg_speeds[
    ["schedule_gtfs_dataset_key", "service_date"]
].drop_duplicates()

In [None]:
pd.merge(
    df_avg_speeds_gtfskeys,
    schd_vp_df_gtfskeys,
    on=["schedule_gtfs_dataset_key", "service_date"],
    how="outer",
    indicator=True,
)[["_merge"]].value_counts()