# GTFS to NTD

* Get `schedule_gtfs_dataset_key` to `ntd_id_2022` 
* Use `dim_provider_gtfs_data` and `dim_organizations`
* Would anticipate a similar issue to `schedule_gtfs_dataset_key` having multiple rows being linked to `organization_name`, since NTD is another column in `dim_organizations`
* Shweta only found 98 pairs of `schedule_gtfs_dataset_key` to `ntd_id_2022`...should we be expecting more?
* How many operators are there, how close to 100% of the operators we have GTFS data for can we associate an NTD ID?

In [None]:
import pandas as pd
import yaml

from shared_utils import portfolio_utils, schedule_rt_utils
from gtfs_key_ntd_crosswalk import GCS_FILE_PATH, filter_to_valid_dates

with open(
    "../_shared_utils/shared_utils/portfolio_organization_name.yml", "r"
) as f:
    PORTFOLIO_ORGANIZATIONS_DICT = yaml.safe_load(f)


In [None]:
# It's just 3 dates, so let's just keep all the combos available
operators = pd.read_parquet(
    f"{GCS_FILE_PATH}ahsc_test/trips_2022.parquet",
    columns = ["gtfs_dataset_key", "service_date"]
).drop_duplicates().reset_index(drop=True).rename(
    columns = {"gtfs_dataset_key": "schedule_gtfs_dataset_key"}
).astype({"service_date": "datetime64[ns]"})

date_list = operators.service_date.unique().tolist()

operators.service_date.value_counts()

In [None]:
# There are 208 operators, so why can't we get more NTD IDs to link?
# NTD ID 2022, GTFS is from Dec 2022
operators.schedule_gtfs_dataset_key.nunique()

In [None]:
from segment_speed_utils import helpers
analysis_date_list = ['2022-11-30', '2022-12-03', '2022-12-04']

## GTFS schedule trips

* compare what's downloaded with operator summary
* numbers should be close, though gtfs_utils_v2 will filter out and keep agency subfeeds
* already, these numbers are not that close, and do a fresh download...but should mimic gtfs_utils_v2, so use gtfs_funnel

#### Hypothesis 1: every operator in schedule should merge with `dim_provider_gtfs_data` (because it belongs to a quartet)

#### Hypothesis 2: once all operators have an organization_id, all of those should merge with dim_organizations 
* if `organization_source_record_id` is present in both, we should expect this
* but perhaps only most should have an NTD ID (we can't be sure here, since NTD IDs can be nulls).
* possible null NTD IDs are college campus shuttle feeds, those *might* not have NTD ID
* we care about `schedule_gtfs_dataset_key` to `ntd_id_2022` relationship (even if we have to use `organization_source_record_id` to merge tables. 
   * multiple `schedule_gtfs_dataset_key` to 1 `ntd_id_2022` (LA Metro Bus/Rail -> LA Metro)
   * 1 key to multiple NTD IDs? is this even possible? where would this make sense?
   * maybe a regional feed, `VCTC GMV Schedule: 'Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark,
  Ojai, Simi Valley, Thousand Oaks)` could have multiple NTD IDs?

In [None]:
new_operator_df = pd.concat(
    [helpers.import_scheduled_trips(
        one_date, 
        columns = ["gtfs_dataset_key", "name"],
        get_pandas = True
    ) for one_date in analysis_date_list], 
    ignore_index=True 
)

new_operator_df.schedule_gtfs_dataset_key.nunique(), new_operator_df.name.nunique()

In [None]:
dim_provider_gtfs_data_full = pd.read_parquet(
    f"{GCS_FILE_PATH}ahsc_test/dim_provider_gtfs_data_full.parquet"
).pipe(
    schedule_rt_utils.localize_timestamp_col, 
    ["_valid_from", "_valid_to"]
)

valid_provider_full = filter_to_valid_dates(dim_provider_gtfs_data_full, date_list)

dim_orgs = pd.read_parquet(
    f"{GCS_FILE_PATH}ahsc_test/organizations.parquet"
)

In [None]:
new_operator_df.shape

In [None]:
# 1 gtfs_key can have multiple organizations
# can 1 org have multiple NTD IDs? null NTD IDs, we know. but we think only multiple NTD IDs if the ID changed
df2 = pd.merge(
    new_operator_df.rename(columns = {"name": "schedule_gtfs_dataset_name"}),
    valid_provider_full[
        ["schedule_gtfs_dataset_key", 
         "organization_source_record_id"]].drop_duplicates(),
    on = "schedule_gtfs_dataset_key",
    how = "left",
    #indicator = True
).merge(
    dim_orgs,
    left_on = "organization_source_record_id",
    right_on = "source_record_id",
    how = "left",
    indicator = True
)

df2._merge.value_counts()

In [None]:
#df2[df2.ntd_id_2022.isna()].schedule_gtfs_dataset_name

In [None]:
operator_summary = pd.read_parquet(
    f"{GCS_FILE_PATH}ahsc_test/fct_daily_feed_scheduled_service_summary.parquet"
)

# ok, maybe go back to download full trips table, excluding some feeds
# 194 seems reasonable, but 125 seems low
operator_summary.service_date.value_counts()

In [None]:
new_operator_df.schedule_gtfs_dataset_key.nunique(), new_operator_df.name.nunique()

In [None]:
from shared_utils import rt_dates

dec2024_date = rt_dates.DATES["dec2024"]

dec2024_operators = helpers.import_scheduled_trips(
    dec2024_date,
    columns = ["name", "gtfs_dataset_key"],
    get_pandas = True
)
dec2024_operators.shape, dec2024_operators.name.nunique()

In [None]:
dec2023_date = rt_dates.DATES["dec2023"]

dec2023_operators = helpers.import_scheduled_trips(
    dec2023_date,
    columns = ["name", "gtfs_dataset_key"],
    get_pandas = True
)
dec2023_operators.shape, dec2023_operators.name.nunique()

In [None]:
# for now, allow m:m merge because we know operators show up for multiple dates,
# and 1 gtfs_key can link to several orgs
# yikes, this is too crazy to deal with
dim_provider_gtfs_data = pd.read_parquet(
    f"{GCS_FILE_PATH}ahsc_test/dim_provider_gtfs_data.parquet"
).pipe(
    schedule_rt_utils.localize_timestamp_col, 
    ["_valid_from", "_valid_to"]
)

valid_gtfs_data = filter_to_valid_dates(dim_provider_gtfs_data, date_list)


operator_to_orgs = pd.merge(
    operators[["schedule_gtfs_dataset_key"]].drop_duplicates(),
    valid_gtfs_data [["schedule_gtfs_dataset_key", "schedule_gtfs_dataset_name", 
                              "organization_source_record_id", "_valid_from_local", "_valid_to_local"]],
    on = "schedule_gtfs_dataset_key",
    how = "inner",
    validate = "1:m",
    indicator=True
)

operator_to_orgs._merge.value_counts()


In [None]:
# started with 208, yet only some merge on, weird
operators.schedule_gtfs_dataset_key.nunique()

In [None]:
# schedule_gtfs_dataset_key:org_name (can be 1:m, 1:1 for majority).
# schedule_gtfs_dataset_key:portfolio_org_name is 1:1. But LA Metro Bus / LA Metro Rail both map to LA Metro.