# Confirming the operators under `El Dorado County Transportation Commission` and `Placer County Transportation Planning Agency`

El Dorado and Placer County inquired us asking where is the "published data" Caltrans is promised to deliver to RTPAs, via sb125. as of 10/17, the monthly ridership site did not have a separate tab for these RTPAs. Further investigation shows that though these RTPAs did not have separate tabs, some (if not all) the associated transit operators were showing under the SACOG tab. (City of Placer, El Dorado County Transit Authority)

Before splitting out the these operatos and RTPAs into separate tabs, need to confirm if any other transit operators are under these RTPAs, then update the rtpa ntd id crosswalk file.  

Per Tiffany
>This crosswalk you use is `ntd_id to RTPA`. You can combine that with several warehouse tables:
>
>- `dim_organizations` and `dim_gtfs_datasets` and `dim_provider_gtfs_data`
>
>- since the above is a fairly complicated traversal, in `shared_utils`, there's a function that wraps this and gets you from a starting poitn of  a `schedule_gtfs_dataset_key` (operator) to `ntd_id`. It is used and created here.
>
>- Operators (based on GTFS schedule) are subject to a date. If you know which date you want, you can use the crosswalk created and saved >out here in `gtfs_analytics_data.yml` find the GCS path. This will get you the operator's `schedule_gtfs_dataset_key` + `ntd_id` + other ntd columns and you can connect that to your `ntd_id - RTPA` crosswalk."

In [1]:
import pandas as pd
import shutil
import sys
import os
import gcsfs
from calitp_data_analysis.tables import tbls
from calitp_data_analysis.sql import get_engine
from shared_utils.rt_dates import MONTH_DICT
from segment_speed_utils import helpers
import sys

sys.path.append("../monthly_ridership_report")  # up one level
from update_vars import NTD_MODES, NTD_TOS, YEAR, MONTH

In [2]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [None]:
from shared_utils.schedule_rt_utils import sample_gtfs_dataset_key_to_organization_crosswalk

---

## testing the `create_gtfs_dataset_key_to_organization_crosswalk` function
uses the `sample_schedule_feed_key_to_organization_crosswalk` function. this should give us gtfs `schedule feed`, `org name` and `ntd ID`



In [None]:
def create_gtfs_dataset_key_to_organization_crosswalk(
    analysis_date: str
) -> pd.DataFrame:
    """
    For every operator that appears in schedule data, 
    create a crosswalk that links to organization_source_record_id.
    For all our downstream outputs, at various aggregations,
    we need to attach these over and over again.
    """
    df = helpers.import_scheduled_trips(
        analysis_date,
        columns = ["gtfs_dataset_key", "name"],
        get_pandas = True
    ).rename(columns = {"schedule_gtfs_dataset_key": "gtfs_dataset_key"})
    # rename columns because we must use simply gtfs_dataset_key in schedule_rt_utils function
    
    # Get base64_url, organization_source_record_id and organization_name
    crosswalk = sample_gtfs_dataset_key_to_organization_crosswalk(
        df,
        analysis_date,
        quartet_data = "schedule",
        dim_gtfs_dataset_cols = ["key", "source_record_id", "base64_url"],
        dim_organization_cols = ["source_record_id", "name", 
                                 "itp_id", "caltrans_district",
                                  "ntd_id_2022"]
    )

    df_with_org = pd.merge(
        df.rename(columns = {"gtfs_dataset_key": "schedule_gtfs_dataset_key"}),
        crosswalk,
        on = "schedule_gtfs_dataset_key",
        how = "inner"
    )
    
    return df_with_org

In [None]:
# can get dates from GCS `calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk`
gtfs_to_org = create_gtfs_dataset_key_to_organization_crosswalk(
            "2024-08-14"
        )

In [None]:
gtfs_to_org.shape

In [None]:
gtfs_to_org.head()

---

##  If you know which date you want, you can use the crosswalk created and saved out here in `gtfs_analytics_data.yml` find the GCS path.

should also give us `gtfs schedule feed`, `org name` and `ntd id`

In [None]:
aug_crosswalk = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_2024-08-14.parquet"
)

In [None]:
display(
    aug_crosswalk.shape,
    list(aug_crosswalk.columns.sort_values())

)

In [None]:
aug_crosswalk.head()

In [None]:
aug_crosswalk["name"].value_counts()

---

Compare the dataframes in both methods to compare the values. they may be the same things but the initiald DF has ntd id

In [None]:
display(
    gtfs_to_org.columns,
    aug_crosswalk.columns
)

In [None]:
col = [
    "schedule_gtfs_dataset_key",
    "name",
    "schedule_source_record_id",
    "organization_source_record_id",
    #"itp_id"
]

for i in col:
    print(
        gtfs_to_org[i].unique() == aug_crosswalk[i].unique()
    )

# looks good! move forward with gtfs_to_org

---

- analyze the gtfs_to_org df
- what agencies are in in

In [None]:
list(gtfs_to_org.columns)

In [None]:
gtfs_to_org.describe(include=object)

In [None]:
gtfs_to_org[gtfs_to_org["organization_name"].str.contains("Roseville")]

In [None]:
# see which scheduel feeds these operators are in
city=[
    "Roseville",
    "Auburn",
    "Placer",
    "El Dorado",
    "Shingle Springs",
    "Placerville",
    "Cameron Park",
    "Pollock Pines",
    "Colfax",
    "Lincoln",
    "Rocklin",
    "alpine"
]

city_2 = "|".join(city)

display(
    gtfs_to_org[gtfs_to_org["organization_name"].str.contains(city_2, case=False)].sort_values(by="organization_name"),

    aug_crosswalk[
        aug_crosswalk["counties_served"].str.contains(city_2, case=False, na=False)].sort_values(by="counties_served"),
    
    aug_crosswalk[
        aug_crosswalk["organization_name"].str.contains("auburn", case=False, na=False)].sort_values(by="organization_name"),
    aug_crosswalk[
        aug_crosswalk["counties_served"].str.contains("none", case=False, na=False)].sort_values(by="counties_served")
)

In [None]:
# see if there are any other operators in these schedule feeds
# looks like this is the only one
sched_list = [
    "Roseville Transit GMV Schedule",
    "Roseville Schedule",
    "Placer Schedule",
    "El Dorado Schedule",
    "Auburn Schedule",
]
sched_list_2 = "|".join(sched_list)

gtfs_to_org[gtfs_to_org["name"].str.contains(sched_list_2, case=False)].sort_values(by="name")

In [None]:
# sanity check 
# example of schedle feed key, with multiple operators 
display(
    gtfs_to_org[gtfs_to_org["organization_name"].str.contains("Ojai")],
    gtfs_to_org[gtfs_to_org["name"].str.contains("VCTC GMV Schedule")]
)

## Read in ntd_id_rtpa_crosswalk

In [None]:
crosswalk = pd.read_csv(
        f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", 
        dtype = {"NTD ID": "str"}
    #have to rename NTD ID col to match the dim table
    ).rename(columns={"NTD ID": "ntd_id"})

In [None]:
display(
    crosswalk["RTPA"].nunique(),
    crosswalk["RTPA_open_data"].nunique()
)

In [None]:
## do the operators found in the previous, exist in the crosswalk?

In [None]:
crosswalk[crosswalk["Agency"].str.contains(city_2)]

--- 

## Read in ntd ridership data

In [7]:
db_engine = get_engine()

with db_engine.connect() as connection:
    query = """
    SELECT * 
    FROM `mart_ntd.dim_monthly_ridership_with_adjustments` AS `mart_ntd.dim_monthly_ridership_with_adjustments_1`
    WHERE regexp_contains(`mart_ntd.dim_monthly_ridership_with_adjustments_1`.`period_year_month`, '2024-')
    """
    full_upt = pd.read_sql(query,connection)
    

In [8]:
full_upt = full_upt_2.rename(columns = {"mode_type_of_service_status": "Status"}
       )

In [None]:
full_upt.columns

In [None]:
ca = full_upt[full_upt["primary_uza_name"].str.contains(", CA") & full_upt["period_year_month"].str.contains("2024-")]

## do any of the operators found in the previous, exist in the ntd data?

In [None]:
ca[ca["agency"].str.contains(city_2)]["agency"].value_counts()

In [None]:
## do all the operators in the crosswalk exist in the ntd data?

crosswalk_agency = crosswalk["Agency"].unique().tolist()
crosswalk_agency_2 = "|".join(crosswalk_agency)

display(
    len(ca[ca["agency"].str.contains(crosswalk_agency_2)]["agency"].unique()),
    len(crosswalk_agency)
)

#yes they do 

---

# Conclusion

**Update the following to `ntd_id_rtpa` crosswalk**

County of placer
- RTPA: Placer County Transportation Planning Agency
    
City of Roseville
- RTPA: Placer County Transportation Planning Agency

El Dorado County Transit Authority
- RTPA: El Dorado County Transportation Commission

**Add to the ntd_id_rtpa crosswalk**

City of Auburn
- RTPA: Placer County Transportation Planning Agency
- NTD ID: 91032
- UZA Name: Sacramento, CA

In [None]:
display(
    crosswalk[crosswalk["Agency"].str.contains("Roseville")],
    crosswalk[crosswalk["Agency"].str.contains("El Dorado")]
)

In [None]:
crosswalk.loc[crosswalk["Agency"].str.contains("Placer"), "RTPA"] = "Placer County Transportation Planning Agency"

In [None]:
crosswalk.loc[crosswalk["Agency"].str.contains("Roseville"), "RTPA"] = "Placer County Transportation Planning Agency"
crosswalk.loc[crosswalk["Agency"].str.contains("El Dorado"), "RTPA"] = "El Dorado County Transportation Commission"

In [None]:
display(
    crosswalk[crosswalk["Agency"].str.contains("Placer")],
    crosswalk[crosswalk["Agency"].str.contains("Roseville")],
    crosswalk[crosswalk["Agency"].str.contains("El Dorado")]
)

In [None]:
columns = crosswalk.columns

auburn = pd.DataFrame([["91032","91032","City of Auburn","Sacramento,CA","Placer County Transportation Planning Agency", "Placer County Transportation Planning Agency"]], columns=columns)

crosswalk_2 = pd.concat([crosswalk, auburn], ignore_index=True)

display(
    crosswalk.shape,
    crosswalk_2.shape
)

In [None]:
crosswalk_2["ntd_id"] = crosswalk_2["ntd_id"].astype("str")

In [None]:
crosswalk_2.info()

In [None]:
# save updated crosswalk back to GCS
#crosswalk_2.to_csv(f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", index=False)


In [None]:
#crosswalk_2.to_parquet(f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.parquet")

---

In [20]:
with db_engine.connect() as connection:
    query = """
    SELECT * 
    FROM `mart_ntd.dim_monthly_ridership_with_adjustments` AS `mart_ntd.dim_monthly_ridership_with_adjustments_1`
    """
    full_upt = pd.read_sql(query,connection).rename(columns = {"mode_type_of_service_status": "Status"})

In [None]:
full_upt = full_upt[full_upt.agency.notna()].reset_index(drop=True)
    
#full_upt.to_parquet(
#        f"{GCS_FILE_PATH}ntd_monthly_ridership_{year}_{month}.parquet"
#    )
    
ca = full_upt[(full_upt["uza_name"].str.contains(", CA")) & 
            (full_upt.agency.notna())].reset_index(drop=True)
    
crosswalk = pd.read_csv(
        f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", 
        dtype = {"ntd_id": "str"}
    #have to rename NTD ID col to match the dim table
    )#.rename(columns={"NTD ID": "ntd_id"})
    
df = pd.merge(
        ca,
        # Merging on too many columns can create problems 
        # because csvs and dtypes aren't stable / consistent 
        # for NTD ID, Legacy NTD ID, and UZA
        crosswalk[["ntd_id", "RTPA"]],
        on = "ntd_id",
        how = "left",
        indicator = True
    )
    
print(df._merge.value_counts())

In [None]:
df[df["RTPA"].str.contains("El Dor")]