# [Issue# 1897 Additional Visuals for PUC Analysis](https://github.com/cal-itp/data-analyses/issues/1897)

Received list of transit operators cohorts that may be exempt from efficiency reporting, per PUC 99314.11, .6 and .7. 
- create visuals based on grouping set by list
- recreate visuals based on previous notebook work

## [99314.6](https://leginfo.legislature.ca.gov/faces/codes_displaySection.xhtml?sectionNum=99314.6.&lawCode=PUC)
>`funds shall be allocated for operating or capital purpose` pursuant to Sections 99313 and 99314 to an operator `if the operator meets either of the following efficiency standards`:
>- (A) `The operator shall receive its entire allocation`, and any or all of this allocation may be used for operating purposes, if the operator’s `total operating cost per revenue vehicle hour` in the latest year for which audited data are available `does not exceed the sum of the preceding year’s total operating cost per revenue vehicle hour and an amount equal to the product of the percentage change in the Consumer Price Index for the same period multiplied by the preceding year’s total operating cost per revenue vehicle hour.`
>- (B) The operator shall receive its entire allocation, and any or all of this allocation may be used for operating purposes, `if the operator’s average total operating cost per revenue vehicle hour` in the latest three years for which audited data are available `does not exceed the sum of the average of the total operating cost per revenue vehicle hour in the three years preceding the latest year for which audited data are available and an amount equal to the product of the average percentage change in the Consumer Price Index for the same period multiplied by the average total operating cost per revenue vehicle hour in the same three years`.
## [99314.7 (mainly MTC specific)](https://leginfo.legislature.ca.gov/faces/codes_displaySection.xhtml?lawCode=PUC&sectionNum=99314.7.)
>the `Metropolitan Transportation Commission` shall apply the following eligibility standards to the operators within the region subject to its jurisdiction:

# [99314.11](https://leginfo.legislature.ca.gov/faces/codes_displaySection.xhtml?sectionNum=99314.11.&nodeTreePath=17.11.2.8&lawCode=PUC)
>`Sections 99314.6 and 99314.7 do not apply to an operator for a fiscal year in which the operator expended from local funding an amount for transit operations not less than the amount the operator expended from local funding for transit operations during the 2018–19 fiscal year.` As used in this subdivision, “local funding” means any nonstate grant funds or other revenues generated by, earned by, or distributed to, an operator.

Meaning, if a transit operator spent local funds >= the local funds spent during FY 2018-2019, they are exempt from meeting efficiency standards(?)

## Data Exploration

### Categorical variaables
- Underlying metric
  - Farebox Recovery Ratio
  - Local funding expended
- area type
  - urban
  - rural
- cohorts
  - A
  - B
  - C
- NTD metric
  - UPT
  - PMT
  - VRH
- year
  - 2019
  - 2020
  - 2021
  - 2022
  - 2023
  - 2024

## analyses should be split by underlying metric
resulting groups are:
1. Farebox Recovery ratio
    - urban
        - cohorts
        - ntd metric
        - year
    - rural
        - cohorts
        - ntd metrics
        - year
2. Local funding expended
    - urban
        - cohorts
        - ntd metric
        - year
    - rural
        - cohorts
        - ntd metrics
        - year

## 



In [1]:
import pandas as pd
import altair as alt
from functools import cache
from calitp_data_analysis.gcs_pandas import GCSPandas
from calitp_data_analysis.sql import get_engine, to_snakecase, query_sql

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:,.2f}'.format

@cache
def gcs_pandas():
    return GCSPandas()

gcs_path = "gs://calitp-analytics-data/data-analyses/ntd/"

# Read in cohort list data

In [2]:
# cohort_data = gcs_pandas().read_csv(f"{gcs_path}fbr_local_funding_by_cohorts_2019-2024_compiled.csv")

# cohort_data.columns = cohort_data.columns.str.lower()
# cohort_data["ntd_id"] = cohort_data["ntd_id"].astype("str")

# display(
#     cohort_data.info(),
#     cohort_data.head(),
#     cohort_data.value_counts(
#     subset=["urban_rural","metric","cohort","year"]
#     )
# )

# Read in yes/no list data

In [75]:
yes_no_data = gcs_pandas().read_csv(f"{gcs_path}cs_sco_yes_no_fbr_funding_2019-2024.csv")

yes_no_data.columns = yes_no_data.columns.str.lower()
yes_no_data[["year","ntd_id"]] = yes_no_data[["year","ntd_id"]].astype("str")
yes_no_data = yes_no_data.rename(columns={"requirement_flag":"requirement_met_flag"})

yes_no_data["requirement_met_flag"] = yes_no_data["requirement_met_flag"].str.lower().map({
    "yes":True,
    "no":False
})
display(
    yes_no_data.info(),
    yes_no_data.head(3)
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1797 entries, 0 to 1796
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   year                  1797 non-null   object
 1   ntd_id                1797 non-null   object
 2   ntd_entity_name       1797 non-null   object
 3   area_type             1797 non-null   object
 4   metric                1797 non-null   object
 5   quartile              1797 non-null   object
 6   metric_short          1797 non-null   object
 7   metric_value          1797 non-null   object
 8   requirement           1797 non-null   object
 9   requirement_met_flag  1797 non-null   bool  
dtypes: bool(1), object(9)
memory usage: 128.2+ KB


None

Unnamed: 0,year,ntd_id,ntd_entity_name,area_type,metric,quartile,metric_short,metric_value,requirement,requirement_met_flag
0,2019,90003,San Francisco Bay Area Rapid Transit District,Urban,Farebox Recovery Ratio,Top 25%,FBR,63.14,Met FBR Min,True
1,2019,90003,San Francisco Bay Area Rapid Transit District,Urban,Local Funding % Change vs 2019,Middle 50%,Pct_Change_vs_2019,0.014763266,Maintained_or_Increased_vs_2019,True
2,2019,90004,Golden Empire Transit District,Urban,Farebox Recovery Ratio,Middle 50%,FBR,20.67,Met FBR Min,True


## Do yes/no actually match up with the metric value?
if local funding metric, does metric >=0 actually value = Yes?
if fbr, does fbr>10 for rural or fbr>20 for urban actually = Yes?

In [None]:
yes_no_data[
    (yes_no_data["metric_short"] == "FBR")
    & (yes_no_data["area_type"]=="Rural")
    & (yes_no_data["metric_value"]>=10)
]

In [82]:
yes_no_data.groupby(
    ["area_type",
    "metric_short",
     "requirement_met_flag",
    ]
).agg(
  total = ("metric_value","sum")  
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total
area_type,metric_short,requirement_met_flag,Unnamed: 3_level_1
Rural,FBR,False,6.69.374.488.543.617.527.928.189.442.722.237.1...
Rural,FBR,True,11.7320.9610.913.3721.5912.0710.9110.6819.3510...
Rural,Pct_Change_vs_2019,False,-0.839636463-0.010429901-0.028013381-0.4093289...
Rural,Pct_Change_vs_2019,True,0.0446984890.0224299720.0547503330.0613581260....
Urban,FBR,False,16.9413.4810.647.17.859.8115.5910.879.324.0713...
Urban,FBR,True,63.1420.6720.822.1226.072328.0922.6521.0331.14...
Urban,Pct_Change_vs_2019,False,-0.081994874-0.010603536-0.129803892-0.0006049...
Urban,Pct_Change_vs_2019,True,0.0147632660.0043525760.0312459160.0319358650....


# Read in analysis data from prev notebook

In [4]:

# ntd_name = "puc_analysis_data.parquet"
# ntd_analysis_data = gcs_pandas().read_parquet(f"{gcs_path}{ntd_name}")

# display(
#     ntd_analysis_data.info(),
#     ntd_analysis_data["year"].unique()
# )

# May need to requery this data to include 2024
is 2024 NTD data in the warehouse now? copy pasted from initial puc analysis notebook.

In [5]:
metric_list = [
    "pmt",
    "upt",
    "vrh",
    # "opexp_total" # not needed for this project
]

# empty list for appending DFs
df_list = []

# loop to query pmt, upt and vrh from 2018 to 2024
for metric in metric_list:
        query = f"""
        SELECT
          ntd_id,
          source_agency,
          agency_status,
          primary_uza_name,
          uza_population,
          uza_area_sq_miles,
          year,
          mode,
          type_of_service,
          reporter_type,
          SUM({metric}) AS total_{metric},
        FROM
          `cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_{metric}`
        WHERE
          source_state = "CA"
          AND year BETWEEN 2018 AND 2024
        GROUP BY
          ntd_id,
          source_agency,
          agency_status,
          primary_uza_name,
          uza_population,
          uza_area_sq_miles,
          year,
          mode,
          type_of_service,
          reporter_type
        """
        # create df
        metric = query_sql(query, as_df=True)

        # append df to list
        df_list.append(metric)

# unpack list into separate DFs
ntd_pmt, ntd_upt, ntd_vrh = df_list

display( 
    ntd_upt.head(3)
)

Unnamed: 0,ntd_id,source_agency,agency_status,primary_uza_name,uza_population,uza_area_sq_miles,year,mode,type_of_service,reporter_type,total_upt
0,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2018,HR,DO,Full Reporter,127874512.0
1,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2021,YR,DO,Full Reporter,601424.0
2,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2021,HR,DO,Full Reporter,17125273.0


## merge all the metrics together

In [6]:
merge_on_col = [
    "ntd_id",
    "year",
    "source_agency",
    "agency_status",
    "primary_uza_name",
    "uza_population",
    "uza_area_sq_miles",
    "mode",
    "type_of_service",
    "reporter_type",
]

merge_1 = ntd_vrh.merge(ntd_upt, on=merge_on_col, how="inner")
# merge_2 = merge_1.merge(ntd_vrh, on=merge_on_col, how = "inner")

ntd_metrics_merge = merge_1.merge(ntd_pmt, on=merge_on_col, how="inner")

ntd_metrics_merge.head(3)

Unnamed: 0,ntd_id,source_agency,agency_status,primary_uza_name,uza_population,uza_area_sq_miles,year,mode,type_of_service,reporter_type,total_vrh,total_upt,total_pmt
0,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,HR,DO,Full Reporter,2225056.0,125105460.0,1756364558.0
1,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2020,MB,PT,Full Reporter,,,
2,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,MG,PT,Full Reporter,19815.0,886515.0,2819118.0


## get districts for ntd ID
- Do i still need district data for this specific analysis?

In [7]:
for metric in metric_list:
        query = f"""
        SELECT
          `mart_transit_database.dim_organizations`.`key` AS `key`,
          `mart_transit_database.dim_organizations`.`source_record_id` AS `source_record_id`,
          `mart_transit_database.dim_organizations`.`name` AS `name`,
          `mart_transit_database.dim_organizations`.`ntd_id_2022` AS `ntd_id_2022`,
          `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`county_geography_name` AS `county`,
          `Dim_County_Geography___County_Geography_Key`.`caltrans_district` AS `caltrans_district`
        FROM
          `mart_transit_database.dim_organizations`

        LEFT JOIN `mart_transit_database.bridge_organizations_x_headquarters_county_geography` AS `Bridge_Organizations_X_Headquarters_County_Geography___Key` ON `mart_transit_database.dim_organizations`.`key` = `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`organization_key`
          LEFT JOIN `mart_transit_database.dim_county_geography` AS `Dim_County_Geography___County_Geography_Key` ON `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`county_geography_key` = `Dim_County_Geography___County_Geography_Key`.`key`
        WHERE
          (
            `mart_transit_database.dim_organizations`.`_is_current` = TRUE
          )

           AND (
            `mart_transit_database.dim_organizations`.`ntd_id_2022` IS NOT NULL
          )
          AND (
            (
              `mart_transit_database.dim_organizations`.`ntd_id_2022` <> ''
            )

            OR (
              `mart_transit_database.dim_organizations`.`ntd_id_2022` IS NULL
            )
          )
          AND (
            `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`_is_current` = TRUE
          )
          AND (
            `Dim_County_Geography___County_Geography_Key`.`_is_current` = TRUE
          )
        """
        # create df
        ntd_id_x_district = query_sql(query, as_df=True)
        
ntd_id_x_district["caltrans_district"] = ntd_id_x_district["caltrans_district"].astype("str")

ntd_id_x_district.head()

Unnamed: 0,key,source_record_id,name,ntd_id_2022,county,caltrans_district
0,d84a961daa618c733f9d9c3bd49c322f,recJtH0Ae8YNo01aj,Access Services,90157,Los Angeles,7
1,9b5971d16d58e4fcafa694ee7fa33b12,rec79AM4tMwdokWhE,Alpine County,91116,Alpine,10
2,e5de5083d68e8c2463a784ceb13e91f2,recUTSH4TT1wB3RSC,Attentive Transportation LLC,90314,Sacramento,3
3,957618c89db2f5e992caa5ca2e6086ab,rec7F6JKLMVrRhQJU,Bishop Paiute Tribe,99268,Inyo,9
4,a024fabd0002f9c9bd636042de30715d,recE6qJFuoREa9EHg,Calaveras County,91063,Calaveras,10


## merge the ntd metrics with Caltrans Districts

In [8]:
ntd_metrics_merge = ntd_metrics_merge.merge(
    ntd_id_x_district[["ntd_id_2022","county","caltrans_district"]],
    left_on = "ntd_id",
    right_on = "ntd_id_2022",
    how="inner",
    indicator=True
)
ntd_metrics_merge["year"] = ntd_metrics_merge["year"].astype("str")
ntd_metrics_merge.head()

Unnamed: 0,ntd_id,source_agency,agency_status,primary_uza_name,uza_population,uza_area_sq_miles,year,mode,type_of_service,reporter_type,total_vrh,total_upt,total_pmt,ntd_id_2022,county,caltrans_district,_merge
0,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,HR,DO,Full Reporter,2225056.0,125105460.0,1756364558.0,90003,San Francisco,4,both
1,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2020,MB,PT,Full Reporter,,,,90003,San Francisco,4,both
2,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,MG,PT,Full Reporter,19815.0,886515.0,2819118.0,90003,San Francisco,4,both
3,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2018,HR,DO,Full Reporter,2189422.0,127874512.0,1784699309.0,90003,San Francisco,4,both
4,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,DR,PT,Full Reporter,,,,90003,San Francisco,4,both


# merge ntd metrics with ~~cohort data~~ yes/no data
- merge on ntd_id
- are there any unmerged rows?

In [9]:
ntd_metrics_merge.dtypes

ntd_id                 object
source_agency          object
agency_status          object
primary_uza_name       object
uza_population          int64
uza_area_sq_miles     float64
year                   object
mode                   object
type_of_service        object
reporter_type          object
total_vrh             float64
total_upt             float64
total_pmt             float64
ntd_id_2022            object
county                 object
caltrans_district      object
_merge               category
dtype: object

In [10]:
ntd_yes_no_merge = ntd_metrics_merge.drop(columns="_merge").merge(
    yes_no_data,
    left_on = ["ntd_id","year"],
    right_on = ["ntd_id","year"],
    indicator= True,
)

# any unmerged rows? NONE
ntd_yes_no_merge["_merge"].value_counts()

_merge
both          4965
left_only        0
right_only       0
Name: count, dtype: int64

In [11]:
# Sanity check
# pick up a couple of NTD ID, see if the merge data tracks with the cohort data
sample_ids = ntd_yes_no_merge["ntd_id"].sample(3).to_list()
keep_cols=[
    "ntd_id",
    "source_agency",
    "mode",
    "type_of_service",
    "total_vrh",
    "total_pmt",
    "total_upt",
    "area_type",
    "metric",
    "year",
    "requirement",
    "requirement_met_flag"
]

for sample_id in sample_ids:
    display(
        f"Sameple NTD ID: {sample_id}",
        "cohort data",
        yes_no_data[
            (yes_no_data["ntd_id"]== sample_id)
            & (yes_no_data["year"].isin(["2023","2024"]))
            ].sort_values(by=["area_type","metric","year"]).head(5),
        "merge table",
        ntd_yes_no_merge[
            (ntd_yes_no_merge["ntd_id"]== sample_id)
            & (ntd_yes_no_merge["year"].isin(["2023","2024"]))
            ][keep_cols].sort_values(by=["area_type","metric","year"]),
        
    )

# cohort data matches, 
# looks a little weird since the ntd metrics is per mode and TOS. the cohort data becomes categorical. GTG

'Sameple NTD ID: 90147'

'cohort data'

Unnamed: 0,year,ntd_id,ntd_entity_name,area_type,metric,quartile,metric_short,metric_value,requirement,requirement_met_flag
1297,2023,90147,City of Los Angeles,Urban,Farebox Recovery Ratio,Bottom 25%,FBR,1.17,Met FBR Min,NO
1596,2024,90147,City of Los Angeles,Urban,Farebox Recovery Ratio,Bottom 25%,FBR,0.99,Met FBR Min,NO
1298,2023,90147,City of Los Angeles,Urban,Local Funding % Change vs 2019,Top 25%,Pct_Change_vs_2019,0.137190175,Maintained_or_Increased_vs_2019,Yes
1597,2024,90147,City of Los Angeles,Urban,Local Funding % Change vs 2019,Top 25%,Pct_Change_vs_2019,0.324594875,Maintained_or_Increased_vs_2019,Yes


'merge table'

Unnamed: 0,ntd_id,source_agency,mode,type_of_service,total_vrh,total_pmt,total_upt,area_type,metric,year,requirement,requirement_met_flag
3568,90147,City of Los Angeles (LADOT) - City of Los Ange...,CB,PT,94164.0,14334330.0,873176.0,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
3570,90147,City of Los Angeles (LADOT) - City of Los Ange...,MB,PT,622327.0,24548159.0,14344180.0,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
3574,90147,City of Los Angeles (LADOT) - City of Los Ange...,DR,TX,13634.0,214995.0,84498.0,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
3582,90147,City of Los Angeles (LADOT) - City of Los Ange...,DR,PT,106251.0,614408.0,180808.0,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
3566,90147,City of Los Angeles (LADOT) - City of Los Ange...,CB,PT,95020.0,14995295.0,992531.0,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
3572,90147,City of Los Angeles (LADOT) - City of Los Ange...,DR,PT,108907.0,671698.0,185863.0,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
3586,90147,City of Los Angeles (LADOT) - City of Los Ange...,MB,PT,684758.0,22812382.0,14512714.0,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
3588,90147,City of Los Angeles (LADOT) - City of Los Ange...,DR,TX,13974.0,247255.0,94122.0,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
3569,90147,City of Los Angeles (LADOT) - City of Los Ange...,CB,PT,94164.0,14334330.0,873176.0,Urban,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,Yes
3571,90147,City of Los Angeles (LADOT) - City of Los Ange...,MB,PT,622327.0,24548159.0,14344180.0,Urban,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,Yes


'Sameple NTD ID: 91097'

'cohort data'

Unnamed: 0,year,ntd_id,ntd_entity_name,area_type,metric,quartile,metric_short,metric_value,requirement,requirement_met_flag
1472,2023,91097,Redwood Coast Transit Authority,Rural,Farebox Recovery Ratio,Middle 50%,FBR,4.39,Met FBR Min,NO
1769,2024,91097,Redwood Coast Transit Authority,Rural,Farebox Recovery Ratio,Bottom 25%,FBR,3.81,Met FBR Min,NO
1473,2023,91097,Redwood Coast Transit Authority,Rural,Local Funding % Change vs 2019,Middle 50%,Pct_Change_vs_2019,-0.456445676,Maintained_or_Increased_vs_2019,No
1770,2024,91097,Redwood Coast Transit Authority,Rural,Local Funding % Change vs 2019,Top 25%,Pct_Change_vs_2019,3.493387262,Maintained_or_Increased_vs_2019,Yes


'merge table'

Unnamed: 0,ntd_id,source_agency,mode,type_of_service,total_vrh,total_pmt,total_upt,area_type,metric,year,requirement,requirement_met_flag
4783,91097,Redwood Coast Transit Authority (RCTA),DR,PT,1771.0,,4260.0,Rural,Farebox Recovery Ratio,2023,Met FBR Min,NO
4793,91097,Redwood Coast Transit Authority (RCTA),MB,PT,14154.0,,68091.0,Rural,Farebox Recovery Ratio,2023,Met FBR Min,NO
4785,91097,Redwood Coast Transit Authority (RCTA),DR,PT,2044.0,,4428.0,Rural,Farebox Recovery Ratio,2024,Met FBR Min,NO
4791,91097,Redwood Coast Transit Authority (RCTA),MB,PT,7093.0,,60253.0,Rural,Farebox Recovery Ratio,2024,Met FBR Min,NO
4784,91097,Redwood Coast Transit Authority (RCTA),DR,PT,1771.0,,4260.0,Rural,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,No
4794,91097,Redwood Coast Transit Authority (RCTA),MB,PT,14154.0,,68091.0,Rural,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,No
4786,91097,Redwood Coast Transit Authority (RCTA),DR,PT,2044.0,,4428.0,Rural,Local Funding % Change vs 2019,2024,Maintained_or_Increased_vs_2019,Yes
4792,91097,Redwood Coast Transit Authority (RCTA),MB,PT,7093.0,,60253.0,Rural,Local Funding % Change vs 2019,2024,Maintained_or_Increased_vs_2019,Yes


'Sameple NTD ID: 90022'

'cohort data'

Unnamed: 0,year,ntd_id,ntd_entity_name,area_type,metric,quartile,metric_short,metric_value,requirement,requirement_met_flag
1230,2023,90022,City of Norwalk,Urban,Farebox Recovery Ratio,Middle 50%,FBR,5.02,Met FBR Min,NO
1529,2024,90022,City of Norwalk,Urban,Farebox Recovery Ratio,Middle 50%,FBR,6.19,Met FBR Min,NO
1231,2023,90022,City of Norwalk,Urban,Local Funding % Change vs 2019,Middle 50%,Pct_Change_vs_2019,-0.434509989,Maintained_or_Increased_vs_2019,No
1530,2024,90022,City of Norwalk,Urban,Local Funding % Change vs 2019,Middle 50%,Pct_Change_vs_2019,-0.294531834,Maintained_or_Increased_vs_2019,No


'merge table'

Unnamed: 0,ntd_id,source_agency,mode,type_of_service,total_vrh,total_pmt,total_upt,area_type,metric,year,requirement,requirement_met_flag
2937,90022,City of Norwalk (NTS) - Department of Transpor...,DR,TX,,,,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
2939,90022,City of Norwalk (NTS) - Department of Transpor...,DR,PT,8538.0,30386.0,17080.0,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
2947,90022,City of Norwalk (NTS) - Department of Transpor...,MB,DO,83689.0,3545652.0,1022686.0,Urban,Farebox Recovery Ratio,2023,Met FBR Min,NO
2941,90022,City of Norwalk (NTS) - Department of Transpor...,DR,TX,,,,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
2943,90022,City of Norwalk (NTS) - Department of Transpor...,DR,PT,10581.0,38007.0,21220.0,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
2949,90022,City of Norwalk (NTS) - Department of Transpor...,MB,DO,82796.0,3954712.0,1140644.0,Urban,Farebox Recovery Ratio,2024,Met FBR Min,NO
2938,90022,City of Norwalk (NTS) - Department of Transpor...,DR,TX,,,,Urban,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,No
2940,90022,City of Norwalk (NTS) - Department of Transpor...,DR,PT,8538.0,30386.0,17080.0,Urban,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,No
2948,90022,City of Norwalk (NTS) - Department of Transpor...,MB,DO,83689.0,3545652.0,1022686.0,Urban,Local Funding % Change vs 2019,2023,Maintained_or_Increased_vs_2019,No
2942,90022,City of Norwalk (NTS) - Department of Transpor...,DR,TX,,,,Urban,Local Funding % Change vs 2019,2024,Maintained_or_Increased_vs_2019,No


# Save merged cohort data

In [12]:
# cort_merge_filname = "ntd_cohort_data_2026-01-26.parquet"
# gcs_pandas().data_frame_to_parquet(ntd_cohort_merge,f"{gcs_path}{cort_merge_filname}")

# Save merged  yes/no data

In [13]:
yes_no_merge_filname = "ntd_yes/no_data_2026-01-28.parquet"
# gcs_pandas().data_frame_to_parquet(ntd_yes_no_merge,f"{gcs_path}{yes_no_merge_filname}")

# Read in merged ~~cohort~~ yes/no data from GCS

In [30]:
# ntd_cohort_merge = gcs_pandas().read_parquet(f"{gcs_path}{cort_merge_filname}")

ntd_yes_no_merge = gcs_pandas().read_parquet(f"{gcs_path}{yes_no_merge_filname}")

ntd_yes_no_merge = ntd_yes_no_merge.rename(columns={
    "requirement_flag":"requirement_met_flag"
})

ntd_yes_no_merge["requirement_met_flag"] = ntd_yes_no_merge["requirement_met_flag"].str.lower().map(
    {"yes":True,
     "no":False
    }
)


# separate list by both metrics (farebox and funding change)

In [31]:
merge_farebox = ntd_yes_no_merge[ntd_yes_no_merge["metric"]=="Farebox Recovery Ratio"]
merge_funding = ntd_yes_no_merge[ntd_yes_no_merge["metric"]=="Local Funding % Change vs 2019"]

In [34]:
display(
    merge_farebox.shape,
    merge_funding.shape,
    merge_farebox["metric"].unique(),
    merge_funding["metric"].unique(),
    merge_funding.dtypes
)

(2460, 25)

(2505, 25)

array(['Farebox Recovery Ratio'], dtype=object)

array(['Local Funding % Change vs 2019'], dtype=object)

ntd_id                    object
source_agency             object
agency_status             object
primary_uza_name          object
uza_population             int64
uza_area_sq_miles        float64
year                      object
mode                      object
type_of_service           object
reporter_type             object
total_vrh                float64
total_upt                float64
total_pmt                float64
ntd_id_2022               object
county                    object
caltrans_district         object
ntd_entity_name           object
area_type                 object
metric                    object
quartile                  object
metric_short              object
metric_value              object
requirement               object
requirement_met_flag        bool
_merge                  category
dtype: object

In [33]:
if merge_farebox.columns.equals(merge_funding.columns):
        display(merge_farebox.head(3)) 

Unnamed: 0,ntd_id,source_agency,agency_status,primary_uza_name,uza_population,uza_area_sq_miles,year,mode,type_of_service,reporter_type,total_vrh,total_upt,total_pmt,ntd_id_2022,county,caltrans_district,ntd_entity_name,area_type,metric,quartile,metric_short,metric_value,requirement,requirement_met_flag,_merge
0,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,MG,PT,Full Reporter,19815.0,886515.0,2819118.0,90003,San Francisco,4,San Francisco Bay Area Rapid Transit District,Urban,Farebox Recovery Ratio,Top 25%,FBR,63.14,Met FBR Min,True,both
2,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2019,MB,PT,Full Reporter,,,,90003,San Francisco,4,San Francisco Bay Area Rapid Transit District,Urban,Farebox Recovery Ratio,Top 25%,FBR,63.14,Met FBR Min,True,both
4,90003,San Francisco Bay Area Rapid Transit District ...,Active,"San Francisco--Oakland, CA",3515933,513.8,2021,MG,PT,Full Reporter,17819.0,112981.0,359280.0,90003,San Francisco,4,San Francisco Bay Area Rapid Transit District,Urban,Farebox Recovery Ratio,Middle 50%,FBR,8.5,Met FBR Min,False,both


# Group aggregation

## melt big DF 
- so all columns are under 1 column.

In [35]:
group_list_melt = [
    "source_agency",
    "year",
    "ntd_id",
    "caltrans_district",
    "mode",
    "type_of_service",
    "area_type",
    "reporter_type",
    "quartile",
    "metric",
    "metric_value",
    "requirement",
    "requirement_met_flag"
]

value_cols = ["total_upt", "total_vrh", "total_pmt"]

melt_farebox = pd.melt(
    merge_farebox,
    id_vars=group_list_melt,
    value_vars=value_cols,
    var_name="ntd_metric",
    value_name="ntd_metric_value",
    ignore_index=True,
)

melt_funding = pd.melt(
    merge_funding,
    id_vars=group_list_melt,
    value_vars=value_cols,
    var_name="ntd_metric",
    value_name="ntd_metric_value",
    ignore_index=True,
)

In [36]:
display(
    melt_farebox.shape,
    melt_funding.shape
)

(7380, 15)

(7515, 15)

In [37]:
sample_ids = ntd_yes_no_merge["ntd_id"].sample(3).to_list()
melt_farebox[melt_farebox["ntd_id"].isin([sample_ids[1]])].sort_values(by=["year","mode","type_of_service"])

Unnamed: 0,source_agency,year,ntd_id,caltrans_district,mode,type_of_service,area_type,reporter_type,quartile,metric,metric_value,requirement,requirement_met_flag,ntd_metric,ntd_metric_value
691,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,CB,DO,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_upt,
3151,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,CB,DO,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_vrh,
5611,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,CB,DO,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_pmt,
697,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,CB,PT,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_upt,79095.0
3157,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,CB,PT,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_vrh,3176.0
5617,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,CB,PT,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_pmt,2100558.0
686,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,DR,DO,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_upt,1732.0
3146,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,DR,DO,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_vrh,447.0
5606,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,DR,DO,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_pmt,27527.0
696,County of Placer (PCT/TART) - Department of Pu...,2019,90196,3,DR,PT,Urban,Full Reporter,Bottom 25%,Farebox Recovery Ratio,7.59,Met FBR Min,False,total_upt,27381.0


In [26]:
melt_funding[melt_funding["ntd_id"].isin([sample_ids[1]])].sort_values(by=["year","mode","type_of_service"])

Unnamed: 0,source_agency,year,ntd_id,caltrans_district,mode,type_of_service,area_type,reporter_type,quartile,metric,metric_value,requirement,requirement_met_flag,ntd_metric,ntd_metric_value
8,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,DR,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_upt,
2513,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,DR,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_vrh,
5018,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,DR,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_pmt,
12,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,HR,DO,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_upt,125105460.0
2517,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,HR,DO,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_vrh,2225056.0
5022,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,HR,DO,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_pmt,1756364558.0
1,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,MB,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_upt,
2506,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,MB,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_vrh,
5011,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,MB,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_pmt,
0,San Francisco Bay Area Rapid Transit District ...,2019,90003,4,MG,PT,Urban,Full Reporter,Middle 50%,Local Funding % Change vs 2019,0.014763266,Maintained_or_Increased_vs_2019,Yes,total_upt,886515.0


## aggregation group by
- farebox melt
    - PMT, UPT, VRH totals for urban, per year
    - PMT, UPT, VRH totals for rural, per year
    - PMT, UPT, VRH totals for met FBR YES, per year
    - PMT, UPT, VRH totals for met FBR NO , per year

- funding melt
    - PMT, UPT, VRH totals for urban, per year
    - PMT, UPT, VRH totals for rural, per year
    - PMT, UPT, VRH totals for met funding YES, per year
    - PMT, UPT, VRH totals for met funding NO, per year


# Sanity Check

In [38]:
years = [
    "2020",
    "2021",
    "2022",
    "2023",
    "2024"
]

modes = [
    "MB",
    "CB",
    "RB",
    "TB",
    "DR",
    "VP"
]

melt_farebox[
    (melt_farebox["ntd_id"]=="90004")
    & (melt_farebox["year"].isin(years))
    & (melt_farebox["ntd_metric"]=="total_upt")
    ].sort_values(by="year")

Unnamed: 0,source_agency,year,ntd_id,caltrans_district,mode,type_of_service,area_type,reporter_type,quartile,metric,metric_value,requirement,requirement_met_flag,ntd_metric,ntd_metric_value
17,Golden Empire Transit District (GET),2020,90004,6,DR,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,25.98,Met FBR Min,True,total_upt,78845.0
18,Golden Empire Transit District (GET),2020,90004,6,MB,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,25.98,Met FBR Min,True,total_upt,5245726.0
15,Golden Empire Transit District (GET),2021,90004,6,DR,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,17.03,Met FBR Min,False,total_upt,78556.0
20,Golden Empire Transit District (GET),2021,90004,6,MB,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,17.03,Met FBR Min,False,total_upt,2783880.0
1278,Golden Empire Transit District (GET),2022,90004,6,DR,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,85.95,Met FBR Min,True,total_upt,106797.0
1282,Golden Empire Transit District (GET),2022,90004,6,MB,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,85.95,Met FBR Min,True,total_upt,3094249.0
1277,Golden Empire Transit District (GET),2023,90004,6,MB,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,34.81,Met FBR Min,True,total_upt,3130678.0
1280,Golden Empire Transit District (GET),2023,90004,6,DR,DO,Urban,Full Reporter,Top 25%,Farebox Recovery Ratio,34.81,Met FBR Min,True,total_upt,162915.0
1279,Golden Empire Transit District (GET),2024,90004,6,MB,DO,Urban,Full Reporter,Middle 50%,Farebox Recovery Ratio,10.3,Met FBR Min,False,total_upt,3639198.0
1281,Golden Empire Transit District (GET),2024,90004,6,DR,DO,Urban,Full Reporter,Middle 50%,Farebox Recovery Ratio,10.3,Met FBR Min,False,total_upt,196023.0


In [55]:
melt_farebox[
    (melt_farebox["ntd_id"]=="90012")
    & (melt_farebox["year"].isin(years[:5]))
    & (melt_farebox["mode"].isin(modes))
    & (melt_farebox["ntd_metric"]=="total_upt")
    # & (melt_farebox["requirement_flag"].str.lower()=="no")
    ].groupby(["year","ntd_id"]).agg(
        ntd_value_mean = ("ntd_metric_value","mean"),
        ntd_value_total = ("ntd_metric_value","sum"),
        # ntd_value_median = ("ntd_metric_value","median")
    ).sort_values(by="year") 
# golden Empire Transit meet/didnt meet FBR requirements through some years. totals and sums still matched
# double checked the NTD TS2.1 report, and these UPT numbers match. 

Unnamed: 0_level_0,Unnamed: 1_level_0,ntd_value_mean,ntd_value_total
year,ntd_id,Unnamed: 2_level_1,Unnamed: 3_level_1
2020,90012,599440.4,2997202.0
2021,90012,280585.4,1402927.0
2022,90012,262821.86,1839753.0
2023,90012,575447.25,2301789.0
2024,90012,639089.5,2556358.0


In [62]:
melt_farebox[
    (melt_farebox["ntd_id"]=="90004")
    & (melt_farebox["year"].isin(years[:3]))
    & (melt_farebox["mode"].isin(modes))
    & (melt_farebox["ntd_metric"]=="total_upt")
    & (melt_farebox["requirement_met_flag"]==True)
    ].groupby(["year","ntd_id","requirement_met_flag"]).agg(
        ntd_value_mean = ("ntd_metric_value","mean"),
        ntd_value_total = ("ntd_metric_value","sum"),
        ntd_value_median = ("ntd_metric_value","median")
    ).sort_values(by="year") # double checked the NTD TS2.1 report, and these VRH numbers match

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ntd_value_mean,ntd_value_total,ntd_value_median
year,ntd_id,requirement_met_flag,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020,90004,True,2662285.5,5324571.0,2662285.5
2022,90004,True,1600523.0,3201046.0,1600523.0


In [None]:
melt_funding[
    (melt_funding["ntd_id"]=="90013")
    & (melt_funding["mode"].isin(modes))
    & (melt_funding["year"].isin(years))
    & (melt_funding["ntd_metric"]=="total_pmt")
    ].groupby([
        "year",
        "ntd_id",
        # "requirement_met_flag"
    ]).agg(
        ntd_value_mean = ("ntd_metric_value","mean"),
        ntd_value_total = ("ntd_metric_value","sum"),
        ntd_value_median = ("ntd_metric_value","median")).sort_values(by="year")

# NaN values are omitted from mean calculation!!!

## I feel very confident that the aggreagations, averages and sum are working.
notes
- The metrics i queried from the warehouse match the ts2.1 ntd report
- NaN values are not included in the mean calculations
- 

## How many ntd_id are there, how many were True, how many were False

In [101]:
melt_farebox.columns

Index(['source_agency', 'year', 'ntd_id', 'caltrans_district', 'mode',
       'type_of_service', 'area_type', 'reporter_type', 'quartile', 'metric',
       'metric_value', 'requirement', 'requirement_met_flag', 'ntd_metric',
       'ntd_metric_value'],
      dtype='object')

In [105]:
melt_farebox.groupby(
    [
        "area_type",
        "ntd_id",
        "requirement_met_flag",
    ]).agg(
    #total_unique_ntd_id =("ntd_id","nunique"),
    true_count = ("requirement_met_flag","count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,true_count
area_type,ntd_id,requirement_met_flag,Unnamed: 3_level_1
Rural,90216,False,45
Rural,90216,True,9
Rural,91000,False,54
Rural,91002,False,18
Rural,91005,False,72
Rural,91006,False,30
Rural,91006,True,6
Rural,91007,False,12
Rural,91007,True,60
Rural,91008,False,48


## Do yes/no actually match up with the metric value?
- if local funding metric, does metric >=0 actually value = Yes?
- if fbr, does fbr>10 for rural or fbr>20 for urban actually = Yes?

In [93]:
melt_farebox["metric_value"].astype("int")

ValueError: invalid literal for int() with base 10: '63.14'

In [71]:
melt_farebox.groupby(
    ["area_type",
    "requirement_met_flag"]
).agg({"ntd_id":"count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,ntd_id
area_type,requirement_met_flag,Unnamed: 2_level_1
Rural,False,1284
Rural,True,765
Urban,False,3816
Urban,True,1515


In [84]:
melt_farebox[
    (melt_farebox["metric"] == "Farebox Recovery Ratio")
    & (melt_farebox["area_type"]=="Rural")
    & (melt_farebox["metric_value"]>=10)
]

TypeError: '>=' not supported between instances of 'str' and 'int'

In [None]:
melt_farebox["sanity_check"] = melt_farebox[
    if area_type == "Urban" & metric_value >= 10:
        "yes"
    else "no"
]