### Prototyping automatic validation checks for NTD reporting data: form RR-20
  
Decided *not* to use Pandera to validate this form. It is more straightforward and customizable to write our own functions.  
  
This notebook first imports a csv of cleaned data, presumably submitted from the RR-20 form, from the BlackCat report generation function on their website. We will assume that a future API will have this data in the **exact same format** as ther generated report has it in, which is what we requested from BlackCat developers. 
  
This notebook shows the development of the functions that are used in the executable file `rr20_check.py`

In [5]:
from google.cloud import bigquery
import pandas as pd
# import pandera as pa
import numpy as np 
import datetime

#### FOR NOW:
For each org, filter to the rows with the latest `date_uploaded`. 
* Each revision that an agency submits into BlackCat should be in the `blackcat_raw` data table in BigQuery. By grabbing all rows with the latest date for each agency, we get the latest submitted report info.
* Since 2023 and 2022 data is kept in different tables, do this for each table and we will have the latest data submittal for each year.


In [6]:
this_year = datetime.datetime.now().year
last_year =(this_year-1)
print(last_year)

2022


In [7]:
### Query "For Now"
def get_bq_data(client, year, tablename):
    bq_data_query = f"""SELECT * FROM 
          (select *,
          RANK() OVER(PARTITION BY Organization_Legal_Name ORDER BY date_uploaded DESC) rank_date 
        from `cal-itp-data-infra.blackcat_raw.{year}_{tablename}`) s 
        WHERE rank_date = 1;
        """

    print(bq_data_query)
    
    df = client.query(bq_data_query).to_dataframe()
    df = df.drop_duplicates().drop(['rank_date', 'date_uploaded'], axis=1)
    return df


In [8]:
client = bigquery.Client()

In [9]:
rr20_service = get_bq_data(client, this_year, "rr20_service_data")
print(rr20_service.shape)
rr20_service.head()

SELECT * FROM 
          (select *,
          RANK() OVER(PARTITION BY Organization_Legal_Name ORDER BY date_uploaded DESC) rank_date 
        from `cal-itp-data-infra.blackcat_raw.2023_rr20_service_data`) s 
        WHERE rank_date = 1;
        
(80, 9)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0
2,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0
3,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0
4,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0


In [10]:
rr20_exp_by_mode = get_bq_data(client, this_year, "rr20_expenses_by_mode")
print(rr20_exp_by_mode.shape)
rr20_exp_by_mode.head()

SELECT * FROM 
          (select *,
          RANK() OVER(PARTITION BY Organization_Legal_Name ORDER BY date_uploaded DESC) rank_date 
        from `cal-itp-data-infra.blackcat_raw.2023_rr20_expenses_by_mode`) s 
        WHERE rank_date = 1;
        
(160, 6)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Operating_Capital,Mode,Total_Annual_Expenses_By_Mode
0,Alpine County Community Development,,2023,Capital,Demand Response (DR) - (DO),0
1,Alpine County Community Development,,2023,Operating,Demand Response (DR) - (DO),81766
2,City of Auburn,,2023,Capital,Demand Response (DR) - (DO),261721
3,City of Auburn,,2023,Capital,Deviated Fixed Route (DF) - (DO),11705
4,City of Auburn,,2023,Operating,Demand Response (DR) - (DO),394496


In [11]:
orgs_q = """SELECT * FROM `cal-itp-data-infra.blackcat_raw.2023_organizations`"""
orgs = client.query(orgs_q).to_dataframe().drop_duplicates().drop(['date_uploaded'], axis=1)


print(orgs.shape)
orgs.head()

(90, 1)


Unnamed: 0,Organization
0,Alpine County Community Development
1,Amador Transit
2,Butte County Association of Governments/ Butte...
3,Calaveras Transit Agency
4,City of Arvin


In [2]:
# Older way - load from excel files
# rr20_exp_by_mode = pd.read_excel("../data/NTD_Annual_Report_Rural_2022.xlsx", 
#                      sheet_name="Expenses By Mode", index_col=None) 
# rr20_service = pd.read_excel("../data/NTD_Annual_Report_Rural_2022.xlsx", 
#                      sheet_name="Service Data", index_col=None) 
# orgs = pd.read_csv("../data/organizations.csv")

In [5]:
# data1 = rr20_service.merge(orgs, left_on ='Organization Legal Name', right_on = 'Organization', 
#                           indicator=True).query('_merge == "both"').drop(columns=['_merge', 'Organization'])
# print(data1.shape)
# data1.head()

(86, 9)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0
1,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0
2,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0
3,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0
4,Calaveras Transit Agency,CTA,2022,Demand Response (DR) - (PT),23812.0,1416.0,2104.0,0.0,5.0


In [6]:
# Doubles the rows as expected - because of the 'Operating/Capital' column
# data = data1.merge(rr20_exp_by_mode, on = ['Organization Legal Name', 'Common Name/Acronym/DBA', 'Fiscal Year', 'Mode'])

# print(data.shape)
# data.head()

(172, 11)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0


In [12]:
# try the 2 merges together
data = rr20_service.merge(orgs, left_on ='Organization_Legal_Name', right_on = 'Organization', 
                          indicator=True).query('_merge == "both"').drop(columns=['_merge', 'Organization'])\
.merge(rr20_exp_by_mode, on = ['Organization_Legal_Name', 'Common_Name_Acronym_DBA', 'Fiscal_Year', 'Mode'])

In [13]:
print(data.shape)
data.head(20)

(160, 11)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Capital,0
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589
2,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Capital,0
3,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072
4,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Capital,2060201
5,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Operating,3605380
6,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0,Capital,0
7,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0,Operating,1432792
8,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0,Capital,0
9,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0,Operating,81766


#### ALSO NEED FARE REVENUE FROM RR-20 FINANCIAL TABLE

In [120]:
rr20_fin = get_bq_data(client, this_year, "rr20_financials__2")
print(rr20_fin.shape)
rr20_fin.head()

SELECT * FROM 
          (select *,
          RANK() OVER(PARTITION BY Organization_Legal_Name ORDER BY date_uploaded DESC) rank_date 
        from `cal-itp-data-infra.blackcat_raw.2023_rr20_financials__2`) s 
        WHERE rank_date = 1;
        
(84, 45)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Operating_Capital,Total_Annual_Revenues_Expended,Total_Annual_Expenses_by_Mode,Fare_Revenues,Other_Directly_Generated_Funds,Revenues_Accrued_Through_a_PT_Agreement,NonFederal_Funds,...,CRRSA_Act_Public_Transportation_on_Indian_Reservations_Program_Funds_5321,American_Rescue_Plan_Act_of_2021_Rural_Area_Program_Funds_5311,American_Rescue_Plan_Act_of_2021_Public_Transportation_on_Indian_Reservations_Program_Funds_5321,FTA_Job_Access_and_Reverse_Commute_Formula_Program_5316,State_of_Good_Repair_5308,FTA_Bus_and_Bus_Facilities,ARRA_TIGGER_Greenhouse_Gas_and_Energy_Reduction,Other_FTA_Funds,Other_USDOT_Funds,Other_Federal_Funds
0,City of McFarland,,2023,Capital,0,0,,,,0.0,...,,,,,,,,,,
1,City of McFarland,,2023,Operating,138309,138309,,,,61783.0,...,,,,,,,,,,
2,Lassen Transit Service Agency,LTSA,2023,Capital,207136,207136,0.0,,,207136.0,...,,,,,,,,,,
3,Lassen Transit Service Agency,LTSA,2023,Operating,1304748,1304748,113981.0,,,615695.0,...,,,,,,,,,,
4,Fresno County Rural Transit Agency,FCRTA,2023,Capital,4547110,4547110,0.0,,,2806243.0,...,,,,,,,,,,


**Now join this to the other 2023 data**

In [121]:
rr20_fin2 = rr20_fin[['Organization_Legal_Name', 'Common_Name_Acronym_DBA', 'Fiscal_Year', 'Operating_Capital', 'Fare_Revenues']]
print(rr20_fin2.shape)
rr20_fin2.head()

(84, 5)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Operating_Capital,Fare_Revenues
0,City of McFarland,,2023,Capital,
1,City of McFarland,,2023,Operating,
2,Lassen Transit Service Agency,LTSA,2023,Capital,0.0
3,Lassen Transit Service Agency,LTSA,2023,Operating,113981.0
4,Fresno County Rural Transit Agency,FCRTA,2023,Capital,0.0


In [124]:
data_all = data.merge(rr20_fin2, on =['Organization_Legal_Name', 'Common_Name_Acronym_DBA', 'Fiscal_Year', 'Operating_Capital'],
                          indicator=True).query('_merge == "both"').drop(columns=['_merge'])

In [125]:
data_all.head()

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Capital,0,0.0
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Capital,0,0.0
2,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589,56073.0
3,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072,56073.0
4,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Capital,2060201,0.0


In [143]:
### Check
agency = 'Tuolumne County Transit Agency (TCTA)'
data_all[data_all['Organization_Legal_Name'] == agency]


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues
32,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Capital,0,0.0
33,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Capital,0,0.0
34,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Operating,997518,36486.0
35,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Operating,1436041,36486.0


### Get data for "last year" - 2022
 2022 data was only uploaded once so has slightly different schema

In [14]:
bq_2022_query = f"""SELECT * FROM `cal-itp-data-infra.blackcat_raw.{last_year}_rr20_service_data`"""
rr20_service_lastyr = client.query(bq_2022_query).to_dataframe().drop_duplicates()
print(rr20_service_lastyr.shape)
rr20_service_lastyr.head()

(89, 9)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX
0,County of Siskiyou,CTA,2022,Bus (MB) (Fixed Route) - (DO),226547.0,10279.0,27970.0,0.0,5.0
1,"County of Nevada Public Works, Transit Service...",CTA,2022,Bus (MB) (Fixed Route) - (DO),322267.0,17926.0,99321.0,0.0,7.0
2,Mountain Area Regional Transit Authority,CTA,2022,Demand Response (DR) - (DO),101401.0,8316.0,12232.0,0.0,8.0
3,County of Mariposa,CTA,2022,Demand Response (DR) - (DO),117623.0,3577.0,8111.0,1904.0,8.0
4,"County of Nevada Public Works, Transit Service...",CTA,2022,Demand Response (DR) - (PT),103930.0,8505.0,18057.0,0.0,10.0


In [15]:
exp_2022_query = f"""SELECT * FROM `cal-itp-data-infra.blackcat_raw.{last_year}_rr20_expenses_by_mode`"""
rr20_exp_by_mode_lastyr = client.query(exp_2022_query).to_dataframe().drop_duplicates()
print(rr20_exp_by_mode_lastyr.shape)
rr20_exp_by_mode_lastyr.head()

(178, 6)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Operating_Capital,Mode,Total_Annual_Expenses_By_Mode
0,Amador Transit,,2022,Operating,Commuter Bus (CB) - (DO),208171.0
1,Amador Transit,,2022,Capital,Commuter Bus (CB) - (DO),0.0
2,Humboldt Transit Authority,HTA,2022,Operating,Commuter Bus (CB) - (DO),3414537.0
3,Humboldt Transit Authority,HTA,2022,Capital,Commuter Bus (CB) - (DO),138173.0
4,Morongo Basin Transit Authority,,2022,Capital,Commuter Bus (CB) - (DO),100818.0


In [16]:
data_lastyear = (rr20_service_lastyr.merge(orgs, left_on ='Organization_Legal_Name', right_on = 'Organization', 
                            indicator=True).query('_merge == "both"').drop(columns=['_merge', 'Organization'])
                            .merge(rr20_exp_by_mode_lastyr, on = ['Organization_Legal_Name', 'Fiscal_Year', 'Mode'])
                .sort_values(by="Organization_Legal_Name"))
print(data_lastyear.shape)
data_lastyear.head()

(166, 12)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA_x,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Common_Name_Acronym_DBA_y,Operating_Capital,Total_Annual_Expenses_By_Mode
30,Alpine County Community Development,CTA,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,,Operating,75944.0
31,Alpine County Community Development,CTA,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,,Capital,0.0
24,Amador Transit,CTA,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,,Operating,208171.0
25,Amador Transit,CTA,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,,Capital,0.0
26,Amador Transit,eTrans,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,,Capital,0.0


In [17]:
# 2022: we use the "Common Name" from the service data, if empty then from the expenses table. If neither empty, still use from the service data
data_lastyear['Common_Name_Acronym_DBA'] = data_lastyear['Common_Name_Acronym_DBA_x'].combine_first(data_lastyear['Common_Name_Acronym_DBA_y'])
data_lastyear.drop(columns=['Common_Name_Acronym_DBA_x', 'Common_Name_Acronym_DBA_y'], inplace=True)
data_lastyear.head()


Unnamed: 0,Organization_Legal_Name,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Common_Name_Acronym_DBA
30,Alpine County Community Development,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,CTA
31,Alpine County Community Development,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,CTA
24,Amador Transit,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,CTA
25,Amador Transit,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,CTA
26,Amador Transit,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,eTrans


In [131]:
data_lastyear[data_lastyear['Organization_Legal_Name'].str.contains('Ojai')]

Unnamed: 0,Organization_Legal_Name,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Common_Name_Acronym_DBA
138,City of Ojai,2022,Bus (MB) (Fixed Route) - (DO),62878.0,4649.0,37070.0,0.0,2.0,Operating,733466.0,PVVTA
139,City of Ojai,2022,Bus (MB) (Fixed Route) - (DO),62878.0,4649.0,37070.0,0.0,2.0,Capital,0.0,PVVTA


In [129]:
print(data_lastyear.shape)

(166, 11)


In [126]:
### Now need to add in the Fare Revenues from 2022 also.
fin_2022_query = f"""SELECT * FROM `cal-itp-data-infra.blackcat_raw.{last_year}_rr20_financials__2`"""
fin_lastyr = client.query(fin_2022_query).to_dataframe().drop_duplicates()
print(fin_lastyr.shape)
fin_lastyr.head()

(96, 44)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Operating_Capital,Total_Annual_Revenues_Expended,Total_Annual_Expenses_by_Mode,Fare_Revenues,Other_Directly_Generated_Funds,Revenues_Accrued_Through_a_PT_Agreement,NonFederal_Funds,...,CRRSA_Act_Public_Transportation_on_Indian_Reservations_Program_Funds_5321,American_Rescue_Plan_Act_of_2021_Rural_Area_Program_Funds_5311,American_Rescue_Plan_Act_of_2021_Public_Transportation_on_Indian_Reservations_Program_Funds_5321,FTA_Job_Access_and_Reverse_Commute_Formula_Program_5316,State_of_Good_Repair_5308,FTA_Bus_and_Bus_Facilities,ARRA_TIGGER_Greenhouse_Gas_and_Energy_Reduction,Other_FTA_Funds,Other_USDOT_Funds,Other_Federal_Funds
0,City of Ojai,,2022,Operating,733466,733466,39335.0,40460.0,,465385.0,...,,,,,,,,,,
1,City of Ojai,,2022,Capital,0,0,0.0,0.0,,0.0,...,,,,,,,,,,
2,City of Taft,,2022,Capital,0,0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
3,City of Taft,,2022,Operating,394016,394016,55871.0,12.0,0.0,305234.0,...,,,,,,,,,,
4,City of Arvin,,2022,Operating,853504,853504,36386.0,,,548259.0,...,,,,,,,,,,


In [130]:
# fin_lastyr2 = fin_lastyr[['Organization_Legal_Name', 'Common_Name_Acronym_DBA', 'Fiscal_Year', 'Operating_Capital', 'Fare_Revenues']]
print(fin_lastyr2.shape)
fin_lastyr2.head()


(96, 5)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Operating_Capital,Fare_Revenues
0,City of Ojai,,2022,Operating,39335.0
1,City of Ojai,,2022,Capital,0.0
2,City of Taft,,2022,Capital,0.0
3,City of Taft,,2022,Operating,55871.0
4,City of Arvin,,2022,Operating,36386.0


In [132]:
# We remove 'Common_Name_Acronym_DBA' from the join clause becuase so many of them don't match between the...
# ...2022 financial table and the service and expenses tables

## INNER JOIN
data_all_lastyear = data_lastyear.merge(fin_lastyr2, on =['Organization_Legal_Name', 'Fiscal_Year', 'Operating_Capital'],
                          indicator=True).query('_merge == "both"').drop(columns=['_merge'])
print(data_all_lastyear.shape)
data_all_lastyear.head()

(166, 13)


Unnamed: 0,Organization_Legal_Name,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Common_Name_Acronym_DBA_x,Common_Name_Acronym_DBA_y,Fare_Revenues
0,Alpine County Community Development,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,CTA,,3448.0
1,Alpine County Community Development,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,CTA,,0.0
2,Amador Transit,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,CTA,,68745.0
3,Amador Transit,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Operating,980081.0,eTrans,,68745.0
4,Amador Transit,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,eTrans,,68745.0


In [133]:
## Again, must merge the 'Common_Name_Acronym' column. 
# we use the "Common Name" from the service/expenses data, if empty then from the financial table. If neither empty, still use from the service data
data_all_lastyear['Common_Name_Acronym_DBA'] = data_all_lastyear['Common_Name_Acronym_DBA_x'].combine_first(data_all_lastyear['Common_Name_Acronym_DBA_y'])
data_all_lastyear.drop(columns=['Common_Name_Acronym_DBA_x', 'Common_Name_Acronym_DBA_y'], inplace=True)
data_all_lastyear.head()

Unnamed: 0,Organization_Legal_Name,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,Common_Name_Acronym_DBA
0,Alpine County Community Development,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,3448.0,CTA
1,Alpine County Community Development,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,0.0,CTA
2,Amador Transit,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,68745.0,CTA
3,Amador Transit,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Operating,980081.0,68745.0,eTrans
4,Amador Transit,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,68745.0,eTrans


In [135]:
# Rearrange columns of 2022 so that it matches the order of columns in 2023
cols = list(data_all_lastyear.columns.values)
cols

['Organization_Legal_Name',
 'Fiscal_Year',
 'Mode',
 'Annual_VRM',
 'Annual_VRH',
 'Annual_UPT',
 'Sponsored_UPT',
 'VOMX',
 'Operating_Capital',
 'Total_Annual_Expenses_By_Mode',
 'Fare_Revenues',
 'Common_Name_Acronym_DBA']

In [138]:
data_all_lastyear = data_all_lastyear[['Organization_Legal_Name','Common_Name_Acronym_DBA',
 'Fiscal_Year',
 'Mode',
 'Annual_VRM',
 'Annual_VRH',
 'Annual_UPT',
 'Sponsored_UPT',
 'VOMX',
 'Operating_Capital',
 'Total_Annual_Expenses_By_Mode',
 'Fare_Revenues']]

data_all_lastyear.head()

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues
0,Alpine County Community Development,CTA,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,3448.0
1,Alpine County Community Development,CTA,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,0.0
2,Amador Transit,CTA,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,68745.0
3,Amador Transit,eTrans,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Operating,980081.0,68745.0
4,Amador Transit,eTrans,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,68745.0


In [144]:
allyears = pd.concat([data_all, data_all_lastyear], ignore_index = True)
print(allyears.shape)
allyears.head()

(326, 12)


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Capital,0.0,0.0
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Capital,0.0,0.0
2,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,56073.0
3,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072.0,56073.0
4,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Capital,2060201.0,0.0


In [141]:
allyears['Organization_Legal_Name'].unique()

array(['City of Solvang',
       'County of Nevada Public Works, Transit Services Division',
       'Alpine County Community Development', 'City of Auburn',
       'City of Guadalupe', 'Mountain Area Regional Transit Authority',
       'City of Arvin', 'Town of Truckee',
       'Tuolumne County Transit Agency (TCTA)',
       'Humboldt Transit Authority', 'City of Rio Vista',
       'City of Ridgecrest', 'Modoc Transportation Agency',
       'City of Wasco', 'Morongo Basin Transit Authority',
       'City of California City', 'City of Ojai', 'City of Dixon',
       'City of Tehachapi', 'City of Corcoran - Corcoran Area Transit',
       'Eastern Sierra Transit Authority', 'City of Shafter',
       'City of Taft', 'Sierra County Transportation Commission',
       'Lake Transit Authority', 'Palo Verde Valley Transit Agency',
       'Amador Transit',
       'San Benito County Local Transportation Authority',
       'City of McFarland', 'Lassen Transit Service Agency',
       'Tehama County 

In [145]:
### for testing
agency = 'Tuolumne County Transit Agency (TCTA)'
allyears[allyears['Organization_Legal_Name'] == agency]


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues
32,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Capital,0.0,0.0
33,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Capital,0.0,0.0
34,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Operating,997518.0,36486.0
35,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Operating,1436041.0,36486.0
320,Tuolumne County Transit Agency (TCTA),Santa Ynez Valley Transit/SYVT,2022,Demand Response (DR) - (PT),169446.0,12487.0,32189.0,0.0,8.0,Capital,156040.0,0.0
321,Tuolumne County Transit Agency (TCTA),YARTS,2022,Bus (MB) (Fixed Route) - (PT),83834.0,6074.0,31846.0,0.0,4.0,Capital,58640.0,0.0
322,Tuolumne County Transit Agency (TCTA),YARTS,2022,Bus (MB) (Fixed Route) - (PT),83834.0,6074.0,31846.0,0.0,4.0,Operating,835965.0,4617.0
323,Tuolumne County Transit Agency (TCTA),Santa Ynez Valley Transit/SYVT,2022,Demand Response (DR) - (PT),169446.0,12487.0,32189.0,0.0,8.0,Operating,1557835.0,4617.0


### Initial development of validation checks before the 2023 NTD reporting season started
Used 2022 and 2021 data.  
Commenting out these cells since they are no longer needed. However they show how things were developed.  
  
---  
To get data for 2021 - THIS IS FAKE - made some fake data
If we click "generate report" for 2021, RR-20 rural on BlackCat, we get an empty report. For the sake of time I am just making up fake data for 2021 to build the function to compare the prior year to this year.  
  
(scroll to bottom of notebook for code on how I made the fake data)

In [5]:
# exp_by_mode_2021 = pd.read_excel("../data/NTD_Annual_Report_Rural_2021.xlsx", 
#                      sheet_name="Expenses By Mode", index_col=None) 
# service_2021 = pd.read_excel("../data/NTD_Annual_Report_Rural_2021.xlsx", 
#                      sheet_name="Service Data", index_col=None) 

In [6]:
# all2021 = service_2021.merge(exp_by_mode_2021, on = ['Organization Legal Name', 'Common Name/Acronym/DBA', 'Fiscal Year', 'Mode'])

In [7]:
####------- Combine 2022 (rea) and 2021 (fake) data
# allyears = data.append(all2021, ignore_index = True)
# print(allyears.shape)
# allyears.head()

(344, 11)


  allyears = data.append(all2021, ignore_index = True)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0


### Method A. Individual calculations for making ratios

First, we filter down the dataset to only the "Operating" expenses and not capital also

In [146]:
allyears1 = allyears[allyears['Operating_Capital']=="Operating"]
allyears1[allyears1['Organization_Legal_Name']==agency]


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues
34,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Operating,997518.0,36486.0
35,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Operating,1436041.0,36486.0
322,Tuolumne County Transit Agency (TCTA),YARTS,2022,Bus (MB) (Fixed Route) - (PT),83834.0,6074.0,31846.0,0.0,4.0,Operating,835965.0,4617.0
323,Tuolumne County Transit Agency (TCTA),Santa Ynez Valley Transit/SYVT,2022,Demand Response (DR) - (PT),169446.0,12487.0,32189.0,0.0,8.0,Operating,1557835.0,4617.0


In [147]:
allyears1.dtypes

Organization_Legal_Name           object
Common_Name_Acronym_DBA           object
Fiscal_Year                        Int64
Mode                              object
Annual_VRM                       float64
Annual_VRH                       float64
Annual_UPT                       float64
Sponsored_UPT                    float64
VOMX                             float64
Operating_Capital                 object
Total_Annual_Expenses_By_Mode    Float64
Fare_Revenues                    float64
dtype: object

**Cost per hour (CPH)** includes, importantly, comparing it to the previous year.  
CPH = `Expenses on operations by mode/VRH`
  
Example NTD error message: `The calculated cost per hour for {MB - PT} equals {49.12}. The prior year’s calculated value equals {98.56}. This is a change of {-50.166} Percent.`

In [148]:
allyears2 = (allyears1.groupby(['Organization_Legal_Name', 'Common_Name_Acronym_DBA','Mode', 'Fiscal_Year'], dropna=False)
                       .apply(lambda x: x.assign(cost_per_hr=x['Total_Annual_Expenses_By_Mode']/ x['Annual_VRH']))
                           .reset_index(drop=True))

# allyears1.head()

# test2 = (allyears.groupby(['Organization Legal Name','Mode', 'Fiscal Year'])
#  .apply(lambda x: x.assign(cost_per_hr=lambda x: x['Total Annual Expenses By Mode'].sum() / x['Annual VRH']))
# )

# test2.head()

In [162]:
# Check
agency = 'Lassen'
allyears2[allyears2['Organization_Legal_Name'].str.contains(agency)]

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh,fare_rev_per_trip,rev_speed,trips_per_hr
54,Lassen Transit Service Agency,LTSA,2023,University Service (US) - (PT),,,,,,Operating,,113981.0,,,,,
55,Lassen Transit Service Agency,LTSA,2023,Bus (MB) (Fixed Route) - (PT),68947.0,4389.0,0.0,0.0,0.0,Operating,534327.0,113981.0,121.74231,inf,inf,15.709045,0.0
56,Lassen Transit Service Agency,LTSA,2023,Commuter Bus (CB) - (PT),121943.0,3331.0,0.0,0.0,0.0,Operating,727536.0,113981.0,218.41369,inf,inf,36.608526,0.0
57,Lassen Transit Service Agency,LTSA,2023,Demand Response (DR) - (PT),2130.0,750.0,0.0,0.0,0.0,Operating,42885.0,113981.0,57.18,inf,inf,2.84,0.0
127,Lassen Transit Service Agency,Kern Transit,2022,Bus (MB) (Fixed Route) - (PT),74554.0,4775.0,36897.0,0.0,2.0,Operating,482043.0,157414.0,100.951414,37277.0,4.266309,15.613403,7.72712
128,Lassen Transit Service Agency,PCTC,2022,Commuter Bus (CB) - (PT),129807.0,3584.0,9352.0,0.0,2.0,Operating,627289.0,157414.0,175.024833,64903.5,16.832121,36.218471,2.609375
129,Lassen Transit Service Agency,LTSA,2022,Demand Response (DR) - (PT),11777.0,613.0,11981.0,0.0,1.0,Operating,137550.0,157414.0,224.388254,11777.0,13.138636,19.212072,19.544861
130,Lassen Transit Service Agency,CTA,2022,University Service (US) - (PT),,,,,,Operating,,157414.0,,,,,


In [99]:
# Eastern Sierra for {DR - DO} still doesn't match 2023 in the NTD-reported error (225.79) but this is the correct calculation
1435603.0/18194.0

78.90529845003847

#### Also add in Miles per vehicle: VRM/vehicles of max service (VOMS)  
Example NTD error message: `The calculated miles per vehicle for {DR - DO} is {5,221.40}. The prior year’s calculated value is {3,678.33}. This is a {41.95}% {increase} caused by a change in Vehicle Revenue Miles, the number of Vehicles Operated in Annual Maximum Service, or both.`

In [151]:
allyears2 = (allyears2.groupby(['Organization_Legal_Name','Common_Name_Acronym_DBA', 'Mode', 'Fiscal_Year'], dropna=False)
 .apply(lambda x: x.assign(miles_per_veh=lambda x: x['Annual_VRM'].sum() / x['VOMX']))
)

allyears2.head()

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,56073.0,87.257654,56794.0
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072.0,56073.0,74.535745,7433.5
2,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Operating,3605380.0,235034.0,202.492558,45802.428571
3,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0,Operating,1432792.0,235034.0,147.939288,11923.0
4,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0,Operating,81766.0,2866.0,158.768932,9619.0


In [152]:
### Check
agency = 'Tuolumne'
allyears2[allyears2['Organization_Legal_Name'].str.contains(agency)]

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh
16,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Operating,997518.0,36486.0,149.239677,26844.75
17,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Operating,1436041.0,36486.0,85.606021,29879.25
160,Tuolumne County Transit Agency (TCTA),YARTS,2022,Bus (MB) (Fixed Route) - (PT),83834.0,6074.0,31846.0,0.0,4.0,Operating,835965.0,4617.0,137.630063,20958.5
161,Tuolumne County Transit Agency (TCTA),Santa Ynez Valley Transit/SYVT,2022,Demand Response (DR) - (PT),169446.0,12487.0,32189.0,0.0,8.0,Operating,1557835.0,4617.0,124.756547,21180.75


**Add in:**
* Fare Revenues per unlinked passenger trip
* Revenue Speed
* Trips per hr

In [155]:
allyears2 = (allyears2.groupby(['Organization_Legal_Name','Common_Name_Acronym_DBA', 'Fiscal_Year'], dropna=False)
 .apply(lambda x: x.assign(fare_rev_per_trip=lambda x: x['Fare_Revenues'].sum() / x['Annual_UPT']))
)

allyears2.head()

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh,fare_rev_per_trip
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,56073.0,87.257654,56794.0,2.600005
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072.0,56073.0,74.535745,7433.5,46.610973
2,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Operating,3605380.0,235034.0,202.492558,45802.428571,4.084245
3,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0,Operating,1432792.0,235034.0,147.939288,11923.0,22.437613
4,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0,Operating,81766.0,2866.0,158.768932,9619.0,8.818462


In [158]:
### Check
agency = 'Lassen'
allyears2[allyears2['Organization_Legal_Name'].str.contains(agency)]

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh,fare_rev_per_trip
54,Lassen Transit Service Agency,LTSA,2023,University Service (US) - (PT),,,,,,Operating,,113981.0,,,
55,Lassen Transit Service Agency,LTSA,2023,Bus (MB) (Fixed Route) - (PT),68947.0,4389.0,0.0,0.0,0.0,Operating,534327.0,113981.0,121.74231,inf,inf
56,Lassen Transit Service Agency,LTSA,2023,Commuter Bus (CB) - (PT),121943.0,3331.0,0.0,0.0,0.0,Operating,727536.0,113981.0,218.41369,inf,inf
57,Lassen Transit Service Agency,LTSA,2023,Demand Response (DR) - (PT),2130.0,750.0,0.0,0.0,0.0,Operating,42885.0,113981.0,57.18,inf,inf
127,Lassen Transit Service Agency,Kern Transit,2022,Bus (MB) (Fixed Route) - (PT),74554.0,4775.0,36897.0,0.0,2.0,Operating,482043.0,157414.0,100.951414,37277.0,4.266309
128,Lassen Transit Service Agency,PCTC,2022,Commuter Bus (CB) - (PT),129807.0,3584.0,9352.0,0.0,2.0,Operating,627289.0,157414.0,175.024833,64903.5,16.832121
129,Lassen Transit Service Agency,LTSA,2022,Demand Response (DR) - (PT),11777.0,613.0,11981.0,0.0,1.0,Operating,137550.0,157414.0,224.388254,11777.0,13.138636
130,Lassen Transit Service Agency,CTA,2022,University Service (US) - (PT),,,,,,Operating,,157414.0,,,


In [119]:
# 1080443.0/35377.0
87736/35377.0

2.4800293976312293

In [159]:
allyears2 = (allyears2.groupby(['Organization_Legal_Name','Common_Name_Acronym_DBA', 'Fiscal_Year'], dropna=False)
 .apply(lambda x: x.assign(rev_speed=lambda x: x['Annual_VRM'] / x['Annual_VRH']))
)

allyears2.head()


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh,fare_rev_per_trip,rev_speed
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,56073.0,87.257654,56794.0,2.600005,17.21552
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072.0,56073.0,74.535745,7433.5,46.610973,6.40267
2,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Operating,3605380.0,235034.0,202.492558,45802.428571,4.084245,18.007133
3,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0,Operating,1432792.0,235034.0,147.939288,11923.0,22.437613,12.31079
4,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0,Operating,81766.0,2866.0,158.768932,9619.0,8.818462,18.67767


In [160]:
allyears2 = (allyears2.groupby(['Organization_Legal_Name','Common_Name_Acronym_DBA', 'Fiscal_Year'], dropna=False)
 .apply(lambda x: x.assign(trips_per_hr=lambda x: x['Annual_UPT'] / x['Annual_VRH']))
)

allyears2.head()

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,Fare_Revenues,cost_per_hr,miles_per_veh,fare_rev_per_trip,rev_speed,trips_per_hr
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,56073.0,87.257654,56794.0,2.600005,17.21552,4.358189
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072.0,56073.0,74.535745,7433.5,46.610973,6.40267,1.036176
2,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Operating,3605380.0,235034.0,202.492558,45802.428571,4.084245,18.007133,6.464083
3,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Demand Response (DR) - (PT),119230.0,9685.0,20950.0,0.0,10.0,Operating,1432792.0,235034.0,147.939288,11923.0,22.437613,12.31079,2.163139
4,Alpine County Community Development,,2023,Demand Response (DR) - (DO),9619.0,515.0,325.0,209.0,1.0,Operating,81766.0,2866.0,158.768932,9619.0,8.818462,18.67767,0.631068


### Method B. Function for making ratio columns - DEPRECATED
The function method turned out to be wonky, and it was different enough for each ratio, that we are deprecating it.

In [45]:
# https://stackoverflow.com/questions/59865432/can-i-use-pandas-dataframe-assign-with-a-variable-name
# Making df.assign accept variables by pasing in a dict

def make_ratio_cols(df, numerator, denominator, col_name, operation="sum"):
    if col_name is not None:
        # If a user specify a column name, use it
        # Raise error if the column already exists
        if col_name in df.columns:
            raise ValueError(f"Dataframe already has column '{col_name}'")
        else:
            _col_name = col_name
    
    df1 = df[df['Operating_Capital']=="Operating"]
    
    if operation == "sum":    
        df2 = (df1.groupby(['Organization_Legal_Name','Common_Name_Acronym_DBA', 'Mode', 'Fiscal_Year'])
              .apply(lambda x: x.assign(**{_col_name:
                     lambda x: x[numerator].sum() / x[denominator]}))
                    )
    # ADD 0 DIVISION CHECK
#     elif denominator == 0:
        
    else:
        df2 = (df1.groupby(['Organization_Legal_Name','Common_Name_Acronym_DBA', 'Mode', 'Fiscal_Year'])
              .apply(lambda x: x.assign(**{_col_name:
                     lambda x: x[numerator].mean() / x[denominator].mean()}))
                    )
        
    return df2

### Add in cost per hr to df 

**Cost per hour (CPH)** includes, importantly, comparing it to the previous year.  
CPH = `Expenses on operations by mode/VRH`
  
Example NTD error message: `The calculated cost per hour for {MB - PT} equals {49.12}. The prior year’s calculated value equals {98.56}. This is a change of {-50.166} Percent.`

In [46]:
### testing
# # 'Annual VRH' in allyears.columns
# allyears = make_ratio_cols(allyears, 'Total_Annual_Expenses_By_Mode',
#                        'Annual_VRH', 
#                        'cost_per_hr')


allyears.head(3)

ValueError: Dataframe already has column 'cost_per_hr'

In [None]:
allyears = make_ratio_cols(allyears, 'Annual_VRM', 'VOMX', 'miles_per_veh')

**Add in:**
* Fare Revenues per unlinked passenger trip
* Revenue Speed
* Trips per hr

In [40]:
allyears = make_ratio_cols(allyears, 'Total_Annual_Expenses_By_Mode', 'Annual_UPT', 
                           'fare_rev_per_trip')  
allyears.head(3)

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,cost_per_hr,miles_per_veh,fare_rev_per_trip
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Capital,0.0,87.257654,113588.0,20.021538
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,87.257654,113588.0,20.021538
2,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Capital,0.0,74.535745,14867.0,71.9335


In [None]:
# This calculation is NOT summing across modes like the ones above that involve total Expenses.
# Here we want to just calculate a fraction using the values in one row only. 
# So we add in 'operation = "mean" as a function argument. The function then does not sum across the numerator column.'
allyears = make_ratio_cols(allyears, 'Annual_VRM', 'Annual_VRH', 'rev_speed', operation = "mean")
allyears.head()

In [55]:
# hand calculate Alpine value to double check it worked as it should have. Confirmed.
10386/643

16.152410575427684

In [29]:
allyears = make_ratio_cols(allyears, 'Annual_UPT', 'Annual_VRH', 'trips_per_hr', operation = "mean")
allyears.head()

Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,cost_per_hr,miles_per_veh,fare_rev_per_trip,rev_speed,trips_per_hr
0,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Capital,0.0,87.257654,113588.0,20.021538,17.21552,4.358189
1,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Bus (MB) (Fixed Route) - (PT),170382.0,9897.0,43133.0,0.0,3.0,Operating,863589.0,87.257654,113588.0,20.021538,17.21552,4.358189
2,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Capital,0.0,74.535745,14867.0,71.9335,6.40267,1.036176
3,City of Solvang,Santa Ynez Valley Transit/SYVT,2023,Demand Response (DR) - (PT),14867.0,2322.0,2406.0,0.0,2.0,Operating,173072.0,74.535745,14867.0,71.9335,6.40267,1.036176
4,"County of Nevada Public Works, Transit Service...",Nevada County Connects,2023,Bus (MB) (Fixed Route) - (DO),320617.0,17805.0,115093.0,0.0,7.0,Capital,2060201.0,318.201685,91604.857143,49.226113,18.007133,6.464083


In [30]:
### for testing
agency = 'Tuolumne County Transit Agency (TCTA)'
mode = 'Demand Response (DR) - (DO)'
allyears[allyears['Organization_Legal_Name'] == agency]


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode,cost_per_hr,miles_per_veh,fare_rev_per_trip,rev_speed,trips_per_hr
32,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Capital,0.0,149.239677,53689.5,28.567444,16.065081,5.224117
33,Tuolumne County Transit Agency (TCTA),TCT,2023,Bus (MB) (Fixed Route) - (PT),107379.0,6684.0,34918.0,0.0,4.0,Operating,997518.0,149.239677,53689.5,28.567444,16.065081,5.224117
34,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Capital,0.0,85.606021,59758.5,44.641911,14.249419,1.917615
35,Tuolumne County Transit Agency (TCTA),TCT,2023,Demand Response (DR) - (PT),239034.0,16775.0,32168.0,0.0,8.0,Operating,1436041.0,85.606021,59758.5,44.641911,14.249419,1.917615
320,Tuolumne County Transit Agency (TCTA),Santa Ynez Valley Transit/SYVT,2022,Demand Response (DR) - (PT),169446.0,12487.0,32189.0,0.0,8.0,Capital,156040.0,137.252743,42361.5,53.244121,13.569793,2.577801
321,Tuolumne County Transit Agency (TCTA),YARTS,2022,Bus (MB) (Fixed Route) - (PT),83834.0,6074.0,31846.0,0.0,4.0,Capital,58640.0,147.284327,41917.0,28.091597,13.802107,5.243003
322,Tuolumne County Transit Agency (TCTA),YARTS,2022,Bus (MB) (Fixed Route) - (PT),83834.0,6074.0,31846.0,0.0,4.0,Operating,835965.0,147.284327,41917.0,28.091597,13.802107,5.243003
323,Tuolumne County Transit Agency (TCTA),Santa Ynez Valley Transit/SYVT,2022,Demand Response (DR) - (PT),169446.0,12487.0,32189.0,0.0,8.0,Operating,1557835.0,137.252743,42361.5,53.244121,13.569793,2.577801


In [34]:
# 239034/8
(239034*2)/(8*2)
(239034/8) + (239034/8)

59758.5

### Function development

In [14]:
import datetime

# this_year = datetime.datetime.now().year
# print(this_year)
this_year = 2022 #for testing purposes
last_year = this_year - 1
print(last_year)


allyears['Fiscal Year'].unique()

agencies = data['Organization Legal Name'].unique()

2021


In [20]:
### for testing
agency = 'Tuolomne County'
mode = 'Demand Response (DR) - (DO)'
allyears[allyears['Organization_Legal_Name'] == agency]


Unnamed: 0,Organization_Legal_Name,Common_Name_Acronym_DBA,Fiscal_Year,Mode,Annual_VRM,Annual_VRH,Annual_UPT,Sponsored_UPT,VOMX,Operating_Capital,Total_Annual_Expenses_By_Mode


In [31]:
variable = 'cost_per_hr'
(allyears[(allyears['Organization Legal Name'] == agency) 
          & (allyears['Mode']==mode)
        & (allyears['Fiscal Year'] == last_year)]
 [variable].unique().sum()) #[0])

69.6242059511869

In [26]:
(allyears[(allyears['Organization Legal Name'] == agency) 
        & (allyears['Fiscal Year'] == this_year)]['Mode'].unique()
)

array(['Commuter Bus (CB) - (DO)', 'Demand Response (DR) - (DO)',
       'Deviated Fixed Route (DF) - (DO)'], dtype=object)

In [82]:
df = allyears.copy()

this_year = 2022 #for testing purposes
last_year = this_year - 1

def rr20_ratios(df, variable, threshold, this_year, last_year):
    agencies = df['Organization Legal Name'].unique()
    output = []
    for agency in agencies:

        if len(df[df['Organization Legal Name']==agency]) > 0:
        # Check whether data for both years is present, if so perform prior yr comparison.
            if (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==this_year)]) > 0) \
                & (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==last_year)]) > 0): 

                for mode in df[df['Organization Legal Name'] == agency]['Mode'].unique():
                    value_thisyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == this_year)]
                                  [variable].unique()[0], 2))
                    value_lastyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == last_year)]
                                  [variable].unique()[0], 2))
                    if (value_lastyr == 0) and (abs(value_thisyr - value_lastyr) >= threshold):
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed from last year by > = {threshold*100}%, please provide a narrative justification.")
                    elif abs((value_lastyr - value_thisyr)/value_lastyr) >= threshold:
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed from last year by {round((1 - value_thisyr/value_lastyr)*100, 1)}%, please provide a narrative justification.")
                    else:
                        result = "pass"
                        check_name = f"{variable}"
                        mode = mode
                        description = ""

                    output_line = {"Organization": agency,
                           "name_of_check" : check_name,
                                   "mode": mode,
                            "value_checked": f"{this_year} = {value_thisyr}, {last_year} = {value_lastyr}",
                            "pct_change": round(abs((value_lastyr - value_thisyr)/value_lastyr)*100, 1),
                            "check_status": result,
                            "Description": description}
                    output.append(output_line)
        else:
            print(f"There is no data for {agency}")
    checks = pd.DataFrame(output).sort_values(by="Organization")
    return checks

In [86]:
cph_checks = rr20_ratios(allyears, 'cost_per_hr', .3, this_year, last_year)
mpv_checks = rr20_ratios(allyears, 'miles_per_veh', .3, this_year, last_year)
fare_rev_checks = rr20_ratios(allyears, 'fare_rev_per_trip', .25, this_year, last_year)
rev_speed_checks = rr20_ratios(allyears, 'rev_speed', .15, this_year, last_year)
tph_checks = rr20_ratios(allyears, 'trips_per_hr', .30, this_year, last_year)



(86, 7)


Unnamed: 0,Organization,name_of_check,mode,value_checked,pct_change,check_status,Description
0,Alpine County Community Development,Annual VRM,Demand Response (DR) - (DO),"2022 = 10386.0, 2021 = 12386.0",16.1,pass,
1,Amador Transit,Annual VRM,Commuter Bus (CB) - (DO),"2022 = 45472.0, 2021 = 40000.0",13.7,pass,
2,Amador Transit,Annual VRM,Demand Response (DR) - (DO),"2022 = 31337.0, 2021 = 31337.0",0.0,pass,
3,Amador Transit,Annual VRM,Deviated Fixed Route (DF) - (DO),"2022 = 153757.0, 2021 = 153757.0",0.0,pass,
4,Calaveras Transit Agency,Annual VRM,Demand Response (DR) - (PT),"2022 = 23812.0, 2021 = 40000.0",40.5,fail,The Annual VRM for Demand Response (DR) - (PT)...


In [87]:
##---------------- Uncomment the following line pairs one by one to inspect the results
# print(cph_checks.shape)
# cph_checks.head()

print(mpv_checks.shape)
mpv_checks.head()

# print(fare_rev_checks.shape)
# fare_rev_checks.head()

# print(rev_speed_checks.shape)
# rev_speed_checks.head()

# print(tph_checks.shape)
# tph_checks.head()

(86, 7)


Unnamed: 0,Organization,name_of_check,mode,value_checked,pct_change,check_status,Description
0,Alpine County Community Development,Annual VRM,Demand Response (DR) - (DO),"2022 = 10386.0, 2021 = 12386.0",16.1,pass,
1,Amador Transit,Annual VRM,Commuter Bus (CB) - (DO),"2022 = 45472.0, 2021 = 40000.0",13.7,pass,
2,Amador Transit,Annual VRM,Demand Response (DR) - (DO),"2022 = 31337.0, 2021 = 31337.0",0.0,pass,
3,Amador Transit,Annual VRM,Deviated Fixed Route (DF) - (DO),"2022 = 153757.0, 2021 = 153757.0",0.0,pass,
4,Calaveras Transit Agency,Annual VRM,Demand Response (DR) - (PT),"2022 = 23812.0, 2021 = 40000.0",40.5,fail,The Annual VRM for Demand Response (DR) - (PT)...


In [91]:
rr20_checks = pd.concat([cph_checks, mpv_checks, fare_rev_checks, rev_speed_checks, tph_checks]).sort_values(by="Organization")

In [92]:
print(rr20_checks.shape)
rr20_checks.head(10)

(430, 7)


Unnamed: 0,Organization,name_of_check,mode,value_checked,pct_change,check_status,Description
0,Alpine County Community Development,cost_per_hr,Demand Response (DR) - (DO),"2022 = 118.11, 2021 = 118.11",0.0,pass,
0,Alpine County Community Development,trips_per_hr,Demand Response (DR) - (DO),"2022 = 0.6, 2021 = 0.6",0.0,pass,
0,Alpine County Community Development,rev_speed,Demand Response (DR) - (DO),"2022 = 16.15, 2021 = 19.26",16.1,fail,The rev_speed for Demand Response (DR) - (DO) ...
0,Alpine County Community Development,fare_rev_per_trip,Demand Response (DR) - (DO),"2022 = 197.77, 2021 = 197.77",0.0,pass,
0,Alpine County Community Development,miles_per_veh,Demand Response (DR) - (DO),"2022 = 20772.0, 2021 = 24772.0",16.1,pass,
1,Amador Transit,rev_speed,Commuter Bus (CB) - (DO),"2022 = 27.78, 2021 = 40.0",30.6,fail,The rev_speed for Commuter Bus (CB) - (DO) has...
3,Amador Transit,fare_rev_per_trip,Deviated Fixed Route (DF) - (DO),"2022 = 60.51, 2021 = 60.51",0.0,pass,
2,Amador Transit,fare_rev_per_trip,Demand Response (DR) - (DO),"2022 = 29.63, 2021 = 29.63",0.0,pass,
3,Amador Transit,miles_per_veh,Deviated Fixed Route (DF) - (DO),"2022 = 27955.82, 2021 = 27955.82",0.0,pass,
2,Amador Transit,miles_per_veh,Demand Response (DR) - (DO),"2022 = 12534.8, 2021 = 12534.8",0.0,pass,


In [93]:
rr20_checks.to_excel("../data/test_rr20.xlsx", index=False)

#### Function to check whether the VRM changed significantly.
Unlike the above metrics, VRM is not a ratio - the NTD error is just comparing a single number to it's value from the previous year.  
  
Example NTD error message: `For {DR - DO|DR - PT|CB - DO| CB - PT|MB - PT|MB - DO}, Annual Vehicle Revenue Miles {} {} percent compared to last year from {} to {}. This change is high.`

In [107]:
def check_single_number(df, variable, this_year, last_year, threshold=None):
    agencies = df['Organization Legal Name'].unique()
    output = []
    for agency in agencies:

        if len(df[df['Organization Legal Name']==agency]) > 0:
        # Check whether data for both years is present, if so perform prior yr comparison.
            if (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==this_year)]) > 0) \
                & (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==last_year)]) > 0): 

                for mode in df[df['Organization Legal Name'] == agency]['Mode'].unique():
                    value_thisyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == this_year)]
                                  [variable].unique()[0], 2))
                    value_lastyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == last_year)]
                                  [variable].unique()[0], 2))
                    
                    if (round(value_thisyr)==0 and round(value_lastyr) != 0) | (round(value_thisyr)!=0 and round(value_lastyr) == 0):
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed either from or to zero compared to last year. Please provide a narrative justification.")
                    # run only the above check on whether something changed from zero to non-zero, if no threshold is given
                    elif threshold==None:
                        result = "pass"
                        check_name = f"{variable}"
                        mode = mode
                        description = ""
                        pass
                    # also check for pct change, if a threshold parameter is passed into function
                    elif (value_lastyr == 0) and (abs(value_thisyr - value_lastyr) >= threshold):
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} was 0 last year and has changed by > = {threshold*100}%, please provide a narrative justification.")
                    elif abs((value_lastyr - value_thisyr)/value_lastyr) >= threshold:
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed from last year by {round(abs((value_lastyr - value_thisyr)/value_lastyr)*100, 1)}%; please provide a narrative justification.")                        
                    else:
                        result = "pass"
                        check_name = f"{variable}"
                        mode = mode
                        description = ""

                    output_line = {"Organization": agency,
                           "name_of_check" : check_name,
                                   "mode": mode,
                            "value_checked": f"{this_year} = {value_thisyr}, {last_year} = {value_lastyr}",
                            "check_status": result,
                            "Description": description}
                    output.append(output_line)
        else:
            print(f"There is no data for {agency}")
    checks = pd.DataFrame(output).sort_values(by="Organization")
    return checks

In [109]:
numeric_columns = allyears.select_dtypes(include=['number']).columns
allyears[numeric_columns] = allyears[numeric_columns].fillna(0)

vrm_checks = check_single_number(allyears,  'Annual VRM', this_year, last_year, threshold=.30)
vrm_checks

  elif abs((value_lastyr - value_thisyr)/value_lastyr) >= threshold:


Unnamed: 0,Organization,name_of_check,mode,value_checked,check_status,Description
0,Alpine County Community Development,Annual VRM,Demand Response (DR) - (DO),"2022 = 10386.0, 2021 = 12386.0",pass,
1,Amador Transit,Annual VRM,Commuter Bus (CB) - (DO),"2022 = 45472.0, 2021 = 40000.0",pass,
2,Amador Transit,Annual VRM,Demand Response (DR) - (DO),"2022 = 31337.0, 2021 = 31337.0",pass,
3,Amador Transit,Annual VRM,Deviated Fixed Route (DF) - (DO),"2022 = 153757.0, 2021 = 153757.0",pass,
4,Calaveras Transit Agency,Annual VRM,Demand Response (DR) - (PT),"2022 = 23812.0, 2021 = 40000.0",fail,The Annual VRM for Demand Response (DR) - (PT)...
...,...,...,...,...,...,...
80,Town of Truckee,Annual VRM,Bus (MB) (Fixed Route) - (PT),"2022 = 101480.0, 2021 = 101480.0",pass,
82,Trinity County Department of Transportation,Annual VRM,Intercity Service (IC) - (DO),"2022 = 116976.0, 2021 = 116976.0",pass,
84,Tuolumne County Transit Agency (TCTA),Annual VRM,Demand Response (DR) - (PT),"2022 = 169446.0, 2021 = 169446.0",pass,
83,Tuolumne County Transit Agency (TCTA),Annual VRM,Bus (MB) (Fixed Route) - (PT),"2022 = 83834.0, 2021 = 83834.0",pass,


In [110]:
fare_rev_checks = check_single_number(allyears, 'fare_rev_per_trip', this_year, last_year)
fare_rev_checks

Unnamed: 0,Organization,name_of_check,mode,value_checked,check_status,Description
0,Alpine County Community Development,fare_rev_per_trip,Demand Response (DR) - (DO),"2022 = 197.77, 2021 = 197.77",pass,
1,Amador Transit,fare_rev_per_trip,Commuter Bus (CB) - (DO),"2022 = 127.79, 2021 = 127.79",pass,
2,Amador Transit,fare_rev_per_trip,Demand Response (DR) - (DO),"2022 = 29.63, 2021 = 29.63",pass,
3,Amador Transit,fare_rev_per_trip,Deviated Fixed Route (DF) - (DO),"2022 = 60.51, 2021 = 60.51",pass,
4,Calaveras Transit Agency,fare_rev_per_trip,Demand Response (DR) - (PT),"2022 = 99.93, 2021 = 99.93",pass,
...,...,...,...,...,...,...
80,Town of Truckee,fare_rev_per_trip,Bus (MB) (Fixed Route) - (PT),"2022 = 26.46, 2021 = 26.46",pass,
82,Trinity County Department of Transportation,fare_rev_per_trip,Intercity Service (IC) - (DO),"2022 = 116.08, 2021 = 116.08",pass,
84,Tuolumne County Transit Agency (TCTA),fare_rev_per_trip,Demand Response (DR) - (PT),"2022 = 53.24, 2021 = 53.24",pass,
83,Tuolumne County Transit Agency (TCTA),fare_rev_per_trip,Bus (MB) (Fixed Route) - (PT),"2022 = 28.09, 2021 = 28.09",pass,


In [111]:
voms0_check = check_single_number(allyears, 'VOMX', this_year, last_year)
voms0_check

Unnamed: 0,Organization,name_of_check,mode,value_checked,check_status,Description
0,Alpine County Community Development,VOMX,Demand Response (DR) - (DO),"2022 = 1.0, 2021 = 1.0",pass,
1,Amador Transit,VOMX,Commuter Bus (CB) - (DO),"2022 = 1.0, 2021 = 1.0",pass,
2,Amador Transit,VOMX,Demand Response (DR) - (DO),"2022 = 5.0, 2021 = 5.0",pass,
3,Amador Transit,VOMX,Deviated Fixed Route (DF) - (DO),"2022 = 11.0, 2021 = 11.0",pass,
4,Calaveras Transit Agency,VOMX,Demand Response (DR) - (PT),"2022 = 5.0, 2021 = 5.0",pass,
...,...,...,...,...,...,...
80,Town of Truckee,VOMX,Bus (MB) (Fixed Route) - (PT),"2022 = 3.0, 2021 = 3.0",pass,
82,Trinity County Department of Transportation,VOMX,Intercity Service (IC) - (DO),"2022 = 4.0, 2021 = 4.0",pass,
84,Tuolumne County Transit Agency (TCTA),VOMX,Demand Response (DR) - (PT),"2022 = 8.0, 2021 = 8.0",pass,
83,Tuolumne County Transit Agency (TCTA),VOMX,Bus (MB) (Fixed Route) - (PT),"2022 = 4.0, 2021 = 4.0",pass,


#### Check on any missing service data
NTD error: `One or more service data fields are are missing for {}: {}. You must report miles, hours, trips and vehicles operated in maximum service (VOMS) if the transit unit operated revenue service during the fiscal year.`

In [112]:
agency = 'Amador Transit'
mode = 'Demand Response (DR) - (DO)'
allyears[allyears['Organization Legal Name'] == agency]

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr,miles_per_veh,fare_rev_per_trip,rev_speed,trips_per_hr
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158,90944.0,127.790669,27.777642,0.995113
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,127.166158,90944.0,127.790669,27.777642,0.995113
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8,29.630905,10.477098,2.349716
5,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,69.624206,12534.8,29.630905,10.477098,2.349716
6,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Capital,0.0,134.220898,27955.818182,60.513769,21.056834,2.218022
7,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Operating,980081.0,134.220898,27955.818182,60.513769,21.056834,2.218022
174,Amador Transit,,2021,Commuter Bus (CB) - (DO),40000.0,1000.0,1629.0,0.0,1.0,Capital,0.0,208.171,80000.0,127.790669,40.0,1.629
175,Amador Transit,,2021,Commuter Bus (CB) - (DO),40000.0,1000.0,1629.0,0.0,1.0,Operating,208171.0,208.171,80000.0,127.790669,40.0,1.629
176,Amador Transit,,2021,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8,29.630905,10.477098,2.349716
177,Amador Transit,,2021,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,69.624206,12534.8,29.630905,10.477098,2.349716


In [113]:
# first double-check how 0 vs blank cells are imported into pandas
test = pd.read_excel("~/Downloads/NTD_Annual_Report_Rural_2023_2023-09-28.xlsx", 
                     sheet_name="Service Data", index_col=None)
test

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,City of Arvin,,2023,Demand Response (DR) - (DO),12495.0,1988.0,4551.0,0.0,1.0
1,City of Arvin,,2023,Deviated Fixed Route (DF) - (DO),120801.0,5499.0,55125.0,0.0,3.0
2,City of Corcoran - Corcoran Area Transit,,2023,Demand Response (DR) - (DO),31978.0,3691.0,25199.0,0.0,4.0
3,County of Shasta Department of Public Works,,2023,Bus (MB) (Fixed Route) - (PT),,,,,
4,County of Shasta Department of Public Works,,2023,Commuter Bus (CB) - (PT),,,,,
5,Plumas County Transportation Commission,PCTC,2023,Deviated Fixed Route (DF) - (PT),196883.0,5929.0,24699.0,0.0,6.0
6,Tehama County Transit Agency,TRAX,2023,Bus (MB) (Fixed Route) - (PT),463998.0,20838.0,105515.0,0.0,7.0
7,Tehama County Transit Agency,TRAX,2023,Demand Response (DR) - (PT),114389.0,6626.0,14348.0,0.0,7.0


In [119]:
print(test.loc[3, 'Annual VRM'])
# pd.isnull(test.loc[3, 'Annual VRM']) #True
# pd.isna(test.loc[3, 'Annual VRM']) #True

print(test.loc[2, 'Sponsored UPT'])
pd.isnull(test.loc[2, 'Sponsored UPT']) #False for 0 values

nan
0.0


False

In [125]:
service_cols = ['Annual VRM', 'Annual VRH', "Annual UPT",
               "Sponsored UPT", "VOMX"]

mask = [test[x].isnull() for x in service_cols]
mask

[0    False
 1    False
 2    False
 3     True
 4     True
 5    False
 6    False
 7    False
 Name: Annual VRM, dtype: bool,
 0    False
 1    False
 2    False
 3     True
 4     True
 5    False
 6    False
 7    False
 Name: Annual VRH, dtype: bool,
 0    False
 1    False
 2    False
 3     True
 4     True
 5    False
 6    False
 7    False
 Name: Annual UPT, dtype: bool,
 0    False
 1    False
 2    False
 3     True
 4     True
 5    False
 6    False
 7    False
 Name: Sponsored UPT, dtype: bool,
 0    False
 1    False
 2    False
 3     True
 4     True
 5    False
 6    False
 7    False
 Name: VOMX, dtype: bool]

In [126]:
test[test[mask]]

KeyError: "None of [Index([(False, False, False, True, True, False, False, False),\n       (False, False, False, True, True, False, False, False),\n       (False, False, False, True, True, False, False, False),\n       (False, False, False, True, True, False, False, False),\n       (False, False, False, True, True, False, False, False)],\n      dtype='object')] are in the [columns]"

In [158]:
mask = test['Annual VRM'].isnull() | test['Annual VRH'].isnull() | test['Annual UPT'].isnull() | test['Annual UPT'].isnull() | test["VOMX"].isnull()
print(test[mask])

                       Organization Legal Name Common Name/Acronym/DBA  \
3  County of Shasta Department of Public Works                     NaN   
4  County of Shasta Department of Public Works                     NaN   

   Fiscal Year                           Mode  Annual VRM  Annual VRH  \
3         2023  Bus (MB) (Fixed Route) - (PT)         NaN         NaN   
4         2023       Commuter Bus (CB) - (PT)         NaN         NaN   

   Annual UPT  Sponsored UPT  VOMX  
3         NaN            NaN   NaN  
4         NaN            NaN   NaN  


In [162]:
orgs_missing_data = test[mask]['Organization Legal Name'].unique()
orgs_missing_data

array(['County of Shasta Department of Public Works'], dtype=object)

In [None]:
output = []
for x in orgs_missing_data:
    result = "fail"
    check_name = "Missing service data check"
    mode = ""
    description = ("One or more service data values is missing in these columns. Please revise in BlackCat and resubmit.'Annual VRM', 'Annual VRH', 'Annual UPT',
               'Sponsored UPT', 'VOMX'")
    output_line = {"Organization": x,
                   "name_of_check" : check_name,
                   "mode": mode,
                    "value_checked": "Service data columns",
                    "check_status": result,
                    "Description": description}
            output.append(output_line)

In [170]:
# Test code to subtract lists from each other
orgs_missing_data = ['a', 'b', 'c']
all_orgs = ['a', 'b', 'c', 'd', 'e']
orgs_not_missing_data = list(set(all_orgs) - set(orgs_missing_data))
orgs_not_missing_data

['d', 'e']

In [184]:
def check_missing_servicedata(df):
    agencies = df['Organization Legal Name'].unique()
    
    mask = df['Annual VRM'].isnull() | df['Annual VRH'].isnull() | df['Annual UPT'].isnull() | df['Annual UPT'].isnull() | df["VOMX"].isnull()
    orgs_missing_data = df[mask]['Organization Legal Name'].unique()
    print(f"missing = {orgs_missing_data}")
    orgs_not_missing_data = list(set(agencies) - set(orgs_missing_data))
    print(f"Not missing = {orgs_not_missing_data}")
    
    output = []
    for x in agencies:
        if x in orgs_missing_data:
            result = "fail"
            check_name = "Missing service data check"
            mode = ""
            description = ("One or more service data values is missing in these columns. Please revise in BlackCat and resubmit.'Annual VRM', 'Annual VRH', 'Annual UPT','Sponsored UPT', 'VOMX'")
        elif x in orgs_not_missing_data:
            result = "pass"
            check_name = "Missing service data check"
            mode = ""
            description = ""
        output_line = {"Organization": x,
                    "name_of_check" : check_name,
                    "mode": mode,
                        "value_checked": "Service data columns",
                        "check_status": result,
                        "Description": description}
        output.append(output_line)
    checks = pd.DataFrame(output).sort_values(by="Organization")
    
    return checks

In [185]:
testcheck = check_missing_servicedata(test)
testcheck

missing = ['County of Shasta Department of Public Works']
Not missing = ['Plumas County Transportation Commission', 'City of Corcoran - Corcoran Area Transit', 'City of Arvin', 'Tehama County Transit Agency']


Unnamed: 0,Organization,name_of_check,mode,value_checked,check_status,Description
0,City of Arvin,Missing service data check,,Service data columns,pass,
1,City of Corcoran - Corcoran Area Transit,Missing service data check,,Service data columns,pass,
2,County of Shasta Department of Public Works,Missing service data check,,Service data columns,fail,One or more service data values is missing in ...
3,Plumas County Transportation Commission,Missing service data check,,Service data columns,pass,
4,Tehama County Transit Agency,Missing service data check,,Service data columns,pass,


In [155]:
agency = 'City of Arvin'
mode = 'Demand Response (DR) - (DO)'

test[(test['Organization Legal Name']==agency) & (test['Mode']==mode)]['Annual VRM'][0]

12495.0

In [156]:
## Not using this one - much less efficient.

# def check_missing_servicedata(df):
#     service_cols = ['Annual VRM', 'Annual VRH', "Annual UPT",
#                "Sponsored UPT", "VOMX"]
#     agencies = df['Organization Legal Name'].unique()
#     output = []
#     for agency in agencies:
#         if len(df[df['Organization Legal Name']==agency]) > 0:
#                 for mode in df[df['Organization Legal Name'] == agency]['Mode'].unique():
                    
#                     if (df[(df['Organization Legal Name']==agency) & (df['Mode']==mode)]['Annual VRM'][0].isnull()): 
#                     #| (df['Annual VRH'].isnull()) | (df['Annual UPT'].isnull()) | (df['Annual UPT'].isnull()) | (df["VOMX"].isnull()):
#                         nan_cols = [i for i in df.service_cols if df[i].isnull().any()]
#                         result = "fail"
#                         check_name = "Missing service data check"
#                         mode = mode
#                         description = (f"Service data is missing for {mode} in {nan_cols}. Please revise in BlackCat and resubmit.")
#                     else:
#                         result = "pass"
#                         check_name = "Missing service data check"
#                         mode = mode
#                         description = ""
                    
#                     output_line = {"Organization": agency,
#                            "name_of_check" : check_name,
#                                    "mode": mode,
#                             "value_checked": "Service data columns",
#                             "check_status": result,
#                             "Description": description}
#                     output.append(output_line)
#     checks = pd.DataFrame(output).sort_values(by="Organization")
#     return checks


---
  
The below cells show how I made the fake data - now commented out.

In [9]:
# fake2021_service = data1.copy()
# fake2021_service.head(3)

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0
1,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0
2,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0


In [42]:
##----- Tried these ways of automatically generating fake data, but they take too much memory and crash

# import numpy as np

# def vrm(vrm):
#     return np.random.rand(int(vrm*0.9), int(vrm*1.1))


# fake2021_service['VRM'] = fake2021_service['Annual VRM'].apply(lambda x: vrm(x))

# fake2021_service.loc[50, 'Annual VRM'] = vrm(fake2021_service.loc[50, 'Annual VRM'])
# print(fake2021_service.loc[50, 'Annual VRM'] )

In [10]:
# fake2021_service.loc[0,'Annual VRM'] = 12386
# print(fake2021_service.loc[0,'Annual VRM'])
# fake2021_service.loc[1,'Annual VRM'] = 40000
# print(fake2021_service.loc[1,'Annual VRM'])
# fake2021_service.loc[4,'Annual VRM'] = 40000
# fake2021_service.loc[1,'Annual VRH'] = 1000
# print(fake2021_service.loc[1,'Annual VRH'])
# print(fake2021_service.loc[4,'Annual VRM'])

# fake2021_service.loc[9,'VOMX'] = 7
# print(fake2021_service.loc[9,'VOMX'])

# fake2021_service.loc[:,'Fiscal Year'] = 2021

12386.0
40000.0
1000.0
40000.0
7.0


In [11]:
# fake2021_service.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,Alpine County Community Development,,2021,Demand Response (DR) - (DO),12386.0,643.0,384.0,274.0,1.0
1,Amador Transit,,2021,Commuter Bus (CB) - (DO),40000.0,1000.0,1629.0,0.0,1.0
2,Amador Transit,,2021,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0
3,Amador Transit,,2021,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0
4,Calaveras Transit Agency,CTA,2021,Demand Response (DR) - (PT),40000.0,1416.0,2104.0,0.0,5.0


In [12]:
# fake_exp = rr20_exp_by_mode.copy()
# fake_exp['Fiscal Year'] = 2021

In [157]:
### Export, for code development
## We will assume that the data coming in from the API will be *exactly* in this format

# with pd.ExcelWriter("../data/NTD_Annual_Report_Rural_2021_fake.xlsx") as writer:
    
#     fake2021_service.to_excel(writer, sheet_name="Expenses By Mode", index=False)
#     fake_exp.to_excel(writer, sheet_name="Service Data", index=False)


In [33]:
test2 = allyears.copy()
test2.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr,miles_per_veh
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,118.108865,20772.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,118.108865,20772.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158,90944.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,127.166158,90944.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8


In [40]:
test2.loc[0,'Total Annual Expenses By Mode'] = 2000
test2.head(3)

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr,miles_per_veh
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,2000.0,118.108865,20772.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,118.108865,20772.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158,90944.0


In [41]:
test2 = (test2.groupby(['Organization Legal Name','Mode', 'Fiscal Year'])
 .apply(lambda x: x.assign(cost_per_hr=lambda x: x['Total Annual Expenses By Mode'].sum() / x['Annual VRH']))
)

test2.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr,miles_per_veh
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,2000.0,121.219285,20772.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,121.219285,20772.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158,90944.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,127.166158,90944.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8


In [42]:
(75944)/643.0


118.10886469673406

In [43]:
(75944 + 2000)/643.0

121.21928460342146