### Prototyping automatic validation checks for NTD reporting data: form RR-20
  
Decided *not* to use Pandera to validate this form. It is more straightforward and customizable to write our own functions.  
  
This notebook first imports a csv of cleaned data, presumably submitted from the RR-20 form, from the BlackCat report generation function on their website. We will assume that a future API will have this data in the **exact same format** as ther generated report has it in, which is what we requested from BlackCat developers. 
  
This notebook shows the development of the functions that are used in the executable file `rr20_check.py`

In [1]:
import pandas as pd
import pandera as pa
import numpy as np 

In [3]:
rr20_exp_by_mode = pd.read_excel("../data/NTD_Annual_Report_Rural_for2022.xlsx", 
                     sheet_name="Expenses By Mode", index_col=None) 
rr20_service = pd.read_excel("../data/NTD_Annual_Report_Rural_for2022.xlsx", 
                     sheet_name="Service Data", index_col=None) 
orgs = pd.read_csv("../data/organizations.csv")

In [4]:
rr20_exp_by_mode.head()
# print(rr20_service.shape)
# rr20_service.head()
# orgs.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Operating/Capital,Mode,Total Annual Expenses By Mode
0,Alpine County Community Development,,2022,Capital,Demand Response (DR) - (DO),0.0
1,Alpine County Community Development,,2022,Operating,Demand Response (DR) - (DO),75944.0
2,Amador Transit,,2022,Capital,Commuter Bus (CB) - (DO),0.0
3,Amador Transit,,2022,Capital,Demand Response (DR) - (DO),0.0
4,Amador Transit,,2022,Capital,Deviated Fixed Route (DF) - (DO),0.0


In [5]:
data1 = rr20_service.merge(orgs, left_on ='Organization Legal Name', right_on = 'Organization', 
                          indicator=True).query('_merge == "both"').drop(columns=['_merge', 'Organization'])
print(data1.shape)
data1.head()

(86, 9)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0
1,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0
2,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0
3,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0
4,Calaveras Transit Agency,CTA,2022,Demand Response (DR) - (PT),23812.0,1416.0,2104.0,0.0,5.0


In [6]:
# Doubles the rows as expected - because of the 'Operating/Capital' column
data = data1.merge(rr20_exp_by_mode, on = ['Organization Legal Name', 'Common Name/Acronym/DBA', 'Fiscal Year', 'Mode'])

print(data.shape)
data.head()

(172, 11)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0


In [7]:
# try the 2 merges together
data = rr20_service.merge(orgs, left_on ='Organization Legal Name', right_on = 'Organization', 
                          indicator=True).query('_merge == "both"').drop(columns=['_merge', 'Organization'])\
.merge(rr20_exp_by_mode, on = ['Organization Legal Name', 'Common Name/Acronym/DBA', 'Fiscal Year', 'Mode'])

In [8]:
print(data.shape)
data.head(20)

(172, 11)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0
5,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0
6,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Capital,0.0
7,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Operating,980081.0
8,Calaveras Transit Agency,CTA,2022,Demand Response (DR) - (PT),23812.0,1416.0,2104.0,0.0,5.0,Capital,50189.0
9,Calaveras Transit Agency,CTA,2022,Demand Response (DR) - (PT),23812.0,1416.0,2104.0,0.0,5.0,Operating,160056.0


#### MAke fake data for 2021
If we click "generate report" for 2021, RR-20 rural on BlackCat, we get an empty report. For the sake of time I am just making up fake data for 2021 to build the function to compare the prior year to this year.

In [9]:
fake2021_service = data1.copy()
fake2021_service.head(3)

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0
1,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0
2,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0


In [42]:
import numpy as np



def vrm(vrm):
    return np.random.rand(int(vrm*0.9), int(vrm*1.1))

##----- This takes too much memory and crashes
# fake2021_service['VRM'] = fake2021_service['Annual VRM'].apply(lambda x: vrm(x))

# fake2021_service.loc[50, 'Annual VRM'] = vrm(fake2021_service.loc[50, 'Annual VRM'])
# print(fake2021_service.loc[50, 'Annual VRM'] )

In [10]:
fake2021_service.loc[0,'Annual VRM'] = 12386
print(fake2021_service.loc[0,'Annual VRM'])
fake2021_service.loc[1,'Annual VRM'] = 40000
print(fake2021_service.loc[1,'Annual VRM'])
fake2021_service.loc[4,'Annual VRM'] = 40000
fake2021_service.loc[1,'Annual VRH'] = 1000
print(fake2021_service.loc[1,'Annual VRH'])
print(fake2021_service.loc[4,'Annual VRM'])

fake2021_service.loc[9,'VOMX'] = 7
print(fake2021_service.loc[9,'VOMX'])

fake2021_service.loc[:,'Fiscal Year'] = 2021

12386.0
40000.0
1000.0
40000.0
7.0


In [11]:
fake2021_service.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX
0,Alpine County Community Development,,2021,Demand Response (DR) - (DO),12386.0,643.0,384.0,274.0,1.0
1,Amador Transit,,2021,Commuter Bus (CB) - (DO),40000.0,1000.0,1629.0,0.0,1.0
2,Amador Transit,,2021,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0
3,Amador Transit,,2021,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0
4,Calaveras Transit Agency,CTA,2021,Demand Response (DR) - (PT),40000.0,1416.0,2104.0,0.0,5.0


In [12]:
fake_exp = rr20_exp_by_mode.copy()
fake_exp['Fiscal Year'] = 2021

In [157]:
### Export, for code development
## We will assume that the data coming in from the API will be *exactly* in this format
with pd.ExcelWriter("../data/NTD_Annual_Report_Rural_2021_fake.xlsx") as writer:
    
    fake2021_service.to_excel(writer, sheet_name="Expenses By Mode", index=False)
    fake_exp.to_excel(writer, sheet_name="Service Data", index=False)


In [13]:
all2021 = fake2021_service.merge(fake_exp, on = ['Organization Legal Name', 'Common Name/Acronym/DBA', 'Fiscal Year', 'Mode'])

In [14]:
####------- Combine 2022 (rea) and 2021 (fake) data
allyears = data.append(all2021, ignore_index = True)
print(allyears.shape)
allyears.head()

(344, 11)


  allyears = data.append(all2021, ignore_index = True)


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0


In [63]:
75944 / 643

118.10886469673406

#### Add in cost per hr to df 

**Cost per hour (CPH)** includes, importantly, comparing it to the previous year.  
CPH = `Expenses on operations by mode/VRH`
  
Example NTD error message: `The calculated cost per hour for {MB - PT} equals {49.12}. The prior year’s calculated value equals {98.56}. This is a change of {-50.166} Percent.`

In [32]:
# https://stackoverflow.com/questions/59865432/can-i-use-pandas-dataframe-assign-with-a-variable-name
# Making df.assign accept variables by pasing in a dict

def make_ratio_cols(df, numerator, denominator, col_name):
    if col_name is not None:
        # If a user specify a column name, use it
        # Raise error if the column already exists
        if col_name in df.columns:
            raise ValueError(f"Dataframe already has column '{col_name}'")
        else:
            _col_name = col_name
            
    df = (df.groupby(['Organization Legal Name','Mode', 'Fiscal Year'])
          .apply(lambda x: x.assign(**{_col_name:
                 lambda x: x[numerator].sum() / x[denominator]}))
                )
    return df

In [34]:
### testing
# 'Annual VRH' in allyears.columns
test = make_ratio_cols(allyears, 'Total Annual Expenses By Mode', 
                       'Annual VRH', 
                       'cost_per_hr')


# test['cost_per_hr'] = (allyears.groupby(['Organization Legal Name','Mode', 'Fiscal Year'])
#                        .apply(lambda x: x['Total Annual Expenses By Mode'].sum() / x['Annual VRH']))
test.head(3)

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,118.108865
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,118.108865
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158


In [115]:
#without a function, 1-liner to add column

allyears = (allyears.groupby(['Organization Legal Name','Mode', 'Fiscal Year'])
 .apply(lambda x: x.assign(cost_per_hr=lambda x: x['Total Annual Expenses By Mode'].sum() / x['Annual VRH']))
)

allyears.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,118.108865
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,118.108865
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,127.166158
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206


#### Also add in Miles per vehicle: VRM/vehicles of max service (VOMS)  
Example NTD error message: `The calculated miles per vehicle for {DR - DO} is {5,221.40}. The prior year’s calculated value is {3,678.33}. This is a {41.95}% {increase} caused by a change in Vehicle Revenue Miles, the number of Vehicles Operated in Annual Maximum Service, or both.`

In [116]:
allyears = (allyears.groupby(['Organization Legal Name','Mode', 'Fiscal Year'])
 .apply(lambda x: x.assign(miles_per_veh=lambda x: x['Annual VRM'].sum() / x['VOMX']))
)

allyears.head()

Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr,miles_per_veh
0,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Capital,0.0,118.108865,20772.0
1,Alpine County Community Development,,2022,Demand Response (DR) - (DO),10386.0,643.0,384.0,274.0,1.0,Operating,75944.0,118.108865,20772.0
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158,90944.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,127.166158,90944.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8


### Function development

In [74]:
import datetime

# this_year = datetime.datetime.now().year
# print(this_year)
this_year = 2022 #for testing purposes
last_yr = this_year - 1
print(last_yr)


allyears['Fiscal Year'].unique()

agencies = data['Organization Legal Name'].unique()

2021


In [117]:
### for testing
agency = 'Amador Transit'
mode = 'Demand Response (DR) - (DO)'
allyears[allyears['Organization Legal Name'] == agency]


Unnamed: 0,Organization Legal Name,Common Name/Acronym/DBA,Fiscal Year,Mode,Annual VRM,Annual VRH,Annual UPT,Sponsored UPT,VOMX,Operating/Capital,Total Annual Expenses By Mode,cost_per_hr,miles_per_veh
2,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Capital,0.0,127.166158,90944.0
3,Amador Transit,,2022,Commuter Bus (CB) - (DO),45472.0,1637.0,1629.0,0.0,1.0,Operating,208171.0,127.166158,90944.0
4,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8
5,Amador Transit,,2022,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,69.624206,12534.8
6,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Capital,0.0,134.220898,27955.818182
7,Amador Transit,,2022,Deviated Fixed Route (DF) - (DO),153757.0,7302.0,16196.0,0.0,11.0,Operating,980081.0,134.220898,27955.818182
174,Amador Transit,,2021,Commuter Bus (CB) - (DO),40000.0,1000.0,1629.0,0.0,1.0,Capital,0.0,208.171,80000.0
175,Amador Transit,,2021,Commuter Bus (CB) - (DO),40000.0,1000.0,1629.0,0.0,1.0,Operating,208171.0,208.171,80000.0
176,Amador Transit,,2021,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Capital,0.0,69.624206,12534.8
177,Amador Transit,,2021,Demand Response (DR) - (DO),31337.0,2991.0,7028.0,315.0,5.0,Operating,208246.0,69.624206,12534.8


In [141]:
variable = 'cost_per_hr'
(allyears[(allyears['Organization Legal Name'] == agency) 
          & (allyears['Mode']==mode)
        & (allyears['Fiscal Year'] == last_year)]
 [variable].unique()[0])

180.94598554898693

In [172]:
df = allyears.copy()

this_year = 2022 #for testing purposes
last_year = this_year - 1

def rr20_ratios(df, variable, threshold):
    agencies = df['Organization Legal Name'].unique()
    output = []
    for agency in agencies:

        if len(df[df['Organization Legal Name']==agency]) > 0:
        # Check whether data for both years is present, if so perform prior yr comparison.
            if (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==this_year)]) > 0) \
                & (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==last_year)]) > 0): 

                for mode in df[df['Organization Legal Name'] == agency]['Mode'].unique():
                    value_thisyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == this_year)]
                                  [variable].unique()[0], 2))
                    value_lastyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == last_year)]
                                  [variable].unique()[0], 2))
                    if abs((value_lastyr - value_thisyr)/value_lastyr) >= threshold:
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed from last year by {round((1 - value_thisyr/value_lastyr)*100, 1)}%, please provide a narrative justification.")
                    else:
                        result = "pass"
                        check_name = f"{variable}"
                        mode = mode
                        description = ""

                    output_line = {"Organization": agency,
                           "name_of_check" : check_name,
                                   "mode": mode,
                            "value_checked": f"{this_year} = {value_thisyr}, {last_year} = {value_lastyr}",
                            "pct_change": round(abs((value_lastyr - value_thisyr)/value_lastyr)*100, 1),
                            "check_status": result,
                            "Description": description}
                    output.append(output_line)
        else:
            print(f"There is no data for {agency}")
    checks = pd.DataFrame(output).sort_values(by="Organization")
    return checks

In [173]:
cph_checks = rr20_ratios(allyears, 'cost_per_hr', .3)
mpv_checks = rr20_ratios(allyears, 'miles_per_veh', .3)
vrm_checks = rr20_ratios(allyears, 'Annual VRM', .25)

print(cph_checks.shape)
cph_checks.head()

print(mpv_checks.shape)
mpv_checks.head()
rr20_checks = pd.concat([cph_checks, mpv_checks, vrm_checks]).sort_values(by="Organization")

(86, 7)
(86, 7)


  if abs((value_lastyr - value_thisyr)/value_lastyr) >= threshold:
  "pct_change": round(abs((value_lastyr - value_thisyr)/value_lastyr)*100, 1),


In [174]:
print(rr20_checks.shape)
rr20_checks.head(10)

(258, 7)


Unnamed: 0,Organization,name_of_check,mode,value_checked,pct_change,check_status,Description
0,Alpine County Community Development,cost_per_hr,Demand Response (DR) - (DO),"2022 = 118.11, 2021 = 118.11",0.0,pass,
0,Alpine County Community Development,Annual VRM,Demand Response (DR) - (DO),"2022 = 10386.0, 2021 = 12386.0",16.1,pass,
0,Alpine County Community Development,miles_per_veh,Demand Response (DR) - (DO),"2022 = 20772.0, 2021 = 24772.0",16.1,pass,
1,Amador Transit,Annual VRM,Commuter Bus (CB) - (DO),"2022 = 45472.0, 2021 = 40000.0",13.7,pass,
1,Amador Transit,miles_per_veh,Commuter Bus (CB) - (DO),"2022 = 90944.0, 2021 = 80000.0",13.7,pass,
2,Amador Transit,miles_per_veh,Demand Response (DR) - (DO),"2022 = 12534.8, 2021 = 12534.8",0.0,pass,
3,Amador Transit,miles_per_veh,Deviated Fixed Route (DF) - (DO),"2022 = 27955.82, 2021 = 27955.82",0.0,pass,
3,Amador Transit,Annual VRM,Deviated Fixed Route (DF) - (DO),"2022 = 153757.0, 2021 = 153757.0",0.0,pass,
2,Amador Transit,Annual VRM,Demand Response (DR) - (DO),"2022 = 31337.0, 2021 = 31337.0",0.0,pass,
3,Amador Transit,cost_per_hr,Deviated Fixed Route (DF) - (DO),"2022 = 134.22, 2021 = 134.22",0.0,pass,


In [175]:
rr20_checks.to_excel("../data/test_rr20.xlsx", index=False)

#### Function to check whether the VRM changed significantly.
Example NTD error message: `For {DR - DO|DR - PT|CB - DO| CB - PT|MB - PT|MB - DO}, Annual Vehicle Revenue Miles {} {} percent compared to last year from {} to {}. This change is high.`

In [159]:
def check_single_number(df, variable, threshold):
    agencies = df['Organization Legal Name'].unique()
    output = []
    for agency in agencies:

        if len(df[df['Organization Legal Name']==agency]) > 0:
        # Check whether data for both years is present, if so perform prior yr comparison.
            if (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==this_year)]) > 0) \
                & (len(df[(df['Organization Legal Name']==agency) & (df['Fiscal Year']==last_year)]) > 0): 

                for mode in df[df['Organization Legal Name'] == agency]['Mode'].unique():
                    value_thisyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == this_year)]
                                  [variable].unique()[0], 2))
                    value_lastyr = (round(df[(df['Organization Legal Name'] == agency) 
                                          & (df['Mode']==mode)
                                          & (df['Fiscal Year'] == last_year)]
                                  [variable].unique()[0], 2))
                    
                    if abs(value_thisyr/value_lastyr) >= threshold:
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed from last year by {round((value_thisyr/value_lastyr)*100, 1)}%, please provide a narrative justification.")
                    else if (round(value_thisyr)==0 and round(value_lastyr) != 0) | (round(value_thisyr)!=0 and round(value_lastyr) == 0):
                        result = "fail"
                        check_name = f"{variable}"
                        mode = mode
                        description = (f"The {variable} for {mode} has changed either from or to zero compared to last year. Please provide a narrative justification.")
                    else:
                        result = "pass"
                        check_name = f"{variable}"
                        mode = mode
                        description = ""

                    output_line = {"Organization": agency,
                           "name_of_check" : check_name,
                                   "mode": mode,
                            "value_checked": f"{this_year} = {value_thisyr}, {last_year} = {value_lastyr}",
                            "check_status": result,
                            "Description": description}
                    output.append(output_line)
        else:
            print(f"There is no data for {agency}")
    checks = pd.DataFrame(output).sort_values(by="Organization")
    return checks

In [161]:
vrm_checks = rr20_ratios(allyears, 'Annual VRM', .30)
vrm_checks

  if 1 - abs(value_thisyr/value_lastyr) >= .30:


Unnamed: 0,Organization,name_of_check,mode,value_checked,check_status,Description
0,Alpine County Community Development,Annual VRM,Demand Response (DR) - (DO),"2022 = 10386.0, 2021 = 12386.0",pass,
1,Amador Transit,Annual VRM,Commuter Bus (CB) - (DO),"2022 = 45472.0, 2021 = 40000.0",pass,
2,Amador Transit,Annual VRM,Demand Response (DR) - (DO),"2022 = 31337.0, 2021 = 31337.0",pass,
3,Amador Transit,Annual VRM,Deviated Fixed Route (DF) - (DO),"2022 = 153757.0, 2021 = 153757.0",pass,
4,Calaveras Transit Agency,Annual VRM,Demand Response (DR) - (PT),"2022 = 23812.0, 2021 = 40000.0",fail,The Annual VRM for Demand Response (DR) - (PT)...
...,...,...,...,...,...,...
80,Town of Truckee,Annual VRM,Bus (MB) (Fixed Route) - (PT),"2022 = 101480.0, 2021 = 101480.0",pass,
82,Trinity County Department of Transportation,Annual VRM,Intercity Service (IC) - (DO),"2022 = 116976.0, 2021 = 116976.0",pass,
84,Tuolumne County Transit Agency (TCTA),Annual VRM,Demand Response (DR) - (PT),"2022 = 169446.0, 2021 = 169446.0",pass,
83,Tuolumne County Transit Agency (TCTA),Annual VRM,Bus (MB) (Fixed Route) - (PT),"2022 = 83834.0, 2021 = 83834.0",pass,
