### Prototyping automatic validation checks for NTD reporting data: form A-10
  
Decided *not* to use Pandera to validate this form. Details in the section on Pandera below - it is more straightforward and customizable to write our own functions.  
  
This notebook first imports a csv of cleaned data, presumably submitted from the A10 form, from the NTD data files found on their website. The data was transformed in the notebook `2a_clean_format_data_a10.ipynb`. 
  
This notebook shows the development of the functions that are used in the executable file `a10_facilities_check.py`

In [1]:
import pandas as pd
import pandera as pa
import numpy as np 

In [2]:
# data to validate -

df1 = pd.read_csv("../data/2021_a10_submitted_partialdata.csv", index_col = 0) # 2021
prior_yr = pd.read_csv("../data/2020_a10_submitted_partialdata.csv", index_col = 0) # 2020
df1.head(3)

Unnamed: 0,Agency,City,State,OrganizationType,ReporterType,year,Mode,TOS,ownerships,Under200Vehicles,200to300Vehicles,Over300Vehicles,HeavyMaintenanceFacilities,TotalFacilities
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0


In [3]:
print(df1.shape)
print(prior_yr.shape)

(325, 14)
(439, 14)


---
### Check validation rules with custom functions instead of Pandera  
We write 4 checks:

**1. Check that sum of total facilities for each agency, across all modes, is a whole number.** Historical error: "The total maintenance facilities {} you reported across all modes does not add up to a whole number."  
  * 6 errors in 2022
  * 27 errors in 2021
  
  Example 2021 issue: 
```
3273011 | 91012 - Mountain Area Regional Transit Authority	| A10-032	| Stations and Maintenance Facilities - DO - (A-10)	Commuter Bus| Directly Operated	| The total maintenance facilities {1.23} you reported across all modes does not add up to a whole number.	
```  
  
**2. Check that the sum of all total facilities is not zero.** Historical error: "You did not report any general purpose maintenance facility. For DO modes, you must report any maintenance facilities owned or leased by you. For PT modes, you must report any maintenance facilities owned or leased by you or your contractors."  
**3. Check whether total gen purpose facilities (all but heavy maintenance) is > 1.** If so throw error. Historical error: "You have reported {} general purpose maintenance facilities. Please confirm the subrecipient owns/leases multiple maintenance facilities or revise the number."

**4. Check that the total gen purpose facilities is same as prior year.** If not, throw error. Historical error: "Number of General Purpose Maintenance Facilities is {}, but was {} last report year."
- rowbind 2021 and 2020
- for each agency get all data. get last yr and this yr facility check

In [17]:
allyears = df1.append(prior_yr)
print(allyears.shape)
allyears.head()

(764, 14)


  allyears = df1.append(prior_yr)


Unnamed: 0,Agency,City,State,OrganizationType,ReporterType,year,Mode,TOS,ownerships,Under200Vehicles,200to300Vehicles,Over300Vehicles,HeavyMaintenanceFacilities,TotalFacilities
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,HR,DO,Owned,2.0,0.0,0.0,0.0,2.0
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,RB,DO,Owned,0.0,0.48,0.03,0.06,0.57


In [18]:
import datetime

this_year = datetime.datetime.now().year
print(this_year)
last_yr = this_year - 1
last_yr



### for testing
agency = 'Antelope Valley Transit Authority'
this_year = 2021
allyears['year'].unique()

allyears[(allyears['Agency']==agency) & (allyears['year']==this_year)][['Under200Vehicles', 
                            '200to300Vehicles',
                            'Over300Vehicles']].sum()

df = allyears.copy()
df[df['Agency']==agency]

2023


Unnamed: 0,Agency,City,State,OrganizationType,ReporterType,year,Mode,TOS,ownerships,Under200Vehicles,200to300Vehicles,Over300Vehicles,HeavyMaintenanceFacilities,TotalFacilities
69,Antelope Valley Transit Authority,Lancaster,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,PT,Owned by Public Agency,0.7,0.0,0.0,0.0,0.7
70,Antelope Valley Transit Authority,Lancaster,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,CB,PT,Owned by Public Agency,0.3,0.0,0.0,0.0,0.3
71,Antelope Valley Transit Authority,Lancaster,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,DR,PT,Leased by PT Provider,1.0,0.0,0.0,0.0,1.0
111,Antelope Valley Transit Authority,Lancaster,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2020,MB,PT,Owned by Public Agency,0.7,0.0,0.0,0.0,0.7
112,Antelope Valley Transit Authority,Lancaster,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2020,CB,PT,Owned by Public Agency,0.3,0.0,0.0,0.0,0.3
114,Antelope Valley Transit Authority,Lancaster,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2020,DR,PT,Leased by PT Provider,1.0,0.0,0.0,0.0,1.0


In [19]:
import datetime

#for testing purposes
df = allyears.copy()

a10_agencies = df['Agency'].unique()
# this_year = datetime.datetime.now().year
this_year = 2021
last_year = this_year - 1

output = []


for agency in a10_agencies:
    # facilites are a whole number
    if len(df[df['Agency']==agency]) > 0:
        
        total_gen_fac = round(df[(df['Agency']==agency) & (df['year']==this_year)]
                          [['Under200Vehicles', 
                            '200to300Vehicles',
                            'Over300Vehicles']].sum().sum())
                                          
        # check on whether there's >1 gen purpose fac and/or none reported
        if (round(total_gen_fac) <= 1) & (round(total_gen_fac) != 0):
            result = "pass"
            description = ""
            check_name = "Gen Purpose Facilities"
        elif round(total_gen_fac) > 1:
            result = "fail"
            description = "You reported > 1 general purpose facility. Please verify whether this is correct."
            check_name = "Multiple Gen Purpose Facilities"
        elif round(total_gen_fac) == 0:
            result = "fail"
            description = "You reported no general purpose facilities. Please verify whether this is correct."
            check_name = "Non-zero Gen Purpose Facilities"
        else:
            pass
        
        output_line = {"Organization": agency,
                       "name_of_check" : check_name,
                        "value_checked": f"Gen Purpose Facilities: {total_gen_fac}",
                        "check_status": result,
                        "Description": description}
        output.append(output_line)
        
        # Check whether data for both years is present, if so perform prior yr comparison.
        if (len(df[(df['Agency']==agency) & (df['year']==this_year)]) > 0) & (len(df[(df['Agency']==agency) & (df['year']==last_year)]) > 0): 
            
            last_yr_gen_fac = round(df[(df['Agency']==agency) & (df['year']==last_year)]
                                     [['Under200Vehicles', 
                                        '200to300Vehicles',
                                        'Over300Vehicles']].sum().sum())
             
            if round(total_gen_fac) == round(last_yr_gen_fac):
                result = "pass"
                description = ""
                check_name = "Comparison to last yr: Gen Purpose Facilities"
            else:
                result = "fail"
                description = "Num. of general purpose facilities differs that last year - please verify or clarify."
                check_name = "Comparison to last yr: Gen Purpose Facilities"

            output_line = {"Organization": agency,
                           "name_of_check" : check_name,
                            "value_checked": f"{total_gen_fac} in {this_year}, {last_yr_gen_fac} in {last_year} (Gen Purpose Facilities)", 
                            "check_status": result,
                            "Description": description}
            output.append(output_line)

        else:
             pass
                              
    
    facility_checks = pd.DataFrame(output).sort_values(by="Organization")


In [20]:
print(facility_checks.shape)
facility_checks.head(20)

(394, 5)


Unnamed: 0,Organization,name_of_check,value_checked,check_status,Description
307,Access Services,Non-zero Gen Purpose Facilities,Gen Purpose Facilities: 0,fail,You reported no general purpose facilities. Pl...
7,Access Services,Multiple Gen Purpose Facilities,Gen Purpose Facilities: 6,fail,You reported > 1 general purpose facility. Ple...
12,Alameda-Contra Costa Transit District,Multiple Gen Purpose Facilities,Gen Purpose Facilities: 8,fail,You reported > 1 general purpose facility. Ple...
310,"Alameda-Contra Costa Transit District, dba: AC...",Non-zero Gen Purpose Facilities,Gen Purpose Facilities: 0,fail,You reported no general purpose facilities. Pl...
381,Alpine County Local Transportation Commission,Non-zero Gen Purpose Facilities,Gen Purpose Facilities: 0,fail,You reported no general purpose facilities. Pl...
119,Altamont Corridor Express,Gen Purpose Facilities,Gen Purpose Facilities: 1,pass,
120,Altamont Corridor Express,Comparison to last yr: Gen Purpose Facilities,"1 in 2021, 1 in 2020 (Gen Purpose Facilities)",pass,
143,Amador Regional Transit System,Non-zero Gen Purpose Facilities,Gen Purpose Facilities: 0,fail,You reported no general purpose facilities. Pl...
144,Amador Regional Transit System,Comparison to last yr: Gen Purpose Facilities,"0 in 2021, 0 in 2020 (Gen Purpose Facilities)",pass,
100,Anaheim Transportation Network,Gen Purpose Facilities,Gen Purpose Facilities: 1,pass,


In [21]:
facility_checks.to_csv("../data/test.csv", index=False)

Table to make:  
`Organization | name_of_check | value_checked | check_status | Description`

In [22]:
a10_agencies = df['Agency'].unique()

output = []
for agency in a10_agencies:

    ##Total facilities
    if len(df[df['Agency']==agency]) > 0:
        total_fac = round(df[(df['Agency']==agency) & ((df['year']==this_year))]['TotalFacilities'].sum())
        
        # whole number check
        if total_fac % 1 == 0:
            result = "pass"
            description = ""
            check_name = "Whole Number Facilities"
        else:
            result = "fail"
            description = "The reported total facilities do not add up to a whole number. Please explain."
            check_name = "Whole Number Facilities"
        
        output_line = {"Organization": agency,
                       "name_of_check" : check_name,
                           "value_checked": f"Total Facilities: {total_fac}",
                           "check_status": result,
                          "Description": description}
        output.append(output_line)
        
        # Non-zero check
        if total_fac != 0:
            result = "pass"
            description = ""
            check_name = "Non-zero Facilities"
        else:
            result = "fail"
            description = "There are no reported facilities. Please verify or explain."
            check_name = "Non-zero Facilities"
        
        output_line = {"Organization": agency,
                       "name_of_check" : check_name,
                        "value_checked": f"Total Facilities: {total_fac}",
                        "check_status": result,
                        "Description": description}
        output.append(output_line)
        
        ## General purpose facilities (all except "heavy maintenance")
        total_gen_fac = round(df[(df['Agency']==agency) & (df['year']==this_year)]
                          [['Under200Vehicles', 
                            '200to300Vehicles',
                            'Over300Vehicles']].sum().sum())
                                          
        # check on whether there's >1 gen purpose fac and/or none reported
        if (round(total_gen_fac) <= 1) & (round(total_gen_fac) != 0):
            result = "pass"
            description = ""
            check_name = "Gen Purpose Facilities"
        elif round(total_gen_fac) > 1:
            result = "fail"
            description = "You reported > 1 general purpose facility. Please verify whether this is correct."
            check_name = "Multiple Gen Purpose Facilities"
        elif round(total_gen_fac) == 0:
            result = "fail"
            description = "You reported no general purpose facilities. Please verify whether this is correct."
            check_name = "Non-zero Gen Purpose Facilities"
        else:
            pass
        
        output_line = {"Organization": agency,
                       "name_of_check" : check_name,
                        "value_checked": f"Gen Purpose Facilities: {total_gen_fac}",
                        "check_status": result,
                        "Description": description}
        output.append(output_line)
        
        # Check whether data for both years is present, if so perform prior yr comparison.
        if (len(df[(df['Agency']==agency) & (df['year']==this_year)]) > 0) & (len(df[(df['Agency']==agency) & (df['year']==last_year)]) > 0): 
            
            last_yr_gen_fac = round(df[(df['Agency']==agency) & (df['year']==last_year)]
                                     [['Under200Vehicles', 
                                        '200to300Vehicles',
                                        'Over300Vehicles']].sum().sum())
             
            if round(total_gen_fac) == round(last_yr_gen_fac):
                result = "pass"
                description = ""
                check_name = "Comparison to last yr: Gen Purpose Facilities"
            else:
                result = "fail"
                description = "Num. of general purpose facilities differs that last year - please verify or clarify."
                check_name = "Comparison to last yr: Gen Purpose Facilities"

            output_line = {"Organization": agency,
                           "name_of_check" : check_name,
                            "value_checked": f"{total_gen_fac} in {this_year}, {last_yr_gen_fac} in {last_year} (Gen Purpose Facilities)", 
                            "check_status": result,
                            "Description": description}
            output.append(output_line)

        else:
             pass
    
    facility_checks = pd.DataFrame(output).sort_values(by="Organization")


In [23]:
print(facility_checks.shape)
facility_checks.head(20)

(912, 5)


Unnamed: 0,Organization,name_of_check,value_checked,check_status,Description
652,Access Services,Non-zero Facilities,Total Facilities: 0,fail,There are no reported facilities. Please verif...
653,Access Services,Non-zero Gen Purpose Facilities,Gen Purpose Facilities: 0,fail,You reported no general purpose facilities. Pl...
651,Access Services,Whole Number Facilities,Total Facilities: 0,pass,
19,Access Services,Multiple Gen Purpose Facilities,Gen Purpose Facilities: 6,fail,You reported > 1 general purpose facility. Ple...
18,Access Services,Non-zero Facilities,Total Facilities: 6,pass,
17,Access Services,Whole Number Facilities,Total Facilities: 6,pass,
29,Alameda-Contra Costa Transit District,Non-zero Facilities,Total Facilities: 9,pass,
28,Alameda-Contra Costa Transit District,Whole Number Facilities,Total Facilities: 9,pass,
30,Alameda-Contra Costa Transit District,Multiple Gen Purpose Facilities,Gen Purpose Facilities: 8,fail,You reported > 1 general purpose facility. Ple...
660,"Alameda-Contra Costa Transit District, dba: AC...",Whole Number Facilities,Total Facilities: 0,pass,


#### Checking against prior year submissions.
NEXT TO DO:  
Make a `year` column and add another row for prior year, then compare against years in the `group by` checks.  
We will have to build adding on the prior year's data for each agency, into the data pipeline. Possibly:
* agency submits a form
* we format the form, then row bind the prior year's form to it from a table that has all agency info from 2022 for each form. (must have identical table schema)


### Pandera testing. 
https://github.com/unionai-oss/pandera  
(this part included just to show how we checked out Pandera and then decided not to use it)

In [3]:
### Create schema - this is what will define what the data table needs to conform to - and hold all validation checks.
# We can infer a template from the starting data.


facilities_a10_schema = pa.infer_schema(df1).to_script()
print(facilities_a10_schema)

from pandera import DataFrameSchema, Column, Check, Index, MultiIndex

schema = DataFrameSchema(
    columns={
        "Agency": Column(
            dtype="object",
            checks=None,
            nullable=False,
            unique=False,
            coerce=False,
            required=True,
            regex=False,
            description=None,
            title=None,
        ),
        "City": Column(
            dtype="object",
            checks=None,
            nullable=False,
            unique=False,
            coerce=False,
            required=True,
            regex=False,
            description=None,
            title=None,
        ),
        "State": Column(
            dtype="object",
            checks=None,
            nullable=False,
            unique=False,
            coerce=False,
            required=True,
            regex=False,
            description=None,
            title=None,
        ),
        "OrganizationType": Column(
            dtype="object",


Copied and pasted the output above into a file called `a10_inferred_schema`  in a folder I made here called `schemas`.   
  Now will try to validate it against its own data.

In [4]:
from schemas.a10_inferred_schema import schema
try:
    schema.validate(df1, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases_df = exc.failure_cases
    display(exc.failure_cases)

Nothing happens because the data has no errors.  
  
Now we will start customizing the validation schema. Made another copy of it first and called it **`facilities_a10_schema.py`**. THIS IS THE FINAL SCHEMA, USE FROM NOW ON.  
  
To test, first changed one thing - to check if the `State` column equals "CA".  
Now I'll change one state to WA and see what the error report looks like.

In [5]:
df1.loc[0,'State'] = 'WA' #change the value
df1.loc[0,'State'] #check

'WA'

In [6]:
df1.head(3)

Unnamed: 0,Agency,City,State,OrganizationType,ReporterType,year,Mode,TOS,ownerships,Under200Vehicles,200to300Vehicles,Over300Vehicles,HeavyMaintenanceFacilities,TotalFacilities
0,Los Angeles County Metropolitan Transportation...,Los Angeles,WA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0


In [7]:
pd.set_option('display.max_rows',100)

from schemas.facilities_a10_schema import a10_schema

try:
    a10_schema.validate(df1, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases_df = exc.failure_cases
    display(exc.failure_cases)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,State,equal_to(CA),0,WA,0


Ta-da! This is an example Pandera auto-generated error.  
  
---  

### Tests to write the schema checks
Below, we test one custom rule at a time. After testing here successfully, I added the code rule to the schema file. 
  
Also at this point I added checks to each of the `total facilities` columns to check whether they are a whole number. See the `facilities_a10_schema.py` file for the code.

In [8]:
# filter by agency first
agency = "Mountain Area Regional Transit Authority, dba: Mountain Transit"

df_agency = df1[df1['Agency']==agency]
df_agency

Unnamed: 0,Agency,City,State,OrganizationType,ReporterType,year,Mode,TOS,ownerships,Under200Vehicles,200to300Vehicles,Over300Vehicles,HeavyMaintenanceFacilities,TotalFacilities
190,"Mountain Area Regional Transit Authority, dba:...",Big Bear Lake,CA,Independent Public Agency or Authority of Tran...,Rural Reporter,2021,MB,DO,Owned,0.0,0.0,0.0,0.0,1.23
197,"Mountain Area Regional Transit Authority, dba:...",Big Bear Lake,CA,Independent Public Agency or Authority of Tran...,Rural Reporter,2021,CB,DO,Owned,0.0,0.0,0.0,0.0,0.36
198,"Mountain Area Regional Transit Authority, dba:...",Big Bear Lake,CA,Independent Public Agency or Authority of Tran...,Rural Reporter,2021,DR,DO,Owned,0.0,0.0,0.0,0.0,0.41


In [10]:
# df[df['Agency']==agency]['Total Facilities'].sum()
round(df1[df1['Agency']==agency][['Under200Vehicles', 
                            '200to300Vehicles',
                            'Over300Vehicles']].sum().sum())

0

This way we also do not use the groupby checks in Pandera

In [9]:
# check that sum of all total facilities per agency is a whole number.
pd.set_option('display.max_rows',100)

from schemas.facilities_a10_schema import a10_schema

try:
    a10_schema.validate(df_agency, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases_df = exc.failure_cases
    display(exc.failure_cases)

In [10]:
# Double-check the above check works, by changing the dataset
df_agency.loc[0,'TotalFacilities'] = 1.55 #change the value
df_agency.loc[0,'TotalFacilities'] #check

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_agency.loc[0,'TotalFacilities'] = 1.55 #change the value


1.55

In [11]:
# retry
try:
    a10_schema.validate(df_agency, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases_df = exc.failure_cases
    display(exc.failure_cases)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,year,coerce_dtype('int64'),,,0.0
1,Column,Agency,not_nullable,,,0.0
2,Column,City,not_nullable,,,0.0
3,Column,State,not_nullable,,,0.0
4,Column,OrganizationType,not_nullable,,,0.0
5,Column,ReporterType,not_nullable,,,0.0
6,Column,year,not_nullable,,,0.0
7,Column,year,dtype('int64'),,float64,
8,Column,Mode,not_nullable,,,0.0
9,Column,TOS,not_nullable,,,0.0


#### Result: 
What this seems to do is return *all* cells in the offending row. Each cell is given a new row. Also, the offending column `TotalFacilities` is the only one **not** shown here, which is confusing.   

#### Conclusion: 
Faster and to instead switch to custom functions like I wrote for the A-30, and also gives greater control to format  the output as we want.

In [36]:
pd.set_option('display.max_colwidth',None)


Now testing validation against the schema with the above rules added.
* check that each `total facilities` value is a whol number
* check that the total facilities across all modes is > 0
* check that the total facilities across all modes is a whole umber
* check that the State is == 'CA'