The BlackCat API came with no instructions. Here we just inspect what is in it and its format. This notebook contains:  
1. the code for inspecting the data 
2. What to do to check whether data has changed in 2024 (and years after that). 
  
We are inferring what to do. Contact BlackCat for further instructions.

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import pendulum
import re

**NOTE that the URL has the year at the end. Change this to whatever year of data you would like to get.**

In [2]:
api_2024 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2024"

In [3]:
api_2023 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2023"

In [4]:
response_2024 = requests.get(api_2024)

In [5]:
response_2023 = requests.get(api_2023)

In [6]:
display(
    response_2024,
    response_2023
)
#looking for response 200

<Response [200]>

<Response [200]>

In [7]:
blob_2023 = response_2023.json()
blob_2024 = response_2024.json()

In [8]:
#blob_2024

**Get table and org list**

In [9]:
# type(blob) #list
display(
    len(blob_2023), #35
    len(blob_2024)
)

# type(blob[0]) #dict

# blob[0]

# blob['Tables']

85

64

In [10]:
tables_2023 = []
tables_2024 = []

# # For listing out ONLY the python dictionary keys in the blob (these are the tables) that start with "NTD"
# for k, v in blob[0].items():
#     if k.startswith("NTD"):
#         tables.append(k)

# For listing out ALL the dict keys:
for k, v in blob_2023[0].items():
    tables_2023.append(k)
    
for k, v in blob_2024[0].items():
    tables_2024.append(k)

# compare both table list
display(
    tables_2023 == tables_2024,
    #list(tables_2023), 
    list(tables_2024)
)

True

['ReportId',
 'Organization',
 'ReportPeriod',
 'ReportStatus',
 'ReportLastModifiedDate',
 'NTDReportingStationsAndMaintenance',
 'NTDTransitAssetManagementA15',
 'NTDAssetAndResourceInfo',
 'NTDReportingP10',
 'NTDReportingP20',
 'NTDReportingP50',
 'NTDReportingA35',
 'NTDReportingRR20_Intercity',
 'NTDReportingRR20_Rural',
 'NTDReportingRR20_Urban_Tribal',
 'NTDReportingTAMNarrative',
 'SS60']

In [11]:
# get list all org names 
org_list = [blob_2024[i]["Organization"] for i in range(len(blob_2024))]

In [12]:
# get index position of org names from org list
org_list.index("Eastern Sierra Transit Authority")

24

In [13]:
# slicing blob at specific index number, report name and 'data' field

len(blob_2024[24]["NTDReportingRR20_Rural"]["Data"])

56

### Inspect whether any tables have changed from last year (2023)

1. Pull up the external tables yaml at airflow/dags/create_external_tables/ntd_report_validation/external_table_all_ntdreports.yml
2. Pull up the external_blackcat.all_ntdreports table in BigQuery
  
To help do #3 and #4 below, use the cell below to copy in table names one by one and inspect the API data for whichever year one is interested in. If you don't see any data, then cycle through the JSON list by changing `blob[0]` to `blob[1]`, `blob[3]` etc.  
  
3. Compare the table names above with what is in the table list on the schema there. NOTE table names in BigQuery are not *exactly* the same as the API, they have been made all lowercase with `_data` added.
4. Compare the individual columns within each of the above tables to what is there.
  
Change the schema in the yaml as needed to reflect the data. Do not remove any old column names. Just add new columns and/or tables

In [14]:
# A10 report
# check complete
#blob_2024[3]['NTDReportingStationsAndMaintenance']['Data'][0]

In [15]:
# check complete

# new columns: 
    #"Type", 
    #"Note", 
    #"LastModifiedDate"

# added to draft yaml
#blob_2024[14]['NTDTransitAssetManagementA15']["Data"][0]

In [16]:
# A30
# check complete

# new columns:
					 #'TotalVehicles':,
					 #'ActiveVehicles',
					 #'DedicatedFleet',
					 #'NoCapitalReplacementResponsibility',
					 #'AutomatedorAutonomousVehicles',
					 #'Manufacturer',
					 #'DescribeOtherManufacturer',
					 #'Model',
					 #'YearRebuilt',
					 #'OtherFuelType',
					 #'DuelFuelType',
					 #'StandingCapacity',
					 #'OtherOwnershipType',
					 #'EmergencyVehicles',
					 #'TypeofLastRenewal',
					 #'UsefulLifeBenchmark,
					 #'MilesThisYear',
					 #'AverageLifetimeMilesPerActiveVehicle'
                        
# added to draft yaml
#blob_2024[15]['NTDAssetAndResourceInfo']['Data'][1]

In [17]:
# check complete
#blob_2024[12]['NTDReportingP10']['Data']

In [18]:
#check complete
#blob_2024[5]['NTDReportingP20']['Data']

In [19]:
#check complete
#blob_2024[5]['NTDReportingP50']['Data']

In [20]:
#check complete
#blob_2024[16]['NTDReportingA35']['Data'][0]

In [138]:
#check complete
blob_2024[51]['NTDReportingRR20_Intercity']['Data'][5]

{'Id': 78,
 'ItemId': 13,
 'ReportId': 1084,
 'Item': '\r\nIntercity Service',
 'Type': 'Service Data',
 'OperationsExpended': None,
 'CapitalExpended': None,
 'Description': None,
 'AnnualVehicleRevMiles': 186651,
 'RegularUnlinkedPassengerTrips': None,
 'LastModifiedDate': '2024-10-08T16:06:56.9'}

In [22]:
# check complete
# new columns:
						#◊ "AnnualVehicleRevMilesComments"
						#◊ "AnnualVehicleRevHoursComments"
						#◊ "AnnualUnlinkedPassTripsComments"
						#◊ "AnnualVehicleMaxServiceComments"
                        # "SponsoredServiceUPTComments"
# added to draft yaml
#blob_2024[16]['NTDReportingRR20_Rural']['Data'][2]

In [23]:
#check complete
#blob_2024[6]['NTDReportingRR20_Urban_Tribal']['Data'][0]

In [139]:
# check complete
# NEW COLUMN(S)
    # 'VehiclesToBePurchasesNextYear'
    
# added to draft yaml
blob_2024[2]['NTDReportingTAMNarrative']['Data'][0]

{'Id': 601,
 'ReportId': 1012,
 'Type': 'Revenue Vehicles',
 'Category': 'Light-Duty Mid-Sized Bus',
 'VehiclesInAssetClass': 1,
 'VehiclesExceededULBTAMPlan': 0,
 'TAMPlanGoalsDescription': None,
 'VehiclesToBeRetiredBeyondULB': 0,
 'VehiclesToBePurchasesNextYear': 0,
 'VehiclesPastULBInTAM': 0,
 'LastModifiedDate': None}

In [25]:
# check complete
#blob_2024[16]['SS60']['Data'][0]

## examine warehouse data

In [26]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

In [27]:
org = "Eastern Sierra Transit Authority"

The cell below checks the values in the `rr20_equal_totals_check` model, and its associated `int` and `stg` model from the warehouse.


Update model and column names with whatever you ned to check 

In [28]:
# Query the fct model from warehouse
fct_ntd_rr20_service_checks = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org)
    >> collect()
)

# Query the int model from warehouse
int_ntd_rr20_service_3ratios_wide = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org)
    >> collect()
)

stg_ntd_rr20_rural = (
    tbls.staging.stg_ntd_rr20_rural()
    >> filter(_.organization == org, 
              _.api_report_period == 2024)
    >> collect()
)



In [46]:
test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[39])
    >> collect()
)

test.head()

# error_list= [9 (-2.3/0), 14 (0/0), 16(0/0), 36(-4/0)]

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Mendocino Transit Authority,RR20F-143: Vehicle Revenue Miles (VRM) change ...,Demand Response (DR) - (DO),2024,Pass,"2024 = 83273, 2023 = 72892chg = 12.5%",,,2024-10-09 16:16:42.863005+00:00
1,Mendocino Transit Authority,RR20F-143: Vehicle Revenue Miles (VRM) change ...,Deviated Fixed Route (DF) - (DO),2024,Pass,"2024 = 32307, 2023 = 32184chg = 0.4%",,,2024-10-09 16:16:42.863005+00:00
2,Mendocino Transit Authority,RR20F-143: Vehicle Revenue Miles (VRM) change ...,Bus (MB) (Fixed Route) - (DO),2024,Pass,"2024 = 720355, 2023 = 559702chg = 22.3%",,,2024-10-09 16:16:42.863005+00:00
3,Mendocino Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) ...,Demand Response (DR) - (DO),2024,Pass,"2024 = 4, 2023 = 4chg = 0%",,,2024-10-09 16:16:42.863005+00:00
4,Mendocino Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) ...,Deviated Fixed Route (DF) - (DO),2024,Pass,"2024 = 1, 2023 = 1chg = 0%",,,2024-10-09 16:16:42.863005+00:00


In [47]:
test["organization"].value_counts()

Mendocino Transit Authority    27
Name: organization, dtype: int64

In [30]:
error_list= [9, 14, 16, 36]

error_orgs = [org_list[i] for i in error_list]

error_orgs

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'Trinity County Department of Transportation ']

In [31]:
#cant find the error orgs in int model
int_ntd_rr20_service_3ratios_wide[int_ntd_rr20_service_3ratios_wide["organization"].isin(error_list)]

Unnamed: 0,organization,mode,cph_last_year,cph_this_year,mpv_last_year,mpv_this_year,frpt_last_year,frpt_this_year,rev_speed_last_year,rev_speed_this_year,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year,vrh_last_year,vrh_this_year,upt_last_year,upt_this_year


In [32]:
# does error list appear in stg model? NOT IN STAGIN MODEL!?
stg_ntd_rr20_rural[stg_ntd_rr20_rural["organization"].isin(error_list)]

Unnamed: 0,organization,api_report_status,api_report_last_modified_date,api_report_period,id,report_id,item,revenue,type,css_class,operations_expended,capital_expended,description,annual_vehicle_rev_miles,annual_vehicle_rev_hours,annual_unlinked_pass_trips,annual_vehicle_max_service,sponsored_service_upt,quantity,last_modified_date


In [33]:
# checking blob for error list
for i in error_list:
    print(len(blob_2024[i]["NTDReportingRR20_Rural"]))

1
1
1
1


In [34]:
for i in range(10):
    print(blob_2024[9]["NTDReportingRR20_Rural"]["Data"][i].get("Revenue"))



Passenger-Paid Fares
Organization-Paid Fares
Passenger-Paid Fares
Organization-Paid Fares
None
None
None
None


In [35]:
blob_2024[9]["NTDReportingRR20_Rural"]["Data"][0]

{'Id': 14150,
 'ReportId': 1026,
 'Item': 'Demand Response (DR) - (DO)',
 'Revenue': '',
 'Type': 'Expenses by Mode',
 'CssClass': 'expense',
 'OperationsExpended': None,
 'CapitalExpended': None,
 'Description': None,
 'AnnualVehicleRevMiles': None,
 'AnnualVehicleRevMilesComments': None,
 'AnnualVehicleRevHours': None,
 'AnnualVehicleRevHoursComments': None,
 'AnnualUnlinkedPassTrips': None,
 'AnnualUnlinkedPassTripsComments': None,
 'AnnualVehicleMaxService': None,
 'AnnualVehicleMaxServiceComments': None,
 'SponsoredServiceUPT': None,
 'SponsoredServiceUPTComments': None,
 'Quantity': None,
 'LastModifiedDate': '2024-09-26T17:48:25.27'}

In [36]:
error_orgs

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'Trinity County Department of Transportation ']

In [37]:
error_rr20_rural = (
    tbls.staging.stg_ntd_rr20_rural()
    >> filter(_.organization.isin([error_orgs]),
              _.api_report_period == 2024)
    >> collect()
)

In [38]:
error_rr20_rural.head(3)

Unnamed: 0,organization,api_report_status,api_report_last_modified_date,api_report_period,id,report_id,item,revenue,type,css_class,operations_expended,capital_expended,description,annual_vehicle_rev_miles,annual_vehicle_rev_hours,annual_unlinked_pass_trips,annual_vehicle_max_service,sponsored_service_upt,quantity,last_modified_date
0,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14150,1026,Demand Response (DR) - (DO),,Expenses by Mode,expense,,,,,,,,,,2024-09-26 17:48:25.270000+00:00
1,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14151,1026,Deviated Fixed Route (DF) - (DO),,Expenses by Mode,expense,,,,,,,,,,2024-09-26 17:48:25.270000+00:00
2,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14152,1026,Demand Response (DR) - (DO),Passenger-Paid Fares,Fare Revenues,revenue,,,,,,,,,,2024-09-26 17:48:25.270000+00:00


In [39]:
error_rr20_rural["capital_expended"].value_counts()

Series([], Name: capital_expended, dtype: int64)

In [40]:
# what do other have for "revenue"?
stg_ntd_rr20_rural["api_report_status"].value_counts()

Submitted    56
Name: api_report_status, dtype: int64

In [41]:
# investigating the rr20 service check errors
fct_ntd_rr20_service_checks[fct_ntd_rr20_service_checks["check_status"] =="Fail"].head()

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
3,Eastern Sierra Transit Authority,RR20F-139: Revenue Speed change,Bus (MB) (Fixed Route) - (DO),2024,Fail,"2024 = 17, 2023 = 16.9chg = -83%","The revenue speed, the avg speed of your vehic...",,2024-10-09 16:13:28.364198+00:00
4,Eastern Sierra Transit Authority,RR20F-139: Revenue Speed change,Demand Response (DR) - (DO),2024,Fail,"2024 = 8.8, 2023 = 8.6chg = -91%","The revenue speed, the avg speed of your vehic...",,2024-10-09 16:13:28.364198+00:00
5,Eastern Sierra Transit Authority,RR20F-139: Revenue Speed change,Commuter Bus (CB) - (DO),2024,Fail,"2024 = 43.5, 2023 = 33.4chg = -57%","The revenue speed, the avg speed of your vehic...",,2024-10-09 16:13:28.364198+00:00
9,Eastern Sierra Transit Authority,RR20F-139: Vehicle Revenue Miles (VRM) % change,Bus (MB) (Fixed Route) - (DO),2024,Fail,"2024 = 606924, 2023 = 585952chg = 3.5%",The annual vehicle revenue miles for this mode...,,2024-10-09 16:13:28.364198+00:00
10,Eastern Sierra Transit Authority,RR20F-139: Vehicle Revenue Miles (VRM) % change,Demand Response (DR) - (DO),2024,Fail,"2024 = 170407, 2023 = 156732chg = 8%",The annual vehicle revenue miles for this mode...,,2024-10-09 16:13:28.364198+00:00


In [42]:
fct_ntd_rr20_service_checks["name_of_check"].unique()

array(['RR20F-179: Missing Service Data check',
       'RR20F-139: Revenue Speed change',
       'RR20F-171: Vehicles of Maximum Service (VOMS) change',
       'RR20F-139: Vehicle Revenue Miles (VRM) % change',
       'RR20F-143: Vehicle Revenue Miles (VRM) change from zero',
       'RR20F-146: Miles per Vehicle change',
       'RR20F-005: Cost per Hour change',
       'RR20F-139: Fare Revenue Per Trip change',
       'RR20F-154: Trips per Hour change'], dtype=object)

In [43]:
fct_ntd_rr20_service_checks["value_checked"].value_counts()

2024 = 606924, 2023 = 585952chg = 3.5%                        2
2024 = 170407, 2023 = 156732chg = 8%                          2
2024 = 189887, 2023 = 115303chg = 39.3%                       2
2024 Service data: VRM=606924 VRH=35605 UPT=883954 VOMS=24    1
2024 = 37977.4, 2023 = 23060.6chg = 39.3%                     1
2024 = 2.9, 2023 = 2.9chg = 0%                                1
2024 = 24.8, 2023 = 24.4chg = 1.6%                            1
2024 = 73.3, 2023 = 245chg = -234.2%                          1
2024 = 21, 2023 = 40.4chg = -92.4%                            1
2024 = 1.3, 2023 = 2.5chg = -92.3%                            1
2024 = 98, 2023 = 124.2chg = -26.7%                           1
2024 = 95.4, 2023 = 78.9chg = 17.3%                           1
2024 = 136.3, 2023 = 118.4chg = 13.1%                         1
2024 = 25288.5, 2023 = 24414.7chg = 3.5%                      1
2024 = 21300.9, 2023 = 19591.5chg = 8%                        1
2024 Service data: VRM=170407 VRH=19459 

In [44]:
display(
    int_rr20["vrm_this_year"],
    int_rr20["vrm_last_year"]
)


NameError: name 'int_rr20' is not defined

In [None]:
# sum the "expense" and "operations" revenue/expense rows for the agency

op_ex = stg_rr20[stg_rr20["css_class"]=="expense"]["operations_expended"].sum()
op_rev = stg_rr20[stg_rr20["css_class"]=="revenue"]["operations_expended"].sum()

int_op_ex = int_rr20["Total_Annual_Op_Expenses_by_Mode"][0]
int_op_rev = int_rr20["Total_Annual_Op_Revenues_Expended"][0]

print(f"""Org: {org}
Does opX and opRev match? {op_ex == op_rev}
Does warehouse opX and opRev match? {int_op_ex == int_op_rev}
opx: {op_ex}
oprev: {op_rev}
Does opX and warehouse opX match? {op_ex == int_op_ex}
Does opRev and warehouse opRev match? {op_rev == int_op_rev}
warehouse opx: {int_op_ex}
warehouse oprev: {int_op_rev}""")
display(fct_rr20)

In [71]:
display(error_list, error_orgs)

[9, 14, 16, 36]

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'Trinity County Department of Transportation ']

In [184]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

arvin_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[9],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              _.name_of_check == "RR20F-146: Miles per Vehicle change",
             )
    
    >> collect()
)

arvin_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[9])
    >> collect()
)

keep_cols=['organization',
    'tph_last_year',
           'tph_this_year', 
           'voms_last_year', 
           'voms_this_year', 
           'vrm_last_year',
           'vrm_this_year',  
]

display(
    #arvin_check_test,
# failed at: 
# RR20F-154: Trips per Hour change
# RR20F-171: Vehicles of Maximum Service (VOMS) change

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 


arvin_int_check[keep_cols]
       )

Unnamed: 0,organization,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,10.02455,0.0,3.0,0.0,120801.0,116550.0


In [121]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

truckee_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[14],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              _.name_of_check == "RR20F-146: Miles per Vehicle change",
             )
    
    >> collect()
)

truckee_check_test
# failed at:
# RR20F-139: Fare Revenue Per Trip change

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 


Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Town of Truckee,RR20F-146: Miles per Vehicle change,Bus (MB) (Fixed Route) - (PT),2024,Fail,"2024 = 28366.7, 2023 = 36385chg = -28.3%",The miles per vehicle for this mode has change...,,2024-10-09 16:56:31.198352+00:00
1,Town of Truckee,RR20F-146: Miles per Vehicle change,Demand Response (DR) - (PT),2024,Fail,"2024 = 7380, 2023 = 8124.3chg = -10.1%",The miles per vehicle for this mode has change...,,2024-10-09 16:56:31.198352+00:00


In [151]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

colusa_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[16],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              _.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
             )
    
    >> collect()
)

colusa_check_test
# failed at:
# RR20F-154: Trips per Hour change

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 



Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Colusa County Transit Agency,RR20F-171: Vehicles of Maximum Service (VOMS) ...,Demand Response (DR) - (DO),2024,Pass,"2024 = 7, 2023 = 7chg = 0%",,,2024-10-09 17:43:00.658577+00:00


In [150]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

trinity_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[36],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              _.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
             )
    
    >> collect()
)

trinity_check_test
# failed at:
# RR20F-171: Vehicles of Maximum Service (VOMS) change

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 


Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Trinity County Department of Transportation,RR20F-154: Trips per Hour change,Intercity Service (IC) - (DO),2024,Fail,"2024 = 2, 2023 = 2.1chg = -5%",The calculated trips per hour for this mode ha...,,2024-10-09 17:42:42.963943+00:00


In [149]:
# What does a good check look like?
check_list = [
    "RR20F-154: Trips per Hour change",
    "RR20F-171: Vehicles of Maximum Service (VOMS) change",
    "RR20F-139: Fare Revenue Per Trip change",
]

# 


fct_ntd_rr20_service_checks.shape

(27, 9)

## for RR20F-154: Trips per Hour change: 
- first checks if `tph_this_year` or `tph_last_year` are `NULL` or `0`. if true, then return `did not run`
- else, calculates the absolute value of `(tph_this_year - tph_last_year)/ tph_last_year`. 
- City of Arvin error was (-2.3/0), meaning that `tph_last_year` = 0, should have been caught by the first conditions and return "did not run"
- there are other agencies with this check that have the "did not run" check status.

## for RR20F-171: Vehicles of Maximum Service (VOMS) change:
- first checks if `voms_this_year` or `voms_last_year` is some combination of 0, not 0 or NULL and returns "fail",
- then checks if `voms_this_year` and `voms_last_year` are NULL or !=0, then returns "did not run"
- but City of Arvin's error was (0/0) and Trinity's error was (-4/0), which both should have been caught
- there are other agencies with this check that have the "did not run" status

## for RR20F-139: Fare Revenue Per Trip change:
- check `vrm_this_year` and `vrm_last_year` if they are NULL or 0, if so, then return "did not run",
- then checks if the absolute value of (`vrm_this_year` - `vrm_last_year`)/ `vrm_last_year`.
- but Truckee's error was 0/0. so `vrm_last year` must equal 0, which should have been caught before?
- there are other agencies with this check that have the "did not run" status

In [172]:
# check rr20 service check for any rows that have "did not run" in there "check_status"

not_list = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(~_.organization.isin(org_list),
             )
    
    >> collect()
)

not_list[(not_list["name_of_check"].isin(check_list)) & (not_list["check_status"]=="Did Not Run")]["description"].value_counts()

No data but this check was run before the NTD submission due date in 2024.             46
No data but this check was run before the NTD submission due date in 2024 for VOMS.    23
Name: description, dtype: int64

In [187]:
error_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization.isin(error_orgs))
    >> collect()
)

keep_cols=['organization',
    'tph_last_year',
           'tph_this_year', 
           'voms_last_year', 
           'voms_this_year', 
           'vrm_last_year',
           'vrm_this_year',  
]

error_int_check[keep_cols]

Unnamed: 0,organization,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,10.02455,0.0,3.0,0.0,120801.0,116550.0
2,Colusa County Transit Agency,4.17158,0.0,7.0,7.0,141531.0,149880.0
3,Town of Truckee,6.302895,5.344347,3.0,3.0,109155.0,85100.0
4,Town of Truckee,2.506347,4.224924,3.0,3.0,24373.0,22140.0
5,Trinity County Department of Transportation,2.102253,2.021668,4.0,0.0,116650.0,128945.0


#### Extract the org name and details from the blob
This is extra code to show how to inspect what organizations are in the API at any given time, in case it is helpful. 

In [None]:
for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod_string = x.get('ReportLastModifiedDate')
    last_mod = pendulum.from_format(last_mod_string, 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    print(f"Report details: ID {report_id}, org {org}, report period {period}, status {status}, last modified on {last_mod_string}.")
#     print(f"New datetime {last_mod}")
#     print(f"iso is {iso}")

#### Quick check on what orgs are in this API, and how many have RR-20 info

In [None]:
org_data = []

for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod = pendulum.from_format(x.get('ReportLastModifiedDate'), 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    
    
    rural = x['NTDReportingRR20_Rural']
    for k,v in rural.items():
        rural_n = len(v)
    city = x['NTDReportingRR20_Intercity']
    for k,v in city.items():
        city_n = len(v)
    urban_tribal = x['NTDReportingRR20_Urban_Tribal']
    for k,v in urban_tribal.items():
        urban_n = len(v)
    
    org_info = pd.DataFrame(data=[[report_id, org, period, status, iso, rural_n, city_n, urban_n]], 
                            columns=['report_id', 'organization', 'report_period', 'report_status', 'last_modified', 
                                     'rr20_rural_rows', 'rr20_intercity_rows', 'rr20_urban-tribal_rows'])
#     whole_df = pd.concat([org_info, raw_df], axis=1).sort_values(by='organization')
    
    org_data.append(org_info)


In [None]:
newapi = pd.concat(org_data)
print(len(newapi))
newapi.head()

In [None]:
newapi.to_csv('../data/newapi_rr20_11-27-23.csv')

## Convert API data to dataframes
Here using the test API to develop a function.

Just shove entire blob into a dataframe - this approach is what's recommended by Cal-ITP. They prefer we then do any transformations and separating of tables on dbt.  
Downsides:
* there are many columns with nested data (converts to lists and dictionaries). Basically each NTD report is in ONE column.
* the column names get changed because of the nesting and of repeated columns

In [None]:
df = pd.json_normalize(blob_2024)
df

In [None]:
df.info()

In [None]:
df['ReportLastModifiedDate'] =  df['ReportLastModifiedDate'].astype('datetime64[ns]')
# df['ReportLastModifiedDate'] = pd.to_datetime(df['ReportLastModifiedDate'], format='%m/%d/YYYY HH:mm:ss %p')

In [None]:
df

In [None]:
user_dict = blob[0]['NTDReportingP50']['Data']
user_dict # a list of dictionaries. Each dict is one row of data.

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

However in several tables, rows have several columns that are nested dictionaries.  
  
The following code explores ways to unnest them and expand the dataframe rows. **NOTE WE DID NOT USE THIS APPROACH IN PRODUCTION. We decided to unnest tables using SQL instead, in the `staging` dbt models.**

In [None]:
pd.json_normalize(user_dict)

# This expands columns instead of expanding rows. Not exactly what we want.

In [None]:
# We only really want the "Text" value in the dictionaries in the "Mode" and "Type" columns.
# user_dict[0]['Mode']
user_dict[0]

In [None]:
# How to replace certain values in a key:value pair of an existing python dictionary.
original = user_dict[0]
copy = {**original, 'Mode': original['Mode']['Text'], 
        'Type': original['Type']['Text']}
copy

----
Done! This worked but is not super ideal because we hard-code the keys that we want to change instead of iterating over them, but it works as long as we know which dictionary items in each table are nested.  

In [None]:
# Trying loop of creating new dict from old dict.
# New dict will not be nested - checks for a nested dict in each value; for each nested dict, 
# we extract only the k,v pair where the key == 'Text' 

copy_test = {**original}
for k,v in copy_test.items():
    if type(v) is dict:
        copy_test[k] = copy_test[k]['Text']
        
copy_test

In [None]:
## Worked! Now try the above loop over an entire JSON data table

for x in user_dict:
    for k,v in x.items():
        if type(v) is dict:
            x[k] = x[k]['Text']

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

#### Table is now one level and in the format desired.