The BlackCat API came with no instructions. Here we just inspect what is in it and its format. This notebook contains:  
1. the code for inspecting the data 
2. What to do to check whether data has changed in 2024 (and years after that). 
  
We are inferring what to do. Contact BlackCat for further instructions.

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import pendulum
import re

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_colwidth', None)

**NOTE that the URL has the year at the end. Change this to whatever year of data you would like to get.**

In [3]:
api_2024 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2024"

In [4]:
api_2023 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2023"

In [5]:
response_2024 = requests.get(api_2024)

In [6]:
response_2023 = requests.get(api_2023)

In [7]:
display(
    response_2024,
    response_2023
)
#looking for response 200

<Response [200]>

<Response [200]>

In [8]:
blob_2023 = response_2023.json()
blob_2024 = response_2024.json()

In [9]:
#blob_2024

**Get table and org list**

In [10]:
# type(blob) #list
display(
    len(blob_2023), #35
    len(blob_2024)
)

# type(blob[0]) #dict

# blob[0]

# blob['Tables']

85

71

In [11]:
tables_2023 = []
tables_2024 = []

# # For listing out ONLY the python dictionary keys in the blob (these are the tables) that start with "NTD"
# for k, v in blob[0].items():
#     if k.startswith("NTD"):
#         tables.append(k)

# For listing out ALL the dict keys:
for k, v in blob_2023[0].items():
    tables_2023.append(k)
    
for k, v in blob_2024[0].items():
    tables_2024.append(k)

# compare both table list
display(
    tables_2023 == tables_2024,
    #list(tables_2023), 
    list(tables_2024)
)

True

['ReportId',
 'Organization',
 'ReportPeriod',
 'ReportStatus',
 'ReportLastModifiedDate',
 'NTDReportingStationsAndMaintenance',
 'NTDTransitAssetManagementA15',
 'NTDAssetAndResourceInfo',
 'NTDReportingP10',
 'NTDReportingP20',
 'NTDReportingP50',
 'NTDReportingA35',
 'NTDReportingRR20_Intercity',
 'NTDReportingRR20_Rural',
 'NTDReportingRR20_Urban_Tribal',
 'NTDReportingTAMNarrative',
 'SS60']

In [12]:
# get list all org names 
org_list = [blob_2024[i]["Organization"] for i in range(len(blob_2024))]

In [13]:
# get index position of org names from org list
org_list.index("Eastern Sierra Transit Authority")

24

In [14]:
# slicing blob at specific index number, report name and 'data' field

len(blob_2024[24]["NTDReportingRR20_Rural"]["Data"])

56

### Inspect whether any tables have changed from last year (2023)

1. Pull up the external tables yaml at airflow/dags/create_external_tables/ntd_report_validation/external_table_all_ntdreports.yml
2. Pull up the external_blackcat.all_ntdreports table in BigQuery
  
To help do #3 and #4 below, use the cell below to copy in table names one by one and inspect the API data for whichever year one is interested in. If you don't see any data, then cycle through the JSON list by changing `blob[0]` to `blob[1]`, `blob[3]` etc.  
  
3. Compare the table names above with what is in the table list on the schema there. NOTE table names in BigQuery are not *exactly* the same as the API, they have been made all lowercase with `_data` added.
4. Compare the individual columns within each of the above tables to what is there.
  
Change the schema in the yaml as needed to reflect the data. Do not remove any old column names. Just add new columns and/or tables

In [15]:
# A10 report
# check complete
#blob_2024[3]['NTDReportingStationsAndMaintenance']['Data'][0]

In [16]:
# check complete

# new columns: 
    #"Type", 
    #"Note", 
    #"LastModifiedDate"

# added to draft yaml
#blob_2024[14]['NTDTransitAssetManagementA15']["Data"][0]

In [17]:
# A30
# check complete

# new columns:
					 #'TotalVehicles':,
					 #'ActiveVehicles',
					 #'DedicatedFleet',
					 #'NoCapitalReplacementResponsibility',
					 #'AutomatedorAutonomousVehicles',
					 #'Manufacturer',
					 #'DescribeOtherManufacturer',
					 #'Model',
					 #'YearRebuilt',
					 #'OtherFuelType',
					 #'DuelFuelType',
					 #'StandingCapacity',
					 #'OtherOwnershipType',
					 #'EmergencyVehicles',
					 #'TypeofLastRenewal',
					 #'UsefulLifeBenchmark,
					 #'MilesThisYear',
					 #'AverageLifetimeMilesPerActiveVehicle'
                        
# added to draft yaml
#blob_2024[15]['NTDAssetAndResourceInfo']['Data'][1]

In [18]:
# check complete
#blob_2024[12]['NTDReportingP10']['Data']

In [19]:
#check complete
#blob_2024[5]['NTDReportingP20']['Data']

In [20]:
#check complete
#blob_2024[5]['NTDReportingP50']['Data']

In [21]:
#check complete
#blob_2024[16]['NTDReportingA35']['Data'][0]

In [22]:
#check complete
#blob_2024[50]['NTDReportingRR20_Intercity']['Data'][0]

In [23]:
# check complete
# new columns:
						#◊ "AnnualVehicleRevMilesComments"
						#◊ "AnnualVehicleRevHoursComments"
						#◊ "AnnualUnlinkedPassTripsComments"
						#◊ "AnnualVehicleMaxServiceComments"
                        # "SponsoredServiceUPTComments"
# added to draft yaml
#blob_2024[16]['NTDReportingRR20_Rural']['Data'][2]

In [24]:
#check complete
#blob_2024[6]['NTDReportingRR20_Urban_Tribal']['Data'][0]

In [25]:
# check complete
# NEW COLUMN(S)
    # 'VehiclesToBePurchasesNextYear'
    
# added to draft yaml
#blob_2024[2]['NTDReportingTAMNarrative']['Data'][0]

In [26]:
# check complete
#blob_2024[16]['SS60']['Data'][0]

## examine warehouse data

In [27]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

In [28]:
org = "Eastern Sierra Transit Authority"

The cell below checks the values in the `rr20_equal_totals_check` model, and its associated `int` and `stg` model from the warehouse.


Update model and column names with whatever you ned to check 

In [29]:
# Query the fct model from warehouse
fct_ntd_rr20_service_checks = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org)
    >> collect()
)

# Query the int model from warehouse
int_ntd_rr20_service_3ratios_wide = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org)
    >> collect()
)

stg_ntd_rr20_rural = (
    tbls.staging.stg_ntd_rr20_rural()
    >> filter(_.organization == org, 
              _.api_report_period == 2024)
    >> collect()
)



In [30]:
# use this to query the fct table for individual orgs in the org list.
# go one-by-one until you hit an error.
test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[39])
    >> collect()
)

test.head()

# errors encounterd at index 9, 14, 16 and 36. What are thse orgs?
# error_list= [9 (-2.3/0), 14 (0/0), 16(0/0), 36(-4/0)]

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


In [31]:
# list of org by index location
error_list= [9, 14, 16, 36]

# populate a new list of org names by using list comprehension to get names out of error_list
error_orgs = [org_list[i] for i in error_list]

error_orgs

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'San Joaquin Regional Transit District']

In [32]:
blob_2024[9]["NTDReportingRR20_Rural"]["Data"][0]

{'Id': 14150,
 'ReportId': 1026,
 'Item': 'Demand Response (DR) - (DO)',
 'Revenue': None,
 'Type': 'Expenses by Mode',
 'CssClass': 'expense',
 'OperationsExpended': 210688.0,
 'CapitalExpended': 0.0,
 'Description': 'Allocation: 1 bus DR and 4 bus DF',
 'AnnualVehicleRevMiles': None,
 'AnnualVehicleRevMilesComments': None,
 'AnnualVehicleRevHours': None,
 'AnnualVehicleRevHoursComments': None,
 'AnnualUnlinkedPassTrips': None,
 'AnnualUnlinkedPassTripsComments': None,
 'AnnualVehicleMaxService': None,
 'AnnualVehicleMaxServiceComments': None,
 'SponsoredServiceUPT': None,
 'SponsoredServiceUPTComments': None,
 'Quantity': None,
 'LastModifiedDate': '2024-10-09T20:50:21.587'}

In [33]:
# query the stg rr20 rural table for only orgs in the error_orgs list
error_rr20_rural = (
    tbls.staging.stg_ntd_rr20_rural()
    >> filter(_.organization.isin([error_orgs]),
              _.api_report_period == 2024)
    >> collect()
)

In [34]:
error_rr20_rural.head(3)

Unnamed: 0,organization,api_report_status,api_report_last_modified_date,api_report_period,id,report_id,item,revenue,type,css_class,operations_expended,capital_expended,description,annual_vehicle_rev_miles,annual_vehicle_rev_hours,annual_unlinked_pass_trips,annual_vehicle_max_service,sponsored_service_upt,quantity,last_modified_date
0,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14150,1026,Demand Response (DR) - (DO),,Expenses by Mode,expense,,,,,,,,,,2024-09-26 17:48:25.270000+00:00
1,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14151,1026,Deviated Fixed Route (DF) - (DO),,Expenses by Mode,expense,,,,,,,,,,2024-09-26 17:48:25.270000+00:00
2,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14152,1026,Demand Response (DR) - (DO),Passenger-Paid Fares,Fare Revenues,revenue,,,,,,,,,,2024-09-26 17:48:25.270000+00:00


In [35]:
# see names of checks in fct model
fct_ntd_rr20_service_checks["name_of_check"].unique()

array(['RR20F-005: Cost per Hour change',
       'RR20F-139: Vehicle Revenue Miles (VRM) % change',
       'RR20F-139: Revenue Speed change',
       'RR20F-143: Vehicle Revenue Miles (VRM) change from zero',
       'RR20F-171: Vehicles of Maximum Service (VOMS) change',
       'RR20F-139: Fare Revenue Per Trip change',
       'RR20F-154: Trips per Hour change',
       'RR20F-146: Miles per Vehicle change',
       'RR20F-179: Missing Service Data check'], dtype=object)

In [36]:
# see all modes in fct model
fct_ntd_rr20_service_checks["mode"].value_counts()

Demand Response (DR) - (DO)      9
Bus (MB) (Fixed Route) - (DO)    9
Commuter Bus (CB) - (DO)         9
Name: mode, dtype: int64

In [37]:
display(
    #int_rr20["vrm_this_year"],
    #int_rr20["vrm_last_year"]
)


In [38]:
# sum the "expense" and "operations" revenue/expense rows for the agency

#op_ex = stg_rr20[stg_rr20["css_class"]=="expense"]["operations_expended"].sum()
#op_rev = stg_rr20[stg_rr20["css_class"]=="revenue"]["operations_expended"].sum()

#int_op_ex = int_rr20["Total_Annual_Op_Expenses_by_Mode"][0]
#int_op_rev = int_rr20["Total_Annual_Op_Revenues_Expended"][0]

#print(f"""Org: {org}
#Does opX and opRev match? {op_ex == op_rev}
#Does warehouse opX and opRev match? {int_op_ex == int_op_rev}
#opx: {op_ex}
#oprev: {op_rev}
#Does opX and warehouse opX match? {op_ex == int_op_ex}
#Does opRev and warehouse opRev match? {op_rev == int_op_rev}
#warehouse opx: {int_op_ex}
#warehouse oprev: {int_op_rev}""")
#display(fct_rr20)

In [39]:
display(error_list, error_orgs)

[9, 14, 16, 36]

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'San Joaquin Regional Transit District']

In [41]:
# query the fct model without the error orgs, then see what modes are in it.
just_modes = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(~_.organization.isin(error_orgs),
             )
    >>collect()
)

mode_list = list(just_modes["mode"].unique())

DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 division by zero: -4 / 0

Location: us-west2
Job ID: e99eb887-67ef-4a83-878d-1d91b3f4e979

[SQL: SELECT `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`organization`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`name_of_check`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`mode`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`year_of_data`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`check_status`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`value_checked`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`description`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`Agency_Response`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`date_checked` 
FROM `mart_ntd_validation.fct_ntd_rr20_service_checks` AS `mart_ntd_validation.fct_ntd_rr20_service_checks_1` 
WHERE (`mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`organization` NOT IN UNNEST(%(organization_1:STRING)s))]
[parameters: {'organization_1': ['City of Arvin', 'Town of Truckee', 'Colusa County Transit Agency', 'San Joaquin Regional Transit District']}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

In [42]:
mode_list

NameError: name 'mode_list' is not defined

In [43]:
# check the error orgs to see which specific check and mode is failing in the validation report.
# or see which checks are still being produced

arvin_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[9],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              _.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              #_.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              #_.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              #_.mode == "University Service (US) - (PT)",
              _.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)

arvin_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[9])
    >> collect()
)

keep_cols=['organization',
           "mode",
           'tph_last_year',
           'tph_this_year', 
           'voms_last_year', 
           'voms_this_year', 
           'vrm_last_year',
           'vrm_this_year',  
]

display(
arvin_check_test,
# failed at: 
# RR20F-154: Trips per Hour change, 
    # mode: Demand Response (DR) - (DO) (-2.3/0)
    # mode: Deviated Fixed Route (DF) - (DO) (-10/0)
    
# RR20F-171: Vehicles of Maximum Service (VOMS) change
    # mode: Demand Response (DR) - (DO) (-1/0)
    # mode: Deviated Fixed Route (DF) - (DO) (-3/0)

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 


arvin_int_check[keep_cols]
       )

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,Demand Response (DR) - (DO),2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,Deviated Fixed Route (DF) - (DO),10.02455,0.0,3.0,0.0,120801.0,116550.0


In [44]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

truckee_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[14],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              _.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              #_.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              _.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              #_.mode == "University Service (US) - (PT)",
              #_.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)
truckee_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[14])
    >> collect()
)

display(
    truckee_check_test,
# failed at:
# RR20F-139: Fare Revenue Per Trip change
    # mode: Bus (MB) (Fixed Route) - (PT) (0/0)


# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 
    truckee_int_check[keep_cols]
)

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Town of Truckee,RR20F-139: Fare Revenue Per Trip change,Demand Response (DR) - (PT),2024,Fail,"2024 = 0.3, 2023 = 0.7chg = -133.3%","The fare revenues per unlinked passenger trip for this mode has changed from last year by >= 25%, please provide a narrative justification.",,2024-10-11 17:37:28.465823+00:00


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,Town of Truckee,Bus (MB) (Fixed Route) - (PT),6.302895,5.344347,3.0,3.0,109155.0,85100.0
1,Town of Truckee,Demand Response (DR) - (PT),2.506347,4.224924,3.0,3.0,24373.0,22140.0


In [45]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

colusa_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[16],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              _.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              _.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              #_.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              #_.mode == "University Service (US) - (PT)",
              #_.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)


colusa_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[16])
    >> collect()
)

display(
colusa_check_test,
# failed at:
# RR20F-154: Trips per Hour change
    #mode: Demand Response (DR) - (DO) (-4.2/0)

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 
colusa_int_check[keep_cols]
)


Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,Colusa County Transit Agency,Demand Response (DR) - (DO),4.17158,0.0,7.0,7.0,141531.0,149880.0


In [46]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

trinity_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[36],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              _.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              #_.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              #_.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              _.mode == "University Service (US) - (PT)",
              #_.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)

trinity_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[36])
    >> collect()
)

display(
    trinity_check_test,
# failed at:
# RR20F-171: Vehicles of Maximum Service (VOMS) change
    # mode: Intercity Service (IC) - (DO)

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 
    trinity_int_check[keep_cols]
)

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year


In [47]:
# What does a good check look like?
check_list = [
    "RR20F-154: Trips per Hour change",
    "RR20F-171: Vehicles of Maximum Service (VOMS) change",
    "RR20F-139: Fare Revenue Per Trip change",
]


fct_ntd_rr20_service_checks[fct_ntd_rr20_service_checks["name_of_check"].isin(check_list)]

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
12,Eastern Sierra Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) change,Demand Response (DR) - (DO),2024,Pass,"2024 = 8, 2023 = 8chg = 0%",,,2024-10-11 16:52:02.360557+00:00
13,Eastern Sierra Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) change,Bus (MB) (Fixed Route) - (DO),2024,Pass,"2024 = 24, 2023 = 24chg = 0%",,,2024-10-11 16:52:02.360557+00:00
14,Eastern Sierra Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) change,Commuter Bus (CB) - (DO),2024,Pass,"2024 = 5, 2023 = 5chg = 0%",,,2024-10-11 16:52:02.360557+00:00
15,Eastern Sierra Transit Authority,RR20F-139: Fare Revenue Per Trip change,Demand Response (DR) - (DO),2024,Fail,"2024 = 21, 2023 = 40.4chg = -92.4%","The fare revenues per unlinked passenger trip for this mode has changed from last year by >= 25%, please provide a narrative justification.",,2024-10-11 16:52:02.360557+00:00
16,Eastern Sierra Transit Authority,RR20F-139: Fare Revenue Per Trip change,Bus (MB) (Fixed Route) - (DO),2024,Pass,"2024 = 1.3, 2023 = 2.5chg = -92.3%",,,2024-10-11 16:52:02.360557+00:00
17,Eastern Sierra Transit Authority,RR20F-139: Fare Revenue Per Trip change,Commuter Bus (CB) - (DO),2024,Fail,"2024 = 73.3, 2023 = 245chg = -234.2%","The fare revenues per unlinked passenger trip for this mode has changed from last year by >= 25%, please provide a narrative justification.",,2024-10-11 16:52:02.360557+00:00
18,Eastern Sierra Transit Authority,RR20F-154: Trips per Hour change,Demand Response (DR) - (DO),2024,Fail,"2024 = 2.9, 2023 = 2.9chg = 0%","The calculated trips per hour for this mode has changed from last year by >= 30%, please provide a narrative justification.",,2024-10-11 16:52:02.360557+00:00
19,Eastern Sierra Transit Authority,RR20F-154: Trips per Hour change,Bus (MB) (Fixed Route) - (DO),2024,Fail,"2024 = 24.8, 2023 = 24.4chg = 1.6%","The calculated trips per hour for this mode has changed from last year by >= 30%, please provide a narrative justification.",,2024-10-11 16:52:02.360557+00:00
20,Eastern Sierra Transit Authority,RR20F-154: Trips per Hour change,Commuter Bus (CB) - (DO),2024,Fail,"2024 = 3.7, 2023 = 2.5chg = 32.4%","The calculated trips per hour for this mode has changed from last year by >= 30%, please provide a narrative justification.",,2024-10-11 16:52:02.360557+00:00


In [48]:
# checking the int table for this validation report to insepct the tph, voms, and vrm values for these checks of the error orgs

error_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization.isin(error_orgs))
    >> collect()
)

keep_cols=['organization',
           "mode",
    'tph_last_year',
           'tph_this_year', 
           'voms_last_year', 
           'voms_this_year', 
           'vrm_last_year',
           'vrm_this_year',  
]

error_int_check[keep_cols]

Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,Demand Response (DR) - (DO),2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,Deviated Fixed Route (DF) - (DO),10.02455,0.0,3.0,0.0,120801.0,116550.0
2,Colusa County Transit Agency,Demand Response (DR) - (DO),4.17158,0.0,7.0,7.0,141531.0,149880.0
3,Town of Truckee,Bus (MB) (Fixed Route) - (PT),6.302895,5.344347,3.0,3.0,109155.0,85100.0
4,Town of Truckee,Demand Response (DR) - (PT),2.506347,4.224924,3.0,3.0,24373.0,22140.0


# Summary of 2024 Update 
rr20_service_check errors investigation

## orgs that were failing checks
- City of Arvin
	- RR20F-154: Trips per Hour change
		- mode: Demand Response (DR) - (DO), error (-2.3/0)
		- mode: Deviated Fixed Route (DF) - (DO),  error (-10/0)
	- RR20F-171: Vehicles of Maximum Service (VOMS) change
		- mode: Demand Response (DR) - (DO), error (-1/0)
		- mode: Deviated Fixed Route (DF) - error (DO), (-3/0)
<br>
<br>
- Town of Truckee
    - RR20F-139: Fare Revenue Per Trip change
        - Mode: Bus (MB) (Fixed Route) - (PT), error (0/0)
<br>
<br>
- Colusa County Transit Agency
	- RR20F-154: Trips per Hour change 
        - Mode: Demand Response (DR) - (DO), error (-4.2/0)
<br>
<br>
- Trinity County Department of Transportation
	- RR20F-171: Vehicles of Maximum Service (VOMS) change 
        - Mode: Intercity Service (IC) - (DO), error (-4/0)


## for RR20F-154: Trips per Hour change: 
- condition first checks if `tph_this_year` or `tph_last_year` are `NULL` or `0`. if true, then return `did not run`. else, calculates the absolute value of `(tph_this_year - tph_last_year)/ tph_last_year`. 
- City of Arvin, mode: Demand Response (DR) - (DO): 
    - `tph_this year = 0` and `tph_last_year = 2.3`. should have returned `did not run` since tph_this_year = 0. but error was (-2.3/0)
    - error calculated (0-2.3)/0. instead of (0-2.3/2.3)
- Colusa County Transit Agency,Mode: Demand Response (DR) - (DO):
    - `tph_this_year = 0` and `tph_last_year = 4.2`. also should have returned `did not run`. but error returned as (-4.2/0). 
    - meaning the calculation was (0-4.2)/0, instead of (0-4.2)4.2.
<br>
<br>

## for RR20F-171: Vehicles of Maximum Service (VOMS) change:
- condition first checks if (`voms_this_year = 0` and `voms_last_year !=0`) or if (`voms_this_year !=0` and `voms_last_year NOT NULL and = 0`) "fail", then checks if `voms_this_year is NULL or 0`, then returns "did not run". then checks `voms_last_year is NULL or 0`, then returns "did not run". all else, return pass.
- City of Arvin, mode: Demand Response (DR) - (DO):
    - `voms_this_year =  0` and `voms_last_year = 1.0`. should have returned `Did Not Run`.but instead got error for (-1/0).
    - idk how it got that result 
- Trinity County Department of Transportation, Mode: Intercity Service (IC) - (DO):
    - `voms_this_year = 0`, `voms_last_year = 4.0`. should have returned `Did not run`. but did soem calculation to get (-4/0)
<br>
<br>

## for RR20F-139: Fare Revenue Per Trip change:
- condition check `vrm_this_year` and `vrm_last_year` if they are NULL or 0, if so, then return "did not run". then checks if the absolute value of (`vrm_this_year` - `vrm_last_year`)/ `vrm_last_year`. if results >=0.3 then fail, else pass
- Town of Truckee, Mode: Bus (MB) (Fixed Route) - (PT):
    - `vrm_this_year = 85100.0`, `vrm_last_year =109155.0 `. should have calculated `(85100.0-109155.0)/109155.0 = 0.22, pass`. but got some calculation as (0/0)?
    
## conclusion
- something about these values are not getting recognized in the conditions.
- unsure how to solve these errors


In [49]:
error_int_check[keep_cols]

Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,Demand Response (DR) - (DO),2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,Deviated Fixed Route (DF) - (DO),10.02455,0.0,3.0,0.0,120801.0,116550.0
2,Colusa County Transit Agency,Demand Response (DR) - (DO),4.17158,0.0,7.0,7.0,141531.0,149880.0
3,Town of Truckee,Bus (MB) (Fixed Route) - (PT),6.302895,5.344347,3.0,3.0,109155.0,85100.0
4,Town of Truckee,Demand Response (DR) - (PT),2.506347,4.224924,3.0,3.0,24373.0,22140.0


In [50]:
all_int_service_ratio = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(~_.organization.isin(error_orgs))
    >> collect()
)
all_int_service_ratio.shape


(104, 20)

In [51]:
all_fct_service_checks = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(~_.organization.isin(error_orgs))
    >> collect()
)
all_fct_service_checks.shape

DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 division by zero: -4 / 0

Location: us-west2
Job ID: 6d6b6bb1-e664-42af-9831-075bd7a7600b

[SQL: SELECT `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`organization`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`name_of_check`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`mode`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`year_of_data`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`check_status`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`value_checked`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`description`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`Agency_Response`, `mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`date_checked` 
FROM `mart_ntd_validation.fct_ntd_rr20_service_checks` AS `mart_ntd_validation.fct_ntd_rr20_service_checks_1` 
WHERE (`mart_ntd_validation.fct_ntd_rr20_service_checks_1`.`organization` NOT IN UNNEST(%(organization_1:STRING)s))]
[parameters: {'organization_1': ['City of Arvin', 'Town of Truckee', 'Colusa County Transit Agency', 'San Joaquin Regional Transit District']}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

In [None]:
# are there any *_last_year = 0?
all_int_service_ratio.columns

In [52]:
# are there any *_last_year = 0?
display(
    all_service_ratio["tph_last_year"].unique(),
    all_service_ratio["voms_last_year"].unique(),
    all_service_ratio["vrm_last_year"].unique()
)


NameError: name 'all_service_ratio' is not defined

In [53]:
# check all the "value_checked" values. 
# which rows have "None"?

short_col_list =["organization","mode", "tph_this_year", "tph_last_year"]

display(
    #all_fct_service_checks[(all_fct_service_checks["value_checked"].isna()) & (all_fct_service_checks["name_of_check"].str.contains("RR20F-154:"))].head(1),
    all_fct_service_checks[(all_fct_service_checks["organization"].str.contains("Lassen Transit Service Agency"))& (all_fct_service_checks["name_of_check"].str.contains("RR20F-154:"))],
    all_service_ratio[all_service_ratio["organization"].str.contains("Lassen Transit Service Agency")][short_col_list].head(),
    all_service_ratio[all_service_ratio["tph_last_year"]== 0][short_col_list]
)

NameError: name 'all_fct_service_checks' is not defined

## checking for VIN data in A30 reports

In [98]:
org_list.index("City of Arvin")

9

In [99]:
org_num = 9
display(
    blob_2024[org_num]["Organization"],
    len(blob_2024[org_num]["NTDAssetAndResourceInfo"]["Data"]),
    blob_2024[org_num]["NTDAssetAndResourceInfo"]["Data"][1].get("Vin")
)

'City of Arvin'

6

'0114'

---

#### Extract the org name and details from the blob
This is extra code to show how to inspect what organizations are in the API at any given time, in case it is helpful. 

In [None]:
for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod_string = x.get('ReportLastModifiedDate')
    last_mod = pendulum.from_format(last_mod_string, 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    print(f"Report details: ID {report_id}, org {org}, report period {period}, status {status}, last modified on {last_mod_string}.")
#     print(f"New datetime {last_mod}")
#     print(f"iso is {iso}")

#### Quick check on what orgs are in this API, and how many have RR-20 info

In [None]:
org_data = []

for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod = pendulum.from_format(x.get('ReportLastModifiedDate'), 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    
    
    rural = x['NTDReportingRR20_Rural']
    for k,v in rural.items():
        rural_n = len(v)
    city = x['NTDReportingRR20_Intercity']
    for k,v in city.items():
        city_n = len(v)
    urban_tribal = x['NTDReportingRR20_Urban_Tribal']
    for k,v in urban_tribal.items():
        urban_n = len(v)
    
    org_info = pd.DataFrame(data=[[report_id, org, period, status, iso, rural_n, city_n, urban_n]], 
                            columns=['report_id', 'organization', 'report_period', 'report_status', 'last_modified', 
                                     'rr20_rural_rows', 'rr20_intercity_rows', 'rr20_urban-tribal_rows'])
#     whole_df = pd.concat([org_info, raw_df], axis=1).sort_values(by='organization')
    
    org_data.append(org_info)


In [None]:
newapi = pd.concat(org_data)
print(len(newapi))
newapi.head()

In [None]:
newapi.to_csv('../data/newapi_rr20_11-27-23.csv')

## Convert API data to dataframes
Here using the test API to develop a function.

Just shove entire blob into a dataframe - this approach is what's recommended by Cal-ITP. They prefer we then do any transformations and separating of tables on dbt.  
Downsides:
* there are many columns with nested data (converts to lists and dictionaries). Basically each NTD report is in ONE column.
* the column names get changed because of the nesting and of repeated columns

In [None]:
df = pd.json_normalize(blob_2024)
df

In [None]:
df.info()

In [None]:
df['ReportLastModifiedDate'] =  df['ReportLastModifiedDate'].astype('datetime64[ns]')
# df['ReportLastModifiedDate'] = pd.to_datetime(df['ReportLastModifiedDate'], format='%m/%d/YYYY HH:mm:ss %p')

In [None]:
df

In [None]:
user_dict = blob[0]['NTDReportingP50']['Data']
user_dict # a list of dictionaries. Each dict is one row of data.

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

However in several tables, rows have several columns that are nested dictionaries.  
  
The following code explores ways to unnest them and expand the dataframe rows. **NOTE WE DID NOT USE THIS APPROACH IN PRODUCTION. We decided to unnest tables using SQL instead, in the `staging` dbt models.**

In [None]:
pd.json_normalize(user_dict)

# This expands columns instead of expanding rows. Not exactly what we want.

In [None]:
# We only really want the "Text" value in the dictionaries in the "Mode" and "Type" columns.
# user_dict[0]['Mode']
user_dict[0]

In [None]:
# How to replace certain values in a key:value pair of an existing python dictionary.
original = user_dict[0]
copy = {**original, 'Mode': original['Mode']['Text'], 
        'Type': original['Type']['Text']}
copy

----
Done! This worked but is not super ideal because we hard-code the keys that we want to change instead of iterating over them, but it works as long as we know which dictionary items in each table are nested.  

In [None]:
# Trying loop of creating new dict from old dict.
# New dict will not be nested - checks for a nested dict in each value; for each nested dict, 
# we extract only the k,v pair where the key == 'Text' 

copy_test = {**original}
for k,v in copy_test.items():
    if type(v) is dict:
        copy_test[k] = copy_test[k]['Text']
        
copy_test

In [None]:
## Worked! Now try the above loop over an entire JSON data table

for x in user_dict:
    for k,v in x.items():
        if type(v) is dict:
            x[k] = x[k]['Text']

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

#### Table is now one level and in the format desired.