The BlackCat API came with no instructions. Here we just inspect what is in it and its format. This notebook contains:  
1. the code for inspecting the data 
2. What to do to check whether data has changed in 2024 (and years after that). 
  
We are inferring what to do. Contact BlackCat for further instructions.

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import pendulum
import re

In [87]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_colwidth', None)

**NOTE that the URL has the year at the end. Change this to whatever year of data you would like to get.**

In [2]:
api_2024 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2024"

In [3]:
api_2023 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2023"

In [4]:
response_2024 = requests.get(api_2024)

In [5]:
response_2023 = requests.get(api_2023)

In [6]:
display(
    response_2024,
    response_2023
)
#looking for response 200

<Response [200]>

<Response [200]>

In [7]:
blob_2023 = response_2023.json()
blob_2024 = response_2024.json()

In [8]:
#blob_2024

**Get table and org list**

In [9]:
# type(blob) #list
display(
    len(blob_2023), #35
    len(blob_2024)
)

# type(blob[0]) #dict

# blob[0]

# blob['Tables']

85

70

In [10]:
tables_2023 = []
tables_2024 = []

# # For listing out ONLY the python dictionary keys in the blob (these are the tables) that start with "NTD"
# for k, v in blob[0].items():
#     if k.startswith("NTD"):
#         tables.append(k)

# For listing out ALL the dict keys:
for k, v in blob_2023[0].items():
    tables_2023.append(k)
    
for k, v in blob_2024[0].items():
    tables_2024.append(k)

# compare both table list
display(
    tables_2023 == tables_2024,
    #list(tables_2023), 
    list(tables_2024)
)

True

['ReportId',
 'Organization',
 'ReportPeriod',
 'ReportStatus',
 'ReportLastModifiedDate',
 'NTDReportingStationsAndMaintenance',
 'NTDTransitAssetManagementA15',
 'NTDAssetAndResourceInfo',
 'NTDReportingP10',
 'NTDReportingP20',
 'NTDReportingP50',
 'NTDReportingA35',
 'NTDReportingRR20_Intercity',
 'NTDReportingRR20_Rural',
 'NTDReportingRR20_Urban_Tribal',
 'NTDReportingTAMNarrative',
 'SS60']

In [11]:
# get list all org names 
org_list = [blob_2024[i]["Organization"] for i in range(len(blob_2024))]

In [12]:
# get index position of org names from org list
org_list.index("Eastern Sierra Transit Authority")

24

In [13]:
# slicing blob at specific index number, report name and 'data' field

len(blob_2024[24]["NTDReportingRR20_Rural"]["Data"])

56

### Inspect whether any tables have changed from last year (2023)

1. Pull up the external tables yaml at airflow/dags/create_external_tables/ntd_report_validation/external_table_all_ntdreports.yml
2. Pull up the external_blackcat.all_ntdreports table in BigQuery
  
To help do #3 and #4 below, use the cell below to copy in table names one by one and inspect the API data for whichever year one is interested in. If you don't see any data, then cycle through the JSON list by changing `blob[0]` to `blob[1]`, `blob[3]` etc.  
  
3. Compare the table names above with what is in the table list on the schema there. NOTE table names in BigQuery are not *exactly* the same as the API, they have been made all lowercase with `_data` added.
4. Compare the individual columns within each of the above tables to what is there.
  
Change the schema in the yaml as needed to reflect the data. Do not remove any old column names. Just add new columns and/or tables

In [14]:
# A10 report
# check complete
#blob_2024[3]['NTDReportingStationsAndMaintenance']['Data'][0]

In [15]:
# check complete

# new columns: 
    #"Type", 
    #"Note", 
    #"LastModifiedDate"

# added to draft yaml
#blob_2024[14]['NTDTransitAssetManagementA15']["Data"][0]

In [16]:
# A30
# check complete

# new columns:
					 #'TotalVehicles':,
					 #'ActiveVehicles',
					 #'DedicatedFleet',
					 #'NoCapitalReplacementResponsibility',
					 #'AutomatedorAutonomousVehicles',
					 #'Manufacturer',
					 #'DescribeOtherManufacturer',
					 #'Model',
					 #'YearRebuilt',
					 #'OtherFuelType',
					 #'DuelFuelType',
					 #'StandingCapacity',
					 #'OtherOwnershipType',
					 #'EmergencyVehicles',
					 #'TypeofLastRenewal',
					 #'UsefulLifeBenchmark,
					 #'MilesThisYear',
					 #'AverageLifetimeMilesPerActiveVehicle'
                        
# added to draft yaml
#blob_2024[15]['NTDAssetAndResourceInfo']['Data'][1]

In [17]:
# check complete
#blob_2024[12]['NTDReportingP10']['Data']

In [18]:
#check complete
#blob_2024[5]['NTDReportingP20']['Data']

In [19]:
#check complete
#blob_2024[5]['NTDReportingP50']['Data']

In [20]:
#check complete
#blob_2024[16]['NTDReportingA35']['Data'][0]

In [21]:
#check complete
#blob_2024[50]['NTDReportingRR20_Intercity']['Data'][0]

In [22]:
# check complete
# new columns:
						#◊ "AnnualVehicleRevMilesComments"
						#◊ "AnnualVehicleRevHoursComments"
						#◊ "AnnualUnlinkedPassTripsComments"
						#◊ "AnnualVehicleMaxServiceComments"
                        # "SponsoredServiceUPTComments"
# added to draft yaml
#blob_2024[16]['NTDReportingRR20_Rural']['Data'][2]

In [23]:
#check complete
#blob_2024[6]['NTDReportingRR20_Urban_Tribal']['Data'][0]

In [24]:
# check complete
# NEW COLUMN(S)
    # 'VehiclesToBePurchasesNextYear'
    
# added to draft yaml
#blob_2024[2]['NTDReportingTAMNarrative']['Data'][0]

In [25]:
# check complete
#blob_2024[16]['SS60']['Data'][0]

## examine warehouse data

In [26]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

In [27]:
org = "Eastern Sierra Transit Authority"

The cell below checks the values in the `rr20_equal_totals_check` model, and its associated `int` and `stg` model from the warehouse.


Update model and column names with whatever you ned to check 

In [28]:
# Query the fct model from warehouse
fct_ntd_rr20_service_checks = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org)
    >> collect()
)

# Query the int model from warehouse
int_ntd_rr20_service_3ratios_wide = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org)
    >> collect()
)

stg_ntd_rr20_rural = (
    tbls.staging.stg_ntd_rr20_rural()
    >> filter(_.organization == org, 
              _.api_report_period == 2024)
    >> collect()
)



In [29]:
# use this to query the fct table for individual orgs in the org list.
# go one-by-one until you hit an error.
test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[39])
    >> collect()
)

test.head()

# errors encounterd at index 9, 14, 16 and 36. What are thse orgs?
# error_list= [9 (-2.3/0), 14 (0/0), 16(0/0), 36(-4/0)]

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Mendocino Transit Authority,RR20F-143: Vehicle Revenue Miles (VRM) change ...,Demand Response (DR) - (DO),2024,Pass,"2024 = 83273, 2023 = 72892chg = 12.5%",,,2024-10-10 20:07:11.253650+00:00
1,Mendocino Transit Authority,RR20F-143: Vehicle Revenue Miles (VRM) change ...,Bus (MB) (Fixed Route) - (DO),2024,Pass,"2024 = 720355, 2023 = 559702chg = 22.3%",,,2024-10-10 20:07:11.253650+00:00
2,Mendocino Transit Authority,RR20F-143: Vehicle Revenue Miles (VRM) change ...,Deviated Fixed Route (DF) - (DO),2024,Pass,"2024 = 32307, 2023 = 32184chg = 0.4%",,,2024-10-10 20:07:11.253650+00:00
3,Mendocino Transit Authority,RR20F-146: Miles per Vehicle change,Demand Response (DR) - (DO),2024,Fail,"2024 = 20818.3, 2023 = 18223chg = 12.5%",The miles per vehicle for this mode has change...,,2024-10-10 20:07:11.253650+00:00
4,Mendocino Transit Authority,RR20F-146: Miles per Vehicle change,Bus (MB) (Fixed Route) - (DO),2024,Fail,"2024 = 37913.4, 2023 = 29458chg = 22.3%",The miles per vehicle for this mode has change...,,2024-10-10 20:07:11.253650+00:00


In [30]:
# list of org by index location
error_list= [9, 14, 16, 36]

# populate a new list of org names by using list comprehension to get names out of error_list
error_orgs = [org_list[i] for i in error_list]

error_orgs

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'Trinity County Department of Transportation ']

In [31]:
blob_2024[9]["NTDReportingRR20_Rural"]["Data"][0]

{'Id': 14150,
 'ReportId': 1026,
 'Item': 'Demand Response (DR) - (DO)',
 'Revenue': None,
 'Type': 'Expenses by Mode',
 'CssClass': 'expense',
 'OperationsExpended': 210688.0,
 'CapitalExpended': 0.0,
 'Description': 'Allocation: 1 bus DR and 4 bus DF',
 'AnnualVehicleRevMiles': None,
 'AnnualVehicleRevMilesComments': None,
 'AnnualVehicleRevHours': None,
 'AnnualVehicleRevHoursComments': None,
 'AnnualUnlinkedPassTrips': None,
 'AnnualUnlinkedPassTripsComments': None,
 'AnnualVehicleMaxService': None,
 'AnnualVehicleMaxServiceComments': None,
 'SponsoredServiceUPT': None,
 'SponsoredServiceUPTComments': None,
 'Quantity': None,
 'LastModifiedDate': '2024-10-09T20:50:21.587'}

In [32]:
# query the stg rr20 rural table for only orgs in the error_orgs list
error_rr20_rural = (
    tbls.staging.stg_ntd_rr20_rural()
    >> filter(_.organization.isin([error_orgs]),
              _.api_report_period == 2024)
    >> collect()
)

In [33]:
error_rr20_rural.head(3)

Unnamed: 0,organization,api_report_status,api_report_last_modified_date,api_report_period,id,report_id,item,revenue,type,css_class,operations_expended,capital_expended,description,annual_vehicle_rev_miles,annual_vehicle_rev_hours,annual_unlinked_pass_trips,annual_vehicle_max_service,sponsored_service_upt,quantity,last_modified_date
0,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14150,1026,Demand Response (DR) - (DO),,Expenses by Mode,expense,,,,,,,,,,2024-09-26 17:48:25.270000+00:00
1,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14151,1026,Deviated Fixed Route (DF) - (DO),,Expenses by Mode,expense,,,,,,,,,,2024-09-26 17:48:25.270000+00:00
2,City of Arvin,Not Submitted,2024-09-25 18:30:48+00:00,2024,14152,1026,Demand Response (DR) - (DO),Passenger-Paid Fares,Fare Revenues,revenue,,,,,,,,,,2024-09-26 17:48:25.270000+00:00


In [34]:
# see names of checks in fct model
fct_ntd_rr20_service_checks["name_of_check"].unique()

array(['RR20F-005: Cost per Hour change',
       'RR20F-171: Vehicles of Maximum Service (VOMS) change',
       'RR20F-146: Miles per Vehicle change',
       'RR20F-179: Missing Service Data check',
       'RR20F-139: Revenue Speed change',
       'RR20F-139: Fare Revenue Per Trip change',
       'RR20F-154: Trips per Hour change',
       'RR20F-139: Vehicle Revenue Miles (VRM) % change',
       'RR20F-143: Vehicle Revenue Miles (VRM) change from zero'],
      dtype=object)

In [35]:
# see all modes in fct model
fct_ntd_rr20_service_checks["mode"].value_counts()

Commuter Bus (CB) - (DO)         9
Bus (MB) (Fixed Route) - (DO)    9
Demand Response (DR) - (DO)      9
Name: mode, dtype: int64

In [36]:
display(
    #int_rr20["vrm_this_year"],
    #int_rr20["vrm_last_year"]
)


In [37]:
# sum the "expense" and "operations" revenue/expense rows for the agency

#op_ex = stg_rr20[stg_rr20["css_class"]=="expense"]["operations_expended"].sum()
#op_rev = stg_rr20[stg_rr20["css_class"]=="revenue"]["operations_expended"].sum()

#int_op_ex = int_rr20["Total_Annual_Op_Expenses_by_Mode"][0]
#int_op_rev = int_rr20["Total_Annual_Op_Revenues_Expended"][0]

#print(f"""Org: {org}
#Does opX and opRev match? {op_ex == op_rev}
#Does warehouse opX and opRev match? {int_op_ex == int_op_rev}
#opx: {op_ex}
#oprev: {op_rev}
#Does opX and warehouse opX match? {op_ex == int_op_ex}
#Does opRev and warehouse opRev match? {op_rev == int_op_rev}
#warehouse opx: {int_op_ex}
#warehouse oprev: {int_op_rev}""")
#display(fct_rr20)

In [38]:
display(error_list, error_orgs)

[9, 14, 16, 36]

['City of Arvin',
 'Town of Truckee',
 'Colusa County Transit Agency',
 'Trinity County Department of Transportation ']

In [39]:
# query the fct model without the error orgs, then see what modes are in it.
just_modes = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(~_.organization.isin(error_orgs),
             )
    >>collect()
)

mode_list = list(just_modes["mode"].unique())

In [40]:
mode_list

['Demand Response (DR) - (DO)',
 'Deviated Fixed Route (DF) - (DO)',
 'Commuter Bus (CB) - (DO)',
 'Deviated Fixed Route (DF) - (PT)',
 'Demand Response (DR) - (PT)',
 'Bus (MB) (Fixed Route) - (DO)',
 'Bus (MB) (Fixed Route) - (PT)',
 'Intercity Service (IC) - (PT)',
 'Commuter Bus (CB) - (PT)',
 'Intercity Service (IC) - (DO)',
 'University Service (US) - (PT)',
 'Vanpool (VP) - (PT)']

In [41]:
# check the error orgs to see which specific check and mode is failing in the validation report.
# or see which checks are still being produced

arvin_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[9],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              _.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              #_.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              #_.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              #_.mode == "University Service (US) - (PT)",
              _.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)

arvin_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[9])
    >> collect()
)

keep_cols=['organization',
           "mode",
           'tph_last_year',
           'tph_this_year', 
           'voms_last_year', 
           'voms_this_year', 
           'vrm_last_year',
           'vrm_this_year',  
]

display(
arvin_check_test,
# failed at: 
# RR20F-154: Trips per Hour change, 
    # mode: Demand Response (DR) - (DO) (-2.3/0)
    # mode: Deviated Fixed Route (DF) - (DO) (-10/0)
    
# RR20F-171: Vehicles of Maximum Service (VOMS) change
    # mode: Demand Response (DR) - (DO) (-1/0)
    # mode: Deviated Fixed Route (DF) - (DO) (-3/0)

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 


arvin_int_check[keep_cols]
       )

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,Demand Response (DR) - (DO),2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,Deviated Fixed Route (DF) - (DO),10.02455,0.0,3.0,0.0,120801.0,116550.0


In [42]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

truckee_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[14],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              _.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              #_.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              _.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              #_.mode == "University Service (US) - (PT)",
              #_.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)
truckee_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[14])
    >> collect()
)

display(
    truckee_check_test,
# failed at:
# RR20F-139: Fare Revenue Per Trip change
    # mode: Bus (MB) (Fixed Route) - (PT) (0/0)


# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 
    truckee_int_check[keep_cols]
)

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
0,Town of Truckee,RR20F-139: Fare Revenue Per Trip change,Demand Response (DR) - (PT),2024,Fail,"2024 = 0.3, 2023 = 0.7chg = -133.3%",The fare revenues per unlinked passenger trip ...,,2024-10-10 20:07:30.566300+00:00


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,Town of Truckee,Bus (MB) (Fixed Route) - (PT),6.302895,5.344347,3.0,3.0,109155.0,85100.0
1,Town of Truckee,Demand Response (DR) - (PT),2.506347,4.224924,3.0,3.0,24373.0,22140.0


In [43]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

colusa_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[16],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              _.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              #_.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              _.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              #_.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              #_.mode == "University Service (US) - (PT)",
              #_.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)


colusa_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[16])
    >> collect()
)

display(
colusa_check_test,
# failed at:
# RR20F-154: Trips per Hour change
    #mode: Demand Response (DR) - (DO) (-4.2/0)

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 
colusa_int_check[keep_cols]
)


Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,Colusa County Transit Agency,Demand Response (DR) - (DO),4.17158,0.0,7.0,7.0,141531.0,149880.0


In [44]:
# check the error orgs to see which specific check is failing in the validation report.
# or see which checks are still being produced

trinity_check_test = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(_.organization == org_list[36],
              #_.name_of_check == "RR20F-143: Vehicle Revenue Miles (VRM) change from zero",
              #_.name_of_check == "RR20F-139: Fare Revenue Per Trip change",
              #_.name_of_check == "RR20F-154: Trips per Hour change",
              #_.name_of_check == "RR20F-179: Missing Service Data check",
              #_.name_of_check == "RR20F-139: Vehicle Revenue Miles (VRM) % change",
              #_.name_of_check == "RR20F-139: Revenue Speed change",
              #_.name_of_check == "RR20F-005: Cost per Hour change",
              _.name_of_check == "RR20F-171: Vehicles of Maximum Service (VOMS) change",
              #_.name_of_check == "RR20F-146: Miles per Vehicle change",
              #_.mode == "Demand Response (DR) - (DO)",
              #_.mode == "Commuter Bus (CB) - (DO)",
              #_.mode == "Deviated Fixed Route (DF) - (DO)",
              #_.mode == "Demand Response (DR) - (PT)",
              #_.mode == "Deviated Fixed Route (DF) - (PT)",
              #_.mode == "Bus (MB) (Fixed Route) - (DO)",
              #_.mode == "Bus (MB) (Fixed Route) - (PT)",
              #_.mode == "Intercity Service (IC) - (PT)",
              #_.mode == "Commuter Bus (CB) - (PT)",
              #_.mode == "Intercity Service (IC) - (DO)",
              _.mode == "University Service (US) - (PT)",
              #_.mode == "Vanpool (VP) - (PT)",
             )
    
    >> collect()
)

trinity_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization == org_list[36])
    >> collect()
)

display(
    trinity_check_test,
# failed at:
# RR20F-171: Vehicles of Maximum Service (VOMS) change
    # mode: Intercity Service (IC) - (DO)

# need to check the `int_ntd_rr20_service_3ratios_wide` for to see what values they have 
    trinity_int_check[keep_cols]
)

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked


Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,Trinity County Department of Transportation,Intercity Service (IC) - (DO),2.102253,2.021668,4.0,0.0,116650.0,128945.0


In [45]:
# What does a good check look like?
check_list = [
    "RR20F-154: Trips per Hour change",
    "RR20F-171: Vehicles of Maximum Service (VOMS) change",
    "RR20F-139: Fare Revenue Per Trip change",
]


fct_ntd_rr20_service_checks[fct_ntd_rr20_service_checks["name_of_check"].isin(check_list)]

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
3,Eastern Sierra Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) ...,Commuter Bus (CB) - (DO),2024,Pass,"2024 = 5, 2023 = 5chg = 0%",,,2024-10-10 20:07:05.469982+00:00
4,Eastern Sierra Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) ...,Bus (MB) (Fixed Route) - (DO),2024,Pass,"2024 = 24, 2023 = 24chg = 0%",,,2024-10-10 20:07:05.469982+00:00
5,Eastern Sierra Transit Authority,RR20F-171: Vehicles of Maximum Service (VOMS) ...,Demand Response (DR) - (DO),2024,Pass,"2024 = 8, 2023 = 8chg = 0%",,,2024-10-10 20:07:05.469982+00:00
15,Eastern Sierra Transit Authority,RR20F-139: Fare Revenue Per Trip change,Commuter Bus (CB) - (DO),2024,Fail,"2024 = 73.3, 2023 = 245chg = -234.2%",The fare revenues per unlinked passenger trip ...,,2024-10-10 20:07:05.469982+00:00
16,Eastern Sierra Transit Authority,RR20F-139: Fare Revenue Per Trip change,Bus (MB) (Fixed Route) - (DO),2024,Pass,"2024 = 1.3, 2023 = 2.5chg = -92.3%",,,2024-10-10 20:07:05.469982+00:00
17,Eastern Sierra Transit Authority,RR20F-139: Fare Revenue Per Trip change,Demand Response (DR) - (DO),2024,Fail,"2024 = 21, 2023 = 40.4chg = -92.4%",The fare revenues per unlinked passenger trip ...,,2024-10-10 20:07:05.469982+00:00
18,Eastern Sierra Transit Authority,RR20F-154: Trips per Hour change,Commuter Bus (CB) - (DO),2024,Fail,"2024 = 3.7, 2023 = 2.5chg = 32.4%",The calculated trips per hour for this mode ha...,,2024-10-10 20:07:05.469982+00:00
19,Eastern Sierra Transit Authority,RR20F-154: Trips per Hour change,Bus (MB) (Fixed Route) - (DO),2024,Fail,"2024 = 24.8, 2023 = 24.4chg = 1.6%",The calculated trips per hour for this mode ha...,,2024-10-10 20:07:05.469982+00:00
20,Eastern Sierra Transit Authority,RR20F-154: Trips per Hour change,Demand Response (DR) - (DO),2024,Fail,"2024 = 2.9, 2023 = 2.9chg = 0%",The calculated trips per hour for this mode ha...,,2024-10-10 20:07:05.469982+00:00


In [46]:
# checking the int table for this validation report to insepct the tph, voms, and vrm values for these checks of the error orgs

error_int_check = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(_.organization.isin(error_orgs))
    >> collect()
)

keep_cols=['organization',
           "mode",
    'tph_last_year',
           'tph_this_year', 
           'voms_last_year', 
           'voms_this_year', 
           'vrm_last_year',
           'vrm_this_year',  
]

error_int_check[keep_cols]

Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,Demand Response (DR) - (DO),2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,Deviated Fixed Route (DF) - (DO),10.02455,0.0,3.0,0.0,120801.0,116550.0
2,Colusa County Transit Agency,Demand Response (DR) - (DO),4.17158,0.0,7.0,7.0,141531.0,149880.0
3,Town of Truckee,Bus (MB) (Fixed Route) - (PT),6.302895,5.344347,3.0,3.0,109155.0,85100.0
4,Town of Truckee,Demand Response (DR) - (PT),2.506347,4.224924,3.0,3.0,24373.0,22140.0
5,Trinity County Department of Transportation,Intercity Service (IC) - (DO),2.102253,2.021668,4.0,0.0,116650.0,128945.0


# Summary of 2024 Update 
rr20_service_check errors investigation

## orgs that were failing checks
- City of Arvin
	- RR20F-154: Trips per Hour change
		- mode: Demand Response (DR) - (DO), error (-2.3/0)
		- mode: Deviated Fixed Route (DF) - (DO),  error (-10/0)
	- RR20F-171: Vehicles of Maximum Service (VOMS) change
		- mode: Demand Response (DR) - (DO), error (-1/0)
		- mode: Deviated Fixed Route (DF) - error (DO), (-3/0)
<br>
<br>
- Town of Truckee
    - RR20F-139: Fare Revenue Per Trip change
        - Mode: Bus (MB) (Fixed Route) - (PT), error (0/0)
<br>
<br>
- Colusa County Transit Agency
	- RR20F-154: Trips per Hour change 
        - Mode: Demand Response (DR) - (DO), error (-4.2/0)
<br>
<br>
- Trinity County Department of Transportation
	- RR20F-171: Vehicles of Maximum Service (VOMS) change 
        - Mode: Intercity Service (IC) - (DO), error (-4/0)


## for RR20F-154: Trips per Hour change: 
- condition first checks if `tph_this_year` or `tph_last_year` are `NULL` or `0`. if true, then return `did not run`. else, calculates the absolute value of `(tph_this_year - tph_last_year)/ tph_last_year`. 
- City of Arvin, mode: Demand Response (DR) - (DO): 
    - `tph_this year = 0` and `tph_last_year = 2.3`. should have returned `did not run` since tph_this_year = 0. but error was (-2.3/0)
    - error calculated (0-2.3)/0. instead of (0-2.3/2.3)
- Colusa County Transit Agency,Mode: Demand Response (DR) - (DO):
    - `tph_this_year = 0` and `tph_last_year = 4.2`. also should have returned `did not run`. but error returned as (-4.2/0). 
    - meaning the calculation was (0-4.2)/0, instead of (0-4.2)4.2.
<br>
<br>

## for RR20F-171: Vehicles of Maximum Service (VOMS) change:
- condition first checks if (`voms_this_year = 0` and `voms_last_year !=0`) or if (`voms_this_year !=0` and `voms_last_year NOT NULL and = 0`) "fail", then checks if `voms_this_year is NULL or 0`, then returns "did not run". then checks `voms_last_year is NULL or 0`, then returns "did not run". all else, return pass.
- City of Arvin, mode: Demand Response (DR) - (DO):
    - `voms_this_year =  0` and `voms_last_year = 1.0`. should have returned `Did Not Run`.but instead got error for (-1/0).
    - idk how it got that result 
- Trinity County Department of Transportation, Mode: Intercity Service (IC) - (DO):
    - `voms_this_year = 0`, `voms_last_year = 4.0`. should have returned `Did not run`. but did soem calculation to get (-4/0)
<br>
<br>

## for RR20F-139: Fare Revenue Per Trip change:
- condition check `vrm_this_year` and `vrm_last_year` if they are NULL or 0, if so, then return "did not run". then checks if the absolute value of (`vrm_this_year` - `vrm_last_year`)/ `vrm_last_year`. if results >=0.3 then fail, else pass
- Town of Truckee, Mode: Bus (MB) (Fixed Route) - (PT):
    - `vrm_this_year = 85100.0`, `vrm_last_year =109155.0 `. should have calculated `(85100.0-109155.0)/109155.0 = 0.22, pass`. but got some calculation as (0/0)?
    
## conclusion
- something about these values are not getting recognized in the conditions.
- unsure how to solve these errors


In [47]:
error_int_check[keep_cols]

Unnamed: 0,organization,mode,tph_last_year,tph_this_year,voms_last_year,voms_this_year,vrm_last_year,vrm_this_year
0,City of Arvin,Demand Response (DR) - (DO),2.289235,0.0,1.0,0.0,12495.0,13371.0
1,City of Arvin,Deviated Fixed Route (DF) - (DO),10.02455,0.0,3.0,0.0,120801.0,116550.0
2,Colusa County Transit Agency,Demand Response (DR) - (DO),4.17158,0.0,7.0,7.0,141531.0,149880.0
3,Town of Truckee,Bus (MB) (Fixed Route) - (PT),6.302895,5.344347,3.0,3.0,109155.0,85100.0
4,Town of Truckee,Demand Response (DR) - (PT),2.506347,4.224924,3.0,3.0,24373.0,22140.0
5,Trinity County Department of Transportation,Intercity Service (IC) - (DO),2.102253,2.021668,4.0,0.0,116650.0,128945.0


In [73]:
all_int_service_ratio = (
    tbls.staging.int_ntd_rr20_service_3ratios_wide()
    >> filter(~_.organization.isin(error_orgs))
    >> collect()
)
all_int_service_ratio.shape


(103, 20)

In [74]:
all_fct_service_checks = (
    tbls.mart_ntd_validation.fct_ntd_rr20_service_checks()
    >> filter(~_.organization.isin(error_orgs))
    >> collect()
)
all_fct_service_checks.shape

(927, 9)

In [75]:
# are there any *_last_year = 0?
all_int_service_ratio.columns

Index(['organization', 'mode', 'cph_last_year', 'cph_this_year',
       'mpv_last_year', 'mpv_this_year', 'frpt_last_year', 'frpt_this_year',
       'rev_speed_last_year', 'rev_speed_this_year', 'tph_last_year',
       'tph_this_year', 'voms_last_year', 'voms_this_year', 'vrm_last_year',
       'vrm_this_year', 'vrh_last_year', 'vrh_this_year', 'upt_last_year',
       'upt_this_year'],
      dtype='object')

In [76]:
# are there any *_last_year = 0?
display(
    all_service_ratio["tph_last_year"].unique(),
    all_service_ratio["voms_last_year"].unique(),
    all_service_ratio["vrm_last_year"].unique()
)


array([ 0.63106796,  2.92359094,  2.68361064,  1.15371809,  2.23591062,
        1.21193459, 11.12991363,  5.01879085,  3.12676887,  5.90524968,
        6.21560976,         nan,  6.82714711,  3.99604654,  1.03762663,
        5.5       ,  2.18013468,  7.6052719 ,  3.13515982,  5.57520325,
       11.18952734,  2.48910006,  1.76397516,  1.40467836,  0.9190372 ,
        9.10097888,  4.35818935,  1.03617571,  4.49965193,  1.62750716,
        3.78316498,  1.97675439,  2.16313887,  6.46408312,  1.04297994,
        2.44625881,  2.84239905,  2.7325115 ,  2.53458755, 24.36836341,
        2.9161262 , 11.28001825,  2.02106649,  2.98359373,  2.50954617,
        3.23326572,  4.32623318,  9.29097458,  3.72113833,  1.23403755,
        3.61550117,  0.94996374,  5.4249697 ,  0.        ,  1.57010112,
        2.4102028 ,  2.37218205,  3.89466522,  1.96683938,  0.96795791,
        2.21466769,  8.27583627,  4.48230668,  5.08681744,  2.0626229 ,
        1.68984615,  2.19852192, 20.60853104,  4.79492868,  4.74

array([ 1.,  5., 12.,  6.,  2.,  4., nan, 10.,  3.,  8.,  7., 24., 51.,
       14.,  9., 26., 21.,  0., 19., 18.])

array([9.61900e+03, 5.58580e+04, 1.56417e+05, 4.49330e+04, 3.23277e+05,
       6.55630e+04, 7.62420e+04, 1.78780e+04, 9.21710e+04, 2.51010e+04,
       1.95540e+04,         nan, 3.19780e+04, 7.36720e+04, 1.55650e+04,
       1.24400e+03, 1.19000e+04, 1.37265e+05, 6.51800e+03, 5.07570e+04,
       5.27320e+04, 4.51270e+04, 1.81550e+04, 7.62200e+03, 3.71470e+04,
       3.90560e+04, 1.70382e+05, 1.48670e+04, 5.58720e+04, 1.33870e+04,
       2.64630e+04, 1.34237e+05, 1.19230e+05, 3.20617e+05, 6.59590e+04,
       2.58550e+05, 9.36650e+04, 3.04153e+05, 1.15303e+05, 5.85952e+05,
       1.56732e+05, 1.23629e+05, 1.27474e+05, 3.60930e+05, 3.14382e+05,
       3.03500e+03, 1.36849e+05, 6.18300e+05, 2.27354e+05, 1.84440e+05,
       1.66742e+06, 9.32400e+04, 7.54227e+05, 1.37360e+04, 6.89470e+04,
       1.21943e+05, 4.34890e+04, 2.84458e+05, 7.28920e+04, 5.59702e+05,
       3.21840e+04, 7.59690e+04, 3.04320e+04, 2.22977e+05, 8.19920e+04,
       2.53400e+05, 7.41090e+04, 9.93090e+04, 8.24200e+04, 5.592

In [112]:
# check all the "value_checked" values. 
# which rows have "None"?

short_col_list =["organization","mode", "tph_this_year", "tph_last_year"]

display(
    #all_fct_service_checks[(all_fct_service_checks["value_checked"].isna()) & (all_fct_service_checks["name_of_check"].str.contains("RR20F-154:"))].head(1),
    all_fct_service_checks[(all_fct_service_checks["organization"].str.contains("Lassen Transit Service Agency"))& (all_fct_service_checks["name_of_check"].str.contains("RR20F-154:"))],
    all_service_ratio[all_service_ratio["organization"].str.contains("Lassen Transit Service Agency")][short_col_list].head(),
    all_service_ratio[all_service_ratio["tph_last_year"]== 0][short_col_list]
)

Unnamed: 0,organization,name_of_check,mode,year_of_data,check_status,value_checked,description,Agency_Response,date_checked
579,Lassen Transit Service Agency,RR20F-154: Trips per Hour change,Demand Response (DR) - (PT),2024,Did Not Run,,No data but this check was run before the NTD submission due date in 2024.,,2024-10-10 20:23:45.071370+00:00
580,Lassen Transit Service Agency,RR20F-154: Trips per Hour change,University Service (US) - (PT),2024,Did Not Run,,No data but this check was run before the NTD submission due date in 2024.,,2024-10-10 20:23:45.071370+00:00
581,Lassen Transit Service Agency,RR20F-154: Trips per Hour change,Bus (MB) (Fixed Route) - (PT),2024,Did Not Run,,No data but this check was run before the NTD submission due date in 2024.,,2024-10-10 20:23:45.071370+00:00
582,Lassen Transit Service Agency,RR20F-154: Trips per Hour change,Commuter Bus (CB) - (PT),2024,Did Not Run,,No data but this check was run before the NTD submission due date in 2024.,,2024-10-10 20:23:45.071370+00:00


Unnamed: 0,organization,mode,tph_this_year,tph_last_year
64,Lassen Transit Service Agency,Demand Response (DR) - (PT),,0.0
65,Lassen Transit Service Agency,University Service (US) - (PT),,
66,Lassen Transit Service Agency,Bus (MB) (Fixed Route) - (PT),,0.0
67,Lassen Transit Service Agency,Commuter Bus (CB) - (PT),,0.0


Unnamed: 0,organization,mode,tph_this_year,tph_last_year
64,Lassen Transit Service Agency,Demand Response (DR) - (PT),,0.0
66,Lassen Transit Service Agency,Bus (MB) (Fixed Route) - (PT),,0.0
67,Lassen Transit Service Agency,Commuter Bus (CB) - (PT),,0.0


#### Extract the org name and details from the blob
This is extra code to show how to inspect what organizations are in the API at any given time, in case it is helpful. 

In [48]:
for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod_string = x.get('ReportLastModifiedDate')
    last_mod = pendulum.from_format(last_mod_string, 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    print(f"Report details: ID {report_id}, org {org}, report period {period}, status {status}, last modified on {last_mod_string}.")
#     print(f"New datetime {last_mod}")
#     print(f"iso is {iso}")

Report details: ID 993, org Tahoe Transportation District, report period 2024, status Not Submitted, last modified on 1/23/2024 4:34:34 PM.
Report details: ID 1011, org Morongo Basin Transit Authority, report period 2024, status Submitted, last modified on 9/16/2024 2:05:42 PM.
Report details: ID 1012, org Modoc Transportation Agency, report period 2024, status Not Submitted, last modified on 9/17/2024 7:51:02 PM.
Report details: ID 1016, org Palo Verde Valley Transit Agency, report period 2024, status Not Submitted, last modified on 9/19/2024 4:38:19 PM.
Report details: ID 1017, org City of Rio Vista, report period 2024, status Not Submitted, last modified on 9/19/2024 4:48:03 PM.
Report details: ID 1021, org City of Needles, report period 2024, status Submitted, last modified on 9/25/2024 12:19:54 PM.
Report details: ID 1023, org Santa Cruz Metropolitan Transit District, report period 2024, status Not Submitted, last modified on 9/25/2024 3:35:43 PM.
Report details: ID 1024, org Butt

#### Quick check on what orgs are in this API, and how many have RR-20 info

In [49]:
org_data = []

for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod = pendulum.from_format(x.get('ReportLastModifiedDate'), 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    
    
    rural = x['NTDReportingRR20_Rural']
    for k,v in rural.items():
        rural_n = len(v)
    city = x['NTDReportingRR20_Intercity']
    for k,v in city.items():
        city_n = len(v)
    urban_tribal = x['NTDReportingRR20_Urban_Tribal']
    for k,v in urban_tribal.items():
        urban_n = len(v)
    
    org_info = pd.DataFrame(data=[[report_id, org, period, status, iso, rural_n, city_n, urban_n]], 
                            columns=['report_id', 'organization', 'report_period', 'report_status', 'last_modified', 
                                     'rr20_rural_rows', 'rr20_intercity_rows', 'rr20_urban-tribal_rows'])
#     whole_df = pd.concat([org_info, raw_df], axis=1).sort_values(by='organization')
    
    org_data.append(org_info)


In [50]:
newapi = pd.concat(org_data)
print(len(newapi))
newapi.head()

70


Unnamed: 0,report_id,organization,report_period,report_status,last_modified,rr20_rural_rows,rr20_intercity_rows,rr20_urban-tribal_rows
0,993,Tahoe Transportation District,2024,Not Submitted,2024-01-23T08:34:34-08:00,0,0,0
0,1011,Morongo Basin Transit Authority,2024,Submitted,2024-09-16T07:05:42-07:00,59,0,0
0,1012,Modoc Transportation Agency,2024,Not Submitted,2024-09-17T12:51:02-07:00,51,0,0
0,1016,Palo Verde Valley Transit Agency,2024,Not Submitted,2024-09-19T09:38:19-07:00,55,0,0
0,1017,City of Rio Vista,2024,Not Submitted,2024-09-19T09:48:03-07:00,0,0,0


In [51]:
newapi.to_csv('../data/newapi_rr20_11-27-23.csv')

## Convert API data to dataframes
Here using the test API to develop a function.

Just shove entire blob into a dataframe - this approach is what's recommended by Cal-ITP. They prefer we then do any transformations and separating of tables on dbt.  
Downsides:
* there are many columns with nested data (converts to lists and dictionaries). Basically each NTD report is in ONE column.
* the column names get changed because of the nesting and of repeated columns

In [52]:
df = pd.json_normalize(blob_2024)
df

Unnamed: 0,ReportId,Organization,ReportPeriod,ReportStatus,ReportLastModifiedDate,NTDReportingStationsAndMaintenance.Data,NTDTransitAssetManagementA15.Data,NTDAssetAndResourceInfo.Data,NTDReportingP10.Data,NTDReportingP20.Data,NTDReportingP50.Data,NTDReportingA35.Data,NTDReportingRR20_Intercity.Data,NTDReportingRR20_Rural.Data,NTDReportingRR20_Urban_Tribal.Data,NTDReportingTAMNarrative.Data,SS60.Data
0,993,Tahoe Transportation District,2024,Not Submitted,1/23/2024 4:34:34 PM,[],[],[],[],[],[],[],[],[],[],[],[]
1,1011,Morongo Basin Transit Authority,2024,Submitted,9/16/2024 2:05:42 PM,"[{'Id': 492, 'ReportId': 1011, 'ServiceMode': ...","[{'Id': 420, 'FacilityId': 33, 'ReportId': 101...","[{'Id': 24088, 'VehicleId': 12705, 'ReportId':...","[{'Id': 259, 'ReportId': 1011, 'OrgId': 3738, ...","[{'Id': 0, 'ReportId': 1011, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1011, 'Mode': {'id': 0,...","[{'Id': 303, 'ReportId': 1011, 'EquipmentName'...",[],"[{'Id': 16273, 'ReportId': 1011, 'Item': 'Comm...",[],[],"[{'Id': 4843, 'ItemId': 1, 'ReportId': 1011, '..."
2,1012,Modoc Transportation Agency,2024,Not Submitted,9/17/2024 7:51:02 PM,"[{'Id': 465, 'ReportId': 1012, 'ServiceMode': ...",[],"[{'Id': 24524, 'VehicleId': 12459, 'ReportId':...",[],"[{'Id': 0, 'ReportId': 1012, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1012, 'Mode': {'id': 0,...",[],[],"[{'Id': 16767, 'ReportId': 1012, 'Item': 'Bus ...",[],"[{'Id': 601, 'ReportId': 1012, 'Type': 'Revenu...",[]
3,1016,Palo Verde Valley Transit Agency,2024,Not Submitted,9/19/2024 4:38:19 PM,"[{'Id': 411, 'ReportId': 1016, 'ServiceMode': ...",[],"[{'Id': 24053, 'VehicleId': 15882, 'ReportId':...","[{'Id': 253, 'ReportId': 1016, 'OrgId': 4156, ...","[{'Id': 0, 'ReportId': 1016, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1016, 'Mode': {'id': 0,...","[{'Id': 288, 'ReportId': 1016, 'EquipmentName'...",[],"[{'Id': 15932, 'ReportId': 1016, 'Item': 'Comm...",[],[],"[{'Id': 4747, 'ItemId': 1, 'ReportId': 1016, '..."
4,1017,City of Rio Vista,2024,Not Submitted,9/19/2024 4:48:03 PM,[],[],[],[],"[{'Id': 0, 'ReportId': 1017, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1017, 'Mode': {'id': 0,...",[],[],[],[],"[{'Id': 605, 'ReportId': 1017, 'Type': 'Revenu...",[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,1105,Humboldt Transit Authority,2024,Not Submitted,10/9/2024 6:53:36 PM,"[{'Id': 499, 'ReportId': 1105, 'ServiceMode': ...","[{'Id': 433, 'FacilityId': 25, 'ReportId': 110...","[{'Id': 24474, 'VehicleId': 12434, 'ReportId':...","[{'Id': 263, 'ReportId': 1105, 'OrgId': 3663, ...","[{'Id': 0, 'ReportId': 1105, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1105, 'Mode': {'id': 0,...","[{'Id': 316, 'ReportId': 1105, 'EquipmentName'...",[],"[{'Id': 16473, 'ReportId': 1105, 'Item': 'Bus ...",[],[],"[{'Id': 4891, 'ItemId': 1, 'ReportId': 1105, '..."
66,1106,City of California City,2024,Not Submitted,10/9/2024 8:54:41 PM,"[{'Id': 501, 'ReportId': 1106, 'ServiceMode': ...","[{'Id': 435, 'FacilityId': 62, 'ReportId': 110...","[{'Id': 24393, 'VehicleId': 13829, 'ReportId':...","[{'Id': 266, 'ReportId': 1106, 'OrgId': 3743, ...","[{'Id': 0, 'ReportId': 1106, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1106, 'Mode': {'id': 0,...",[],[],"[{'Id': 16571, 'ReportId': 1106, 'Item': 'Dema...",[],[],"[{'Id': 4986, 'ItemId': 1, 'ReportId': 1106, '..."
67,1107,City of Ridgecrest,2024,Not Submitted,10/10/2024 1:06:40 PM,"[{'Id': 502, 'ReportId': 1107, 'ServiceMode': ...","[{'Id': 438, 'FacilityId': 90, 'ReportId': 110...","[{'Id': 24505, 'VehicleId': 14712, 'ReportId':...","[{'Id': 267, 'ReportId': 1107, 'OrgId': 3744, ...","[{'Id': 0, 'ReportId': 1107, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1107, 'Mode': {'id': 0,...","[{'Id': 327, 'ReportId': 1107, 'EquipmentName'...",[],"[{'Id': 16716, 'ReportId': 1107, 'Item': 'Dema...",[],[],"[{'Id': 4938, 'ItemId': 1, 'ReportId': 1107, '..."
68,1108,Eureka Transit Service,2024,Not Submitted,10/10/2024 1:18:38 PM,"[{'Id': 504, 'ReportId': 1108, 'ServiceMode': ...",[],"[{'Id': 24514, 'VehicleId': 12427, 'ReportId':...","[{'Id': 265, 'ReportId': 1108, 'OrgId': 3662, ...","[{'Id': 0, 'ReportId': 1108, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1108, 'Mode': {'id': 0,...",[],[],"[{'Id': 16669, 'ReportId': 1108, 'Item': 'Bus ...",[],[],"[{'Id': 4962, 'ItemId': 1, 'ReportId': 1108, '..."


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 17 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   ReportId                                 70 non-null     int64 
 1   Organization                             70 non-null     object
 2   ReportPeriod                             70 non-null     object
 3   ReportStatus                             70 non-null     object
 4   ReportLastModifiedDate                   70 non-null     object
 5   NTDReportingStationsAndMaintenance.Data  70 non-null     object
 6   NTDTransitAssetManagementA15.Data        70 non-null     object
 7   NTDAssetAndResourceInfo.Data             70 non-null     object
 8   NTDReportingP10.Data                     70 non-null     object
 9   NTDReportingP20.Data                     70 non-null     object
 10  NTDReportingP50.Data                     70 non-null     object


In [54]:
df['ReportLastModifiedDate'] =  df['ReportLastModifiedDate'].astype('datetime64[ns]')
# df['ReportLastModifiedDate'] = pd.to_datetime(df['ReportLastModifiedDate'], format='%m/%d/YYYY HH:mm:ss %p')

In [55]:
df

Unnamed: 0,ReportId,Organization,ReportPeriod,ReportStatus,ReportLastModifiedDate,NTDReportingStationsAndMaintenance.Data,NTDTransitAssetManagementA15.Data,NTDAssetAndResourceInfo.Data,NTDReportingP10.Data,NTDReportingP20.Data,NTDReportingP50.Data,NTDReportingA35.Data,NTDReportingRR20_Intercity.Data,NTDReportingRR20_Rural.Data,NTDReportingRR20_Urban_Tribal.Data,NTDReportingTAMNarrative.Data,SS60.Data
0,993,Tahoe Transportation District,2024,Not Submitted,2024-01-23 16:34:34,[],[],[],[],[],[],[],[],[],[],[],[]
1,1011,Morongo Basin Transit Authority,2024,Submitted,2024-09-16 14:05:42,"[{'Id': 492, 'ReportId': 1011, 'ServiceMode': ...","[{'Id': 420, 'FacilityId': 33, 'ReportId': 101...","[{'Id': 24088, 'VehicleId': 12705, 'ReportId':...","[{'Id': 259, 'ReportId': 1011, 'OrgId': 3738, ...","[{'Id': 0, 'ReportId': 1011, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1011, 'Mode': {'id': 0,...","[{'Id': 303, 'ReportId': 1011, 'EquipmentName'...",[],"[{'Id': 16273, 'ReportId': 1011, 'Item': 'Comm...",[],[],"[{'Id': 4843, 'ItemId': 1, 'ReportId': 1011, '..."
2,1012,Modoc Transportation Agency,2024,Not Submitted,2024-09-17 19:51:02,"[{'Id': 465, 'ReportId': 1012, 'ServiceMode': ...",[],"[{'Id': 24524, 'VehicleId': 12459, 'ReportId':...",[],"[{'Id': 0, 'ReportId': 1012, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1012, 'Mode': {'id': 0,...",[],[],"[{'Id': 16767, 'ReportId': 1012, 'Item': 'Bus ...",[],"[{'Id': 601, 'ReportId': 1012, 'Type': 'Revenu...",[]
3,1016,Palo Verde Valley Transit Agency,2024,Not Submitted,2024-09-19 16:38:19,"[{'Id': 411, 'ReportId': 1016, 'ServiceMode': ...",[],"[{'Id': 24053, 'VehicleId': 15882, 'ReportId':...","[{'Id': 253, 'ReportId': 1016, 'OrgId': 4156, ...","[{'Id': 0, 'ReportId': 1016, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1016, 'Mode': {'id': 0,...","[{'Id': 288, 'ReportId': 1016, 'EquipmentName'...",[],"[{'Id': 15932, 'ReportId': 1016, 'Item': 'Comm...",[],[],"[{'Id': 4747, 'ItemId': 1, 'ReportId': 1016, '..."
4,1017,City of Rio Vista,2024,Not Submitted,2024-09-19 16:48:03,[],[],[],[],"[{'Id': 0, 'ReportId': 1017, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1017, 'Mode': {'id': 0,...",[],[],[],[],"[{'Id': 605, 'ReportId': 1017, 'Type': 'Revenu...",[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,1105,Humboldt Transit Authority,2024,Not Submitted,2024-10-09 18:53:36,"[{'Id': 499, 'ReportId': 1105, 'ServiceMode': ...","[{'Id': 433, 'FacilityId': 25, 'ReportId': 110...","[{'Id': 24474, 'VehicleId': 12434, 'ReportId':...","[{'Id': 263, 'ReportId': 1105, 'OrgId': 3663, ...","[{'Id': 0, 'ReportId': 1105, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1105, 'Mode': {'id': 0,...","[{'Id': 316, 'ReportId': 1105, 'EquipmentName'...",[],"[{'Id': 16473, 'ReportId': 1105, 'Item': 'Bus ...",[],[],"[{'Id': 4891, 'ItemId': 1, 'ReportId': 1105, '..."
66,1106,City of California City,2024,Not Submitted,2024-10-09 20:54:41,"[{'Id': 501, 'ReportId': 1106, 'ServiceMode': ...","[{'Id': 435, 'FacilityId': 62, 'ReportId': 110...","[{'Id': 24393, 'VehicleId': 13829, 'ReportId':...","[{'Id': 266, 'ReportId': 1106, 'OrgId': 3743, ...","[{'Id': 0, 'ReportId': 1106, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1106, 'Mode': {'id': 0,...",[],[],"[{'Id': 16571, 'ReportId': 1106, 'Item': 'Dema...",[],[],"[{'Id': 4986, 'ItemId': 1, 'ReportId': 1106, '..."
67,1107,City of Ridgecrest,2024,Not Submitted,2024-10-10 13:06:40,"[{'Id': 502, 'ReportId': 1107, 'ServiceMode': ...","[{'Id': 438, 'FacilityId': 90, 'ReportId': 110...","[{'Id': 24505, 'VehicleId': 14712, 'ReportId':...","[{'Id': 267, 'ReportId': 1107, 'OrgId': 3744, ...","[{'Id': 0, 'ReportId': 1107, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1107, 'Mode': {'id': 0,...","[{'Id': 327, 'ReportId': 1107, 'EquipmentName'...",[],"[{'Id': 16716, 'ReportId': 1107, 'Item': 'Dema...",[],[],"[{'Id': 4938, 'ItemId': 1, 'ReportId': 1107, '..."
68,1108,Eureka Transit Service,2024,Not Submitted,2024-10-10 13:18:38,"[{'Id': 504, 'ReportId': 1108, 'ServiceMode': ...",[],"[{'Id': 24514, 'VehicleId': 12427, 'ReportId':...","[{'Id': 265, 'ReportId': 1108, 'OrgId': 3662, ...","[{'Id': 0, 'ReportId': 1108, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1108, 'Mode': {'id': 0,...",[],[],"[{'Id': 16669, 'ReportId': 1108, 'Item': 'Bus ...",[],[],"[{'Id': 4962, 'ItemId': 1, 'ReportId': 1108, '..."


In [56]:
user_dict = blob[0]['NTDReportingP50']['Data']
user_dict # a list of dictionaries. Each dict is one row of data.

NameError: name 'blob' is not defined

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

However in several tables, rows have several columns that are nested dictionaries.  
  
The following code explores ways to unnest them and expand the dataframe rows. **NOTE WE DID NOT USE THIS APPROACH IN PRODUCTION. We decided to unnest tables using SQL instead, in the `staging` dbt models.**

In [None]:
pd.json_normalize(user_dict)

# This expands columns instead of expanding rows. Not exactly what we want.

In [None]:
# We only really want the "Text" value in the dictionaries in the "Mode" and "Type" columns.
# user_dict[0]['Mode']
user_dict[0]

In [None]:
# How to replace certain values in a key:value pair of an existing python dictionary.
original = user_dict[0]
copy = {**original, 'Mode': original['Mode']['Text'], 
        'Type': original['Type']['Text']}
copy

----
Done! This worked but is not super ideal because we hard-code the keys that we want to change instead of iterating over them, but it works as long as we know which dictionary items in each table are nested.  

In [None]:
# Trying loop of creating new dict from old dict.
# New dict will not be nested - checks for a nested dict in each value; for each nested dict, 
# we extract only the k,v pair where the key == 'Text' 

copy_test = {**original}
for k,v in copy_test.items():
    if type(v) is dict:
        copy_test[k] = copy_test[k]['Text']
        
copy_test

In [None]:
## Worked! Now try the above loop over an entire JSON data table

for x in user_dict:
    for k,v in x.items():
        if type(v) is dict:
            x[k] = x[k]['Text']

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

#### Table is now one level and in the format desired.