The BlackCat API came with no instructions. Here we just inspect what is in it and its format. This notebook contains:  
1. the code for inspecting the data 
2. What to do to check whether data has changed in 2024 (and years after that). 
  
We are inferring what to do. Contact BlackCat for further instructions.

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import pendulum
import re

**NOTE that the URL has the year at the end. Change this to whatever year of data you would like to get.**

In [2]:
api_2024 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2024"

In [3]:
api_2023 = "https://services.blackcattransit.com/api/APIModules/GetNTDReportsByYear/BCG_CA/2023"

In [4]:
response_2024 = requests.get(api_2024)

In [5]:
response_2023 = requests.get(api_2023)

In [6]:
display(
    response_2024,
    response_2023
)
#looking for response 200

<Response [200]>

<Response [200]>

In [7]:
blob_2023 = response_2023.json()
blob_2024 = response_2024.json()

In [8]:
#blob_2024

**Get table and org list**

In [9]:
# type(blob) #list
display(
    len(blob_2023), #35
    len(blob_2024)
)

# type(blob[0]) #dict

# blob[0]

# blob['Tables']

85

60

In [10]:
tables_2023 = []
tables_2024 = []

# # For listing out ONLY the python dictionary keys in the blob (these are the tables) that start with "NTD"
# for k, v in blob[0].items():
#     if k.startswith("NTD"):
#         tables.append(k)

# For listing out ALL the dict keys:
for k, v in blob_2023[0].items():
    tables_2023.append(k)
    
for k, v in blob_2024[0].items():
    tables_2024.append(k)

# compare both table list
display(
    tables_2023 == tables_2024,
    #list(tables_2023), 
    list(tables_2024)
)

True

['ReportId',
 'Organization',
 'ReportPeriod',
 'ReportStatus',
 'ReportLastModifiedDate',
 'NTDReportingStationsAndMaintenance',
 'NTDTransitAssetManagementA15',
 'NTDAssetAndResourceInfo',
 'NTDReportingP10',
 'NTDReportingP20',
 'NTDReportingP50',
 'NTDReportingA35',
 'NTDReportingRR20_Intercity',
 'NTDReportingRR20_Rural',
 'NTDReportingRR20_Urban_Tribal',
 'NTDReportingTAMNarrative',
 'SS60']

In [11]:
# slicing blob at specific entry number, report name and 'data' field

blob_2024[3]["NTDReportingRR20_Rural"]#["Data"]

{'Data': [{'Id': 15932,
   'ReportId': 1016,
   'Item': 'Commuter Bus (CB) - (PT)',
   'Revenue': '',
   'Type': 'Expenses by Mode',
   'CssClass': 'expense',
   'OperationsExpended': None,
   'CapitalExpended': None,
   'Description': None,
   'AnnualVehicleRevMiles': None,
   'AnnualVehicleRevMilesComments': None,
   'AnnualVehicleRevHours': None,
   'AnnualVehicleRevHoursComments': None,
   'AnnualUnlinkedPassTrips': None,
   'AnnualUnlinkedPassTripsComments': None,
   'AnnualVehicleMaxService': None,
   'AnnualVehicleMaxServiceComments': None,
   'SponsoredServiceUPT': None,
   'SponsoredServiceUPTComments': None,
   'Quantity': None,
   'LastModifiedDate': '2024-10-08T20:20:24.463'},
  {'Id': 15933,
   'ReportId': 1016,
   'Item': 'Demand Response (DR) - (PT)',
   'Revenue': '',
   'Type': 'Expenses by Mode',
   'CssClass': 'expense',
   'OperationsExpended': None,
   'CapitalExpended': None,
   'Description': None,
   'AnnualVehicleRevMiles': None,
   'AnnualVehicleRevMilesCommen

### Inspect whether any tables have changed from last year (2023)

1. Pull up the external tables yaml at airflow/dags/create_external_tables/ntd_report_validation/external_table_all_ntdreports.yml
2. Pull up the external_blackcat.all_ntdreports table in BigQuery
  
To help do #3 and #4 below, use the cell below to copy in table names one by one and inspect the API data for whichever year one is interested in. If you don't see any data, then cycle through the JSON list by changing `blob[0]` to `blob[1]`, `blob[3]` etc.  
  
3. Compare the table names above with what is in the table list on the schema there. NOTE table names in BigQuery are not *exactly* the same as the API, they have been made all lowercase with `_data` added.
4. Compare the individual columns within each of the above tables to what is there.
  
Change the schema in the yaml as needed to reflect the data. Do not remove any old column names. Just add new columns and/or tables

In [12]:
# A10 report
# check complete
blob_2024[3]['NTDReportingStationsAndMaintenance']['Data'][0]

{'Id': 411,
 'ReportId': 1016,
 'ServiceMode': 'Commuter Bus (CB)',
 'PTOwnedByServiceProvider': None,
 'PTOwnedByPublicAgency': None,
 'PTLeasedByPublicAgency': None,
 'PTLeasedByServiceProvider': None,
 'DOOwned': None,
 'DOLeasedByPublicAgency': None,
 'DOLeasedFromPrivateEntity': None,
 'LastModifiedDate': '2024-09-19T20:38:33.59'}

In [13]:
# check complete

# new columns: 
    #"Type", 
    #"Note", 
    #"LastModifiedDate"

# added to draft yaml
blob_2024[14]['NTDTransitAssetManagementA15']["Data"][0]

{'Id': 353,
 'FacilityId': 52,
 'ReportId': 1033,
 'FacilityName': 'Martis Shelter',
 'PrimaryMode': 'MB - Bus',
 'FacilityClass': 'Passenger Facility',
 'FacilityType': 'Shelter',
 'YearBuilt': 2012,
 'Size': 114,
 'Type': 'Square Feet',
 'DOTCapitalResponsibility': None,
 'OrganizationCapitalResponsibility': None,
 'ConditionAssessment': '4.5',
 'ConditionAssessmentDate': '2024-06-30T21:00:00',
 'SectionOfLargerFacility': 'NO',
 'Latitude': None,
 'LatitudeDirection': None,
 'Longitude': None,
 'LongitudeDirection': None,
 'SecondaryMode': None,
 'PrivateMode': None,
 'Notes': None,
 'LastModifiedDate': '2024-09-26T17:55:56.1'}

In [14]:
# A30
# check complete

# new columns:
					 #'TotalVehicles':,
					 #'ActiveVehicles',
					 #'DedicatedFleet',
					 #'NoCapitalReplacementResponsibility',
					 #'AutomatedorAutonomousVehicles',
					 #'Manufacturer',
					 #'DescribeOtherManufacturer',
					 #'Model',
					 #'YearRebuilt',
					 #'OtherFuelType',
					 #'DuelFuelType',
					 #'StandingCapacity',
					 #'OtherOwnershipType',
					 #'EmergencyVehicles',
					 #'TypeofLastRenewal',
					 #'UsefulLifeBenchmark,
					 #'MilesThisYear',
					 #'AverageLifetimeMilesPerActiveVehicle'
                        
# added to draft yaml
blob_2024[15]['NTDAssetAndResourceInfo']['Data'][1]

{'Id': 21569,
 'VehicleId': 16179,
 'ReportId': 1034,
 'VehicleStatus': 'Active',
 'Vin': '8977',
 'NTDID': '',
 'ADAAccess': 'No',
 'VehicleType': 'CU - Cutaway Bus',
 'FuelType': 'GA - Gasoline',
 'FundSource': 'Other Federal Funds: Section 5337- State of Good Repair',
 'AverageEstimatedServiceYearsWhenNew': 10,
 'AverageExpirationYearsWhenNew': 10,
 'VehicleYear': 2019,
 'UsefulLifeYearsRemaining': 0,
 'VehicleLength': 24.0,
 'SeatingCapacity': 12,
 'OwnershipType': '',
 'ModesOperatedDisplayText': '',
 'ModesOperatedFullText': '',
 'LastModifiedDate': '2024-09-26T17:06:03.26',
 'AgencyFleetId': '308',
 'TotalVehicles': 1,
 'ActiveVehicles': 1,
 'DedicatedFleet': 'No',
 'NoCapitalReplacementResponsibility': 'No',
 'AutomatedorAutonomousVehicles': '',
 'Manufacturer': 'Glaval Bus ',
 'DescribeOtherManufacturer': '',
 'Model': '',
 'YearRebuilt': 0,
 'OtherFuelType': '',
 'DuelFuelType': '',
 'StandingCapacity': 0,
 'OtherOwnershipType': '',
 'EmergencyVehicles': 'No',
 'TypeofLastRen

In [15]:
# check complete
blob_2024[12]['NTDReportingP10']['Data']

[{'Id': 205,
  'ReportId': 1031,
  'OrgId': 3677,
  'UserId': '73e964ed-41a1-44a6-8f7b-3c7513689b82',
  'FirstName': 'Mengil',
  'LastName': 'Deane',
  'FullName': {'id': 0,
   'Text': 'Mengil Deane',
   'Value': '73e964ed-41a1-44a6-8f7b-3c7513689b82',
   'Group': None,
   'BoolValue': False},
  'PrimaryPhone': '(530) 823-4211 ext. 145',
  'Email': 'mdeane@auburn.ca.gov',
  'LastModifiedDate': '2024-09-27T18:43:11.91'}]

In [16]:
#check complete
blob_2024[5]['NTDReportingP20']['Data']

[{'Id': 0,
  'ReportId': 1021,
  'ServiceMode': 'Deviated Fixed Route (DF)',
  'TypeOfService': 'PT - Purchased Transportation',
  'CommitmentDate': None,
  'StartDate': None,
  'EndDate': None,
  'LastModifiedDate': None}]

In [17]:
#check complete
blob_2024[5]['NTDReportingP50']['Data']

[{'Id': 0,
  'ReportId': 1021,
  'Mode': {'id': 0,
   'Text': 'Deviated Fixed Route (DF)',
   'Value': '10',
   'Group': None,
   'BoolValue': False},
  'Type': {'id': 0,
   'Text': 'PT - Purchased Transportation',
   'Value': '2',
   'Group': None,
   'BoolValue': False},
  'WebLink': None,
  'FilePath': None,
  'LastModifiedDate': None}]

In [18]:
#check complete
blob_2024[16]['NTDReportingA35']['Data'][0]

{'Id': 247,
 'ReportId': 1035,
 'EquipmentName': 'TS-1',
 'EquipmentId': 180,
 'VehicleType': 'Trucks and Other Rubber Tire Vehicles',
 'PrimaryMode': '',
 'SecondaryMode': '',
 'TotalVehicles': None,
 'UsefulLifeBenchmark': False,
 'YearOfManufacture': 2006,
 'TransitAgencyCapitalResponsibility': '',
 'EstimatedCost': None,
 'YearDollarsEstimatedCost': None,
 'UsefulLifeYearsBenchMark': None,
 'UsefulLifeYearsRemaining': None,
 'LastModifiedDate': '2024-09-26T17:38:34.383'}

In [19]:
blob_2024[36]['NTDReportingRR20_Intercity']['Data']

[]

In [20]:
# check complete
# new columns:
						#◊ "AnnualVehicleRevMilesComments"
						#◊ "AnnualVehicleRevHoursComments"
						#◊ "AnnualUnlinkedPassTripsComments"
						#◊ "AnnualVehicleMaxServiceComments"
                        # "SponsoredServiceUPTComments"
# added to draft yaml
blob_2024[16]['NTDReportingRR20_Rural']['Data'][2]

{'Id': 14203,
 'ReportId': 1035,
 'Item': 'Demand Response (DR) - (DO)',
 'Revenue': 'Organization-Paid Fares',
 'Type': 'Fare Revenues',
 'CssClass': 'revenue',
 'OperationsExpended': None,
 'CapitalExpended': None,
 'Description': None,
 'AnnualVehicleRevMiles': None,
 'AnnualVehicleRevMilesComments': None,
 'AnnualVehicleRevHours': None,
 'AnnualVehicleRevHoursComments': None,
 'AnnualUnlinkedPassTrips': None,
 'AnnualUnlinkedPassTripsComments': None,
 'AnnualVehicleMaxService': None,
 'AnnualVehicleMaxServiceComments': None,
 'SponsoredServiceUPT': None,
 'SponsoredServiceUPTComments': None,
 'Quantity': None,
 'LastModifiedDate': '2024-09-26T17:51:47.807'}

In [21]:
#check complete
blob_2024[6]['NTDReportingRR20_Urban_Tribal']['Data'][0]

{'Id': 748,
 'ItemId': 8,
 'ReportId': 1023,
 'Item': 'FTA Rural Area Formula Funds (5311)',
 'OperationsExpended': None,
 'CapitalExpended': None,
 'Description': None,
 'LastModifiedDate': '2024-09-25T19:37:26.243'}

In [22]:
# check complete
# NEW COLUMN(S)
# 'VehiclesToBePurchasesNextYear'
blob_2024[2]['NTDReportingTAMNarrative']['Data'][0]

{'Id': 601,
 'ReportId': 1012,
 'Type': 'Revenue Vehicles',
 'Category': 'Light-Duty Mid-Sized Bus',
 'VehiclesInAssetClass': 1,
 'VehiclesExceededULBTAMPlan': 0,
 'TAMPlanGoalsDescription': None,
 'VehiclesToBeRetiredBeyondULB': 0,
 'VehiclesToBePurchasesNextYear': 0,
 'VehiclesPastULBInTAM': 0,
 'LastModifiedDate': None}

In [23]:
# check complete
blob_2024[16]['SS60']['Data'][0]

{'Id': 3843,
 'ItemId': 1,
 'ReportId': 1035,
 'Item': 'Major Safety and Security Events',
 'Type': None,
 'CssClass': None,
 'TransitVehicleAssualts': 0,
 'RevenueFacilityAssualts': 0,
 'NonRevenueFacilityAssualts': 0,
 'OtherLocationAssualts': 0,
 'MajorEvents': None,
 'Fatalities': None,
 'Injuries': None,
 'Quantity': None,
 'LastModifiedDate': '2024-09-26T18:24:02.33'}

#### Extract the org name and details from the blob
This is extra code to show how to inspect what organizations are in the API at any given time, in case it is helpful. 

In [25]:
for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod_string = x.get('ReportLastModifiedDate')
    last_mod = pendulum.from_format(last_mod_string, 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    print(f"Report details: ID {report_id}, org {org}, report period {period}, status {status}, last modified on {last_mod_string}.")
#     print(f"New datetime {last_mod}")
#     print(f"iso is {iso}")

Report details: ID 993, org Tahoe Transportation District, report period 2024, status Not Submitted, last modified on 1/23/2024 4:34:34 PM.
Report details: ID 1011, org Morongo Basin Transit Authority, report period 2024, status Not Submitted, last modified on 9/16/2024 2:05:42 PM.
Report details: ID 1012, org Modoc Transportation Agency, report period 2024, status Not Submitted, last modified on 9/17/2024 7:51:02 PM.
Report details: ID 1016, org Palo Verde Valley Transit Agency, report period 2024, status Not Submitted, last modified on 9/19/2024 4:38:19 PM.
Report details: ID 1017, org City of Rio Vista, report period 2024, status Not Submitted, last modified on 9/19/2024 4:48:03 PM.
Report details: ID 1021, org City of Needles, report period 2024, status Submitted, last modified on 9/25/2024 12:19:54 PM.
Report details: ID 1023, org Santa Cruz Metropolitan Transit District, report period 2024, status Not Submitted, last modified on 9/25/2024 3:35:43 PM.
Report details: ID 1024, org 

#### Quick check on what orgs are in this API, and how many have RR-20 info

In [26]:
org_data = []

for x in blob_2024:
    report_id = x.get('ReportId')
    org = x.get('Organization')
    period = x.get('ReportPeriod')
    status = x.get('ReportStatus')
    last_mod = pendulum.from_format(x.get('ReportLastModifiedDate'), 'MM/DD/YYYY HH:mm:ss A').in_tz('America/Los_Angeles')
    iso = last_mod.to_iso8601_string()
    
    
    rural = x['NTDReportingRR20_Rural']
    for k,v in rural.items():
        rural_n = len(v)
    city = x['NTDReportingRR20_Intercity']
    for k,v in city.items():
        city_n = len(v)
    urban_tribal = x['NTDReportingRR20_Urban_Tribal']
    for k,v in urban_tribal.items():
        urban_n = len(v)
    
    org_info = pd.DataFrame(data=[[report_id, org, period, status, iso, rural_n, city_n, urban_n]], 
                            columns=['report_id', 'organization', 'report_period', 'report_status', 'last_modified', 
                                     'rr20_rural_rows', 'rr20_intercity_rows', 'rr20_urban-tribal_rows'])
#     whole_df = pd.concat([org_info, raw_df], axis=1).sort_values(by='organization')
    
    org_data.append(org_info)


In [27]:
newapi = pd.concat(org_data)
print(len(newapi))
newapi.head()

60


Unnamed: 0,report_id,organization,report_period,report_status,last_modified,rr20_rural_rows,rr20_intercity_rows,rr20_urban-tribal_rows
0,993,Tahoe Transportation District,2024,Not Submitted,2024-01-23T08:34:34-08:00,0,0,0
0,1011,Morongo Basin Transit Authority,2024,Not Submitted,2024-09-16T07:05:42-07:00,0,0,0
0,1012,Modoc Transportation Agency,2024,Not Submitted,2024-09-17T12:51:02-07:00,0,0,0
0,1016,Palo Verde Valley Transit Agency,2024,Not Submitted,2024-09-19T09:38:19-07:00,55,0,0
0,1017,City of Rio Vista,2024,Not Submitted,2024-09-19T09:48:03-07:00,0,0,0


In [28]:
newapi.to_csv('../data/newapi_rr20_11-27-23.csv')

## Convert API data to dataframes
Here using the test API to develop a function.

Just shove entire blob into a dataframe - this approach is what's recommended by Cal-ITP. They prefer we then do any transformations and separating of tables on dbt.  
Downsides:
* there are many columns with nested data (converts to lists and dictionaries). Basically each NTD report is in ONE column.
* the column names get changed because of the nesting and of repeated columns

In [29]:
df = pd.json_normalize(blob_2024)
df

Unnamed: 0,ReportId,Organization,ReportPeriod,ReportStatus,ReportLastModifiedDate,NTDReportingStationsAndMaintenance.Data,NTDTransitAssetManagementA15.Data,NTDAssetAndResourceInfo.Data,NTDReportingP10.Data,NTDReportingP20.Data,NTDReportingP50.Data,NTDReportingA35.Data,NTDReportingRR20_Intercity.Data,NTDReportingRR20_Rural.Data,NTDReportingRR20_Urban_Tribal.Data,NTDReportingTAMNarrative.Data,SS60.Data
0,993,Tahoe Transportation District,2024,Not Submitted,1/23/2024 4:34:34 PM,[],[],[],[],[],[],[],[],[],[],[],[]
1,1011,Morongo Basin Transit Authority,2024,Not Submitted,9/16/2024 2:05:42 PM,[],[],[],[],"[{'Id': 0, 'ReportId': 1011, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1011, 'Mode': {'id': 0,...",[],[],[],[],[],[]
2,1012,Modoc Transportation Agency,2024,Not Submitted,9/17/2024 7:51:02 PM,"[{'Id': 465, 'ReportId': 1012, 'ServiceMode': ...",[],"[{'Id': 23776, 'VehicleId': 12459, 'ReportId':...",[],"[{'Id': 0, 'ReportId': 1012, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1012, 'Mode': {'id': 0,...",[],[],[],[],"[{'Id': 601, 'ReportId': 1012, 'Type': 'Revenu...",[]
3,1016,Palo Verde Valley Transit Agency,2024,Not Submitted,9/19/2024 4:38:19 PM,"[{'Id': 411, 'ReportId': 1016, 'ServiceMode': ...",[],"[{'Id': 23242, 'VehicleId': 15882, 'ReportId':...",[],"[{'Id': 0, 'ReportId': 1016, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1016, 'Mode': {'id': 0,...",[],[],"[{'Id': 15932, 'ReportId': 1016, 'Item': 'Comm...",[],[],[]
4,1017,City of Rio Vista,2024,Not Submitted,9/19/2024 4:48:03 PM,[],[],[],[],"[{'Id': 0, 'ReportId': 1017, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1017, 'Mode': {'id': 0,...",[],[],[],[],"[{'Id': 605, 'ReportId': 1017, 'Type': 'Revenu...",[]
5,1021,City of Needles,2024,Submitted,9/25/2024 12:19:54 PM,"[{'Id': 420, 'ReportId': 1021, 'ServiceMode': ...","[{'Id': 367, 'FacilityId': 72, 'ReportId': 102...","[{'Id': 23644, 'VehicleId': 15310, 'ReportId':...","[{'Id': 214, 'ReportId': 1021, 'OrgId': 3737, ...","[{'Id': 0, 'ReportId': 1021, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1021, 'Mode': {'id': 0,...",[],[],"[{'Id': 14674, 'ReportId': 1021, 'Item': 'Devi...",[],[],"[{'Id': 4105, 'ItemId': 1, 'ReportId': 1021, '..."
6,1023,Santa Cruz Metropolitan Transit District,2024,Not Submitted,9/25/2024 3:35:43 PM,[],[],[],[],"[{'Id': 0, 'ReportId': 1023, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1023, 'Mode': {'id': 0,...",[],[],[],"[{'Id': 748, 'ItemId': 8, 'ReportId': 1023, 'I...",[],[]
7,1024,Butte County Association of Governments/ Butte...,2024,Submitted,9/25/2024 4:12:01 PM,[],[],[],[],"[{'Id': 0, 'ReportId': 1024, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1024, 'Mode': {'id': 0,...",[],[],[],"[{'Id': 757, 'ItemId': 8, 'ReportId': 1024, 'I...",[],[]
8,1025,San Diego Metropolitan Transit System,2024,Submitted,9/25/2024 5:38:29 PM,[],[],[],[],"[{'Id': 0, 'ReportId': 1025, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1025, 'Mode': {'id': 0,...",[],[],[],"[{'Id': 766, 'ItemId': 8, 'ReportId': 1025, 'I...",[],[]
9,1026,City of Arvin,2024,Not Submitted,9/25/2024 6:30:48 PM,"[{'Id': 475, 'ReportId': 1026, 'ServiceMode': ...","[{'Id': 400, 'FacilityId': 44, 'ReportId': 102...","[{'Id': 23887, 'VehicleId': 12579, 'ReportId':...","[{'Id': 232, 'ReportId': 1026, 'OrgId': 3715, ...","[{'Id': 0, 'ReportId': 1026, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1026, 'Mode': {'id': 0,...",[],[],"[{'Id': 14150, 'ReportId': 1026, 'Item': 'Dema...",[],[],"[{'Id': 4436, 'ItemId': 1, 'ReportId': 1026, '..."


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 17 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   ReportId                                 60 non-null     int64 
 1   Organization                             60 non-null     object
 2   ReportPeriod                             60 non-null     object
 3   ReportStatus                             60 non-null     object
 4   ReportLastModifiedDate                   60 non-null     object
 5   NTDReportingStationsAndMaintenance.Data  60 non-null     object
 6   NTDTransitAssetManagementA15.Data        60 non-null     object
 7   NTDAssetAndResourceInfo.Data             60 non-null     object
 8   NTDReportingP10.Data                     60 non-null     object
 9   NTDReportingP20.Data                     60 non-null     object
 10  NTDReportingP50.Data                     60 non-null     object


In [31]:
df['ReportLastModifiedDate'] =  df['ReportLastModifiedDate'].astype('datetime64[ns]')
# df['ReportLastModifiedDate'] = pd.to_datetime(df['ReportLastModifiedDate'], format='%m/%d/YYYY HH:mm:ss %p')

In [32]:
df

Unnamed: 0,ReportId,Organization,ReportPeriod,ReportStatus,ReportLastModifiedDate,NTDReportingStationsAndMaintenance.Data,NTDTransitAssetManagementA15.Data,NTDAssetAndResourceInfo.Data,NTDReportingP10.Data,NTDReportingP20.Data,NTDReportingP50.Data,NTDReportingA35.Data,NTDReportingRR20_Intercity.Data,NTDReportingRR20_Rural.Data,NTDReportingRR20_Urban_Tribal.Data,NTDReportingTAMNarrative.Data,SS60.Data
0,993,Tahoe Transportation District,2024,Not Submitted,2024-01-23 16:34:34,[],[],[],[],[],[],[],[],[],[],[],[]
1,1011,Morongo Basin Transit Authority,2024,Not Submitted,2024-09-16 14:05:42,[],[],[],[],"[{'Id': 0, 'ReportId': 1011, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1011, 'Mode': {'id': 0,...",[],[],[],[],[],[]
2,1012,Modoc Transportation Agency,2024,Not Submitted,2024-09-17 19:51:02,"[{'Id': 465, 'ReportId': 1012, 'ServiceMode': ...",[],"[{'Id': 23776, 'VehicleId': 12459, 'ReportId':...",[],"[{'Id': 0, 'ReportId': 1012, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1012, 'Mode': {'id': 0,...",[],[],[],[],"[{'Id': 601, 'ReportId': 1012, 'Type': 'Revenu...",[]
3,1016,Palo Verde Valley Transit Agency,2024,Not Submitted,2024-09-19 16:38:19,"[{'Id': 411, 'ReportId': 1016, 'ServiceMode': ...",[],"[{'Id': 23242, 'VehicleId': 15882, 'ReportId':...",[],"[{'Id': 0, 'ReportId': 1016, 'ServiceMode': 'C...","[{'Id': 0, 'ReportId': 1016, 'Mode': {'id': 0,...",[],[],"[{'Id': 15932, 'ReportId': 1016, 'Item': 'Comm...",[],[],[]
4,1017,City of Rio Vista,2024,Not Submitted,2024-09-19 16:48:03,[],[],[],[],"[{'Id': 0, 'ReportId': 1017, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1017, 'Mode': {'id': 0,...",[],[],[],[],"[{'Id': 605, 'ReportId': 1017, 'Type': 'Revenu...",[]
5,1021,City of Needles,2024,Submitted,2024-09-25 12:19:54,"[{'Id': 420, 'ReportId': 1021, 'ServiceMode': ...","[{'Id': 367, 'FacilityId': 72, 'ReportId': 102...","[{'Id': 23644, 'VehicleId': 15310, 'ReportId':...","[{'Id': 214, 'ReportId': 1021, 'OrgId': 3737, ...","[{'Id': 0, 'ReportId': 1021, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1021, 'Mode': {'id': 0,...",[],[],"[{'Id': 14674, 'ReportId': 1021, 'Item': 'Devi...",[],[],"[{'Id': 4105, 'ItemId': 1, 'ReportId': 1021, '..."
6,1023,Santa Cruz Metropolitan Transit District,2024,Not Submitted,2024-09-25 15:35:43,[],[],[],[],"[{'Id': 0, 'ReportId': 1023, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1023, 'Mode': {'id': 0,...",[],[],[],"[{'Id': 748, 'ItemId': 8, 'ReportId': 1023, 'I...",[],[]
7,1024,Butte County Association of Governments/ Butte...,2024,Submitted,2024-09-25 16:12:01,[],[],[],[],"[{'Id': 0, 'ReportId': 1024, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1024, 'Mode': {'id': 0,...",[],[],[],"[{'Id': 757, 'ItemId': 8, 'ReportId': 1024, 'I...",[],[]
8,1025,San Diego Metropolitan Transit System,2024,Submitted,2024-09-25 17:38:29,[],[],[],[],"[{'Id': 0, 'ReportId': 1025, 'ServiceMode': 'B...","[{'Id': 0, 'ReportId': 1025, 'Mode': {'id': 0,...",[],[],[],"[{'Id': 766, 'ItemId': 8, 'ReportId': 1025, 'I...",[],[]
9,1026,City of Arvin,2024,Not Submitted,2024-09-25 18:30:48,"[{'Id': 475, 'ReportId': 1026, 'ServiceMode': ...","[{'Id': 400, 'FacilityId': 44, 'ReportId': 102...","[{'Id': 23887, 'VehicleId': 12579, 'ReportId':...","[{'Id': 232, 'ReportId': 1026, 'OrgId': 3715, ...","[{'Id': 0, 'ReportId': 1026, 'ServiceMode': 'D...","[{'Id': 0, 'ReportId': 1026, 'Mode': {'id': 0,...",[],[],"[{'Id': 14150, 'ReportId': 1026, 'Item': 'Dema...",[],[],"[{'Id': 4436, 'ItemId': 1, 'ReportId': 1026, '..."


In [33]:
user_dict = blob[0]['NTDReportingP50']['Data']
user_dict # a list of dictionaries. Each dict is one row of data.

NameError: name 'blob' is not defined

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

However in several tables, rows have several columns that are nested dictionaries.  
  
The following code explores ways to unnest them and expand the dataframe rows. **NOTE WE DID NOT USE THIS APPROACH IN PRODUCTION. We decided to unnest tables using SQL instead, in the `staging` dbt models.**

In [None]:
pd.json_normalize(user_dict)

# This expands columns instead of expanding rows. Not exactly what we want.

In [None]:
# We only really want the "Text" value in the dictionaries in the "Mode" and "Type" columns.
# user_dict[0]['Mode']
user_dict[0]

In [None]:
# How to replace certain values in a key:value pair of an existing python dictionary.
original = user_dict[0]
copy = {**original, 'Mode': original['Mode']['Text'], 
        'Type': original['Type']['Text']}
copy

----
Done! This worked but is not super ideal because we hard-code the keys that we want to change instead of iterating over them, but it works as long as we know which dictionary items in each table are nested.  

In [None]:
# Trying loop of creating new dict from old dict.
# New dict will not be nested - checks for a nested dict in each value; for each nested dict, 
# we extract only the k,v pair where the key == 'Text' 

copy_test = {**original}
for k,v in copy_test.items():
    if type(v) is dict:
        copy_test[k] = copy_test[k]['Text']
        
copy_test

In [None]:
## Worked! Now try the above loop over an entire JSON data table

for x in user_dict:
    for k,v in x.items():
        if type(v) is dict:
            x[k] = x[k]['Text']

In [None]:
raw_df = pd.DataFrame.from_dict(user_dict)
raw_df

#### Table is now one level and in the format desired.