Exploration on using airtable for transit data quality issues table

In [1]:
from dotenv import load_dotenv
import os

import numpy as np
import pandas as pd
from pyairtable import Api
load_dotenv()
api = Api(os.getenv('AIRTABLE_TOKEN'))

https://airtable.com/appmBGOFTvsDv4jdJ/api/docs#curl/table:transit%20data%20quality%20issues

In [2]:
# Trying to stay consistent with 
# https://github.com/cal-itp/data-infra/blob/main/airflow/plugins/operators/airtable_to_gcs.py
def all_rows_as_df(base_id, table_name):
    all_rows = api.table(base_id=base_id, table_name=table_name).all(return_fields_by_field_id=False)

    df = pd.DataFrame(
        [
            {"id":row["id"], **row["fields"]}
            for row in all_rows
        ]
    )
    return df

def takeout_list(x):
    if x is not np.nan:
        return x[0]


For reference for ids:
https://airtable.com/appmBGOFTvsDv4jdJ/api/docs#curl/table:transit%20data%20quality%20issues

In [3]:
TABLE_TRANSIT_DATA_QUALITY_ISSUES_ID = 'appmBGOFTvsDv4jdJ'

In [4]:
issues_df = all_rows_as_df(TABLE_TRANSIT_DATA_QUALITY_ISSUES_ID, 'tblEv7QTfEmypU6gg')
issue_types = all_rows_as_df(TABLE_TRANSIT_DATA_QUALITY_ISSUES_ID, 'tblupkIe04LxEPWSR')
services_df = all_rows_as_df(TABLE_TRANSIT_DATA_QUALITY_ISSUES_ID, 'tblBZtbuntv4D0i1u')

In [5]:
#Clean up columns where data is a bunch of single item lists
issues_df['Issue Type'] = issues_df['Issue Type'].apply(takeout_list)
issues_df['Services'] = issues_df['Services'].apply(takeout_list) 

In [6]:
issues_df.head()

Unnamed: 0,id,Description,Issue Type,GTFS Datasets,Status,Issue #,Services,Resolution Date,Assignee,Issue Creation Time,...,Caltrans District (from Operating County Geographies) (from Services),Is Open,Last Modified,Last Update Month,Last Update Year,Last Modified By,Status Notes,Should wait until,Outreach Status,Waiting Since
0,rec0MW4aKpcpJDMXv,Palo Verde Valley Transit Agency: GTFS Schedul...,recXHbaInR8Uebp5D,[recnCtfmZcTSsrJh7],Fixed - with Cal-ITP help,21,recVmX1bV8LrzyGl7,2023-05-19,"{'id': 'usr2fo9oGEjsugXf0', 'email': 'evan@cal...",2023-05-25T00:16:26.000Z,...,[8],No,2023-05-30T17:21:48.000Z,5,2023,"{'id': 'usr2fo9oGEjsugXf0', 'email': 'evan@cal...",,,,
1,rec0WcOlUWYE7hcpv,City of Auburn: GTFS Schedule Feed has Expired,recXHbaInR8Uebp5D,[recqOFeSKMAP6tcGp],Fixed - with Cal-ITP help,11,recAk8l02ipaRB34a,2023-04-13,"{'id': 'usr2fo9oGEjsugXf0', 'email': 'evan@cal...",2023-05-25T00:16:26.000Z,...,[3],No,2023-05-30T17:21:48.000Z,5,2023,"{'id': 'usr2fo9oGEjsugXf0', 'email': 'evan@cal...",,,,
2,rec0wKHGW5JrGSQ7Y,The GTFS schedule feed is about to expire on 1...,recEmZkgNkfKgYe6N,[recoOHimkMbZlH3vV],Fixed - on its own,143,reczR85P0abN3zb3n,2024-01-03,"{'id': 'usrnOT3BlbNIbiG9v', 'email': 'md.islam...",2023-12-29T00:28:14.000Z,...,[2],No,2024-01-05T00:06:41.000Z,1,2024,"{'id': 'usrnOT3BlbNIbiG9v', 'email': 'md.islam...",As of 12/28: The GTFS schedule feed is about t...,,,
3,rec21OAguvM0dcDHk,Los Angeles County Metropolitan Transportation...,recXHbaInR8Uebp5D,[recVvXBJDgsRqVLy6],Fixed - with Cal-ITP help,47,recEEI0Hoj2x3HTTJ,2023-06-29,"{'id': 'usr2fo9oGEjsugXf0', 'email': 'evan@cal...",2023-05-25T00:16:26.000Z,...,[7],No,2023-09-15T23:35:09.000Z,9,2023,"{'id': 'usrBwrVsyUiZ7jCq8', 'email': 'evan.sir...",I reached out to Nina Kin about various feeds....,,,
4,rec2hy3Jy9BtpJ2J9,City of Bell: Los Angeles Metro feed to transi...,rec9hb7KmcQCvR5BG,[recfPwQA6XD8SN2gR],Outreach,96,recQIV7c2LReySeYp,,"{'id': 'usr2fo9oGEjsugXf0', 'email': 'evan@cal...",2023-09-16T00:18:07.000Z,...,[7],Yes,2024-01-08T23:42:57.000Z,1,2024,"{'id': 'usrnOT3BlbNIbiG9v', 'email': 'md.islam...",As of 9/14: Sent kickoff email to district rep...,2024-01-19,Waiting on District Rep,


In [7]:
df = pd.merge(left=issues_df,right=issue_types[['Name']],how="left", left_on=['Issue Type'], right_on=issue_types['id'])

df = pd.merge(df, services_df[['Operator','Name']], how="left", left_on=['Services'], right_on=services_df['id'])
# df.head()
df = df.rename(columns={"Name_x":"Issue_Type","Name_y":"Service_Name"})

In [8]:
df = df.drop(axis=1, columns=["Last Modified By","Assignee","Services","GTFS Datasets","Issue Type","Created By"])

In [9]:
# find all unresolved issues
df = df.loc[~df['Resolution Date'].notnull()]

In [10]:
df.loc[df['Issue_Type']=="GTFS Realtime Completeness Problem",]

Unnamed: 0,id,Description,Status,Issue #,Resolution Date,Issue Creation Time,Waiting over a week?,QC: Num services,QC: Num Issue Types,QC Checks,...,Last Modified,Last Update Month,Last Update Year,Status Notes,Should wait until,Outreach Status,Waiting Since,Issue_Type,Operator,Service_Name
16,rec9Iz2deWxPxWE5B,"In the past 14 days, GTFS-RT Vehicle Positions...",Outreach,87,,2023-09-08T20:20:38.000Z,No,1,1,OK,...,2024-01-25T19:59:45.000Z,1,2024,"As of 9/8, ticket created for Customer Success...",,Waiting on MTC 511,2024-01-25,GTFS Realtime Completeness Problem,City of Union City,Union City Transit
40,recGKVXr5d3F3Hs88,"Since May 4, Sunline's Trip Updates and Vehicl...",Outreach,72,,2023-07-20T19:55:41.000Z,Yes,1,1,OK,...,2024-01-18T01:17:37.000Z,1,2024,As of 7/20: Sent email to transit agency.\n\nA...,,Waiting on Transit Agency,2024-01-17,GTFS Realtime Completeness Problem,SunLine Transit Agency,SunLine Transit
48,recJ7n1lBE4M2gOkb,"As of December 15, Bay Area 511 Emery Go-Round...",Outreach,136,,2023-12-15T20:01:32.000Z,No,1,1,OK,...,2024-01-25T20:02:39.000Z,1,2024,As of 12/19: Customer success sent an email to...,,Waiting on MTC 511,2024-01-25,GTFS Realtime Completeness Problem,Emeryville Transportation Management Agency,Emery Go-Round
57,recLBkzI5I1xPwAVA,Transit Joint Powers Authority for Merced Coun...,Outreach,19,,2023-05-25T00:16:26.000Z,No,1,1,OK,...,2023-12-20T21:04:55.000Z,12,2023,Agency says they're working on resolving issue...,2024-01-18,Waiting on Customer Success,,GTFS Realtime Completeness Problem,Transit Joint Powers Authority for Merced County,Merced The Bus
65,recOuR2jF8O9E2yXS,"As of January 5, Desert Roadrunner GMV Schedul...",Outreach,153,,2024-01-05T20:44:25.000Z,No,1,1,OK,...,2024-01-17T19:06:07.000Z,1,2024,As of 1/9: Laney sent an email and they replie...,2024-01-23,Waiting on Transit Agency,,GTFS Realtime Completeness Problem,Palo Verde Valley Transit Agency,Desert Roadrunner
72,recRH2FM76ubHMf4w,"As of December 6, Bay Area 511 ACE GTFS-RT has...",On Hold,135,,2023-12-15T19:56:43.000Z,,1,1,INVALID: Should Wait Until must be non-empty f...,...,2023-12-29T00:19:59.000Z,12,2023,As of 12/15: Evan said the data might not be p...,,,,GTFS Realtime Completeness Problem,San Joaquin Regional Rail Commission,Altamont Corridor Express
87,recVkEuYMgDH0PNVW,As of 9/29: only 15.74% of trips have trip upd...,False Positive,115,,2023-09-29T15:56:12.000Z,,1,1,OK,...,2023-10-11T20:48:58.000Z,10,2023,,,,,GTFS Realtime Completeness Problem,City of San Luis Obispo,SLO Transit
93,recYF4t5SnYCtQ0pJ,GTFS-Realtime Vehicle Position data production...,Outreach,131,,2023-12-01T18:53:40.000Z,No,1,1,OK,...,2024-01-25T20:00:33.000Z,1,2024,As of 12/1: only 39.86% of trips are producing...,,Waiting on MTC 511,2024-01-25,GTFS Realtime Completeness Problem,Napa Valley Transportation Authority,Vine Transit
99,reca5nOucWflhCnHz,As of 11/16: Only 39% of trips have vehicle po...,Outreach,125,,2023-11-16T18:32:30.000Z,No,1,1,OK,...,2024-01-16T21:53:58.000Z,1,2024,"As of 12/5: Amanda initially forgot, but now c...",2024-01-23,Waiting on Transit Data Quality Team,,GTFS Realtime Completeness Problem,City of Solvang,Santa Ynez Valley Transit
117,reckFE1qICQWONhDN,Since July 3 there have been zero trips accoun...,Outreach,68,,2023-07-20T16:25:32.000Z,No,1,1,OK,...,2024-01-25T19:59:22.000Z,1,2024,As of 7/20: Sent email to Nisar Ahmed\n\nAs of...,,Waiting on MTC 511,2024-01-25,GTFS Realtime Completeness Problem,Mountain View Transportation Management Associ...,MVGO


In [11]:
df['Issue_Type'].value_counts()

Los Angeles Metro feed transition                20
GTFS Realtime Completeness Problem               10
About to Expire Schedule Feed                     6
Trip Planner GTFS Schedule Assistance             4
Missing GTFS Schedule Feed (non-NTD Reporter)     3
New GTFS Realtime System Setup                    3
Unstable URL (GTFS Schedule)                      1
GTFS Realtime API Access                          1
Expiring feed maintained by Cal-ITP               1
Service Accuracy                                  1
Other GTFS Realtime issue                         1
Trip Planner GTFS Realtime Assistance             1
Missing GTFS Schedule Feed (NTD Reporter)         1
Name: Issue_Type, dtype: int64