# Options for querying `mart_gtfs_rollup` tables

* several ways to do so, which one feels the most ergonomic? 
* favor simple, pandas-like syntax 
* we have sqlalchemy, `pd.read_sql`, `pandas_gbq.read_gbq`, and `google.client`
* specific edge cases - when arrays are present or if you want very quick times and don't mind the syntax, like `fct_vehicle_locations_path`, then we have to use querying method to return the correct data types - have to use `google.client`
* how to handle the fact that gdfs need to be created?


## caching data
Do we have to cache things as backup? [notes on this](./lightweight_pipeline_notes.md) Maybe we should save a version to make available to download on the page.

* If we end up caching things anyway, why would we always hit the warehouse for each page, wouldn't we download 1 parquet, (get as close to possible with zero lines of code for transformations), and make use of that parquet to filter?
* Can we get to a query -> filter to operator -> visualize step? that could make use of the cluster columns more effectively. 
   * prefer this if it feels simple, not if it feels confusing
   * warehouse tables have a lot of columns now, and if selecting all the columns ends up downloading too large of a size, `pandas_gbq` will use GCS to store anyway (>125 MB), and we might as well just download our own parquet. Why would we query each operator, save out its own GCS tiny file, when we can have 1 parquet and filter down (more cost-effective and performant with pyarrow filters)
   * do not want the entire deploy process to take forever simply because larger-than-being-visualized datasets are getting downloaded and crowd the tooltips

In [1]:
import datetime
import google.auth
import pandas as pd
import pandas_gbq

from update_vars import GTFS_DATA_DICT

credentials, project = google.auth.default()
ROLLUP_DICT = GTFS_DATA_DICT.gtfs_digest_rollup

How to better structure the `project`, `dataset_name`, `table_name` components in `gtfs_analytics_data.yml`?
* `project` and `dataset_name` will move together, esp if we want to pull from staging, these can't be hard-coded
* `cal-itp-data-staging` + `name_mart_gtfs_rollup`
* `table_name` will be more consistent, though not entirely if we're iterating on this

The SQL query needs the backticks between `project`.`dataset_name`.`table_name` to pull the correct path.

In [2]:
def basic_sql_query(
    project,
    filename,
    start_date = "2025-01-01"
):
    dataset_name = filename.split(".")[0]
    table_name = filename.split(".")[1]
    
    query_sql = f"""
        SELECT 
            *
        FROM `{project}`.`{dataset_name}`.`{table_name}`
        WHERE month_first_day >=  DATE('{start_date}')
    """
    return query_sql

def download_with_pandas_gbq(
    project = "cal-itp-data-infra",
    filename: str = "",
):
    start = datetime.datetime.now()

    query_sql_statement = basic_sql_query(project, filename)
    print(query_sql_statement)
    
    df = pandas_gbq.read_gbq(
        query_sql_statement, 
        project_id = project,
        dialect = "standard",
        credentials = credentials
    ).astype({"month_first_day": "datetime64[ns]"}) 
    # add this otherwise data type is dbdate and when you read it in, will cause error
    
    end = datetime.datetime.now()
    
    print(f"download time: {end - start}")
    
    #df.to_parquet(engine='pyarrow') must have engine!
    
    return df

def quick_look(df):
    print(df.shape)
    print(df.dtypes)
    display(df.head(10))
    return

### most pythonic option `pandas_gbq` to inject our credentials

In [3]:
# very pythonic, all the components are really easy, and we don't necessarily 
# need the with....as db.connect() followed by indentation.
# pull Jan 2025 - Oct 2025 is 33 seconds
df = download_with_pandas_gbq(
    project = "cal-itp-data-infra",
    filename = ROLLUP_DICT.schedule_route_direction
)


quick_look(df)


        SELECT 
            *
        FROM `cal-itp-data-infra`.`mart_gtfs_rollup`.`fct_monthly_schedule_route_direction_summary`
        WHERE month_first_day >=  DATE('2025-01-01')
    


  import pkg_resources  # noqa


Downloading: 100%|[32m██████████[0m|
download time: 0:00:34.554615
(153338, 32)
name                                    object
month_first_day                 datetime64[ns]
month                                    Int64
year                                     Int64
day_type                                object
route_name                              object
direction_id                             Int64
route_type                              object
route_color                             object
route_typology                          object
daily_trips_all_day                    float64
daily_stop_arrivals_all_day            float64
daily_distinct_stops_all_day           float64
frequency_all_day                      float64
daily_service_hours                    float64
daily_flex_service_hours               float64
daily_trips_owl                        float64
daily_trips_early_am                   float64
daily_trips_am_peak                    float64
daily_trips_midday       

Unnamed: 0,name,month_first_day,month,year,day_type,route_name,direction_id,route_type,route_color,route_typology,...,daily_trips_peak,daily_trips_offpeak,frequency_owl,frequency_early_am,frequency_am_peak,frequency_midday,frequency_pm_peak,frequency_evening,frequency_peak,frequency_offpeak
0,Sacramento Schedule,2025-09-01,9,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
1,Sacramento Schedule,2025-02-01,2,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
2,Sacramento Schedule,2025-06-01,6,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
3,Sacramento Schedule,2025-01-01,1,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
4,Sacramento Schedule,2025-04-01,4,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
5,Sacramento Schedule,2025-08-01,8,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
6,Sacramento Schedule,2025-07-01,7,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
7,Sacramento Schedule,2025-03-01,3,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
8,Sacramento Schedule,2025-10-01,10,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
9,Sacramento Schedule,2025-05-01,5,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,,0.2,,,


In [4]:
# can pull from staging
# pull Jan 2025 - Oct 2025 is 33 seconds
df = download_with_pandas_gbq(
    project = "cal-itp-data-infra-staging",
    filename = "tiffany_mart_gtfs_rollup.fct_monthly_schedule_route_direction_summary"
)

quick_look(df)


        SELECT 
            *
        FROM `cal-itp-data-infra-staging`.`tiffany_mart_gtfs_rollup`.`fct_monthly_schedule_route_direction_summary`
        WHERE month_first_day >=  DATE('2025-01-01')
    
Downloading: 100%|[32m██████████[0m|
download time: 0:00:31.142201
(153338, 32)
name                                    object
month_first_day                 datetime64[ns]
month                                    Int64
year                                     Int64
day_type                                object
route_name                              object
direction_id                             Int64
route_type                              object
route_color                             object
route_typology                          object
daily_trips_all_day                    float64
daily_stop_arrivals_all_day            float64
daily_distinct_stops_all_day           float64
frequency_all_day                      float64
daily_service_hours                    float64
daily_fl

Unnamed: 0,name,month_first_day,month,year,day_type,route_name,direction_id,route_type,route_color,route_typology,...,daily_trips_peak,daily_trips_offpeak,frequency_owl,frequency_early_am,frequency_am_peak,frequency_midday,frequency_pm_peak,frequency_evening,frequency_peak,frequency_offpeak
0,Sacramento Schedule,2025-08-01,8,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
1,Sacramento Schedule,2025-03-01,3,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
2,Sacramento Schedule,2025-05-01,5,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,,0.2,,,
3,Sacramento Schedule,2025-09-01,9,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
4,Sacramento Schedule,2025-04-01,4,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
5,Sacramento Schedule,2025-07-01,7,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
6,Sacramento Schedule,2025-01-01,1,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
7,Sacramento Schedule,2025-06-01,6,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
8,Sacramento Schedule,2025-10-01,10,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
9,Sacramento Schedule,2025-02-01,2,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,


In [5]:
df = download_with_pandas_gbq(
    project = "cal-itp-data-infra-staging",
    filename = "tiffany_mart_gtfs_rollup.fct_monthly_schedule_route_direction_summary"
)

quick_look(df)


        SELECT 
            *
        FROM `cal-itp-data-infra-staging`.`tiffany_mart_gtfs_rollup`.`fct_monthly_schedule_route_direction_summary`
        WHERE month_first_day >=  DATE('2025-01-01')
    
Downloading: 100%|[32m██████████[0m|
download time: 0:00:32.073343
(153338, 32)
name                                    object
month_first_day                 datetime64[ns]
month                                    Int64
year                                     Int64
day_type                                object
route_name                              object
direction_id                             Int64
route_type                              object
route_color                             object
route_typology                          object
daily_trips_all_day                    float64
daily_stop_arrivals_all_day            float64
daily_distinct_stops_all_day           float64
frequency_all_day                      float64
daily_service_hours                    float64
daily_fl

Unnamed: 0,name,month_first_day,month,year,day_type,route_name,direction_id,route_type,route_color,route_typology,...,daily_trips_peak,daily_trips_offpeak,frequency_owl,frequency_early_am,frequency_am_peak,frequency_midday,frequency_pm_peak,frequency_evening,frequency_peak,frequency_offpeak
0,Sacramento Schedule,2025-08-01,8,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
1,Sacramento Schedule,2025-03-01,3,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
2,Sacramento Schedule,2025-05-01,5,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,,0.2,,,
3,Sacramento Schedule,2025-09-01,9,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
4,Sacramento Schedule,2025-04-01,4,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
5,Sacramento Schedule,2025-07-01,7,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
6,Sacramento Schedule,2025-01-01,1,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
7,Sacramento Schedule,2025-06-01,6,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
8,Sacramento Schedule,2025-10-01,10,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,
9,Sacramento Schedule,2025-02-01,2,2025,Weekday,020__20 FSL Route 20,1,3,0000FF,bus,...,,,,,,0.2,0.2,,,


In [6]:
# download vp path for 1 day, 25 seconds
yesterday = str(datetime.date.today() - datetime.timedelta(days=1))

vp_query_statement = f"""
    SELECT *
    FROM `cal-itp-data-infra.mart_gtfs.fct_vehicle_locations_path`
    WHERE service_date >=  DATE('{yesterday}')
"""
print(vp_query_statement)

start = datetime.datetime.now()

df  = pandas_gbq.read_gbq(
    vp_query_statement, 
    project_id = "cal-itp-data-infra",
    dialect = "standard",
    credentials = credentials
)

end = datetime.datetime.now()
    
print(f"download time: {end - start}")


    SELECT *
    FROM `cal-itp-data-infra.mart_gtfs.fct_vehicle_locations_path`
    WHERE service_date >=  DATE('2025-11-20')

Downloading: 100%|[32m██████████[0m|
download time: 0:00:24.357307


In [7]:
df.dtypes

gtfs_dataset_key              object
gtfs_dataset_name             object
base64_url                    object
service_date                  dbdate
schedule_gtfs_dataset_key     object
schedule_name                 object
schedule_feed_key             object
schedule_base64_url           object
trip_id                       object
trip_instance_key             object
pt_array                      object
location_timestamp_pacific    object
pacific_seconds               object
n_vp                           Int64
dtype: object

### edge case for extra speed

Use what's underlying `pandas_gbq` and it's much quicker. Use this option if times every get too long, but the syntax is a little harder to read.

In [8]:
# pandas_gbq handles arrays just fine
# but if time gets to be an issue, like downloading multiple days, use this method 
# this is what's under-the-hood of pandas_gbq
# 3 seconds
from google.cloud import bigquery

client = bigquery.Client()

start = datetime.datetime.now()

query_job = client.query(vp_query_statement)
df = query_job.result().to_dataframe()

end = datetime.datetime.now()
    
print(f"download time: {end - start}")

download time: 0:00:03.775510


### sqlalchemy

haven't gotten this to work yet, how often does this sync? it seems a bit outdated since there are tables in production warehouse that aren't showing up?

In [9]:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session
import os

CALITP_BQ_MAX_BYTES = os.environ.get("CALITP_BQ_MAX_BYTES", 5_000_000_000)
CALITP_BQ_LOCATION = os.environ.get("CALITP_BQ_LOCATION", "us-west2")

CALITP_PROJECT = "cal-itp-data-infra"
ROLLUP_DATASET = "mart_gtfs_rollup"

db_engine = create_engine(
    f"bigquery://{CALITP_PROJECT}/{ROLLUP_DATASET}?maximum_bytes_billed={CALITP_BQ_MAX_BYTES}",  # noqa: E231
    location=CALITP_BQ_LOCATION,
    credentials_path= os.environ.get("CALITP_SERVICE_KEY_PATH"),
)

In [10]:
# why are these the names? there are more tables than this
db_engine.table_names()

  db_engine.table_names()


['fct_monthly_routes',
 'fct_monthly_schedule_route_direction_summary',
 'fct_monthly_scheduled_stops',
 'fct_monthly_scheduled_trips']

In [11]:
with db_engine.connect() as conn, conn.begin():  

    data = pd.read_sql_table("SELECT * FROM fct_monthly_scheduled_trips WHERE month_first_day >= DATE('2025-01-01')", conn)  

BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/cal-itp-data-infra/datasets/mart_gtfs_rollup/tables/SELECT%20*%20FROM%20fct_monthly_scheduled_trips%20WHERE%20month_first_day%20%3E=%20DATE('2025-01-01')?prettyPrint=false: Invalid table ID "SELECT * FROM fct_monthly_scheduled_trips WHERE month_first_day >= DATE('2025-01-01')".

In [None]:
def download_with_pandas(
    project = "cal-itp-data-infra",
    filename: str = "",
):
    start = datetime.datetime.now()

    dataset = filename.split(".")[0]
    table_name = filename.split(".")[1]
    
    conn = create_engine(
        f"bigquery://{project}/{dataset}?maximum_bytes_billed={CALITP_BQ_MAX_BYTES}",  # noqa: E231
        location=CALITP_BQ_LOCATION,
        credentials_path= os.environ.get("CALITP_SERVICE_KEY_PATH"),
    )
    
    df = pd.read_sql(
        f'SELECT * FROM {table_name} WHERE month_first_day >= 2025-01-01', conn
    ).astype({"month_first_day": "datetime64[ns]"})
    
    end = datetime.datetime.now()
    
    print(f"download time: {end - start}")
    
    return df

In [None]:
download_with_pandas(project="cal-itp-data-infra", filename = ROLLUP_DICT.schedule_rt_route_direction)