# Related Issues

## [Research Request - Transit Agency Peer Groups subtask: NTD Characteristics #1442](https://github.com/cal-itp/data-analyses/issues/1442)


## via Juan Matute  email 5/29/2025
In a more advanced version, operators would be clustered into groups of 10 or more based on: 

- mode of service, 
- vehicles available, 
- population density of service territory, 
- job density of service territory, and, perhaps, 
- service area overlap with other transit operators (a GTFS spatial analysis exercise).  

An agency scoring in the bottom 1 or 2 of the cluster would get some remedial help in their triennial audit.Or face consolidation (FWIW, I like the BC Transit model for consolidation starting in 1979).  

Several large transit operators, especially those operating rail, wouldn't be candidates for consolidation and wouldn't fit this clustering method and would instead rely on a triennial audit, where I would expect trends over time for GTFS-RT quality, customer experience metrics (Transit App surveys or mystery shops) and several of these metrics to be considered holistically.  

And perhaps agency costs would be adjusted for regional consumer price index maintained by California Department of Industrial Relations.  Either that or they'd just be clustered with regional peers.


## [Transit Agency Peer Groups literature review #1562](https://github.com/cal-itp/data-analyses/issues/1562)

Link to literature document (requires sharepoint): https://caltrans.sharepoint.com/:w:/r/sites/DOTPMPHQ-DDSContractors/_layouts/15/Doc.aspx?sourcedoc=%7B61CE5D08-BDAC-4947-ADE3-59CA472CF679%7D&file=transit_peer_groups_lit_review.docx&action=default&mobileredirect=true

## [Exploratory clustering analysis with NTD data #1580](https://github.com/cal-itp/data-analyses/issues/1580)

This notebook

In [3]:
# scikit learns imports for clustering
from sklearn.cluster import AgglomerativeClustering # has linage arg for "ward"
import pandas as pd

In [4]:
# new imports to query warehouse using SQLAlchemy
from calitp_data_analysis.tables import tbls
from calitp_data_analysis.sql import get_engine

db_engine = get_engine()

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [None]:
# example syntax for new query method
with db_engine.connect() as connection:
    query = f"""
        SELECT {','.join(time_series_by_mode_opexp_columns)}
        FROM cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_opexp_total
        WHERE state = 'CA'
         AND primary_uza_name LIKE %(uza_match)s
         AND year IN UNNEST(%(years)s)
         AND opexp_total IS NOT NULL
    """
    op_total = pd.read_sql(query, connection, params={'uza_match': '%, CA%', 'years': year_list})

op_total.info()

In [15]:
year_list = [2018, 2019, 2020, 2021, 2022, 2023]

fct_service_base_columns = [
    'source_name',
    'agency_status',
    'source_city',
    'mode',
    'service',
    'ntd_id',
    'reporter_type',
    'reporting_module',
    'source_state',
    'primary_uza_name',
    'year',
    "uza_area_sq_miles",
    "uza_population"   
]

In [None]:
time_series_by_mode_opexp_columns = fct_service_base_columns + ['opexp_total']

with db_engine.connect() as connection:
    query = f"""
        SELECT {','.join(time_series_by_mode_opexp_columns)}
        FROM cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_opexp_total
        WHERE source_state = 'CA'
         AND primary_uza_name LIKE %(uza_match)s
         AND year IN UNNEST(%(years)s)
         AND opexp_total IS NOT NULL
    """
    op_total = pd.read_sql(query, connection, params={'uza_match': '%, CA%', 'years': year_list})

op_total.info()

In [None]:
time_series_by_mode_upt_columns = fct_service_base_columns + ['upt']
# mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt

with db_engine.connect() as connection:
    query = f"""
        SELECT {','.join(time_series_by_mode_upt_columns)}
        FROM cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt
        WHERE source_state = 'CA'
         AND primary_uza_name LIKE %(uza_match)s
         AND year IN UNNEST(%(years)s)
         AND upt IS NOT NULL
    """
    mode_upt = pd.read_sql(query, connection, params={'uza_match': '%, CA%', 'years': year_list})

display(
    mode_upt.info())

In [None]:
time_series_by_mode_vrh_columns = fct_service_base_columns + ['vrh']
# mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrh

with db_engine.connect() as connection:
    query = f"""
        SELECT {','.join(time_series_by_mode_vrh_columns)}
        FROM cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrh
        WHERE source_state = 'CA'
         AND primary_uza_name LIKE %(uza_match)s
         AND year IN UNNEST(%(years)s)
         AND vrh IS NOT NULL
    """
    mode_vrh = pd.read_sql(query, connection, params={'uza_match': '%, CA%', 'years': year_list})

mode_vrh.info()

In [None]:
time_series_by_mode_vrm_columns = fct_service_base_columns + ['vrm']
# mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrm

with db_engine.connect() as connection:
    query = f"""
        SELECT {','.join(time_series_by_mode_vrm_columns)}
        FROM cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrm
        WHERE source_state = 'CA'
         AND primary_uza_name LIKE %(uza_match)s
         AND year IN UNNEST(%(years)s)
         AND vrm IS NOT NULL
    """
    mode_vrm = pd.read_sql(query, connection, params={'uza_match': '%, CA%', 'years': year_list})

mode_vrm.info()

## Reading in cleaned data from `transit_performance_metrics`
this merged data from `mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_..` tables

In [11]:
df = pd.read_parquet(f"gs://calitp-analytics-data/data-analyses/ntd/raw_transit_performance_metrics_data.parquet")

In [12]:
display(
    df.info(),
    df.describe(),
    df["year"].value_counts()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2091 entries, 0 to 2090
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   agency_name       2091 non-null   object  
 1   agency_status     2091 non-null   object  
 2   city              2091 non-null   object  
 3   mode              2091 non-null   object  
 4   service           2091 non-null   object  
 5   ntd_id            2091 non-null   object  
 6   reporter_type     2091 non-null   object  
 7   reporting_module  2091 non-null   object  
 8   state             2091 non-null   object  
 9   primary_uza_name  2091 non-null   object  
 10  year              2091 non-null   object  
 11  upt               2091 non-null   int64   
 12  vrh               2091 non-null   int64   
 13  vrm               2091 non-null   int64   
 14  opexp_total       2091 non-null   int64   
 15  RTPA              2091 non-null   object  
 16  _merge            2091 n

None

Unnamed: 0,upt,vrh,vrm,opexp_total
count,2091.0,2091.0,2091.0,2091.0
mean,2660662.0,116168.0,1754771.0,22233940.0
std,14057870.0,393480.5,5856907.0,86221740.0
min,0.0,0.0,0.0,0.0
25%,18879.0,5847.5,65869.5,585345.5
50%,79613.0,17504.0,258397.0,1841343.0
75%,541719.0,66069.0,1021278.0,8832687.0
max,260902200.0,6341989.0,83783820.0,1355086000.0


2019    358
2020    355
2018    348
2021    347
2022    345
2023    338
Name: year, dtype: int64

In [14]:
df[df["city"].str.contains("Sacramento")]["agency_name"].unique()

array(['Paratransit, Inc.', 'Sacramento Regional Transit District',
       'County of Sacramento Municipal Services Agency (SCT Link) - Department of Transportation'],
      dtype=object)