# [Refine Use of Explored NTD Variables #1646](https://github.com/cal-itp/data-analyses/issues/1646)

Question or Goal:

1. Mode & Service: We previously used these to define the grain of the data. These should be used for classifying/model-fitting.
    - There are many modes: Can we group modes so there are fewer of them (e.g. fixed-guideway vs not)? Too many dummy variables for a category leads to overfitting/too many clusters.
    - What's the best way to use Service? Is it as a dummy variable? Is it numeric "proportion of service(VRH or M)" that's directly operated? Explore the impacts of both.


2. If we group by only Agency (flattening, fewer rows), how do we aggregate the other classification variables before we normalize them? Is adding sufficient? Do we need to average any?


3. Are there ways to get more interactive/easy-to-read visualizations?
    - If it takes significant time, break this one out

In [1]:
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

## read in data

In [2]:
transit_metrics = pd.read_parquet(
    f"gs://calitp-analytics-data/data-analyses/ntd/raw_transit_performance_metrics_data.parquet"
)

In [3]:
transit_metrics.head()
# grain: each row is agency, mode, service per year

Unnamed: 0,agency_name,agency_status,city,mode,service,ntd_id,reporter_type,reporting_module,state,primary_uza_name,year,upt,vrh,vrm,opexp_total,RTPA,_merge
0,City of Porterville (COLT) - Transit Department,Active,Porterville,Demand Response,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2019,13112,2997,43696,572799,Tulare County Association of Governments,both
1,City of Porterville (COLT) - Transit Department,Active,Porterville,Demand Response,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2020,11523,3669,48138,686165,Tulare County Association of Governments,both
2,City of Porterville (COLT) - Transit Department,Active,Porterville,Bus,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2018,635648,50140,700127,3460906,Tulare County Association of Governments,both
3,City of Porterville (COLT) - Transit Department,Active,Porterville,Bus,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2021,145215,21208,244230,2657959,Tulare County Association of Governments,both
4,City of Porterville (COLT) - Transit Department,Active,Porterville,Demand Response,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2021,29380,9565,126604,952031,Tulare County Association of Governments,both


## Get NTD data from PUC analysis
I queried data from `mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_{metric}`


In [4]:
from calitp_data_analysis.sql import get_engine

In [33]:
db_engine = get_engine()

metric_list = [
    # "pmt",
    "upt",
    "vrh",
    "opexp_total"
]

# empty list for appending DFs
df_list = []
with db_engine.connect() as connection:
    for metric in metric_list:
        query = f"""
        SELECT
          ntd_id,
          source_agency,
          agency_status,
          primary_uza_name,
          uza_population,
          uza_area_sq_miles,
          year,
          SUM({metric}) AS total_{metric},
        FROM
          `cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_{metric}`
        WHERE
          source_state = "CA"
          AND year >= 2018
        GROUP BY
          ntd_id,
          source_agency,
          agency_status,
          year,
          primary_uza_name,
          uza_population,
          uza_area_sq_miles
        """
        # create df
        metric = pd.read_sql(query,connection) 
        
        # append df to list
        df_list.append(metric)

# unpack list into separate DFs
ntd_upt, ntd_vrh, ntd_opex = df_list

In [34]:
merge_on_col=["ntd_id", "year","source_agency","agency_status","primary_uza_name","uza_population","uza_area_sq_miles"]
merge_1 = ntd_opex.merge(ntd_upt, on=merge_on_col, how = "inner")

ntd_metrics_merge = merge_1.merge(ntd_vrh, on=merge_on_col, how = "inner")

In [35]:
ntd_metrics_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1614 entries, 0 to 1613
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ntd_id             1524 non-null   object 
 1   source_agency      1614 non-null   object 
 2   agency_status      1614 non-null   object 
 3   primary_uza_name   1614 non-null   object 
 4   uza_population     1614 non-null   int64  
 5   uza_area_sq_miles  1614 non-null   float64
 6   year               1614 non-null   int64  
 7   total_opexp_total  1291 non-null   float64
 8   total_upt          1291 non-null   float64
 9   total_vrh          1291 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 138.7+ KB


## Classify columns

In [6]:
# number of unique values for each column and % of 
for col in transit_metrics.columns:
    print(f"{col}: {transit_metrics[col].nunique()} unique values. {round((transit_metrics[col].nunique()/len(transit_metrics))*100,2)} % of unique values to total rows")

agency_name: 168 unique values. 8.03 % of unique values to total rows
agency_status: 2 unique values. 0.1 % of unique values to total rows
city: 124 unique values. 5.93 % of unique values to total rows
mode: 15 unique values. 0.72 % of unique values to total rows
service: 4 unique values. 0.19 % of unique values to total rows
ntd_id: 168 unique values. 8.03 % of unique values to total rows
reporter_type: 4 unique values. 0.19 % of unique values to total rows
reporting_module: 2 unique values. 0.1 % of unique values to total rows
state: 1 unique values. 0.05 % of unique values to total rows
primary_uza_name: 49 unique values. 2.34 % of unique values to total rows
year: 6 unique values. 0.29 % of unique values to total rows
upt: 2036 unique values. 97.37 % of unique values to total rows
vrh: 2018 unique values. 96.51 % of unique values to total rows
vrm: 2046 unique values. 97.85 % of unique values to total rows
opexp_total: 2068 unique values. 98.9 % of unique values to total rows
RTPA:

In [7]:
id_cols = [ # exclude from clustering
    "agency_name",
    "city",
    "ntd_id",
    "state",
    "primary_uza_name",
    "RTPA",
    "reporter_type"
]

categorical_cols =[ # include in clustering
    "mode", # 15 unique values
    "service" # 4
]

numerical_cols = [ # include in clustering
    "upt",
    "vrh",
    "vrm",
    "opexp_total"
]

other_cols = [ # exclude in clustering
    "agency_status"
    "year",
    "reporting_module",
    "_merge"
]

## 2. Mode & Service

### Unique `Mode` sub-categories
Previous research papers categorized mode by have/not have "dedicated right of way"/ "fixed guideway".

In [8]:
list(transit_metrics["mode"].unique())

['Demand Response',
 'Bus',
 'Streetcar',
 'Heavy Rail',
 'Demand Response Taxi',
 'Commuter Bus',
 'Hybrid Rail',
 'Commuter Rail',
 'Vanpool',
 'Bus Rapid Transit',
 'Cable Car',
 'Light Rail',
 'Trolleybus',
 'Ferryboats',
 'Monorail / Automated Guideway']

In [9]:
fixed_guideway = [
    "Streetcar",
    "Heavy Rail",
    "Hybrid Rail",
    "Commuter Rail",
    "Cablecar",
    "Light Rail"
]

other =[
    "Trolleybus",
    "Ferryboats"
]

nonfixed_guideway = [
    "Demand Response",
    "Bus",
    "Demand Response Taxi",
    "Commuter Bus",
    "Vanpool",
    "Bus Rapid Transit",
    "Monorail / Automated Guideway"
]

### Unique `Service` values

In [10]:
transit_metrics["service"].value_counts()

Purchased Transportation                                     1480
Directly Operated                                             523
Purchased Transportation - Taxi                                74
Purchased Transportation - Transportation Network Company      14
Name: service, dtype: int64

## 3. "Flattening" data
If we group by only Agency (flattening, fewer rows), how do we aggregate the other classification variables before we normalize them? Is adding sufficient? Do we need to average any?

In [11]:
# agg the numerical row, end with each row being an agency
group_id = transit_metrics.groupby(id_cols).agg(
    {col: "sum" for col in numerical_cols} 
).reset_index()

In [12]:
# agg the numerical row, end with each row being an agency
group_cat = transit_metrics.groupby(categorical_cols).agg(
    {col: "sum" for col in numerical_cols} 
).reset_index()

In [13]:
# double checking aggregation works
(transit_metrics[transit_metrics["ntd_id"]=="90211"]["opexp_total"].sum() == group_id[group_id["ntd_id"]=="90211"]["opexp_total"].sum(),
transit_metrics[transit_metrics["mode"]=="Bus"]["opexp_total"].sum() == group_cat[group_cat["mode"]=="Bus"]["opexp_total"].sum())

(True, True)

In [14]:
display("group_id",
    group_id.shape,
    group_id.head(),
    )

'group_id'

(168, 11)

Unnamed: 0,agency_name,city,ntd_id,state,primary_uza_name,RTPA,reporter_type,upt,vrh,vrm,opexp_total
0,Access Services (AS),El Monte,90157,CA,"Los Angeles--Long Beach--Anaheim, CA",Southern California Association of Governments,Full Reporter,21100712,11073631,195345317,992266445
1,Alameda-Contra Costa Transit District,Oakland,90014,CA,"San Francisco--Oakland, CA",Metropolitan Transportation Commission,Full Reporter,238095061,12968803,140963319,2864051248
2,Altamont Corridor Express (ACE),Stockton,90182,CA,"Stockton, CA",San Joaquin Council of Governments,Full Reporter,4923384,141171,5573755,143295615
3,Anaheim Transportation Network (ATN),Anaheim,90211,CA,"Los Angeles--Long Beach--Anaheim, CA",Southern California Association of Governments,Full Reporter,40740395,976178,6528424,82415962
4,Antelope Valley Transit Authority (AVTA),Lancaster,90121,CA,"Palmdale--Lancaster, CA",Southern California Association of Governments,Full Reporter,10230960,1222451,20620573,166805222


In [15]:
display("group_cat",
    group_cat.shape,
    group_cat.head()
)

'group_cat'

(25, 6)

Unnamed: 0,mode,service,upt,vrh,vrm,opexp_total
0,Bus,Directly Operated,2952883383,109698885,1187304414,21064786925
1,Bus,Purchased Transportation,636746541,45722527,581552826,5000145202
2,Bus Rapid Transit,Directly Operated,40599424,828979,11343877,251967151
3,Cable Car,Directly Operated,21089173,551837,1139285,373274268
4,Commuter Bus,Directly Operated,11604823,769809,16510747,195171589


### compared to `ntd_metrics_merge`

In [36]:
ntd_metrics_merge.head()

Unnamed: 0,ntd_id,source_agency,agency_status,primary_uza_name,uza_population,uza_area_sq_miles,year,total_opexp_total,total_upt,total_vrh
0,90198,City of Porterville (COLT) - Transit Department,Active,"Porterville, CA",69862,16.35,2023,,,
1,90198,City of Porterville (COLT) - Transit Department,Active,"Porterville, CA",69862,16.35,2021,3609990.0,174595.0,30773.0
2,90198,City of Porterville (COLT) - Transit Department,Active,"Porterville, CA",69862,16.35,2019,4014118.0,635559.0,52834.0
3,90198,City of Porterville (COLT) - Transit Department,Active,"Porterville, CA",69862,16.35,2020,4212765.0,522056.0,47356.0
4,90198,City of Porterville (COLT) - Transit Department,Active,"Porterville, CA",69862,16.35,2018,4025065.0,648649.0,52799.0


In [37]:
group_id[group_id["ntd_id"]=="90211"]["opexp_total"].sum() == ntd_metrics_merge[ntd_metrics_merge["ntd_id"]=="90211"]["total_opexp_total"].sum()

True

In [38]:
id_cols

['agency_name',
 'city',
 'ntd_id',
 'state',
 'primary_uza_name',
 'RTPA',
 'reporter_type']

In [42]:
merge_group = ntd_metrics_merge.groupby(
    ["ntd_id","source_agency","primary_uza_name","uza_population","uza_area_sq_miles"]
).agg({
    "total_opexp_total":"sum",
    "total_upt":"sum",
    "total_vrh":"sum"
}).reset_index()

In [43]:
merge_group.head()

Unnamed: 0,ntd_id,source_agency,primary_uza_name,uza_population,uza_area_sq_miles,total_opexp_total,total_upt,total_vrh
0,90003,San Francisco Bay Area Rapid Transit District ...,"San Francisco--Oakland, CA",3515933,513.8,4109168000.0,455096497.0,13481404.0
1,90004,Golden Empire Transit District (GET),"Bakersfield, CA",570235,132.12,193895700.0,27369380.0,1871722.0
2,90006,Santa Cruz Metropolitan Transit District (SCMTD),"Santa Cruz, CA",169038,60.45,282837000.0,20980309.0,1318828.0
3,90007,City of Modesto (MAX),"Modesto, CA",357301,70.38,77350150.0,8170544.0,809165.0
4,90008,City of Santa Monica (BBB) - Department of Tra...,"Los Angeles--Long Beach--Anaheim, CA",12237376,1636.83,471552500.0,55239107.0,2879128.0
