# [Refine Use of Explored NTD Variables #1646](https://github.com/cal-itp/data-analyses/issues/1646)

Question or Goal:

1. Mode & Service: We previously used these to define the grain of the data. These should be used for classifying/model-fitting.
    - There are many modes: Can we group modes so there are fewer of them (e.g. fixed-guideway vs not)? Too many dummy variables for a category leads to overfitting/too many clusters.
    - What's the best way to use Service? Is it as a dummy variable? Is it numeric "proportion of service(VRH or M)" that's directly operated? Explore the impacts of both.


2. If we group by only Agency (flattening, fewer rows), how do we aggregate the other classification variables before we normalize them? Is adding sufficient? Do we need to average any?


3. Are there ways to get more interactive/easy-to-read visualizations?
    - If it takes significant time, break this one out

In [9]:
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

In [10]:
transit_metrics = pd.read_parquet(
    f"gs://calitp-analytics-data/data-analyses/ntd/raw_transit_performance_metrics_data.parquet"
)

In [13]:
transit_metrics.head()
# grain: each row is agency, mode, service per year

Unnamed: 0,agency_name,agency_status,city,mode,service,ntd_id,reporter_type,reporting_module,state,primary_uza_name,year,upt,vrh,vrm,opexp_total,RTPA,_merge
0,City of Porterville (COLT) - Transit Department,Active,Porterville,Demand Response,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2019,13112,2997,43696,572799,Tulare County Association of Governments,both
1,City of Porterville (COLT) - Transit Department,Active,Porterville,Demand Response,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2020,11523,3669,48138,686165,Tulare County Association of Governments,both
2,City of Porterville (COLT) - Transit Department,Active,Porterville,Bus,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2018,635648,50140,700127,3460906,Tulare County Association of Governments,both
3,City of Porterville (COLT) - Transit Department,Active,Porterville,Bus,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2021,145215,21208,244230,2657959,Tulare County Association of Governments,both
4,City of Porterville (COLT) - Transit Department,Active,Porterville,Demand Response,Purchased Transportation,90198,Building Reporter,Urban,CA,"Porterville, CA",2021,29380,9565,126604,952031,Tulare County Association of Governments,both


## Classify columns

In [35]:
# number of unique values for each column and % of 
for col in transit_metrics.columns:
    print(f"{col}: {transit_metrics[col].nunique()} unique values. {round((transit_metrics[col].nunique()/len(transit_metrics))*100,2)} % of unique values to total rows")

agency_name: 168 unique values. 8.03 % of unique values to total rows
agency_status: 2 unique values. 0.1 % of unique values to total rows
city: 124 unique values. 5.93 % of unique values to total rows
mode: 15 unique values. 0.72 % of unique values to total rows
service: 4 unique values. 0.19 % of unique values to total rows
ntd_id: 168 unique values. 8.03 % of unique values to total rows
reporter_type: 4 unique values. 0.19 % of unique values to total rows
reporting_module: 2 unique values. 0.1 % of unique values to total rows
state: 1 unique values. 0.05 % of unique values to total rows
primary_uza_name: 49 unique values. 2.34 % of unique values to total rows
year: 6 unique values. 0.29 % of unique values to total rows
upt: 2036 unique values. 97.37 % of unique values to total rows
vrh: 2018 unique values. 96.51 % of unique values to total rows
vrm: 2046 unique values. 97.85 % of unique values to total rows
opexp_total: 2068 unique values. 98.9 % of unique values to total rows
RTPA:

In [17]:
id_cols = [ # exclude from clustering
    "agency_name",
    "city",
    "ntd_id",
    "state",
    "primary_uza"
    "RTPA",
    "reporter_type"
]

categorical_cols =[ # include in clustering
    "mode", # 15 unique values
    "service" # 4
]

numerical_cols = [ # include in clustering
    "upt",
    "vrh",
    "vrm",
    "opex"
]

other_cols = [ # exclude in clustering
    "agency_status"
    "year",
    "reporting_module",
    "_merge"
]

## Unique `Mode` sub-categories
Previous research papers categorized mode by have/not have "dedicated right of way"/ "fixed guideway".

In [37]:
list(transit_metrics["mode"].unique())

['Demand Response',
 'Bus',
 'Streetcar',
 'Heavy Rail',
 'Demand Response Taxi',
 'Commuter Bus',
 'Hybrid Rail',
 'Commuter Rail',
 'Vanpool',
 'Bus Rapid Transit',
 'Cable Car',
 'Light Rail',
 'Trolleybus',
 'Ferryboats',
 'Monorail / Automated Guideway']

In [38]:
fixed_guideway = [
    "Streetcar",
    "Heavy Rail",
    "Hybrid Rail",
    "Commuter Rail",
    "Cablecar",
    "Light Rail"
]

other =[
    "Trolleybus",
    "Ferryboats"
]

nonfixed_guideway = [
    "Demand Response",
    "Bus",
    "Demand Response Taxi",
    "Commuter Bus",
    "Vanpool",
    "Bus Rapid Transit",
    "Monorail / Automated Guideway"
]

## Unique `Service` values

In [41]:
transit_metrics["service"].value_counts()

Purchased Transportation                                     1480
Directly Operated                                             523
Purchased Transportation - Taxi                                74
Purchased Transportation - Transportation Network Company      14
Name: service, dtype: int64