# [Refine Use of Explored NTD Variables #1646](https://github.com/cal-itp/data-analyses/issues/1646)

Question or Goal:

1. **COMPLETE** Mode & Service: We previously used these to define the grain of the data. These should be used for classifying/model-fitting.
    - There are many modes: Can we group modes so there are fewer of them (e.g. fixed-guideway vs not)? Too many dummy variables for a category leads to overfitting/too many clusters.
    - What's the best way to use Service? Is it as a dummy variable? Is it numeric "proportion of service(VRH or M)" that's directly operated? Explore the impacts of both.

2. ~~If we group by only Agency (flattening, fewer rows), how do we aggregate the other classification variables before we normalize them? Is adding sufficient? Do we need to average any?~~
    - Work continuing on isue #1683

3. Are there ways to get more interactive/easy-to-read visualizations?
    - If it takes significant time, break this one out

## Terms
**Example**:
- You could think of an example as analogous to a `single row` in a spreadsheet.

**Feature**:
- Features are the values that a supervised model uses to predict the label.
- In a weather model that predicts rainfall, the features could be `latitude, longitude, temperature, humidity, cloud coverage, wind direction, and atmospheric pressure`.
- An input variable to a machine learning model. An example consists of one or more features. For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. The following table shows three examples, each of which contains three features and one label:

**Label**:
- The label is the "answer," or the value we want the model to predict.
- In a weather model that predicts rainfall, the label would be `rainfall amount`.
- In supervised machine learning, the "answer" or "result" portion of an example. Each labeled example consists of one or more features and a label. For example, in a spam detection dataset, the label would probably be either "spam" or "not spam." In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

---

In [None]:
import sys
import altair as alt
import pandas as pd

sys.path.append("../ntd/monthly_ridership_report")
from update_vars import GCS_FILE_PATH, NTD_MODES, NTD_TOS

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

## Test querying NTD data from warehouse
I queried data from `mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_{metric}`


In [None]:
# from calitp_data_analysis.sql import get_engine

In [None]:
# db_engine = get_engine()

# with db_engine.connect() as connection:
#     # including mode and service will make the table bigger, and would need another aggregation to get yearly totals
#     query = f"""
#         SELECT
#           upt.ntd_id,
#           upt.source_agency,
#           upt.agency_status,
#           upt.city,
#           upt.primary_uza_name,
#           upt.uza_population,
#           upt.uza_area_sq_miles,
#           upt.year,
#           upt.mode,
#           upt.service,
#           upt.reporter_type,
#           SUM(upt.upt) AS total_upt,
#           SUM(voms.voms) AS total_voms,
#           SUM(vrh.vrh) AS total_vrh,
#           SUM(vrm.vrm) AS total_vrm,
#           SUM(opexp_total.opexp_total) AS opexp_total
#         FROM
#           cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt AS upt
#         INNER JOIN
#           cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_voms AS voms
#         ON
#           upt.ntd_id = voms.ntd_id
#           AND upt.year = voms.year
#           AND upt.source_agency = voms.source_agency
#           AND upt.agency_status = voms.agency_status
#           AND upt.primary_uza_name = voms.primary_uza_name
#           AND upt.uza_population = voms.uza_population
#           AND upt.uza_area_sq_miles = voms.uza_area_sq_miles
#         INNER JOIN
#           cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrh AS vrh
#         ON
#           upt.ntd_id = vrh.ntd_id
#           AND upt.year = vrh.year
#           AND upt.source_agency = vrh.source_agency
#           AND upt.agency_status = vrh.agency_status
#           AND upt.primary_uza_name = vrh.primary_uza_name
#           AND upt.uza_population = vrh.uza_population
#           AND upt.uza_area_sq_miles = vrh.uza_area_sq_miles
#         INNER JOIN
#           cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrm AS vrm
#         ON
#           upt.ntd_id = vrm.ntd_id
#           AND upt.year = vrm.year
#           AND upt.source_agency = vrm.source_agency
#           AND upt.agency_status = vrm.agency_status
#           AND upt.primary_uza_name = vrm.primary_uza_name
#           AND upt.uza_population = vrm.uza_population
#           AND upt.uza_area_sq_miles = vrm.uza_area_sq_miles
#         INNER JOIN
#           cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_opexp_total AS opexp_total
#         ON
#           upt.ntd_id = opexp_total.ntd_id
#           AND upt.year = opexp_total.year
#           AND upt.source_agency = opexp_total.source_agency
#           AND upt.agency_status = opexp_total.agency_status
#           AND upt.primary_uza_name = opexp_total.primary_uza_name
#           AND upt.uza_population = opexp_total.uza_population
#           AND upt.uza_area_sq_miles = opexp_total.uza_area_sq_miles
#         WHERE
#           upt.source_state = "CA"
#           AND upt.year >= 2018
#         GROUP BY
#           upt.ntd_id,
#           upt.source_agency,
#           upt.agency_status,
#           upt.city,
#           upt.year,
#           upt.primary_uza_name,
#           upt.uza_population,
#           upt.uza_area_sq_miles,
#           upt.mode,
#           upt.service,
#           upt.reporter_type
#         """
#     # create df
#     raw_ntd_metrics_merge = pd.read_sql(query, connection)

In [None]:
# reverse_mode = {v: k for k,v in NTD_MODES.items()}
# reverse_tos = {v:k for k,v in NTD_TOS.items()}

# replace_dict = {"mode": NTD_MODES, "service": NTD_TOS}

# for k, v in replace_dict.items():
#     raw_ntd_metrics_merge[k] = raw_ntd_metrics_merge[k].replace(v)
#     # test_query[k] = test_query[k].replace(v)

# display(
#     raw_ntd_metrics_merge["mode"].unique(),
#     # test_query["service"].unique(),
#     # test_query.equals(raw_ntd_metrics_merge)
# )

## save out to GCS

In [None]:
# raw_ntd_metrics_merge.to_parquet(f"{GCS_FILE_PATH}transit_peer_group_data_18-23.parquet")

## read in GCS data

In [None]:
raw_ntd_metrics_merge = pd.read_parquet(
    f"{GCS_FILE_PATH}transit_peer_group_data_18-23.parquet"
)

In [None]:
# without mode/service - 1524 rows
# with mode/service - 4002
raw_ntd_metrics_merge.info()

In [None]:
# aggregating to get yearly totals, but missing mode/service
ntd_metrics_yearly = (
    raw_ntd_metrics_merge.groupby(
        [
            "ntd_id",
            "source_agency",
            "agency_status",
            "reporter_type",
            "primary_uza_name",
            "uza_population",
            "uza_area_sq_miles",
            "year",
        ]
    )
    .agg(sum)
    .reset_index()
)

# each row is an agencies total for the year, for 6 years (2018-2023). each agency should have 6 rows.
ntd_metrics_yearly.info()

## Check correlation matrix

In [None]:
corr_matrix = (
    ntd_metrics_yearly[
        [
            "opexp_total",
            "total_upt",
            "total_vrh",
            "total_vrm",
            "total_voms",
            "uza_area_sq_miles",
            "uza_population",
        ]
    ]
    .corr()
    .round(4)
)

corr_matrix

In [None]:
corr_melt = corr_matrix.reset_index().melt(id_vars="index")

In [None]:
alt.Chart(corr_melt).mark_rect().encode(
    x="index",
    y="variable",
    color="value"
)

There are way more pairwise variables over 0.9.
List of pairwise variables under 0.9
- opex & voms: 0.87
- opex & uza area: 0.11
- opex & uza pop: 0.09
- upt & vrm: 0.88
- upt & voms: 0.82
- upt & uza area: 0.11 
- upt & uza pop: 0.09
- vrh & uza area: 0.12
- vrh & uza pop: 0.09
- vrm & uza area: 0.12
- vrm & uza pop: 0.10
- voms & uza area : 0.12
- voms & uza pop: 0.10



## Classify feature columns

In [None]:
list(raw_ntd_metrics_merge.columns)

In [None]:
# Feature groups

# exclude from clustering
id_cols = [
    "source_agency",
    "city",
    "ntd_id",
    "primary_uza_name",
    # "reporter_type",  # maybe move to categorical
]

# include in clustering
categorical_cols = ["mode", "service", "reporter_type"]

# include in clustering
numerical_cols = [
    "total_upt",
    "total_vrh",
    "total_vrm",
    "opexp_total",
    "total_voms",
    # "uza_population",
    # "uza_area_sq_miles",
]

# exclude in clustering
other_cols = [
    "agency_status",
    "year",
    "reporting_module",
    "_merge",
]

## 2. Mode & Service

### Unique `Mode` sub-categories
Previous research papers categorized mode by have/not have "dedicated right of way"/ "fixed guideway".

In [None]:
list(raw_ntd_metrics_merge["mode"].unique())


In [None]:
# what is `OT` and `OR`?

# Which agencies have mode OT/OTR?!
raw_ntd_metrics_merge[raw_ntd_metrics_merge["mode"].isin(["OT", "OR"])][
    "source_agency"
].unique()

In [None]:
fixed_guideway = [
    "Streetcar",
    "Heavy Rail",
    "Hybrid Rail",
    "Commuter Rail",
    "Cablecar",
    "Light Rail",
    "Monorail / Automated Guideway",
]

other = [
    "Trolleybus", 
    "Ferryboats", 
    "OT", 
    "OR"
]

nonfixed_guideway = [
    "Demand Response",
    "Bus",
    "Demand Response Taxi",
    "Commuter Bus",
    "Vanpool",
    "Bus Rapid Transit",
]

### testing what the dataframe will look like if you separate fixed from nonfixed guideway

In [None]:
fixed_guideway_df = raw_ntd_metrics_merge[
    raw_ntd_metrics_merge["mode"].isin(fixed_guideway)
]

nonfixed_guideway_df = raw_ntd_metrics_merge[
    raw_ntd_metrics_merge["mode"].isin(nonfixed_guideway)
]

In [None]:
display(
    fixed_guideway_df.shape,
    fixed_guideway_df["mode"].unique(),
)

In [None]:
display(
    nonfixed_guideway_df.shape,
    nonfixed_guideway_df["mode"].unique(),
)

### Unique `Service` values

In [None]:
list(raw_ntd_metrics_merge["service"].unique())

In [None]:
service_pt_do = [
    "Purchased Transportation",
    "Directly Operated",
]

service_other = [
    "Purchased Transportation - Taxi",
    "Purchased Transportation - Transportation Network Company",
]

In [None]:
display(
    fixed_guideway_df["service"].value_counts(),
    nonfixed_guideway_df["service"].value_counts(),
    nonfixed_guideway_df[nonfixed_guideway_df["service"].str.contains("- T")]["source_agency"].value_counts()
)

**RE: what to do with service columns**

- The majority of rows are either Purchased Transportation or Directly Operated. close to a binary response so i think one-hot encode would be the best option.
- Maybe consider removing the PT taxi and PT transportation netwok since there is a low count of rows

## 3. "Flattening" data
If we group by only Agency (flattening, fewer rows), how do we aggregate the other classification variables before we normalize them? Is adding sufficient? Do we need to average any?

If we aggregate to get 1 row per agency, what happens to the other categorical variables?

In [None]:
numerical_cols

In [None]:
# agg the numerical row, end with each row being an agency
group_id_cols = (
    raw_ntd_metrics_merge.groupby(id_cols)
    .agg({col: "sum" for col in numerical_cols})
    .reset_index()
)

# agg the numerical row, end with each row being a uniqu agency/mode/service
group_id_mode_service = (
    raw_ntd_metrics_merge.groupby(id_cols + ["mode", "service"])
    .agg({col: "sum" for col in numerical_cols})
    .reset_index()
)

# agg the numerical row, end with each row being an agency
group_cat_cols = (
    raw_ntd_metrics_merge.groupby(categorical_cols)
    .agg({col: "sum" for col in numerical_cols})
    .reset_index()
)

In [None]:
# double checking aggregation works
(
    raw_ntd_metrics_merge[raw_ntd_metrics_merge["ntd_id"] == "90211"][
        "opexp_total"
    ].sum()
    == group_id_cols[group_id_cols["ntd_id"] == "90211"]["opexp_total"].sum(),
    raw_ntd_metrics_merge[raw_ntd_metrics_merge["mode"] == "Bus"]["opexp_total"].sum()
    == group_cat_cols[group_cat_cols["mode"] == "Bus"]["opexp_total"].sum(),
)

In [None]:
display(
    "group_id",
    group_id_cols.shape,
    # group_id_cols["service"].value_counts(),
    group_id_cols.head(),
    "group_id_mode_service",
    group_id_mode_service.shape,
    group_id_mode_service["service"].value_counts(),
    group_id_mode_service.head(),
)

In [None]:
display("group_cat", group_cat_cols.shape, group_cat_cols.head())

### **This establishes we can aggregate up to just agecny level data.**

but still need to consider what to do with the categorical columns we lose from a simple aggregate.

this work will continue on issue #1683

---

## what does 1 year of data look like 

Try with only 2023 data

In [None]:
# alt method to query warehous data, via cal-itp docs

from calitp_data_analysis.sql import query_sql

query = f"""
        SELECT
          upt.ntd_id,
          upt.source_agency,
          upt.agency_status,
          upt.city,
          upt.primary_uza_name,
          upt.uza_population,
          upt.uza_area_sq_miles,
          upt.year,
          upt.mode,
          upt.service,
          upt.reporter_type,
          SUM(upt.upt) AS total_upt,
          SUM(voms.voms) AS total_voms,
          SUM(vrh.vrh) AS total_vrh,
          SUM(vrm.vrm) AS total_vrm,
          SUM(opexp_total.opexp_total) AS opexp_total
        FROM
          cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt AS upt
        INNER JOIN
          cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_voms AS voms
        ON
          upt.ntd_id = voms.ntd_id
          AND upt.year = voms.year
          AND upt.source_agency = voms.source_agency
          AND upt.agency_status = voms.agency_status
          AND upt.primary_uza_name = voms.primary_uza_name
          AND upt.uza_population = voms.uza_population
          AND upt.uza_area_sq_miles = voms.uza_area_sq_miles
        INNER JOIN
          cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrh AS vrh
        ON
          upt.ntd_id = vrh.ntd_id
          AND upt.year = vrh.year
          AND upt.source_agency = vrh.source_agency
          AND upt.agency_status = vrh.agency_status
          AND upt.primary_uza_name = vrh.primary_uza_name
          AND upt.uza_population = vrh.uza_population
          AND upt.uza_area_sq_miles = vrh.uza_area_sq_miles
        INNER JOIN
          cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrm AS vrm
        ON
          upt.ntd_id = vrm.ntd_id
          AND upt.year = vrm.year
          AND upt.source_agency = vrm.source_agency
          AND upt.agency_status = vrm.agency_status
          AND upt.primary_uza_name = vrm.primary_uza_name
          AND upt.uza_population = vrm.uza_population
          AND upt.uza_area_sq_miles = vrm.uza_area_sq_miles
        INNER JOIN
          cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_opexp_total AS opexp_total
        ON
          upt.ntd_id = opexp_total.ntd_id
          AND upt.year = opexp_total.year
          AND upt.source_agency = opexp_total.source_agency
          AND upt.agency_status = opexp_total.agency_status
          AND upt.primary_uza_name = opexp_total.primary_uza_name
          AND upt.uza_population = opexp_total.uza_population
          AND upt.uza_area_sq_miles = opexp_total.uza_area_sq_miles
        WHERE
          upt.source_state = "CA"
          AND upt.year = 2023
        GROUP BY
          upt.ntd_id,
          upt.source_agency,
          upt.agency_status,
          upt.city,
          upt.year,
          upt.primary_uza_name,
          upt.uza_population,
          upt.uza_area_sq_miles,
          upt.mode,
          upt.service,
          upt.reporter_type
        """

ntd_2023_data = query_sql(query).fillna(0)

In [None]:
ntd_2023_data[numerical_cols] = ntd_2023_data[numerical_cols].astype("int64", errors="ignore")

### 1yr - explore data

In [None]:
display(
    ntd_2023_data.info(),

    ntd_2023_data.head(),

    ntd_2023_data["service"].value_counts(),

    ntd_2023_data["mode"].value_counts(),
)

### 1yr - correlation matrix

In [None]:
ntd_2023_data[numerical_cols].corr()

### 1yr - Test Hierarchal clustering w/ ward

In [None]:
from sklearn.cluster import AgglomerativeClustering 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [None]:
display(
    numerical_cols,
    categorical_cols
)

In [None]:
# 1. set up pre-processing steps with column transformer

preprocessor = ColumnTransformer(
    [
        ("ntd_metrics", StandardScaler(), numerical_cols),
        ("categorical", OneHotEncoder(drop="first", sparse_output=False), categorical_cols)
    ]
)
preprocessor

In [None]:
# 2. set up pipeline. First pre-processing, then clustering
pipeline = Pipeline(
    [
        ("preprocessing", preprocessor),
        ("clustering", AgglomerativeClustering(n_clusters=10, linkage="ward"))
    ]
)

pipeline

In [None]:
# 3. use pipeline to fit clustering model. create new column for clustering
ntd_2023_fit = ntd_2023_data.copy() # why do i need to copy/clone?

ntd_2023_fit["cluster_name"] = pipeline.fit_predict(ntd_2023_fit)

In [None]:
display(
    ntd_2023_fit.columns,
    ntd_2023_fit["cluster_name"].value_counts()
)

### 1yr - Dendrogram

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

#### set up feature dataframe for dendrogram

In [None]:
# 1. apply preprocessing steps to initial dataset to get feature array
# this will encode the columns

feature_array = preprocessor.fit_transform(ntd_2023_data)

display(
    type(feature_array),
    feature_array
)

In [None]:
# 2. create dataframe of features names to feature array

feature_names = preprocessor.get_feature_names_out()
feature_df = pd.DataFrame(feature_array, columns = feature_names)

display(
    list(feature_names), # list of feature names after encoding
    feature_df.head()
)

# Are there ways to get more interactive/easy-to-read visualizations?
If it takes significant time, break this one out

Might have to break this out. What other visuals would make sense for hierarchal clustering?
- dendrogram (1 for fixed and non-fixed guideway modes)
- loss curve?
- 


#### create dendrogram

In [None]:
import matplotlib.pyplot as plt


In [None]:
z = linkage(feature_df, method="ward")
z

### matplotlib dendrogram

In [None]:
plt.figure(figsize=(10, 5))
dendrogram(z, labels=ntd_2023_data.index.tolist(), leaf_rotation=90)
plt.title("Dendrogram (Ward's Method)")
plt.xlabel("ntd_id")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()

### plotly dendrogram

In [None]:
import plotly.figure_factory as ff


In [None]:
# have to downgrade scipy to work with plotly

!pip install scipy==1.11.4

In [None]:
# plotly figures have more built in interactivity
fig = ff.create_dendrogram(z)
fig.update_layout(width=800, height=500)
fig.show()

## Test util functions

In [None]:
import utils_transit_peer_groups

In [None]:
test_func = make_hierarchal_clustering(
    data = ntd_2023_data,
    num_cols = numerical_cols,
    cat_cols = categorical_cols,
    cluster_num = 10
    
)

In [None]:
test_func["cluster_name"].value_counts() == ntd_2023_fit["cluster_name"].value_counts()

In [None]:
test_z = make_dendrogram_data(
    data= test_func,
    num_cols= numerical_cols,
    cat_cols= categorical_cols
)