# [Research Task - Create visuals for PUC 99314.11 leg report](https://github.com/cal-itp/data-analyses/issues/1656)
1. line graph of each metric (UPT, VRM, PMT) by agency
- x-axis is year
- y-axis is metric
- each line is an agency
- dotted line is average metric for all agencies in the year

2. line graph of each metric, by district
- similar to above
- each line is a district
- dotted line is average metrics for all districts the year

3. line graph of each metric, by mode
- similar to above
- each line is a mode
- dotter line is average metric for all modes in the year

Maybe try a box plot to show min/max/average for each metric?

## NTD Policy Manual for collecting UPT and PMT

### NTD Full Reporting Policy Manual 
However, FTA recognizes that certain statistics are challenging to collect and can drastically increase the reporting burden for transit agencies. To assist reporters who would find conducting 100 percent count burdensome, `transit agencies may estimate Unlinked Passenger Trips (UPT) and PMT through sampling`. The NTD provides a sampling method and sampling guidance on the NTD website.

### NTD Full Reporting Policy Manual & NTD Reduced Reporting Polict Manual
Collecting Service Consumed Data Transit agencies must report actual data on the Annual Report for all service data except UPT and PMT. `Only Full Reporters report PMT data to the NTD.` For these two data points, agencies may provide an estimate but only if the actual 100 percent data are not reliably collected and routinely processed.



In [1]:
import altair as alt
import pandas as pd
from calitp_data_analysis.sql import get_engine, to_snakecase

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

## Data querying, comparing, cleaning

### warehouse query

In [None]:
metric_list = [
    "pmt",
    "upt",
    "vrh",
    # "opexp_total" # not needed for this project
]

# empty list for appending DFs
df_list = []

db_engine = get_engine()

with db_engine.connect() as connection:
    for metric in metric_list:
        query = f"""
        SELECT
          ntd_id,
          source_agency,
          agency_status,
          primary_uza_name,
          uza_population,
          uza_area_sq_miles,
          year,
          mode,
          service,
          reporter_type,
          SUM({metric}) AS total_{metric},
        FROM
          `cal-itp-data-infra.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_{metric}`
        WHERE
          source_state = "CA"
          AND year >= 2018
        GROUP BY
          ntd_id,
          source_agency,
          agency_status,
          primary_uza_name,
          uza_population,
          uza_area_sq_miles,
          year,
          mode,
          service,
          reporter_type
        """
        # create df
        metric = pd.read_sql(query, connection)

        # append df to list
        df_list.append(metric)

# unpack list into separate DFs
ntd_pmt, ntd_upt, ntd_vrh = df_list

In [None]:
# get districts for ntd ID

with db_engine.connect() as connection:
    for metric in metric_list:
        query = f"""
        SELECT
          `mart_transit_database.dim_organizations`.`key` AS `key`,
          `mart_transit_database.dim_organizations`.`source_record_id` AS `source_record_id`,
          `mart_transit_database.dim_organizations`.`name` AS `name`,
          `mart_transit_database.dim_organizations`.`ntd_id_2022` AS `ntd_id_2022`,
          `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`county_geography_name` AS `county`,
          `Dim_County_Geography___County_Geography_Key`.`caltrans_district` AS `caltrans_district`
        FROM
          `mart_transit_database.dim_organizations`

        LEFT JOIN `mart_transit_database.bridge_organizations_x_headquarters_county_geography` AS `Bridge_Organizations_X_Headquarters_County_Geography___Key` ON `mart_transit_database.dim_organizations`.`key` = `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`organization_key`
          LEFT JOIN `mart_transit_database.dim_county_geography` AS `Dim_County_Geography___County_Geography_Key` ON `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`county_geography_key` = `Dim_County_Geography___County_Geography_Key`.`key`
        WHERE
          (
            `mart_transit_database.dim_organizations`.`_is_current` = TRUE
          )

           AND (
            `mart_transit_database.dim_organizations`.`ntd_id_2022` IS NOT NULL
          )
          AND (
            (
              `mart_transit_database.dim_organizations`.`ntd_id_2022` <> ''
            )

            OR (
              `mart_transit_database.dim_organizations`.`ntd_id_2022` IS NULL
            )
          )
          AND (
            `Bridge_Organizations_X_Headquarters_County_Geography___Key`.`_is_current` = TRUE
          )
          AND (
            `Dim_County_Geography___County_Geography_Key`.`_is_current` = TRUE
          )
        """
        # create df
        ntd_id_x_district = pd.read_sql(query, connection)
        
ntd_id_x_district["caltrans_district"] = ntd_id_x_district["caltrans_district"].astype("str")

In [None]:
ntd_id_x_district.info()

In [None]:
merge_on_col = [
    "ntd_id",
    "year",
    "source_agency",
    "agency_status",
    "primary_uza_name",
    "uza_population",
    "uza_area_sq_miles",
    "mode",
    "service",
    "reporter_type",
]

merge_1 = ntd_vrh.merge(ntd_upt, on=merge_on_col, how="inner")
# merge_2 = merge_1.merge(ntd_vrh, on=merge_on_col, how = "inner")

ntd_metrics_merge = merge_1.merge(ntd_pmt, on=merge_on_col, how="inner")

In [None]:
ntd_metrics_merge.info()

### data from other report

In [None]:
gcs_path = "gs://calitp-analytics-data/data-analyses/ntd/"
ntd_name = "ntd_operator_data_18_23.parquet"

ntd_all_metrics = pd.read_parquet(f"{gcs_path}{ntd_name}")

### compare datasets

In [None]:
display(
    ntd_all_metrics.info(), ntd_metrics_merge.info()  # mode/service is aggregated up
)

In [None]:
display(
    ntd_all_metrics["ntd_id"].nunique()
    == ntd_metrics_merge["ntd_id"].nunique(),  # TRUE, same count of unique values
    set(ntd_all_metrics["ntd_id"].unique())
    == set(ntd_metrics_merge["ntd_id"].unique()),  # TRUE, same unique NTD_IDs
)

In [None]:
metric_cols = ["total_upt", "total_vrh", "total_upt"]

for metric in metric_cols:
    print(
        ntd_all_metrics[metric].sum() == ntd_metrics_merge[metric].sum()
    )  # TRUE sum of each metrics are equal

### merge in the district numbers to ntd_metric_merge

In [None]:
ntd_metrics_merge = ntd_metrics_merge.merge(
    ntd_id_x_district[["ntd_id_2022","county","caltrans_district"]],
    left_on = "ntd_id",
    right_on = "ntd_id_2022",
    how="left",
    indicator=True
)

In [None]:
ntd_metrics_merge.info()

In [None]:
ntd_metrics_merge[ntd_metrics_merge["_merge"]=="left_only"]["source_agency"].value_counts().head()

In [None]:
ntd_metrics_merge[ntd_metrics_merge["caltrans_district"].isna()].head()

## save out data

In [2]:
gcs_path = "gs://calitp-analytics-data/data-analyses/ntd/"
# ntd_metrics_merge.to_parquet(f"{gcs_path}puc_analysis_data.parquet")

### read in cleaned ata

In [3]:
ntd_metrics_merge = pd.read_parquet(f"{gcs_path}puc_analysis_data.parquet")
ntd_metrics_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4152 entries, 0 to 4151
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   ntd_id             4002 non-null   object  
 1   source_agency      4152 non-null   object  
 2   agency_status      4152 non-null   object  
 3   primary_uza_name   4152 non-null   object  
 4   uza_population     4152 non-null   int64   
 5   uza_area_sq_miles  4152 non-null   float64 
 6   year               4152 non-null   int64   
 7   mode               4152 non-null   object  
 8   service            4152 non-null   object  
 9   reporter_type      4152 non-null   object  
 10  total_vrh          2623 non-null   float64 
 11  total_upt          2623 non-null   float64 
 12  total_pmt          2623 non-null   float64 
 13  ntd_id_2022        3798 non-null   object  
 14  county             3798 non-null   object  
 15  caltrans_district  3798 non-null   object  
 16  _merge

**everything matches, moving with `ntd_metrics_merge` since its has mode/service**

## Group aggregations

In [4]:
# melt big DF so all columns are under 1 column.
group_list = [
    "source_agency",
    "year",
    "ntd_id",
    "reporter_type",
    "caltrans_district"
]
value_cols = ["total_upt", "total_vrh", "total_pmt"]

melt = pd.melt(
    ntd_metrics_merge,
    id_vars=group_list,
    value_vars=value_cols,
    var_name="metric",
    value_name="metric_value",
    ignore_index=True,
)

In [5]:
# What does group/agg the melted DF look like?
vrh_total = (
    melt[melt["metric"] == "total_vrh"]
    .groupby(group_list)["metric_value"]
    .sum()
    .reset_index()
).rename(columns={"metric_value": "total_vrh"})

upt_total = (
    melt[melt["metric"] == "total_upt"]
    .groupby(group_list)["metric_value"]
    .sum()
    .reset_index()
).rename(columns={"metric_value": "total_upt"})

passenger_total = (
    melt[melt["metric"] == "total_pmt"]
    .groupby(group_list)["metric_value"]
    .sum()
    .reset_index()
).rename(columns={"metric_value": "total_pmt"})

yearly_totals = (
    ntd_metrics_merge.groupby(["year"])
    .agg({"total_upt": "sum", "total_vrh": "sum", "total_pmt": "sum"})
    .reset_index()
) 

agency_totals = (
    ntd_metrics_merge.groupby(["year","source_agency"])
    .agg({"total_upt": "sum", "total_vrh": "sum", "total_pmt": "sum"})
    .reset_index()
)

district_totals = (
    ntd_metrics_merge.groupby(["caltrans_district","year"])
    .agg({"total_upt": "sum", "total_vrh": "sum", "total_pmt": "sum"})
    .reset_index()
)

mode_totals = (
    ntd_metrics_merge.groupby(["mode","year"])
    .agg({"total_upt": "sum", "total_vrh": "sum", "total_pmt": "sum"})
    .reset_index()
)

In [6]:
# how many rows have zero PMT?
len(passenger_total[passenger_total["total_pmt"] == 0])

877

### chart functtion with mean line

In [67]:
def make_chart(data, x_col, y_col, title, color_col = False):
    chart = (alt.Chart(data)
        .mark_line(point=True)
        .encode(
            x=x_col,
            y=alt.Y(f"{y_col}:Q", title=f"{y_col}", axis=alt.Axis(format=",.1f")),
            tooltip=[alt.Tooltip(f"{y_col}",format=",.1f")],
            color = color_col if color_col else alt.Undefined
        )
        .properties(
            title= title,
            height=600,
            width=800,
        )
        .interactive()
    )

    # line for average
    line = (
        alt.Chart(data)
        .mark_rule(color="red", strokeWidth=1, strokeDash=[10, 5], point=True)
        .encode(
            y=alt.Y(f"mean({y_col}):Q"),
            tooltip=[alt.Tooltip(f"mean({y_col}):Q",format=",.1f")],
        )
    )

    return display(chart + line)

## Overall Totals

### Metric grand total per year

In [68]:
for col in yearly_totals.columns[1:]:
    yearly_avg = format(yearly_totals[col].mean(),",.2f")
    
    print(f"\nAverage {col} per  by year: {yearly_avg}"),
    make_chart(
        data = yearly_totals, 
        y_col = col,
        x_col = "year:N",
        title = f"Grand Total {col} per year",
    )



Average total_upt per  by year: 931,903,599.83



Average total_vrh per  by year: 41,186,329.17



Average total_pmt per  by year: 5,381,393,522.33


#### Boxplot of each metric grand total per year

In [14]:
all_totals_dict = {
    "total_vrh": vrh_total,
    "total_upt": upt_total,
    "total_pmt": passenger_total,
}

# Boxplot
# removing zero-values to see what happens
for col, df in all_totals_dict.items():
    box_plot = (
        alt.Chart(df[df[col] != 0])
        .mark_boxplot(extent="min-max")
        .encode(
            x="year:N",
            y=alt.Y(col, axis=alt.Axis(format=",.1f")),
            # row = "reporter_type",
            tooltip=["source_agency", alt.Tooltip(col, format=",.1f"), "year"],
        )
        .interactive()
        .properties(title=col, height=200, width=1000)
    )

    display(
        f"Number of Agencies that reported zero {col}: {df[df[col]==0].ntd_id.nunique()}",
        box_plot.resolve_scale(y="independent"),
    )

'Number of Agencies that reported zero total_vrh: 40'

'Number of Agencies that reported zero total_upt: 40'

'Number of Agencies that reported zero total_pmt: 153'

### Metrics grand total by district, per year

In [69]:
for col in district_totals.columns[2:]:
    district_avg = format(district_totals[col].mean(),",.2f")
    
    print(f"\nAverage {col} per  by year: {district_avg}"),
    make_chart(
        data = district_totals,
        y_col = col,
        x_col = "year:N",
        color_col = "caltrans_district",
        title = f"{col} by district per year"
    )


Average total_upt per  by year: 77,610,856.25



Average total_vrh per  by year: 3,426,755.53



Average total_pmt per  by year: 448,449,460.19


#### Box Plot of metric per district

In [65]:
# Boxplot
# removing zero-values to see what happens
for col in district_totals.columns[2:]:
    box_plot = (
        alt.Chart(district_totals[district_totals[col] != 0])
        .mark_boxplot(extent="min-max")
        .encode(
            x="caltrans_district:N",
            y=col,
            # row = "reporter_type",
            tooltip=[col, "year"],
        )
        .interactive()
        .properties(title=f"Box Plot of {col} per district", height=200, width=1000)
    )

    display(
        f"\nNumber of rows that reported zero {col}: {district_totals[district_totals[col]==0][col].count()}",
        box_plot.resolve_scale(y="independent"),
    )

'\nNumber of rows that reported zero total_upt: 0'

'\nNumber of rows that reported zero total_vrh: 0'

'\nNumber of rows that reported zero total_pmt: 13'

### Metrics grand total by agency, per year

In [70]:
agency_avg = format(agency_totals[col].mean(),",.2f")

for col in agency_totals.columns[2:]:
    agency_avg = format(agency_totals[col].mean(),",.2f")
    
    print(f"\nAverage {col} per agency by year: {agency_avg}"),
    make_chart(
        data = agency_totals,
        y_col = col,
        x_col = "year:N",
        color_col = "source_agency:N",
        title = f"{col} per agency by year"
    )


Average total_upt per agency by year: 3,477,252.24



Average total_vrh per agency by year: 153,680.33



Average total_pmt per agency by year: 20,079,826.58


### Metrics grand total by mode, per year

In [71]:
for col in mode_totals.columns[2:]:
    mode_avg = format(mode_totals[col].mean(),",.2f")
    
    print(f"\nAverage {col} per mode by year: {mode_avg}"),
    make_chart(
        data = mode_totals,
        y_col = col,
        x_col = 'year:N',
        color_col = "mode:N",
        title = f"{col} per Mode by year",
    )
    


Average total_upt per mode by year: 51,772,422.21



Average total_vrh per mode by year: 2,288,129.40



Average total_pmt per mode by year: 298,966,306.80
