# Sanity check: v1 to v2 warehouse: HQTA exports

### Summary
1. **Distribution of `hqta_types` and `hqta_details` must be similar**

It looks similar, percentages aren't shifting much.

2. **Columns are nearly identical**

We drop 2 columns, `calitp_itp_id_primary` and `calitp_itp_id_secondary`, but don't provide another. Maybe we should add `base64_url` to be the identifier that isn't string?

3. **Number of operators in final export**

There are fewer operators included in v2 warehouse, and that seems to check out. Pulling all scheduled operators, there was only 187, but 140 once `trips` was downloaded. The ones that make it to the final export are ~130. 

In [1]:
import geopandas as gpd
import pandas as pd

from utilities import GCS_FILE_PATH
from shared_utils import rt_dates

dec_date = rt_dates.DATES['dec2022']
jan_date = rt_dates.DATES['jan2023']

V1 = f"{GCS_FILE_PATH}export/{dec_date}/"
V2 = f"{GCS_FILE_PATH}export/{jan_date}/"



In [2]:
import operators_for_hqta

valid_operators_dict = operators_for_hqta.feed_keys_from_json()

print(f"# feeds going into hqta with complete info: {len(valid_operators_dict)}")

# feeds going into hqta with complete info: 140


## HQTA Stops

In [3]:
filename = "ca_hq_transit_stops.parquet"

dec_stops = gpd.read_parquet(f"{V1}{filename}")
jan_stops = gpd.read_parquet(f"{V2}{filename}")

In [4]:
def compare_stats(df1, df2):
    print("Columns / shapes")
    print("---------------------------*")
    print(f"{df1.columns}, df1 shape: {df1.shape}")
    print(f"{df2.columns}, df2 shape: {df2.shape}")
    print("\n")
    
    print("Column differences")
    v1_cols = list(df1.columns)
    v2_cols = list(df2.columns)
    print(f"columns in v1 not in v2: {set(v1_cols).difference(v2_cols)}")
    print(f"columns in v2 not in v1: {set(v2_cols).difference(v1_cols)}")
    print("\n")
    
    col = "agency_name_primary"
    print(col)
    print("---------------------------*")
    print(df1[col].nunique())
    print(df2[col].nunique())
    print("\n")

    col = "hqta_type"
    print(col)
    print("---------------------------*")
    print(df1[col].value_counts(), df1[col].value_counts(normalize=True))
    print(df2[col].value_counts(), df2[col].value_counts(normalize=True))
    print("\n")  

    col = "hqta_details"
    print(col)
    print("---------------------------*")
    print(df1[col].value_counts(), df1[col].value_counts(normalize=True))
    print(df2[col].value_counts(), df2[col].value_counts(normalize=True))
    print("\n")


In [5]:
compare_stats(dec_stops, jan_stops)

Columns / shapes
---------------------------*
Index(['calitp_itp_id_primary', 'calitp_itp_id_secondary', 'stop_id',
       'geometry', 'hqta_type', 'route_id', 'agency_name_primary',
       'agency_name_secondary', 'hqta_details'],
      dtype='object'), df1 shape: (65729, 9)
Index(['agency_name_primary', 'hqta_type', 'stop_id', 'route_id',
       'hqta_details', 'agency_name_secondary', 'geometry'],
      dtype='object'), df2 shape: (50099, 7)


Column differences
columns in v1 not in v2: {'calitp_itp_id_secondary', 'calitp_itp_id_primary'}
columns in v2 not in v1: set()


agency_name_primary
---------------------------*
163
129


hqta_type
---------------------------*
hq_corridor_bus     45128
major_stop_bus      18988
major_stop_rail      1385
major_stop_brt        207
major_stop_ferry       21
Name: hqta_type, dtype: int64 hq_corridor_bus     0.686577
major_stop_bus      0.288883
major_stop_rail     0.021071
major_stop_brt      0.003149
major_stop_ferry    0.000319
Name: hqta_type,

## HQTA Areas

In [6]:
filename = "ca_hq_transit_areas.parquet"

dec_areas = gpd.read_parquet(f"{V1}{filename}")
jan_areas = gpd.read_parquet(f"{V2}{filename}")

In [7]:
compare_stats(dec_areas, jan_areas)

Columns / shapes
---------------------------*
Index(['calitp_itp_id_primary', 'calitp_itp_id_secondary',
       'agency_name_primary', 'agency_name_secondary', 'hqta_type',
       'hqta_details', 'route_id', 'geometry'],
      dtype='object'), df1 shape: (22818, 8)
Index(['agency_name_primary', 'agency_name_secondary', 'hqta_type',
       'hqta_details', 'route_id', 'geometry'],
      dtype='object'), df2 shape: (18312, 6)


Column differences
columns in v1 not in v2: {'calitp_itp_id_secondary', 'calitp_itp_id_primary'}
columns in v2 not in v1: set()


agency_name_primary
---------------------------*
110
106


hqta_type
---------------------------*
major_stop_bus      18988
hq_corridor_bus      2217
major_stop_rail      1385
major_stop_brt        207
major_stop_ferry       21
Name: hqta_type, dtype: int64 major_stop_bus      0.832150
hq_corridor_bus     0.097160
major_stop_rail     0.060698
major_stop_brt      0.009072
major_stop_ferry    0.000920
Name: hqta_type, dtype: float64
major_