# Quarterly metrics: v1 and v2 parity

* Step through each step, from raw data from BQ to processed data and compare v1 to v2
* `district` is now determined differently, at the route-level, instead of `itp_id` level
* Do a geospatial overlay to find which district that route mostly overlaps with
* Constraint is because we used to match `itp_id` to get `caltrans_district`, but moving to warehouse v2, we don't have a corresponding table to go from `feed_key` to `caltrans_district`. 
* If we just determine it geospatially, it won't matter as much, but it does mean that within an operator, it can have routes in different districts

In [1]:
import geopandas as gpd
import pandas as pd

from shared_utils import rt_dates
from IPython.display import Markdown
from update_vars import BUS_SERVICE_GCS, COMPILED_CACHED_GCS 

date_q2 = rt_dates.PMAC["Q2_2022"]
date_q3 = rt_dates.PMAC["Q3_2022"]
date_q4 = rt_dates.PMAC["Q4_2022"]

dates = [date_q2, date_q3, date_q4]



## Source Data Differences
* Both v1 data, but one was downloaded soon after that date (closer to May 2022), and one was downloaded in Jan 2023

In [2]:
trips0 = pd.read_parquet(f"{COMPILED_CACHED_GCS}trips_{date_q2}_all.parquet")
trips1 = pd.read_parquet(f"{COMPILED_CACHED_GCS}trips_{date_q2}_v1.parquet")

In [3]:
trips1.columns

Index(['calitp_itp_id', 'calitp_url_number', 'service_date', 'trip_key',
       'trip_id', 'route_id', 'direction_id', 'shape_id',
       'calitp_extracted_at', 'calitp_deleted_at', 'trip_first_departure_ts',
       'trip_last_arrival_ts', 'service_hours'],
      dtype='object')

In [4]:
trips0.columns

Index(['calitp_itp_id', 'calitp_url_number', 'service_date', 'trip_key',
       'trip_id', 'route_id', 'direction_id', 'shape_id',
       'calitp_extracted_at', 'calitp_deleted_at', 'route_short_name',
       'route_long_name', 'route_desc', 'route_type'],
      dtype='object')

In [5]:
m1 = pd.merge(
    trips0, 
    trips1, 
    on = ["calitp_itp_id", "calitp_url_number",
          "trip_id", "route_id", "shape_id", "direction_id"],
    how = "outer",
    validate = "m:1", # why does this have to be m on left?
    indicator=True
)

The presence of multiple `trip_keys` is what's causing the duplicates. 

**Finding**: can't solve the differences that come from querying the same date at different times. `trip_keys` might have gone through different cleaning steps

In [6]:
m1._merge.value_counts()

both          104721
right_only     26325
left_only        213
Name: _merge, dtype: int64

In [7]:
orig = trips0.shape[0]
drop_dups_all = trips0.drop_duplicates().shape[0]
drop_dups_subset = trips0.drop_duplicates(
    subset=["calitp_itp_id", "calitp_url_number",
            "trip_id", "route_id",
            "shape_id", "direction_id"]).shape[0]

print(f"orig # rows: {orig}")
print(f"drop dups across all columns: {drop_dups_all}")
print(f"drop dups based on operator-trip-ids (not keys): {drop_dups_subset}")

orig # rows: 104934
drop dups across all columns: 104836
drop dups based on operator-trip-ids (not keys): 101888


## Source Data: Trips

* Big Query table for that date
* There are minor differences in downloading the same selected_date from months ago and downloading it yesterday.
* Probably some bug fixes have been made, so even for the same selected_date, the query doesn't return exactly the same results, and that's to be expected.

In [8]:
def compare_trips_table(date: str):
    display(Markdown(f"### Date: {date}"))
    
    df1 = pd.read_parquet(f"{COMPILED_CACHED_GCS}trips_{date}_v1.parquet")
    df2 = pd.read_parquet(f"{COMPILED_CACHED_GCS}trips_{date}_v2.parquet")
    
    v1_operator_cols = ["calitp_itp_id"]
    v2_operator_cols = ["feed_key"]
    trip_cols = ["trip_id", "route_id", "service_hours"]
    
    df1_pared = df1[v1_operator_cols + trip_cols].drop_duplicates()
    df2_pared = df2[v2_operator_cols + trip_cols].drop_duplicates()
    
    display(Markdown(f"#### Total service hours"))
    print(f"{df1.service_hours.sum(), df2.service_hours.sum()}")
    
    display(Markdown(f"#### Total service hours for operator (dups dropped)"))
    print(f"{df1_pared.service_hours.sum(), df2_pared.service_hours.sum()}")    
    
    display(Markdown("#### Unique trips"))
    print(f"{len(df1_pared), len(df2_pared)}")
    
    display(Markdown("#### Unique routes"))
    df1_route = df1_pared[v1_operator_cols + ["route_id"]].drop_duplicates()
    df2_route = df2_pared[v2_operator_cols + ["route_id"]].drop_duplicates()
    print(f"{len(df1_route), len(df2_route)}")

In [9]:
for d in dates:
    compare_trips_table(d)

### Date: 2022-05-04

#### Total service hours

(108745.16555555553, 117057.43555555555)


#### Total service hours for operator (dups dropped)

(98052.44027777776, 117054.88555555555)


#### Unique trips

(115945, 143932)


#### Unique routes

(2981, 3481)


### Date: 2022-08-17

#### Total service hours

(109174.03000000001, 101674.71194444444)


#### Total service hours for operator (dups dropped)

(101352.36333333336, 101672.16194444445)


#### Unique trips

(129291, 129955)


#### Unique routes

(2977, 2904)


### Date: 2022-10-12

#### Total service hours

(113752.53305555554, 105073.36277777779)


#### Total service hours for operator (dups dropped)

(98939.19083333333, 105073.36277777779)


#### Unique trips

(126925, 135483)


#### Unique routes

(3135, 3077)


## Step 1: create route-level df

In [10]:
def compare_route_df(date: str):
    display(Markdown(f"### Date: {date}"))
    
    df1 = gpd.read_parquet(f"{BUS_SERVICE_GCS}routes_{date}_v1.parquet")
    df2 = gpd.read_parquet(f"{BUS_SERVICE_GCS}routes_{date}_v2.parquet")
    # Oops, accidentally included MTC 511, which is excluded in v1 always
    df2 = df2[df2.name != "Bay Area 511 Schedule"]
    
    v1_operator_cols = ["calitp_itp_id", "total_routes", "district", "route_id"]
    v2_operator_cols = ["feed_key", "total_routes", "district", "route_id"]
    route_cols = ["service_hours"]
    
    df1_pared = df1[v1_operator_cols+route_cols].drop_duplicates(subset=v1_operator_cols)
    df2_pared = df2[v2_operator_cols+route_cols].drop_duplicates(subset=v2_operator_cols)
    
    display(Markdown(f"#### Total service hours"))
    print(f"{df1.service_hours.sum(), df2.service_hours.sum()}")
    
    display(Markdown(f"#### Total service hours for operator (dups dropped)"))
    print(f"{df1_pared.service_hours.sum(), df2_pared.service_hours.sum()}")    
        
    display(Markdown("#### Unique routes"))
    print(f"{len(df1_pared), len(df2_pared)}")
    
    display(Markdown("#### Total routes"))
    print(f"v1 total routes distribution: {df1_pared.total_routes.describe()}")
    print(f"v2 total routes distribution: {df2_pared.total_routes.describe()}")
    
    display(Markdown("#### District breakdown"))
    print(df1_pared.district.value_counts(), df1_pared.district.value_counts(normalize=True))
    print(df2_pared.district.value_counts(), df2_pared.district.value_counts(normalize=True))

In [11]:
for d in dates:
    compare_route_df(d)

### Date: 2022-05-04

#### Total service hours

(97345.10083333333, 94658.69)


#### Total service hours for operator (dups dropped)

(97345.10083333333, 94658.69)


#### Unique routes

(2739, 2826)


#### Total routes

v1 total routes distribution: count    2739.000000
mean       43.059876
std        36.055190
min         1.000000
25%        16.000000
50%        29.000000
75%        62.000000
max       129.000000
Name: total_routes, dtype: float64
v2 total routes distribution: count    2826.000000
mean       47.418967
std        38.709346
min         1.000000
25%        16.000000
50%        36.000000
75%        67.000000
max       129.000000
Name: total_routes, dtype: float64


#### District breakdown

4.0     774
7.0     736
3.0     221
8.0     164
11.0    142
10.0    133
5.0     123
6.0     121
12.0    119
1.0      75
2.0      61
9.0      22
Name: district, dtype: int64 4.0     0.287625
7.0     0.273504
3.0     0.082126
8.0     0.060944
11.0    0.052768
10.0    0.049424
5.0     0.045708
6.0     0.044965
12.0    0.044221
1.0     0.027871
2.0     0.022668
9.0     0.008175
Name: district, dtype: float64
4.0     1155
7.0      588
3.0      215
8.0      163
11.0     142
10.0     133
5.0      123
6.0      112
12.0      59
2.0       38
1.0       36
9.0       14
Name: district, dtype: int64 4.0     0.415767
7.0     0.211663
3.0     0.077394
8.0     0.058675
11.0    0.051116
10.0    0.047876
5.0     0.044276
6.0     0.040317
12.0    0.021238
2.0     0.013679
1.0     0.012959
9.0     0.005040
Name: district, dtype: float64


### Date: 2022-08-17

#### Total service hours

(95755.14694444445, 97555.44527777779)


#### Total service hours for operator (dups dropped)

(95755.14694444445, 97555.44527777779)


#### Unique routes

(2688, 2832)


#### Total routes

v1 total routes distribution: count    2688.000000
mean       40.773065
std        36.017630
min         1.000000
25%        15.000000
50%        27.000000
75%        58.000000
max       131.000000
Name: total_routes, dtype: float64
v2 total routes distribution: count    2832.000000
mean       46.563559
std        39.441825
min         1.000000
25%        15.000000
50%        36.000000
75%        68.000000
max       131.000000
Name: total_routes, dtype: float64


#### District breakdown

4.0     772
7.0     748
3.0     160
11.0    151
8.0     149
10.0    133
12.0    133
6.0     130
5.0     120
1.0      70
2.0      59
9.0      21
Name: district, dtype: int64 4.0     0.291761
7.0     0.282691
3.0     0.060469
11.0    0.057067
8.0     0.056311
10.0    0.050265
12.0    0.050265
6.0     0.049131
5.0     0.045351
1.0     0.026455
2.0     0.022298
9.0     0.007937
Name: district, dtype: float64
4.0     1217
7.0      592
11.0     151
3.0      149
8.0      148
10.0     133
6.0      120
5.0      120
12.0      73
2.0       37
1.0       37
9.0       13
Name: district, dtype: int64 4.0     0.436201
7.0     0.212186
11.0    0.054122
3.0     0.053405
8.0     0.053047
10.0    0.047670
6.0     0.043011
5.0     0.043011
12.0    0.026165
2.0     0.013262
1.0     0.013262
9.0     0.004659
Name: district, dtype: float64


### Date: 2022-10-12

#### Total service hours

(107837.96916666668, 102686.6461111111)


#### Total service hours for operator (dups dropped)

(107837.96916666668, 102686.6461111111)


#### Unique routes

(2918, 3027)


#### Total routes

v1 total routes distribution: count    2918.000000
mean       43.316655
std        35.680828
min         1.000000
25%        16.000000
50%        30.000000
75%        64.000000
max       131.000000
Name: total_routes, dtype: float64
v2 total routes distribution: count    3027.000000
mean       46.379584
std        38.526996
min         1.000000
25%        16.000000
50%        36.000000
75%        65.000000
max       131.000000
Name: total_routes, dtype: float64


#### District breakdown

7.0     835
4.0     800
3.0     220
8.0     180
11.0    151
5.0     142
12.0    132
10.0    129
6.0     129
1.0      78
2.0      59
9.0      19
Name: district, dtype: int64 7.0     0.290536
4.0     0.278358
3.0     0.076548
8.0     0.062630
11.0    0.052540
5.0     0.049408
12.0    0.045929
10.0    0.044885
6.0     0.044885
1.0     0.027140
2.0     0.020529
9.0     0.006611
Name: district, dtype: float64
4.0     1265
7.0      626
3.0      220
8.0      179
11.0     151
5.0      141
10.0     129
6.0      113
12.0      71
1.0       39
2.0       37
9.0       12
Name: district, dtype: int64 4.0     0.424070
7.0     0.209856
3.0     0.073751
8.0     0.060007
11.0    0.050620
5.0     0.047268
10.0    0.043245
6.0     0.037881
12.0    0.023802
1.0     0.013074
2.0     0.012404
9.0     0.004023
Name: district, dtype: float64


## Category 1: Routes on SHN

In [12]:
def compare_routes_on_shn(date: str):
    display(Markdown(f"### Date: {date}"))
    
    df1 = gpd.read_parquet(
        f"{BUS_SERVICE_GCS}routes_on_shn_{date}_v1.parquet")
    df2 = gpd.read_parquet(
        f"{BUS_SERVICE_GCS}routes_on_shn_{date}_v2.parquet")
    df2 = df2[df2.name != "Bay Area 511 Schedule"]
    
    display(Markdown(f"#### Rows"))
    print(f"{len(df1), len(df2)}")

    display(Markdown(f"#### Distribution of Parallel"))
    
    print(df1.parallel.value_counts())
    print(df1.parallel.value_counts(normalize=True))
    print(df2.parallel.value_counts())
    print(df2.parallel.value_counts(normalize=True))

In [13]:
for d in dates:
    compare_routes_on_shn(d)

### Date: 2022-05-04

#### Rows

(6814, 7215)


#### Distribution of Parallel

0    5999
1     815
Name: parallel, dtype: int64
0    0.880393
1    0.119607
Name: parallel, dtype: float64
0    6402
1     813
Name: parallel, dtype: int64
0    0.887318
1    0.112682
Name: parallel, dtype: float64


### Date: 2022-08-17

#### Rows

(6650, 7075)


#### Distribution of Parallel

0    5825
1     825
Name: parallel, dtype: int64
0    0.87594
1    0.12406
Name: parallel, dtype: float64
0    6242
1     833
Name: parallel, dtype: int64
0    0.882261
1    0.117739
Name: parallel, dtype: float64


### Date: 2022-10-12

#### Rows

(7220, 7561)


#### Distribution of Parallel

0    6349
1     871
Name: parallel, dtype: int64
0    0.879363
1    0.120637
Name: parallel, dtype: float64
0    6716
1     845
Name: parallel, dtype: int64
0    0.888242
1    0.111758
Name: parallel, dtype: float64


In [14]:
def compare_parallel_or_intersecting(date: str):
    display(Markdown(f"### Date: {date}"))
    
    df1 = gpd.read_parquet(
        f"{BUS_SERVICE_GCS}parallel_or_intersecting_{date}_v1.parquet")
    df2 = gpd.read_parquet(
        f"{BUS_SERVICE_GCS}parallel_or_intersecting_{date}_v2.parquet")
    df2 = df2[df2.name != "Bay Area 511 Schedule"]
    
    display(Markdown(f"#### Rows"))
    print(f"{len(df1), len(df2)}")

    display(Markdown(f"#### Distribution of Parallel"))
    print(df1.parallel.value_counts())
    print(df1.parallel.value_counts(normalize=True))
    print(df2.parallel.value_counts())
    print(df2.parallel.value_counts(normalize=True))

In [15]:
for d in dates:
    compare_parallel_or_intersecting(d)

### Date: 2022-05-04

#### Rows

(9269, 10013)


#### Distribution of Parallel

0    5918
1    3351
Name: parallel, dtype: int64
0    0.638472
1    0.361528
Name: parallel, dtype: float64
0    6472
1    3541
Name: parallel, dtype: int64
0    0.64636
1    0.35364
Name: parallel, dtype: float64


### Date: 2022-08-17

#### Rows

(9006, 9768)


#### Distribution of Parallel

0    5694
1    3312
Name: parallel, dtype: int64
0    0.632245
1    0.367755
Name: parallel, dtype: float64
0    6157
1    3611
Name: parallel, dtype: int64
0    0.630324
1    0.369676
Name: parallel, dtype: float64


### Date: 2022-10-12

#### Rows

(9823, 10539)


#### Distribution of Parallel

0    6232
1    3591
Name: parallel, dtype: int64
0    0.634429
1    0.365571
Name: parallel, dtype: float64
0    6723
1    3816
Name: parallel, dtype: int64
0    0.637916
1    0.362084
Name: parallel, dtype: float64


## Processed Data: Routes Categorized

In [16]:
def compare_routes_categorized(date: str):
    display(Markdown(f"### Date: {date}"))
    
    df1 = gpd.read_parquet(
        f"{BUS_SERVICE_GCS}routes_categorized_{date}_v1.parquet")
    df2 = gpd.read_parquet(
        f"{BUS_SERVICE_GCS}routes_categorized_{date}_v2.parquet")
    
    df2 = df2[df2.name != "Bay Area 511 Schedule"]
    
    display(Markdown(f"#### Rows"))
    print(f"{len(df1), len(df2)}")

    display(Markdown(f"#### Distribution of Categories"))
    print(df1.category.value_counts())
    print(df1.category.value_counts(normalize=True))
    print(df2.category.value_counts())
    print(df2.category.value_counts(normalize=True))
    
    display(Markdown(f"#### Distribution of District...now done with overlay"))
    print(df1.district.value_counts())
    print(df1.district.value_counts(normalize=True))
    print(df2.district.value_counts())
    print(df2.district.value_counts(normalize=True))

In [19]:
for d in dates:
    compare_routes_categorized(d)

### Date: 2022-05-04

#### Rows

(2739, 2826)


#### Distribution of Categories

intersects_shn    1510
on_shn             669
other              560
Name: category, dtype: int64
intersects_shn    0.551296
on_shn            0.244250
other             0.204454
Name: category, dtype: float64
intersects_shn    1579
on_shn             670
other              577
Name: category, dtype: int64
intersects_shn    0.558740
on_shn            0.237084
other             0.204176
Name: category, dtype: float64


#### Distribution of District...now done with overlay

4.0     774
7.0     736
3.0     221
8.0     164
11.0    142
10.0    133
5.0     123
6.0     121
12.0    119
1.0      75
2.0      61
9.0      22
Name: district, dtype: int64
4.0     0.287625
7.0     0.273504
3.0     0.082126
8.0     0.060944
11.0    0.052768
10.0    0.049424
5.0     0.045708
6.0     0.044965
12.0    0.044221
1.0     0.027871
2.0     0.022668
9.0     0.008175
Name: district, dtype: float64
4.0     1155
7.0      588
3.0      215
8.0      163
11.0     142
10.0     133
5.0      123
6.0      112
12.0      59
2.0       38
1.0       36
9.0       14
Name: district, dtype: int64
4.0     0.415767
7.0     0.211663
3.0     0.077394
8.0     0.058675
11.0    0.051116
10.0    0.047876
5.0     0.044276
6.0     0.040317
12.0    0.021238
2.0     0.013679
1.0     0.012959
9.0     0.005040
Name: district, dtype: float64


### Date: 2022-08-17

#### Rows

(2688, 2832)


#### Distribution of Categories

intersects_shn    1474
on_shn             673
other              541
Name: category, dtype: int64
intersects_shn    0.548363
on_shn            0.250372
other             0.201265
Name: category, dtype: float64
intersects_shn    1589
on_shn             684
other              559
Name: category, dtype: int64
intersects_shn    0.561088
on_shn            0.241525
other             0.197387
Name: category, dtype: float64


#### Distribution of District...now done with overlay

4.0     772
7.0     748
3.0     160
11.0    151
8.0     149
10.0    133
12.0    133
6.0     130
5.0     120
1.0      70
2.0      59
9.0      21
Name: district, dtype: int64
4.0     0.291761
7.0     0.282691
3.0     0.060469
11.0    0.057067
8.0     0.056311
10.0    0.050265
12.0    0.050265
6.0     0.049131
5.0     0.045351
1.0     0.026455
2.0     0.022298
9.0     0.007937
Name: district, dtype: float64
4.0     1217
7.0      592
11.0     151
3.0      149
8.0      148
10.0     133
6.0      120
5.0      120
12.0      73
2.0       37
1.0       37
9.0       13
Name: district, dtype: int64
4.0     0.436201
7.0     0.212186
11.0    0.054122
3.0     0.053405
8.0     0.053047
10.0    0.047670
6.0     0.043011
5.0     0.043011
12.0    0.026165
2.0     0.013262
1.0     0.013262
9.0     0.004659
Name: district, dtype: float64


### Date: 2022-10-12

#### Rows

(2918, 3027)


#### Distribution of Categories

intersects_shn    1617
on_shn             708
other              593
Name: category, dtype: int64
intersects_shn    0.554147
on_shn            0.242632
other             0.203221
Name: category, dtype: float64
intersects_shn    1723
on_shn             698
other              606
Name: category, dtype: int64
intersects_shn    0.569210
on_shn            0.230591
other             0.200198
Name: category, dtype: float64


#### Distribution of District...now done with overlay

7.0     835
4.0     800
3.0     220
8.0     180
11.0    151
5.0     142
12.0    132
10.0    129
6.0     129
1.0      78
2.0      59
9.0      19
Name: district, dtype: int64
7.0     0.290536
4.0     0.278358
3.0     0.076548
8.0     0.062630
11.0    0.052540
5.0     0.049408
12.0    0.045929
10.0    0.044885
6.0     0.044885
1.0     0.027140
2.0     0.020529
9.0     0.006611
Name: district, dtype: float64
4.0     1265
7.0      626
3.0      220
8.0      179
11.0     151
5.0      141
10.0     129
6.0      113
12.0      71
1.0       39
2.0       37
9.0       12
Name: district, dtype: int64
4.0     0.424070
7.0     0.209856
3.0     0.073751
8.0     0.060007
11.0    0.050620
5.0     0.047268
10.0    0.043245
6.0     0.037881
12.0    0.023802
1.0     0.013074
2.0     0.012404
9.0     0.004023
Name: district, dtype: float64
