# `mart_gtfs` swaps: sanity checks

As new tables get added into the warehouse, we start the analytic pipeline scripts a bit more downstream. https://github.com/cal-itp/data-analyses/issues/1345

In the Makefile, these steps can get reduced significantly:
```
preprocess_vp:
    python vp_keep_usable.py
    python cleanup.py
    python vp_dwell_time.py    
```

Check that counts from Feb 2025 roughly match the prior handful of months.

3 scripts in `gtfs_funnel` are either getting removed or consolidated:
1.  `vp_keep_usable.py` -- paring down to valid trips can be consolidated with `vp_condenser.py`
1. `cleanup.py` -- remove batched download files for raw_vp, we still need this
1. `vp_dwell_time.py` - this script's output is what `mart_gtfs.fct_vehicle_locations_grouped` is 
   * need to work backwards to include paring down to valid trips, generating a vp_idx

In `gtfs_analytics_data.yml`:
    - `raw_vp`: let's leave the filename of existing one as `vp_{analysis_date}` preserved, because that matches `fct_vehicle_locations`
    - we will use `raw_vp` and replace the filename value as `vp_grouped_{analysis_date}`
    - remove `usable_vp` key
    - replace 

In [1]:
import geopandas as gpd
import pandas as pd

from shared_utils import rt_dates
from segment_speed_utils import segment_calcs
from update_vars import GTFS_DATA_DICT, SEGMENT_GCS

analysis_date = rt_dates.DATES["feb2025"]

## compare `vp` and `vp_grouped`

* Look at Long Beach and Culver City trips 
* Full results are produced in `raw_vp_comparison.py`
* Weren't able to reproduce the same results in nearest neighbor.
* Let's check the 2 raw upstream tables to see where differences are.

In [2]:
rt_operators = [
    "Long Beach VehiclePositions", "Culver City VehiclePositions"
]

# Find problem trips within these operators
lb_trips = [
    "d7d2e036448be81bd9d07e35cb871240", 
    "7df431a1e025fc3bb39047e968297aed",
    "497c900ed76deb8c6d0364a183f285a8",
]

culver_trips = ["32034bff4a0692bab4e31d303bd5276d"]

filtering = [[
    ("gtfs_dataset_name", "in", rt_operators),
    ("trip_instance_key", "in", lb_trips + culver_trips)
]]

In [3]:
subset_cols = [
    "gtfs_dataset_name",
    "trip_instance_key",
    "location_timestamp_local",
    "geometry"
]

def import_raw_vp(analysis_date: str, one_trip: str):
    FILE = GTFS_DATA_DICT.speeds_tables.raw_vp

    gdf = gpd.read_parquet(
        f"{SEGMENT_GCS}{FILE}_{analysis_date}.parquet",
        filters = [[("trip_instance_key", "==", one_trip)]],
        columns = subset_cols
    ).pipe(
        segment_calcs.convert_timestamp_to_seconds,
        ["location_timestamp_local"]
    ).drop(
        columns = ["location_timestamp_local"]
    )
    
    return gdf

def import_vp_grouped(analysis_date: str, one_trip: str):
    FILE = GTFS_DATA_DICT.speeds_tables.raw_vp2
    
    gdf = gpd.read_parquet(
        f"{SEGMENT_GCS}{FILE}_{analysis_date}.parquet",
        filters = [[("trip_instance_key", "==", one_trip)]],
        columns = subset_cols + ["moving_timestamp_local", "n_vp"]
    ).pipe(
        segment_calcs.convert_timestamp_to_seconds,
        ["location_timestamp_local", "moving_timestamp_local"]
    ).drop(
        columns = ["location_timestamp_local", "moving_timestamp_local"]
    )
    
    return gdf

In [4]:
subset_trips = pd.read_parquet(
    f"{SEGMENT_GCS}comparison_{analysis_date}.parquet",
    filters = [[("gtfs_dataset_name", "in", rt_operators), 
               ]]
).query('vp_idx != n_vp').trip_instance_key.unique().tolist()

#results[
#    results.gtfs_dataset_name == "Long Beach VehiclePositions"
#].trip_instance_key.value_counts()

In [5]:
results = pd.read_parquet(
    f"{SEGMENT_GCS}comparison_{analysis_date}.parquet",
    filters = [[
        ("gtfs_dataset_name", "in", rt_operators), 
        ("trip_instance_key", "in", subset_trips)
    ]]
)

In [6]:
one_trip = "32034bff4a0692bab4e31d303bd5276d"

In [7]:
raw_vp = import_raw_vp(analysis_date, one_trip)
vp_grouped = import_vp_grouped(analysis_date, one_trip)

In [8]:
results_subset = results[results.trip_instance_key==one_trip]

In [9]:
def which_groups_differ(results_subset):
    subset_groups = results_subset[
        results_subset.vp_idx != results_subset.n_vp
    ].vp_group.unique().tolist()
    
    prior_group = [i - 1 for i in subset_groups]
    post_group = [i + 1 for i in subset_groups]
    
    min_timestamp = results_subset[
        results_subset.vp_group.isin(prior_group)
    ].location_timestamp_local_sec.min()
    
    max_timestamp = results_subset[
        results_subset.vp_group.isin(post_group)
    ].moving_timestamp_local_sec.max()
    
    return min_timestamp, max_timestamp 

In [10]:
min_time, max_time = which_groups_differ(results_subset)

In [11]:
def subset_gdf(raw_vp, vp_grouped, min_time, max_time):
    raw_vp_subset = raw_vp[
        (raw_vp.location_timestamp_local_sec >= min_time) & 
        (raw_vp.location_timestamp_local_sec <= max_time)
    ].sort_values("location_timestamp_local_sec")
    
    vp_grouped_subset = vp_grouped[
        (vp_grouped.location_timestamp_local_sec >= min_time) & 
        (vp_grouped.location_timestamp_local_sec <= max_time)
    ].sort_values("location_timestamp_local_sec")
    
    return raw_vp_subset, vp_grouped_subset


def make_map(gdf):
    m = gdf.explore(
        "location_timestamp_local_sec",
        tiles = "CartoDB Positron",
        categorical=True
    )
    return m

In [12]:
raw_vp2, vp_grouped2 = subset_gdf(raw_vp, vp_grouped, min_time, max_time)

In [13]:
raw_vp2

Unnamed: 0,gtfs_dataset_name,trip_instance_key,geometry,location_timestamp_local_sec
69,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39922 33.99862),23737
130,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39738 33.99736),23756
118,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39529 33.99049),23973
143,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39549 33.98892),24000


In [14]:
vp_grouped2

Unnamed: 0,gtfs_dataset_name,trip_instance_key,geometry,n_vp,location_timestamp_local_sec,moving_timestamp_local_sec
9,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39922 33.99862),1,23737,23737
24,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39529 33.99049),2,23756,23973
84,Culver City VehiclePositions,32034bff4a0692bab4e31d303bd5276d,POINT (-118.39549 33.98892),1,24000,24000


In [15]:
make_map(raw_vp2)

In [16]:
make_map(vp_grouped2)