## Oct 2021 Methodology Ideas from Mjumbe

* Current RT work somewhat diverged from this, but helpful background
* we later figured out that BigQuery simplifies linestrings in an undesirable way, so we will likely process in python someplace and write out tables for warehouse consumption...

# Calculating Level of Delay

1. Get the vehicle position and time (`gtfs_rt.vehicle_positions`)
2. Find the point nearest the vehicles position along its route
3. Calculate when the vehicle should have been at that nearest point
   * If possible/necessary, interpolate between `stop_times` entries
4. Take the difference between now and when the vehicle should have been at its nearest route position

**Challenges**
* What will/should happen when a route loops on itself. This happens frequently if a route spurs into a business park, say, and then comes back out by the same path. Then the vehicle may be closest to points corresponding to two different time points along the trip route. Maybe for now we just pick one and hope it all averages out.
* BigQuery doesn't have functions like [`ST_LineLocatePoint`](https://postgis.net/docs/ST_LineLocatePoint.html), which would be sooooo ideal. Can we do a custom linear interpolation function? Should we rely on BigQuery [user-defined functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions)? And is the linear interpolation even necessary?
  - Alternative simplification #1:
    - Just find the closest stop-time position to the vehicle position
    - Reasonable as long as the max time distance between stops is small (i.e. if there are 30 minutes between stops then it's not as good an idea. But maybe it will just average out...)
    - What's the max time distance between stops? What's our acceptable threshold?
  - Alternative simplification #2:
    - This is hard in the first place because BigQuery doesn't have linear interpolation built in
    - Pull data into Python and use Shapely?; but not real-time (i.e., we have to do the processing in Python on some schedule and then materialize the results in BigQuery as opposed to having a view ready). Also, BigQuery's good at parallelization.
  - Best guess:
    - Start with simplification #1. If custom linear interp function is not too onerous, or time distances are long enough where it's necessary, go with full linear interp method in a UDF.

## Checking time distances between `stop_times`

In [1]:
from calitp import get_engine
import pandas as pd
import pyarrow

In [2]:
engine = get_engine()
connection = engine.connect()

In the results below, we see that 95% of arrival time differences from one stop to the next are under 3 minutes (180 seconds). And even for stop times with a larger gap than 180 seconds, 90% of those are less than 10 minutes. So, fewer than 0.5% of trips between stops take more than 10 minutes. From that, it seems reasonable to at least _start_ by using the closest stop location as opposed to closest position along a route. If necessary, we can generalize later.

In [32]:
sql = '''
WITH

stop_pairs AS (
    SELECT
        calitp_itp_id,
        calitp_url_number,
        calitp_extracted_at,
        trip_id,
        stop_id AS from_stop_id,
        LEAD(stop_id) OVER (stop_sequence_window) AS to_stop_id,
        arrival_ts AS from_arrival_ts,
        LEAD(arrival_ts) OVER (stop_sequence_window) AS to_arrival_ts
    FROM views.gtfs_schedule_dim_stop_times
    WHERE calitp_deleted_at = '2099-01-01'
    WINDOW stop_sequence_window AS (
        PARTITION BY calitp_itp_id, calitp_url_number, calitp_extracted_at, trip_id
        ORDER BY stop_sequence
    )
)

SELECT 
    CAST((ROW_NUMBER() OVER () - 1) * 5 AS STRING) || 'th' AS quantile,
    q_seconds
FROM UNNEST((
    SELECT APPROX_QUANTILES(to_arrival_ts - from_arrival_ts, 20)
    FROM stop_pairs
)) q_seconds
'''

pd.read_sql_query(sql, connection)

Unnamed: 0,quantile,q_seconds
0,0th,-17100
1,5th,10
2,10th,28
3,15th,34
4,20th,40
5,25th,46
6,30th,51
7,35th,58
8,40th,60
9,45th,60


In [3]:
sql = '''
WITH

stop_pairs AS (
    SELECT
        calitp_itp_id,
        calitp_url_number,
        calitp_extracted_at,
        trip_id,
        stop_id AS from_stop_id,
        LEAD(stop_id) OVER (stop_sequence_window) AS to_stop_id,
        arrival_ts AS from_arrival_ts,
        LEAD(arrival_ts) OVER (stop_sequence_window) AS to_arrival_ts
    FROM views.gtfs_schedule_dim_stop_times
    WHERE calitp_deleted_at = '2099-01-01'
    WINDOW stop_sequence_window AS (
        PARTITION BY calitp_itp_id, calitp_url_number, calitp_extracted_at, trip_id
        ORDER BY stop_sequence
    )
)

SELECT 
    CAST((ROW_NUMBER() OVER () - 1) * 0.5 + 95 AS STRING) || 'th' AS quantile,
    q_seconds_above_3min
FROM UNNEST((
    SELECT APPROX_QUANTILES(to_arrival_ts - from_arrival_ts, 10)
    FROM stop_pairs
    WHERE to_arrival_ts - from_arrival_ts > 180
)) AS q_seconds_above_3min
'''

pd.read_sql_query(sql, connection)

Unnamed: 0,quantile,q_seconds_above_3min
0,95th,181
1,95.5th,196
2,96th,215
3,96.5th,240
4,97th,240
5,97.5th,257
6,98th,300
7,98.5th,339
8,99th,420
9,99.5th,600


## Plan (with interpolation)

1. Create dim table for trip_shapes (st_makeline)
2. Create trip_segments between stop_times by cutting trip_shapes between stop positions (requires a UDFs similar to [ST_LineInterpolatePoint](https://postgis.net/docs/ST_LineLocatePoint.html) and [ST_LineSubstring](https://postgis.net/docs/ST_LineSubstring.html))
3. Join the vehicle positions to trip segements (through vehicle position -> trip -> trip segment), measure distances, [keep the closest](https://stackoverflow.com/a/57295530/123776)
4. Find fraction of total segment length closest to vehicle position (requires creating a UDF similar to [ST_LineInterpolatePoint](https://postgis.net/docs/ST_LineLocatePoint.html))
5. Interpolate between stop time departure/arrival times to find scheduled vehicle time

## Plan (without interpolation)

1. Add `stop_pt` to `gtfs_schedule_dim_stops`
2. Create a cleaned table and a `gtfs_rt_fact_vehicle_positions` fact view
3. Join the vehicle positions to stops (through vehicle position -> trip -> stop_time -> stop), measure distances, [keep closest](https://stackoverflow.com/a/57295530/123776)
4. Use stop arrival time as scheduled vehicle time