## RT trip diagnostics: thresholds for usable trips 
Notes/Questions
* What is the calitp url number? What does 0 or 1 mean?
* Do you think that most shape IDS are going to be less than 100% of the length of the longest shape ID? 
* What defines a usable trip and a usable segment?
* What’s the difference between direction ID and route dir identifier? What does the 0 and 1 mean in direction ID?
* Segments represent a 1,000 m segment cut from the route's shape (longest shape in each direction). it current does reflect something about a route, but in the future, road segments will be used. What’s the difference between segment and road segment?
* Ask about graph on Slack.


In [1]:
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import pandas as pd
from calitp.sql import to_snakecase
from shared_utils import geography_utils, utils



In [2]:
# Save files to GCS
from calitp.storage import get_fs
fs = get_fs()

In [3]:
# Record start and end time
import datetime
from loguru import logger

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### Look at files

In [5]:
GCS_DASK_PATH = "gs://calitp-analytics-data/data-analyses/dask_test/"

In [6]:
GCS_RT_PATH = "gs://calitp-analytics-data/data-analyses/rt_delay/"

In [7]:
analysis_date = "2022-10-12"

* Should I use this `get_routelines` from `A1_vehicle_positions`

In [8]:
# Read in route lines
routelines = gpd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/routelines_{analysis_date}.parquet"
)

In [82]:
len(routelines), routelines.shape_id.nunique()

(9430, 6353)

In [81]:
# routelines.sample().drop(columns=["geometry"])

* The `route_dir_identifier` is used for segments to cut segments
for both directions the route runs.
* Why would the same route ID for the other direction have more segments?

In [11]:
# Read in longest_shape of each route
longest_shape = gpd.read_parquet(f"{GCS_DASK_PATH}longest_shape_segments.parquet")

In [12]:
len(
    longest_shape
), longest_shape.longest_shape_id.nunique(), longest_shape.calitp_itp_id.nunique()

(126896, 3960, 175)

In [66]:
# longest_shape.groupby(['calitp_itp_id','route_id','longest_shape_id']).agg({'segment_sequence':'nunique'})

In [64]:
# longest_shape.sort_values(['calitp_itp_id', 'route_id']).head(25).drop(columns=["geometry", "geometry_arrowized"])

In [80]:
# longest_shape.sort_values(['calitp_itp_id', 'route_id']).head(1).drop(columns=["geometry", "geometry_arrowized"])

In [14]:
# Says missing geospatial data
segments_crosswalks = pd.read_parquet(
    f"{GCS_DASK_PATH}segments_route_direction_crosswalk.parquet"
)

In [15]:
len(segments_crosswalks)

5150

In [16]:
segments_crosswalks.sample()

Unnamed: 0,calitp_itp_id,route_id,direction_id,route_dir_identifier
4240,182,4-13157,1,3559130268


In [17]:
# Read in one segment for ONE itp id first
# when using gpd.read_parquert() says
# Missing geo metadata in Parquet/Feather file.
# Use pandas.read_parquet/read_feather() instead.
segment_148 = pd.read_parquet(
    f"{GCS_DASK_PATH}vp_sjoin/vp_segment_148_{analysis_date}.parquet"
)

In [18]:
# segment_148 = gpd.GeoDataFrame(
#    segment_148, geometry=gpd.points_from_xy(segment_148.lon, segment_148.lat))

In [19]:
len(segment_148)

22692

In [20]:
segment_148 = segment_148.sort_values(["calitp_itp_id", "trip_id", "segment_sequence"])

In [21]:
segment_148.sample()

Unnamed: 0,calitp_itp_id,calitp_url_number,vehicle_timestamp,trip_id,route_dir_identifier,segment_sequence,lon,lat
1587,148,0,2022-10-12 13:15:20,933,1342713973,6,27708.54,-186019.68


### Task 1
* Using GTFS schedule data, by route_id-shape_id, calculate the route_length of each shape_id as a proportion of the longest shape_id. 
* For each route_id, what's the shortest shape_id length, in proportion to the longest shape_id's length. if it's 100%, then all shape_ids are equal length for that route. if it's 50%, there's a short trip that exists that only runs 50% of the length and turns around.

Notes
* Is it correct to join on `shape_id` and `longest_shape_id`.

In [22]:
routelines.crs == longest_shape.crs

True

In [23]:
# Calculate length of geometry
routelines = routelines.assign(
    route_length=(routelines.geometry.to_crs(geography_utils.CA_StatePlane).length)
)

* What do the Cal ITP url numbers signify?

In [24]:
routelines.calitp_url_number.value_counts()

0    7348
1    2011
2      71
Name: calitp_url_number, dtype: int64

In [68]:
routelines.drop(columns = ["geometry"]).sort_values(["calitp_itp_id","shape_id"]).head(6)

Unnamed: 0,calitp_itp_id,calitp_url_number,shape_id,route_length
45,4,0,shp-10-09,40538.08
578,4,1,shp-10-09,40538.08
61,4,0,shp-10-10,38768.87
427,4,1,shp-10-10,38768.87
75,4,0,shp-12-14,57472.16
517,4,1,shp-12-14,57472.16


In [26]:
# Dissolve so only one row for each calitp_id/shape_id.
# I want the route length for both directions?
routelines_diss = routelines.dissolve(by=[
        "calitp_itp_id",
        "shape_id",
    ],
    aggfunc={
        "route_length": "sum",
    },
).reset_index()

In [27]:
routelines_diss.shape_id.nunique(), len(routelines_diss), len(routelines)

(6353, 7685, 9430)

In [70]:
routelines_diss.drop(columns = ["geometry"]).sort_values(["calitp_itp_id","shape_id"]).head()

Unnamed: 0,calitp_itp_id,shape_id,route_length
0,4,shp-10-09,81076.17
1,4,shp-10-10,77537.73
2,4,shp-12-14,114944.32
3,4,shp-12-56,119300.58
4,4,shp-14-14,24129.03


In [73]:
# longest_shape.drop(columns = ["geometry", "geometry_arrowized"]).sort_values(["calitp_itp_id","longest_shape_id"]).head(25)

In [72]:
# Dissolve so only one row for each calitp_id/shape_id/route_id
# Don't care about segment necessarily?
longest_shape_diss = longest_shape.dissolve(
    by=["calitp_itp_id", "route_id", "longest_shape_id","route_dir_identifier"],
    aggfunc={
        "route_length": "sum",
    },
).reset_index()

In [31]:
# longest_shape_diss.drop(columns = ["geometry"]).sort_values(["calitp_itp_id", "route_id"]).head(10)

In [32]:
# Do an inner merge? Or should it be left merge?
m1 = routelines_diss.merge(
    longest_shape_diss,
    how="inner",
    left_on=["calitp_itp_id", "shape_id"],
    right_on=["calitp_itp_id", "longest_shape_id"],
    suffixes=("_routelines", "_longest_line"),
)

In [76]:
len(longest_shape_diss), len(routelines_diss), len(m1)

(5150, 7685, 5150)

In [33]:
# Make sure this is a gdf? Is this important?
m1 = m1.set_geometry("geometry_routelines")

In [34]:
# Calculate out proportion of route length against longest.
m1["proportion_route_length"] = (
    m1["route_length_routelines"] / m1["route_length_longest_line"]
) * 100

In [35]:
m1.proportion_route_length.describe()

count   5150.00
mean      32.88
std       32.66
min        1.06
25%       12.62
50%       23.43
75%       41.06
max      657.58
Name: proportion_route_length, dtype: float64

In [78]:
m1.drop(columns = ['geometry_routelines','geometry_longest_line']).sort_values(["calitp_itp_id", "route_id"], ascending = False).head()

Unnamed: 0,calitp_itp_id,shape_id,route_length_routelines,longest_shape_id,route_id,route_dir_identifier,route_length_longest_line,proportion_route_length
5148,485,101,9353.86,101,Treasure Island – Yerba Buena Island,2504319577,8544.38,109.47
5149,485,102,9353.86,102,Treasure Island – Yerba Buena Island,3796095695,8544.38,109.47
5129,484,706_shp,300744.53,706_shp,90,1695780991,8414633.84,3.57
5131,484,735_shp,300464.01,735_shp,90,303341801,8406798.34,3.57
5097,484,628_shp,183383.62,628_shp,80,1691482696,3126727.93,5.87


### Task 2
* Testing with 148 Kings County Area Public Transit Agency
* Calculate time of trips?

In [136]:
len(segment_148)

22692

* How come there are so many different timestamps within a 30 second increments of each either within the same segment?

In [137]:
segment_148.sort_values(["trip_id", "segment_sequence"]).head(2)

Unnamed: 0,calitp_itp_id,calitp_url_number,vehicle_timestamp,trip_id,route_dir_identifier,segment_sequence,lon,lat
287,148,0,2022-10-12 18:26:26,100,4023814891,0,31139.56,-187848.08
288,148,0,2022-10-12 18:26:56,100,4023814891,0,31133.83,-187776.98


In [138]:
merge_cols = ["calitp_itp_id", "trip_id", "route_dir_identifier",]

In [139]:
# Get the minimum time of the vehicle stamp.
# For the trip_ID and direction. 
segment_148_min = (
    segment_148.groupby(merge_cols)
    .agg({"vehicle_timestamp": "min"})
    .rename(columns={"vehicle_timestamp": "min_time"})
    .reset_index()
)

In [140]:
# Get the max time of the vehicle stamp.
segment_148_max = (
    segment_148.groupby(merge_cols)
    .agg({"vehicle_timestamp": "max"})
    .rename(columns={"vehicle_timestamp": "max_time"})
    .reset_index()
)

In [141]:
# Count number of segments
segment_counts_148 = (segment_148.groupby(merge_cols)
    .agg({"segment_sequence": "nunique"})
    .reset_index()
    .rename(columns = {"segment_sequence":"number_of_segments"})
                     )

In [142]:
segment_148_m = segment_148_max.merge(
    segment_148_min,
    how="inner",
    on=merge_cols,
).merge(
    segment_counts_148,
    how="left",
    on=merge_cols)

In [143]:
# segment_148_m.sort_values(["trip_id"]).head(25)

In [144]:
# https://stackoverflow.com/questions/51491724/calculate-difference-of-2-dates-in-minutes-in-pandas
segment_148_m['minutes_elapsed'] = (segment_148_m.max_time - segment_148_m.min_time).dt.total_seconds() / 60

In [146]:
segment_148_m.head(2)

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,max_time,min_time,number_of_segments,minutes_elapsed
0,148,100,4023814891,2022-10-12 18:57:26,2022-10-12 18:26:26,10,31.0
1,148,101,4023814891,2022-10-12 19:23:26,2022-10-12 18:57:56,10,25.5


In [147]:
len(segment_148_m)

232

In [148]:
segment_148_m.trip_id.nunique() == segment_148.trip_id.nunique(), segment_148_m.route_dir_identifier.nunique() == segment_148.route_dir_identifier.nunique()

(True, True)

In [149]:
# Merge to get calitp_itp_id	route_id	direction_id	route_dir_identifier info
segment_148_m = segment_148_m.merge(
    segments_crosswalks,
    how="inner",
    on=['calitp_itp_id','route_dir_identifier'],
)

In [150]:
# Merge with the shape ID proportion df.
m2 = segment_148_m.merge(
    m1,
    how="inner",
    on=['calitp_itp_id','route_dir_identifier', "route_id"],
    indicator = True
)

In [151]:
len(m2), len(segment_148_m)

(232, 232)

In [152]:
col_order = ['calitp_itp_id',  'route_id', 'trip_id', 'number_of_segments', 'route_dir_identifier', 'max_time',
       'min_time', 'minutes_elapsed','direction_id', 'shape_id', 'longest_shape_id',
               'proportion_route_length']

In [153]:
# Change col order & drop unnecessary ones.
m2 = m2[col_order]

In [155]:
len(m2)

232

In [161]:
(m2.sort_values(by = ["route_id", "trip_id"])).head()

Unnamed: 0,calitp_itp_id,route_id,trip_id,number_of_segments,route_dir_identifier,max_time,min_time,minutes_elapsed,direction_id,shape_id,longest_shape_id,proportion_route_length
0,148,1,100,10,4023814891,2022-10-12 18:57:26,2022-10-12 18:26:26,31.0,0,42,42,32.79
1,148,1,101,10,4023814891,2022-10-12 19:23:26,2022-10-12 18:57:56,25.5,0,42,42,32.79
2,148,1,102,10,4023814891,2022-10-12 19:55:26,2022-10-12 19:23:56,31.5,0,42,42,32.79
3,148,1,76,10,4023814891,2022-10-12 06:54:55,2022-10-12 06:18:55,36.0,0,42,42,32.79
4,148,1,77,10,4023814891,2022-10-12 07:23:56,2022-10-12 06:55:25,28.52,0,42,42,32.79


### Questions 
* For each operator, what's the % of RT trip_ids that would remain after those thresholds are used? Make a chart function that takes a single operator. Produce charts for all operators. Is the time or geographic coverage that's driving this excluding of trips? What is a recommended threshold to use?
* For short trips, do they tend to be 50% of the longest route length? 40% 30%? Have this handy to inform question 1.


In [145]:
segment_148_m.minutes_elapsed.describe()

count   232.00
mean     40.54
std      27.08
min      21.00
25%      28.48
50%      30.00
75%      32.50
max     189.50
Name: minutes_elapsed, dtype: float64

In [164]:
threshold = 40

In [165]:
usable = (segment_148_m.loc[segment_148_m.minutes_elapsed > threshold]).reset_index(drop = True)

In [166]:
unusable = (segment_148_m.loc[segment_148_m.minutes_elapsed < threshold]).reset_index(drop = True)

In [167]:
len(unusable), len(usable)

(192, 40)