## RT trip diagnostics: thresholds for usable trips 
Notes/Questions
* What is the calitp url number? What does 0 or 1 mean? V1, operator has different feeds. 
    * 0 could be primary, 1 is backup. Will go away. 
* Do you think that most shape IDS are going to be less than 100% of the length of the longest shape ID? 
    * Not necessarily, shape ID can be a short version of the trip.
* What’s the difference between direction ID and route dir identifier? What does the 0 and 1 mean in direction ID?
    * We don't know where the bus is going, so just do 0 and 1.
    * Route dir identifier: captures route info and direction it is going to capture all the trips. Helps with groupby. 
    * We don't want to stick with trip id, we need to get to route level. 
    * Don't want to lose info on the direction. 
    * Have to distinguish direction or else when plotting, it'll look like the bus is going backwards.
    * RT data comes with direction id. Can get which direction it ran in from schedule data. 
    * Attach route, join coordinate data to segments. 
    * Use segments and average out trips that occurred on that segment. 
* Ask about graph on Slack. 

In [1]:
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import pandas as pd
from calitp.sql import to_snakecase
from shared_utils import geography_utils, utils



In [2]:
# Save files to GCS
from calitp.storage import get_fs

fs = get_fs()

In [3]:
# Record start and end time
import datetime

from loguru import logger

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### Load Files

In [5]:
GCS_DASK_PATH = "gs://calitp-analytics-data/data-analyses/dask_test/"

In [6]:
GCS_RT_PATH = "gs://calitp-analytics-data/data-analyses/rt_delay/"

In [7]:
analysis_date = "2022-10-12"

* Should I use this `get_routelines` from `A1_vehicle_positions`. 
    * Just read it directly from GCS, don't need buffer.

In [8]:
# Read in route lines
routelines = gpd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/routelines_{analysis_date}.parquet"
)

In [9]:
len(routelines), routelines.shape_id.nunique()

(9430, 6353)

In [10]:
# Read in Trips
trips = pd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/trips_{analysis_date}.parquet"
)

In [11]:
len(trips)

120136

* The `route_dir_identifier` is used for segments to cut segments
for both directions the route runs.

In [12]:
# Read in longest_shape of each route
longest_shape = gpd.read_parquet(f"{GCS_DASK_PATH}longest_shape_segments.parquet")

* Why would the same route ID for the other direction have more segments? 
   * Can have a layover. 
   * A segment must be 1000 meters or less.

In [13]:
# longest_shape.groupby(['calitp_itp_id','route_id','longest_shape_id']).agg({'segment_sequence':'nunique'}).head()

In [14]:
# longest_shape.sort_values(['calitp_itp_id', 'route_id']).head(25).drop(columns=["geometry", "geometry_arrowized"])

In [15]:
crosswalk = pd.read_parquet(
    f"{GCS_DASK_PATH}segments_route_direction_crosswalk.parquet"
)

In [16]:
# Use pandas.read_parquet/read_feather() instead.
operator_4 = pd.read_parquet(
    f"{GCS_DASK_PATH}vp_sjoin/vp_segment_4_{analysis_date}.parquet"
)

### Task 1
* Using GTFS schedule data, by route_id-shape_id, calculate the route_length of each shape_id as a proportion of the longest shape_id. 
* For each route_id, what's the shortest shape_id length, in proportion to the longest shape_id's length. if it's 100%, then all shape_ids are equal length for that route. if it's 50%, there's a short trip that exists that only runs 50% of the length and turns around.

<b>How</b>
* Need table `trips` from compile cached views -> shape ID and route ID and direction ID -> merge in segments crosswalk with route direction identifier 
* Shapes table -> attach route dir identifier and merge in longest shape line using  routes and direction. 
* Attach longest shape ID onto this and take the fraction. 

#### Merge trips with crosswalk
* Why do we take away `trip_id` from `trips`?

In [17]:
# Subset
trips2 = trips[
    [
        "calitp_itp_id",
        "route_id",
        "direction_id",
        "shape_id",
    ]
]

In [18]:
len(
    trips2,
), len(crosswalk)

(120136, 5150)

In [19]:
crosswalk.head(2)

Unnamed: 0,calitp_itp_id,route_id,direction_id,route_dir_identifier
0,372,4ba918e5-58c0-4d4a-9f55-5cadb8564bff,0,255544
1,293,7,0,1269889


In [20]:
# crosswalk.info(), trips2.info()

In [21]:
# trips.head()

In [22]:
len(trips2.drop_duplicates()), len(trips2)

(8199, 120136)

##### How could there be duplicates? Is it right to drop them?

In [23]:
trips2 = (trips2.drop_duplicates()).reset_index(drop=True)

* Many more values in `trips` than `crosswalk`.

In [24]:
trips2.merge(
    crosswalk,
    how="outer",
    on=["calitp_itp_id", "route_id", "direction_id"],
    indicator=True,
)[["_merge"]].value_counts()

_merge    
both          7833
left_only      366
right_only      74
dtype: int64

In [25]:
trips_m_crosswalk = trips2.merge(
    crosswalk, how="inner", on=["calitp_itp_id", "route_id", "direction_id"]
)

In [26]:
trips_m_crosswalk.head()

Unnamed: 0,calitp_itp_id,route_id,direction_id,shape_id,route_dir_identifier
0,4,U,1,shp-U-06,1244740981
1,4,U,0,shp-U-07,1026952675
2,4,212,1,shp-212-07,1369834141
3,4,212,0,shp-212-57,648098315
4,4,67,0,shp-67-57,3358964048


#### Shapes table -> attach route dir identifier 

In [27]:
routelines = routelines.drop(columns=["calitp_url_number"])

In [28]:
# Calculate length of geometry
routelines = routelines.assign(
    actual_route_length=(
        routelines.geometry.to_crs(geography_utils.CA_StatePlane).length
    )
)

In [29]:
routelines.loc[routelines.shape_id == "shp-U-06"].drop(columns=["geometry"])

Unnamed: 0,calitp_itp_id,shape_id,actual_route_length
111,4,shp-U-06,111927.21
622,4,shp-U-06,111927.21


##### How could there be duplicates? Is it right to drop them?

In [30]:
len(routelines), len(routelines.drop_duplicates())

(9430, 8133)

In [31]:
routelines = (routelines.drop_duplicates()).reset_index(drop=True)

In [32]:
# When I do an outer merge, there are 363 values that are left only
# aka the trips_crosswalk df. Is this okay?
routelines.merge(
    trips_m_crosswalk, how="outer", on=["calitp_itp_id", "shape_id"], indicator=True
)[["_merge"]].value_counts()

_merge    
both          8529
left_only      363
right_only       0
dtype: int64

In [33]:
routelines_m_trips = routelines.merge(
    trips_m_crosswalk,
    how="inner",
    on=["calitp_itp_id", "shape_id"],
)

In [34]:
len(routelines_m_trips), len(trips_m_crosswalk), len(routelines)

(8529, 7833, 8133)

* Why did my direction ID 0 for "shp-U-06" get dropped?

In [35]:
routelines_m_trips.loc[routelines_m_trips.route_id == "U"].drop(columns=["geometry"])

Unnamed: 0,calitp_itp_id,shape_id,actual_route_length,route_id,direction_id,route_dir_identifier
1,4,shp-U-07,111849.47,U,0,1026952675
111,4,shp-U-06,111927.21,U,1,1244740981


#### Merge in longest shape line on routes and direction.

In [36]:
routelines_m_trips.crs == longest_shape.crs

True

In [37]:
# longest_shape = longest_shape.drop(columns = ["calitp_url_number"])

In [38]:
longest_shape = longest_shape.rename(columns={"route_length": "longest_route_length"})

##### Do I have to aggregate longest_shape because the `longest shape id` is broken down by segment and doesn't total up to the `not_longest_route_length`? 

In [39]:
longest_shape.loc[longest_shape.route_dir_identifier == 1244740981].drop(
    columns=["geometry", "geometry_arrowized"]
)

Unnamed: 0,calitp_itp_id,calitp_url_number,route_id,direction_id,longest_shape_id,route_dir_identifier,longest_route_length,segment_sequence
4012,4,0,U,1,shp-U-06,1244740981,34070.9,0
4013,4,0,U,1,shp-U-06,1244740981,34070.9,1
4014,4,0,U,1,shp-U-06,1244740981,34070.9,2
4015,4,0,U,1,shp-U-06,1244740981,34070.9,3
4016,4,0,U,1,shp-U-06,1244740981,34070.9,4
4017,4,0,U,1,shp-U-06,1244740981,34070.9,5
4018,4,0,U,1,shp-U-06,1244740981,34070.9,6
4019,4,0,U,1,shp-U-06,1244740981,34070.9,7
4020,4,0,U,1,shp-U-06,1244740981,34070.9,8
4021,4,0,U,1,shp-U-06,1244740981,34070.9,9


In [40]:
# Have to aggregate longest_shape?
longest_shape_agg = (
    longest_shape.groupby(
        [
            "calitp_itp_id",
            "route_id",
            "direction_id",
            "longest_shape_id",
            "route_dir_identifier",
        ]
    ).agg({"longest_route_length": "sum"})
).reset_index()

In [41]:
longest_shape_agg.loc[longest_shape_agg.route_dir_identifier == 1244740981]

Unnamed: 0,calitp_itp_id,route_id,direction_id,longest_shape_id,route_dir_identifier,longest_route_length
244,4,U,1,shp-U-06,1244740981,1192481.63


In [42]:
len(routelines_m_trips), len(longest_shape), len(longest_shape_agg)

(8529, 126896, 5150)

In [43]:
# Right only has 74 more rows...
routelines_m_trips.merge(
    longest_shape_agg,
    how="outer",
    on=["calitp_itp_id", "direction_id", "route_id", "route_dir_identifier"],
    indicator=True,
)[["_merge"]].value_counts()

_merge    
both          8529
right_only      74
left_only        0
dtype: int64

In [44]:
routelines_final = routelines_m_trips.merge(
    longest_shape_agg,
    how="inner",
    on=["calitp_itp_id", "direction_id", "route_id", "route_dir_identifier"],
)

In [45]:
# Calculate out proportion of route length against longest.
routelines_final["proportion_route_length"] = (
    routelines_final["actual_route_length"] / routelines_final["longest_route_length"]
) * 100

In [46]:
routelines_final.proportion_route_length.describe()

count   8529.00
mean      33.32
std      144.08
min        0.07
25%        9.81
50%       18.23
75%       32.45
max     7015.79
Name: proportion_route_length, dtype: float64

##### I don't really get how every trip_id would have the same `route_length` ? 

In [47]:
routelines_final.loc[routelines_final.route_id == "U"].drop(columns=["geometry"]).head()

Unnamed: 0,calitp_itp_id,shape_id,actual_route_length,route_id,direction_id,route_dir_identifier,longest_shape_id,longest_route_length,proportion_route_length
1,4,shp-U-07,111849.47,U,0,1026952675,shp-U-07,1191737.11,9.39
137,4,shp-U-06,111927.21,U,1,1244740981,shp-U-06,1192481.63,9.39


### Task 2
* Testing with 148 Kings County Area Public Transit Agency
* Calculate time of trips?

* How come there are so many different timestamps within a 30 second increments of each either within the same segment? GTFS pings every 30 seconds.

In [48]:
def find_operator_info(df):
    df = df.sort_values(["calitp_itp_id", "trip_id", "segment_sequence"])
    crosswalk = pd.read_parquet(
        f"{GCS_DASK_PATH}segments_route_direction_crosswalk.parquet"
    )

    merge_cols = [
        "calitp_itp_id",
        "trip_id",
        "route_dir_identifier",
    ]

    # Get the minimum time of the vehicle stamp.
    # For the trip_ID and direction.
    start_time_trip = (
        df.groupby(merge_cols)
        .agg({"vehicle_timestamp": "min"})
        .rename(columns={"vehicle_timestamp": "start"})
        .reset_index()
    )

    # Get the max time of the vehicle stamp.
    end_time_trip = (
        df.groupby(merge_cols)
        .agg({"vehicle_timestamp": "max"})
        .rename(columns={"vehicle_timestamp": "end"})
        .reset_index()
    )

    # Count number of segments
    segment_counts = (
        df.groupby(merge_cols)
        .agg({"segment_sequence": "nunique"})
        .reset_index()
        .rename(columns={"segment_sequence": "number_of_segments"})
    )

    # Merge
    m1 = start_time_trip.merge(end_time_trip, how="inner", on=merge_cols).merge(
        segment_counts, how="left", on=merge_cols
    )

    # https://stackoverflow.com/questions/51491724/calculate-difference-of-2-dates-in-minutes-in-pandas
    m1["minutes_elapsed"] = (m1.end - m1.start).dt.total_seconds() / 60

    return m1

In [50]:
operator_4_metrics = find_operator_info(operator_4)

In [49]:
operator_4.head(2)

Unnamed: 0,calitp_itp_id,calitp_url_number,vehicle_timestamp,trip_id,route_dir_identifier,segment_sequence,lon,lat
0,4,0,2022-10-12 03:57:57,1002020,2062080730,0,-199410.89,-20669.41
1,4,0,2022-10-12 03:58:12,1002020,2062080730,0,-199423.82,-20695.24


In [51]:
operator_4_metrics.head()

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,start,end,number_of_segments,minutes_elapsed
0,4,10000020,4214183996,2022-10-12 21:55:01,2022-10-12 22:24:54,7,29.88
1,4,1000020,2437991552,2022-10-12 16:21:29,2022-10-12 16:32:23,4,10.9
2,4,100020,1578901637,2022-10-12 09:20:02,2022-10-12 10:35:03,26,75.02
3,4,10002020,3278503533,2022-10-12 08:40:17,2022-10-12 09:44:50,22,64.55
4,4,10003020,4240972089,2022-10-12 17:25:04,2022-10-12 18:45:18,20,80.23


In [52]:
m2 = operator_4_metrics.merge(
    routelines_final,
    how="inner",
    on=["calitp_itp_id", "route_dir_identifier"],
)

##### Lots of route ids missing?

In [57]:
merged_routeid = set(m2.route_id.unique().tolist())
routelines_final = set(routelines_final.route_id.unique().tolist())
routelines_final - merged_routeid

{'89edf785-5b0a-499a-b1e8-230486553917',
 '53P',
 '4_AM',
 '3162',
 '3403',
 '617-13157',
 'CLEB',
 '905',
 'L1',
 '4062',
 '4141',
 '1965',
 '114',
 'SCVMC Shuttle',
 '251-13157',
 '3385',
 '3615',
 '11110',
 '120',
 '953',
 '985',
 '16739',
 '13056',
 '632',
 'Route2',
 '51-13157',
 '130',
 '1238',
 '12164',
 '3664',
 'OPF Shuttle',
 '1112',
 '9',
 'HB',
 '901-13157',
 'R',
 '117-13157',
 'Inland Emp.-Orange Co. Line',
 'Hwy99Express',
 '11675',
 '398-198',
 '11690',
 'ACE Gray',
 '3_AM',
 '1047',
 '12874',
 '4e5dc868-28bb-4f99-ab5d-1c12f15f5e55',
 '933',
 '81-13157',
 '645',
 '3212',
 'UGC Shuttle',
 '1203',
 '28',
 'A - Afternoon',
 '205-13157',
 '48',
 '397-198',
 '9e3d73a1-3b92-43c6-9a41-906b72b13034',
 '1',
 'DCW',
 '1368',
 '290',
 '167',
 '566',
 '19479',
 '351',
 '37-198',
 '530',
 '576',
 '12867',
 '4A',
 '155-13157',
 '30-13157',
 '4135',
 '615',
 '0ef611a2-beee-4d28-a863-c4fdf15985b1',
 'LF-237',
 '838',
 '3518',
 '030',
 '268-13157',
 '480_merged_6230',
 '7_PM',
 '8BX',
 

### Question
* For each operator, what's the % of RT trip_ids that would remain after those thresholds are used? Make a chart function that takes a single operator. Produce charts for all operators. Is the time or geographic coverage that's driving this excluding of trips? What is a recommended threshold to use?
* For short trips, do they tend to be 50% of the longest route length? 40% 30%? Have this handy to inform question 1.


In [None]:
segment_148_m.minutes_elapsed.describe()

In [None]:
threshold = 40

In [None]:
usable = (segment_148_m.loc[segment_148_m.minutes_elapsed > threshold]).reset_index(
    drop=True
)

In [None]:
unusable = (segment_148_m.loc[segment_148_m.minutes_elapsed < threshold]).reset_index(
    drop=True
)

In [None]:
len(unusable), len(usable)