## RT trip diagnostics: thresholds for usable trips 
### Other Questions
* Should thresholds be on the operator or the operator-route ID level?
* How to figure out whether a segment is acceptable or not?
* Is the `proportion_route_length` tied with usable segments?

In [1]:
# Charts
import altair as alt
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import pandas as pd
from calitp.sql import to_snakecase
from shared_utils import calitp_color_palette as cp
from shared_utils import geography_utils, styleguide, utils



In [2]:
# Save files to GCS
from calitp.storage import get_fs

fs = get_fs()

In [3]:
# Record start and end time
import datetime

from loguru import logger

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### Load Files

In [5]:
GCS_DASK_PATH = "gs://calitp-analytics-data/data-analyses/dask_test/"
GCS_RT_PATH = "gs://calitp-analytics-data/data-analyses/rt_delay/"

In [6]:
analysis_date = "2022-10-12"

In [7]:
# Tells me actual route length for each shape id.
routelines = gpd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/routelines_{analysis_date}.parquet"
)

In [8]:
# RT data Read in Trips
# Gives me trips ran for a particular day across all oeprators.
trips = pd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/trips_{analysis_date}.parquet"
)

In [9]:
# Read in longest_shape of each route
# Schedule data, source of truth.
longest_shape = gpd.read_parquet(f"{GCS_DASK_PATH}longest_shape_segments.parquet")

In [10]:
# longest_shape.groupby(['calitp_itp_id','route_id','longest_shape_id']).agg({'segment_sequence':'nunique'}).head()

In [11]:
# longest_shape.sort_values(['calitp_itp_id', 'route_id']).head(25).drop(columns=["geometry", "geometry_arrowized"])

In [12]:
crosswalk = pd.read_parquet(
    f"{GCS_DASK_PATH}segments_route_direction_crosswalk.parquet"
)

In [13]:
operator_4 = pd.read_parquet(
    f"{GCS_DASK_PATH}vp_sjoin/vp_segment_4_{analysis_date}.parquet"
)

### Task 1
* Using GTFS schedule data, by route_id-shape_id, calculate the route_length of each shape_id as a proportion of the longest shape_id. 
* For <b>each route_id</b>, what's the shortest shape_id length, in proportion to the longest shape_id's length. if it's 100%, then all shape_ids are equal length for that route. if it's 50%, there's a short trip that exists that only runs 50% of the length and turns around.

<b>How</b>
* Need table `trips` from compile cached views -> shape ID and route ID and direction ID -> merge in segments crosswalk with route direction identifier 
* Shapes table -> attach route dir identifier
* Merge in longest shape line using  routes and direction take the fraction. 

#### Step 1. Merge `trips` with `crosswalk`
##### Help: Why do we take away `trip_id` from `trips`? 

In [14]:
# Subset
trips2 = trips[
    [
        "calitp_itp_id",
        "route_id",
        "direction_id",
        "shape_id",
    ]
]

In [15]:
len(trips2), len(crosswalk), len(trips)

(120136, 5150, 120136)

In [16]:
trips2.head(2)

Unnamed: 0,calitp_itp_id,route_id,direction_id,shape_id
0,4,U,1,shp-U-06
1,4,U,1,shp-U-06


In [17]:
crosswalk.head(2)

Unnamed: 0,calitp_itp_id,route_id,direction_id,route_dir_identifier
0,372,4ba918e5-58c0-4d4a-9f55-5cadb8564bff,0,255544
1,293,7,0,1269889


In [18]:
trips2 = (trips2.drop_duplicates()).reset_index(drop=True)

In [19]:
len(trips2)

8199

* 366 more values in `trips` than `crosswalk` even though `Cal ITP ID.nunique()` yields the same number.

In [20]:
trips2.merge(
    crosswalk,
    how="outer",
    on=["calitp_itp_id", "route_id", "direction_id"],
    indicator=True,
)[["_merge"]].value_counts()

_merge    
both          7833
left_only      366
right_only      74
dtype: int64

In [21]:
trips_m_crosswalk = trips2.merge(
    crosswalk, how="inner", on=["calitp_itp_id", "route_id", "direction_id"]
)

In [22]:
trips_m_crosswalk.head()

Unnamed: 0,calitp_itp_id,route_id,direction_id,shape_id,route_dir_identifier
0,4,U,1,shp-U-06,1244740981
1,4,U,0,shp-U-07,1026952675
2,4,212,1,shp-212-07,1369834141
3,4,212,0,shp-212-57,648098315
4,4,67,0,shp-67-57,3358964048


#### Step 2. Shapes table -> attach route dir identifier 
* Drop duplicates in routelines b/c of `calitp_url_number`. 

In [23]:
routelines.crs == longest_shape.crs

True

In [24]:
# Drop CalITP since it's no longer needed
routelines = routelines.drop(columns=["calitp_url_number"])

In [25]:
routelines = (routelines.drop_duplicates()).reset_index(drop=True)

In [26]:
len(routelines)

8133

In [27]:
routelines.crs

<Derived Projected CRS: EPSG:3310>
Name: NAD83 / California Albers
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: United States (USA) - California.
- bounds: (-124.45, 32.53, -114.12, 42.01)
Coordinate Operation:
- name: California Albers
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [28]:
geography_utils.CA_NAD83Albers

'EPSG:3310'

In [29]:
# Calculate length of geometry
routelines = routelines.assign(
    actual_route_length=(
        routelines.geometry.to_crs(geography_utils.CA_NAD83Albers).length
    )
)

In [30]:
routelines.merge(
    trips_m_crosswalk, how="outer", on=["calitp_itp_id", "shape_id"], indicator=True
)[["_merge"]].value_counts()

_merge    
both          8529
left_only      363
right_only       0
dtype: int64

In [31]:
routelines_m_trips = routelines.merge(
    trips_m_crosswalk,
    how="inner",
    on=["calitp_itp_id", "shape_id"],
)

In [32]:
len(routelines_m_trips), len(trips_m_crosswalk), len(routelines)

(8529, 7833, 8133)

In [33]:
routelines_m_trips.loc[routelines_m_trips.route_id == "U"].drop(columns=["geometry"])

Unnamed: 0,calitp_itp_id,shape_id,actual_route_length,route_id,direction_id,route_dir_identifier
1,4,shp-U-07,34049.63,U,0,1026952675
111,4,shp-U-06,34070.9,U,1,1244740981


#### Step 3. Merge in longest shape line on routes and direction.
* Which geometry to keep?

In [34]:
longest_shape = longest_shape.rename(columns={"route_length": "longest_route_length"})

In [35]:
# route_u.explore("segment_sequence", cmap = "tab10",
#                style_kwds = {'weight': 10}, legend = False, height = 400, width = 800)

In [36]:
routelines_final = routelines_m_trips.merge(
    longest_shape.drop(columns=["geometry"]),
    how="inner",
    on=["calitp_itp_id", "direction_id", "route_id", "route_dir_identifier"],
)

In [37]:
# Calculate out proportion of route length against longest.
routelines_final["proportion_route_length"] = (
    (routelines_final["actual_route_length"] / routelines_final["longest_route_length"])
    * 100
).astype(int)

In [38]:
len(routelines_final)

219200

In [39]:
routelines_final.columns

Index(['calitp_itp_id', 'shape_id', 'geometry', 'actual_route_length',
       'route_id', 'direction_id', 'route_dir_identifier', 'calitp_url_number',
       'longest_shape_id', 'longest_route_length', 'segment_sequence',
       'geometry_arrowized', 'proportion_route_length'],
      dtype='object')

In [40]:
# Count total segments a route_id could have
routelines_final_test = (
    routelines_final.groupby(
        [
            "route_id",
            "calitp_itp_id",
            "route_dir_identifier",
            "shape_id",
            "longest_shape_id",
            "proportion_route_length",
        ]
    )
    .agg({"segment_sequence": "count"})
    .reset_index()
)

In [41]:
# routelines_final_test.loc[routelines_final_test.route_dir_identifier == 1244740981].drop(
#    columns=["geometry","geometry_arrowized"]
# ).head()

In [42]:
len(routelines_final_test), len(routelines_final_test.drop_duplicates())

(8022, 8022)

### Task 2
* Testing with Agency 4. 
* Calculate time of trips?


In [43]:
def find_operator_info(df):
    df = df.sort_values(["calitp_itp_id", "trip_id", "segment_sequence"])

    merge_cols = [
        "calitp_itp_id",
        "trip_id",
        "route_dir_identifier",
    ]

    # Get start time.
    start_time_trip = (
        df.groupby(merge_cols)
        .agg({"vehicle_timestamp": "min"})
        .rename(columns={"vehicle_timestamp": "start"})
        .reset_index()
    )

    # Get end time.
    end_time_trip = (
        df.groupby(merge_cols)
        .agg({"vehicle_timestamp": "max"})
        .rename(columns={"vehicle_timestamp": "end"})
        .reset_index()
    )

    # Count number of segments.
    segment_counts = (
        df.groupby(merge_cols)
        .agg({"segment_sequence": "nunique"})
        .reset_index()
        .rename(columns={"segment_sequence": "number_of_segments"})
    )

    # Merge
    m1 = start_time_trip.merge(end_time_trip, how="inner", on=merge_cols).merge(
        segment_counts, how="left", on=merge_cols
    )

    # Calculate time elapsed
    # https://stackoverflow.com/questions/51491724/calculate-difference-of-2-dates-in-minutes-in-pandas
    m1["minutes_elapsed"] = (m1.end - m1.start).dt.total_seconds() / 60

    return m1

In [44]:
operator_4.head(2)

Unnamed: 0,calitp_itp_id,calitp_url_number,vehicle_timestamp,trip_id,route_dir_identifier,segment_sequence,lon,lat
0,4,0,2022-10-12 03:57:57,1002020,2062080730,0,-199410.89,-20669.41
1,4,0,2022-10-12 03:58:12,1002020,2062080730,0,-199423.82,-20695.24


In [45]:
operator_4_metrics = find_operator_info(operator_4)

In [46]:
operator_4_metrics.head(2)

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,start,end,number_of_segments,minutes_elapsed
0,4,10000020,4214183996,2022-10-12 21:55:01,2022-10-12 22:24:54,7,29.88
1,4,1000020,2437991552,2022-10-12 16:21:29,2022-10-12 16:32:23,4,10.9


In [47]:
routelines_final_test.head(2)

Unnamed: 0,route_id,calitp_itp_id,route_dir_identifier,shape_id,longest_shape_id,proportion_route_length,segment_sequence
0,1,208,58701248,1104,1104,100,8
1,1,208,58701248,1106,1104,59,8


In [48]:
# Merge
m2 = operator_4_metrics[
    [
        "calitp_itp_id",
        "trip_id",
        "route_dir_identifier",
        "number_of_segments",
        "minutes_elapsed",
    ]
].merge(
    routelines_final_test,
    how="inner",
    on=["calitp_itp_id", "route_dir_identifier"],
)

In [49]:
len(operator_4_metrics), len(m2)

(5202, 7217)

In [50]:
# Drop some columns for now to check out
# m2 = m2.drop(columns=[ "actual_route_length", "longest_route_length"])

In [51]:
# Find the total number of segments in the specific operator file
# vs. what was recorded in `longest_shape`
m2["segment_proportion"] = ((m2.number_of_segments / m2.segment_sequence) * 100).astype(
    "int64"
)

In [52]:
m2.sample()

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,number_of_segments,minutes_elapsed,route_id,shape_id,longest_shape_id,proportion_route_length,segment_sequence,segment_proportion
4408,4,10239020,2487841604,10,45.73,51B,shp-51B-16,shp-51B-16,100,10,100


In [53]:
m2.segment_proportion.value_counts().head()

100    5971
94      269
54      144
90      116
50      113
Name: segment_proportion, dtype: int64

##### Help.  Why is for 1244740981 not yielding any results, even in the original dataframe?
* 2 more route ids when filtering out the `routelines_final` df for ITP ID 4 compared with the `vp_sjoin/vp_segment_4`
* Wondering why that is.

In [54]:
m2.loc[m2.route_id == "U"]

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,number_of_segments,minutes_elapsed,route_id,shape_id,longest_shape_id,proportion_route_length,segment_sequence,segment_proportion


##### Help. Other Questions
* Why are there different shape ids for the same route trip and route_dir_identifier? 
* Why would all these segments be missing if the route length is 100% 

In [55]:
m2.loc[m2.route_dir_identifier == 4105021223].sample(5)

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,number_of_segments,minutes_elapsed,route_id,shape_id,longest_shape_id,proportion_route_length,segment_sequence,segment_proportion
4297,4,9281020,4105021223,16,53.9,1T,shp-1T-06,shp-1T-04,94,17,94
4168,4,4370020,4105021223,16,55.33,1T,shp-1T-06,shp-1T-04,94,17,94
4044,4,11976020,4105021223,16,56.87,1T,shp-1T-05,shp-1T-04,94,17,94
4229,4,6714020,4105021223,16,59.13,1T,shp-1T-04,shp-1T-04,100,17,94
4079,4,13484020,4105021223,17,40.32,1T,shp-1T-04,shp-1T-04,100,17,100


In [56]:
m2.loc[m2.trip_id == "6566020"]

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,number_of_segments,minutes_elapsed,route_id,shape_id,longest_shape_id,proportion_route_length,segment_sequence,segment_proportion
4223,4,6566020,4105021223,1,46.17,1T,shp-1T-04,shp-1T-04,100,17,5
4224,4,6566020,4105021223,1,46.17,1T,shp-1T-05,shp-1T-04,94,17,5
4225,4,6566020,4105021223,1,46.17,1T,shp-1T-06,shp-1T-04,94,17,5


In [57]:
operator_4.loc[operator_4.trip_id == "6566020"].head()

Unnamed: 0,calitp_itp_id,calitp_url_number,vehicle_timestamp,trip_id,route_dir_identifier,segment_sequence,lon,lat
21089,4,0,2022-10-12 11:38:34,6566020,4105021223,0,-190130.83,-30605.36
21090,4,0,2022-10-12 11:39:04,6566020,4105021223,0,-190130.83,-30605.36
21091,4,0,2022-10-12 11:39:34,6566020,4105021223,0,-190130.83,-30605.36
21092,4,0,2022-10-12 11:39:50,6566020,4105021223,0,-190130.83,-30605.36
21093,4,0,2022-10-12 11:40:05,6566020,4105021223,0,-190130.83,-30605.36


In [58]:
operator_4_metrics.loc[operator_4_metrics.trip_id == "6566020"]

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,start,end,number_of_segments,minutes_elapsed
3852,4,6566020,4105021223,2022-10-12 11:38:34,2022-10-12 12:24:44,1,46.17


In [59]:
# Can't find 1244740981 in this list.
# operator_4.route_dir_identifier.unique().tolist()

In [60]:
# Total route ids using longest_shape/trips/routelines.
routelines_final.loc[routelines_final.calitp_itp_id == 4][["route_id"]].nunique()

route_id    129
dtype: int64

In [61]:
m2.route_id.nunique()

127

In [62]:
merged_routeid = set(m2.route_id.unique().tolist())

In [63]:
routelines_routeid = set(routelines_final.route_id.unique().tolist())

In [64]:
merged_routeid - routelines_routeid

set()

In [65]:
# routelines_routeid - merged_routeid

### Ask
Github
* For each operator, what's the % of RT trip_ids that would remain after those thresholds are used? Make a chart function that takes a single operator. Produce charts for all operators. Is the time or geographic coverage that's driving this excluding of trips? What is a recommended threshold to use?
* For short trips, do they tend to be 50% of the longest route length? 40% 30%? 

Meeting
* Filter out for trips that provide useful information before attaching segments to it. 
* How many shape ID's for that route are usable? 
* What's the typical threshold of the actual length of the route versus the longest length we have on record?
* Example: How many 10 minute unique trip IDs will remain and segments will remain after filtering ones that don't provide insights?
* % of segments that actually show up reflects how much of a trip was recorded in GTFS. 

In [66]:
len(m2)

7217

In [67]:
(m2.proportion_route_length.value_counts() / len(m2) * 100).head(15)

100   72.34
94     2.88
51     2.79
121    1.76
33     1.48
111    1.41
48     1.33
83     1.25
206    1.11
507    1.11
77     0.94
81     0.93
204    0.87
147    0.87
466    0.79
Name: proportion_route_length, dtype: float64

In [68]:
(m2.segment_proportion.value_counts() / len(m2) * 100).head(15)

100   82.74
94     3.73
54     2.00
90     1.61
50     1.57
95     1.30
93     0.53
80     0.53
91     0.48
88     0.48
87     0.29
73     0.26
96     0.25
75     0.24
86     0.22
Name: segment_proportion, dtype: float64

In [69]:
m2.minutes_elapsed.describe()

count   7217.00
mean      91.35
std      235.41
min        0.17
25%       37.18
50%       52.32
75%       66.35
max     1531.98
Name: minutes_elapsed, dtype: float64

In [70]:
p25_time = m2.minutes_elapsed.quantile(0.25).astype(int)
p50_time = m2.minutes_elapsed.quantile(0.50).astype(int)
p75_time = m2.minutes_elapsed.quantile(0.75).astype(int)

In [71]:
def trip_duration(row):
    if (row.minutes_elapsed > 0) and (row.minutes_elapsed <= p25_time):
        return f"Short Trip <= {p25_time} min"
    elif (row.minutes_elapsed > p25_time) and (row.minutes_elapsed <= p75_time):
        return f"Medium Trip <= {p75_time} min"
    else:
        return f"Long Trip > {p75_time} min"

In [72]:
m2["trip_duration_categories"] = m2.apply(lambda x: trip_duration(x), axis=1)

In [73]:
m2.trip_duration_categories.value_counts()

Medium Trip <= 66 min    3595
Long Trip > 66 min       1833
Short Trip <= 37 min     1789
Name: trip_duration_categories, dtype: int64

In [74]:
test = m2.loc[m2.segment_proportion < 100][["segment_proportion"]]

In [75]:
test.describe()

Unnamed: 0,segment_proportion
count,1246.0
mean,74.98
std,22.53
min,3.0
25%,54.0
50%,88.0
75%,94.0
max,98.0


In [76]:
p25_length = test.segment_proportion.quantile(0.25).astype(int)
p75_length = test.segment_proportion.quantile(0.75).astype(int)

* Flag what's usable
* Need two aggregatiosn, one for trips that are usable, one for shape_ids.

In [77]:
def shape_id_comparison(row):
    if (row.segment_proportion > 0) and (row.segment_proportion <= p25_length):
        return f" <={p25_length}% of segments appear"
    elif (row.segment_proportion > p25_length) and (
        row.segment_proportion <= p75_length
    ):
        return f"<= {p75_length}% of segments appear"
    else:
        return f">= {p75_length}% of segments appear"

In [78]:
m2["shapeid_vs_longest_shapeid_length"] = m2.apply(
    lambda x: shape_id_comparison(x), axis=1
)

In [79]:
m2.shapeid_vs_longest_shapeid_length.value_counts()

>= 94% of segments appear    6093
<= 94% of segments appear     740
 <=54% of segments appear     384
Name: shapeid_vs_longest_shapeid_length, dtype: int64

In [80]:
m2.loc[m2.trip_id == "6566020"]

Unnamed: 0,calitp_itp_id,trip_id,route_dir_identifier,number_of_segments,minutes_elapsed,route_id,shape_id,longest_shape_id,proportion_route_length,segment_sequence,segment_proportion,trip_duration_categories,shapeid_vs_longest_shapeid_length
4223,4,6566020,4105021223,1,46.17,1T,shp-1T-04,shp-1T-04,100,17,5,Medium Trip <= 66 min,<=54% of segments appear
4224,4,6566020,4105021223,1,46.17,1T,shp-1T-05,shp-1T-04,94,17,5,Medium Trip <= 66 min,<=54% of segments appear
4225,4,6566020,4105021223,1,46.17,1T,shp-1T-06,shp-1T-04,94,17,5,Medium Trip <= 66 min,<=54% of segments appear


In [81]:
len(m2), len(m2.drop_duplicates())

(7217, 7217)

##### How to incorporate time element?
* Same route_dir_identifier falls into a few different categories? Shouldn't they all be around the same duration in terms of minutes?
* How could the time vary so drastically when the # of segments match up?

In [82]:
m2.loc[m2.route_dir_identifier == 2184919314].minutes_elapsed.describe()

count     50.00
mean     104.93
std      277.94
min       29.53
25%       49.58
50%       49.99
75%       50.23
max     1459.13
Name: minutes_elapsed, dtype: float64

In [83]:
m2.loc[m2.route_dir_identifier == 2184919314][
    ["trip_id", "minutes_elapsed", "trip_duration_categories"]
].head(10)

Unnamed: 0,trip_id,minutes_elapsed,trip_duration_categories
465,10006020,50.23,Medium Trip <= 66 min
466,10072020,49.08,Medium Trip <= 66 min
467,10218020,50.1,Medium Trip <= 66 min
468,10474020,49.68,Medium Trip <= 66 min
469,1069020,49.72,Medium Trip <= 66 min
470,1087020,49.98,Medium Trip <= 66 min
471,11394020,36.1,Short Trip <= 37 min
472,11479020,49.88,Medium Trip <= 66 min
473,12259020,55.05,Medium Trip <= 66 min
474,12460020,49.98,Medium Trip <= 66 min


In [84]:
m2.groupby(
    [
        "route_id",
        "trip_duration_categories",
        "route_dir_identifier",
        "usable_y_n",
    ]
).agg({"trip_id": "count"}).head(10)

KeyError: 'usable_y_n'

In [None]:
def usable(row):
    if row.shapeid_vs_longest_shapeid_length == (
        f" <={p25_length}% of segments appear"
    ):
        return "Unusable"
    else:
        return "Usable"

In [None]:
m2["usable_y_n"] = m2.apply(lambda x: usable(x), axis=1)

### Already Answered Notes/Questions
* What is the calitp url number? What does 0 or 1 mean? V1, operator has different feeds. 
    * 0 could be primary, 1 is backup. This column will be deleted in V2. 
* Do you think that most shape IDS are going to be less than 100% of the length of the longest shape ID? 
    * Not necessarily, shape ID can be a short version of the trip.
* What’s the difference between direction ID and route dir identifier? What does the 0 and 1 mean in direction ID?
    * We don't know where the bus is going, so just do 0 and 1.
    * Route dir identifier: captures route info and direction it is going to capture all the trips. Helps with groupby. 
    * We don't want to stick with trip id, we need to get to route level. 
    * Don't want to lose info on the direction. 
    * Have to distinguish direction or else it'll look like the bus is going backwards when plotting.
    * RT data comes with direction id and can get which direction it ran in from schedule data. 
    * Attach route, join coordinate data to segments. 
    * Use segments and average out trips that occurred on that segment. 
* Ask about graph on Slack. 
* Should I use this `get_routelines` from `A1_vehicle_positions`. 
    * Just read it directly from GCS, don't need buffer.
* Why would the same route ID for the other direction have more segments? 
   * Can have a layover. 
   * A segment must be 1000 meters or less.
* The `route_dir_identifier` is used for segments to cut segments
for both directions the route runs.

* How come there are so many different timestamps within a 30 second increments of each either within the same segment? GTFS pings every 30 seconds.