## RT trip diagnostics: thresholds for usable trips 
### Other Questions
* Should thresholds be on the operator or the operator-route ID level?
* How to figure out whether a segment is acceptable or not?
* Is the `proportion_route_length` tied with usable segments?

In [76]:
# Charts
import altair as alt
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import pandas as pd
from calitp.sql import to_snakecase
from shared_utils import calitp_color_palette as cp
from shared_utils import geography_utils, styleguide, utils

In [77]:
# Save files to GCS
from calitp.storage import get_fs
fs = get_fs()

In [78]:
# Record start and end time
import datetime
from loguru import logger

In [79]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### Load Files

In [80]:
GCS_DASK_PATH = "gs://calitp-analytics-data/data-analyses/dask_test/"
GCS_RT_PATH = "gs://calitp-analytics-data/data-analyses/rt_delay/"

In [81]:
analysis_date = "2022-10-12"

In [82]:
# Tells me actual route length for each shape id.
routelines = gpd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/routelines_{analysis_date}.parquet"
)

In [83]:
# len(routelines), routelines.shape_id.nunique()

In [84]:
# RT data Read in Trips
# Gives me trips ran for a particular day across all oeprators.
trips = pd.read_parquet(
    f"{GCS_RT_PATH}compiled_cached_views/trips_{analysis_date}.parquet"
)

In [85]:
# len(trips)

In [86]:
# Read in longest_shape of each route
# Schedule data, source of truth.
longest_shape = gpd.read_parquet(f"{GCS_DASK_PATH}longest_shape_segments.parquet")

In [87]:
# longest_shape.groupby(['calitp_itp_id','route_id','longest_shape_id']).agg({'segment_sequence':'nunique'}).head()

In [88]:
# longest_shape.sort_values(['calitp_itp_id', 'route_id']).head(25).drop(columns=["geometry", "geometry_arrowized"])

In [89]:
crosswalk = pd.read_parquet(
    f"{GCS_DASK_PATH}segments_route_direction_crosswalk.parquet"
)

In [90]:
# Use pandas.read_parquet/read_feather() instead.
operator_4 = pd.read_parquet(
    f"{GCS_DASK_PATH}vp_sjoin/vp_segment_4_{analysis_date}.parquet"
)

### Task 1
* Using GTFS schedule data, by route_id-shape_id, calculate the route_length of each shape_id as a proportion of the longest shape_id. 
* For <b>each route_id</b>, what's the shortest shape_id length, in proportion to the longest shape_id's length. if it's 100%, then all shape_ids are equal length for that route. if it's 50%, there's a short trip that exists that only runs 50% of the length and turns around.

<b>How</b>
* Need table `trips` from compile cached views -> shape ID and route ID and direction ID -> merge in segments crosswalk with route direction identifier 
* Shapes table -> attach route dir identifier
* Merge in longest shape line using  routes and direction take the fraction. 

#### Step 1. Merge `trips` with `crosswalk`
##### Help: Why do we take away `trip_id` from `trips`? 

In [91]:
# Subset
trips2 = trips[
    [
        "calitp_itp_id",
        "route_id",
        "direction_id",
        "shape_id",
    ]
]

In [92]:
len(trips2), len(crosswalk)

(120136, 5150)

In [93]:
trips2.head(2)

Unnamed: 0,calitp_itp_id,route_id,direction_id,shape_id
0,4,U,1,shp-U-06
1,4,U,1,shp-U-06


In [94]:
crosswalk.head(2)

Unnamed: 0,calitp_itp_id,route_id,direction_id,route_dir_identifier
0,372,4ba918e5-58c0-4d4a-9f55-5cadb8564bff,0,255544
1,293,7,0,1269889


In [95]:
trips2 = (trips2.drop_duplicates()).reset_index(drop=True)

* 366 more values in `trips` than `crosswalk` even though `Cal ITP ID.nunique()` yields the same number.

In [96]:
trips2.merge(
    crosswalk,
    how="outer",
    on=["calitp_itp_id", "route_id", "direction_id"],
    indicator=True,
)[["_merge"]].value_counts()

_merge    
both          7833
left_only      366
right_only      74
dtype: int64

In [97]:
trips_m_crosswalk = trips2.merge(
    crosswalk, how="inner", on=["calitp_itp_id", "route_id", "direction_id"]
)

In [98]:
trips_m_crosswalk.head()

Unnamed: 0,calitp_itp_id,route_id,direction_id,shape_id,route_dir_identifier
0,4,U,1,shp-U-06,1244740981
1,4,U,0,shp-U-07,1026952675
2,4,212,1,shp-212-07,1369834141
3,4,212,0,shp-212-57,648098315
4,4,67,0,shp-67-57,3358964048


#### Step 2. Shapes table -> attach route dir identifier 
* Drop duplicates in routelines b/c of `calitp_url_number`. 

In [99]:
routelines.crs == longest_shape.crs

True

In [139]:
routelines.crs

<Derived Projected CRS: EPSG:3310>
Name: NAD83 / California Albers
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: United States (USA) - California.
- bounds: (-124.45, 32.53, -114.12, 42.01)
Coordinate Operation:
- name: California Albers
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [100]:
# Drop CalITP since it's no longer needed
routelines = routelines.drop(columns=["calitp_url_number"])

In [101]:
routelines = (routelines.drop_duplicates()).reset_index(drop=True)

In [102]:
len(routelines)

8133

In [103]:
# Calculate length of geometry
routelines = routelines.assign(
    actual_route_length=(
        routelines.geometry.to_crs(geography_utils.CA_NAD83Albers).length
    )
)

In [106]:
routelines.merge(
    trips_m_crosswalk, how="outer", on=["calitp_itp_id", "shape_id"], indicator=True
)[["_merge"]].value_counts()

_merge    
both          8529
left_only      363
right_only       0
dtype: int64

In [107]:
routelines_m_trips = routelines.merge(
    trips_m_crosswalk,
    how="inner",
    on=["calitp_itp_id", "shape_id"],
)

In [108]:
len(routelines_m_trips), len(trips_m_crosswalk), len(routelines)

(8529, 7833, 8133)

In [109]:
routelines_m_trips.loc[routelines_m_trips.route_id == "U"].drop(columns=["geometry"])

Unnamed: 0,calitp_itp_id,shape_id,actual_route_length,route_id,direction_id,route_dir_identifier
1,4,shp-U-07,34049.63,U,0,1026952675
111,4,shp-U-06,34070.9,U,1,1244740981


#### Step 3. Merge in longest shape line on routes and direction.
* Which geometry to keep?

In [122]:
longest_shape = longest_shape.rename(columns={"route_length": "longest_route_length"})

In [121]:
#route_u.explore("segment_sequence", cmap = "tab10",
#                style_kwds = {'weight': 10}, legend = False, height = 400, width = 800)

In [125]:
routelines_final = routelines_m_trips.merge(
    longest_shape,
    how="inner",
    on=["calitp_itp_id", "direction_id", "route_id", "route_dir_identifier"],
)

In [135]:
# Calculate out proportion of route length against longest.
routelines_final["proportion_route_length"] = ((
    routelines_final["actual_route_length"] / routelines_final["longest_route_length"]
) * 100).astype(int)

In [141]:
len(routelines_final.loc[routelines_final.proportion_route_length > 100])

31835

In [138]:
# routelines_final.proportion_route_length.value_counts()

In [132]:
routelines_final.loc[routelines_final.route_dir_identifier == 1244740981].drop(
    columns=["geometry_x","geometry_y","geometry_arrowized"]
).head()

Unnamed: 0,calitp_itp_id,shape_id,actual_route_length,route_id,direction_id,route_dir_identifier,calitp_url_number,longest_shape_id,longest_route_length,segment_sequence,proportion_route_length
2170,4,shp-U-06,34070.9,U,1,1244740981,0,shp-U-06,34070.9,0,100.0
2171,4,shp-U-06,34070.9,U,1,1244740981,0,shp-U-06,34070.9,1,100.0
2172,4,shp-U-06,34070.9,U,1,1244740981,0,shp-U-06,34070.9,2,100.0
2173,4,shp-U-06,34070.9,U,1,1244740981,0,shp-U-06,34070.9,3,100.0
2174,4,shp-U-06,34070.9,U,1,1244740981,0,shp-U-06,34070.9,4,100.0


In [133]:
len(routelines_final), len(routelines_final.drop_duplicates())

(219200, 219200)

### Task 2
* Testing with Agency 4. 
* Calculate time of trips?


In [None]:
def find_operator_info(df):
    df = df.sort_values(["calitp_itp_id", "trip_id", "segment_sequence"])

    merge_cols = [
        "calitp_itp_id",
        "trip_id",
        "route_dir_identifier",
    ]

    # Get start time.
    start_time_trip = (
        df.groupby(merge_cols)
        .agg({"vehicle_timestamp": "min"})
        .rename(columns={"vehicle_timestamp": "start"})
        .reset_index()
    )

    # Get end time.
    end_time_trip = (
        df.groupby(merge_cols)
        .agg({"vehicle_timestamp": "max"})
        .rename(columns={"vehicle_timestamp": "end"})
        .reset_index()
    )

    # Count number of segments.
    segment_counts = (
        df.groupby(merge_cols)
        .agg({"segment_sequence": "nunique"})
        .reset_index()
        .rename(columns={"segment_sequence": "number_of_segments"})
    )

    # Merge
    m1 = start_time_trip.merge(end_time_trip, how="inner", on=merge_cols).merge(
        segment_counts, how="left", on=merge_cols
    )

    # Calculate time elapsed
    # https://stackoverflow.com/questions/51491724/calculate-difference-of-2-dates-in-minutes-in-pandas
    m1["minutes_elapsed"] = (m1.end - m1.start).dt.total_seconds() / 60

    return m1

In [None]:
operator_4.head(2)

In [None]:
operator_4_metrics = find_operator_info(operator_4)

In [None]:
operator_4_metrics.head(2)

In [None]:
# Merge
m2 = operator_4_metrics[
    [
        "calitp_itp_id",
        "trip_id",
        "route_dir_identifier",
        "number_of_segments",
        "minutes_elapsed",
    ]
].merge(
    routelines_final,
    how="inner",
    on=["calitp_itp_id", "route_dir_identifier"],
)

In [None]:
len(operator_4_metrics), len(m2)

In [None]:
# Drop some columns for now to check out
m2 = m2.drop(columns=["geometry", "actual_route_length", "longest_route_length"])

In [None]:
# Find the total number of segments in the specific operator file
# vs. what was recorded in `longest_shape`
m2["segment_proportion"] = ((m2.number_of_segments / m2.total_segments) * 100).astype(
    "int64"
)

In [None]:
reorg_cols = [
    "calitp_itp_id",
    "route_id",
    "route_dir_identifier",
    "direction_id",
    "trip_id",
    "shape_id",
    "longest_shape_id",
    "number_of_segments",
    "total_segments",
    "segment_proportion",
    "proportion_route_length",
    "minutes_elapsed",
]

In [None]:
m2 = m2[reorg_cols]

In [None]:
m2.sort_values(["route_id", "shape_id", "minutes_elapsed"]).tail(10)

##### Help.  Why is for 1244740981 not yielding any results, even in the original dataframe?
* 2 more route ids when filtering out the `routelines_final` df for ITP ID 4 compared with the `vp_sjoin/vp_segment_4`
* Wondering why that is.

In [None]:
m2.loc[m2.route_id == "U"]

In [None]:
operator_4.loc[operator_4.route_dir_identifier == 1244740981]

In [None]:
# Can't find 1244740981 in this list.
# operator_4.route_dir_identifier.unique().tolist()

In [None]:
# Total route ids using longest_shape/trips/routelines.
routelines_final.loc[routelines_final.calitp_itp_id == 4][["route_id"]].nunique()

In [None]:
m2.route_id.nunique()

In [None]:
merged_routeid = set(m2.route_id.unique().tolist())

In [None]:
routelines_routeid = set(routelines_final.route_id.unique().tolist())

In [None]:
merged_routeid - routelines_routeid

In [None]:
# routelines_routeid - merged_routeid

### Ask
Github
* For each operator, what's the % of RT trip_ids that would remain after those thresholds are used? Make a chart function that takes a single operator. Produce charts for all operators. Is the time or geographic coverage that's driving this excluding of trips? What is a recommended threshold to use?
* For short trips, do they tend to be 50% of the longest route length? 40% 30%? 

Meeting
* Filter out for trips that provide useful information before attaching segments to it. 
* How many shape ID's for that route are usable? 
* What's the typical threshold of the actual length of the route versus the longest length we have on record?
* Example: How many 10 minute unique trip IDs will remain and segments will remain after filtering ones that don't provide insights?
* % of segments that actually show up reflects how much of a trip was recorded in GTFS. 

In [None]:
m2.proportion_route_length.describe()

In [None]:
m2.minutes_elapsed.describe()

In [None]:
p25_time = m2.minutes_elapsed.quantile(0.25).astype(int)
p50_time = m2.minutes_elapsed.quantile(0.50).astype(int)
p75_time = m2.minutes_elapsed.quantile(0.75).astype(int)

In [None]:
p25_time, p50_time, p75_time

In [None]:
def trip_duration(row):
    if (row.minutes_elapsed > 0) and (row.minutes_elapsed <= p25_time):
        return f"Short Trip <= {p25_time} min"
    elif (row.minutes_elapsed > p25_time) and (row.minutes_elapsed <= p75_time):
        return f"Medium Trip <= {p75_time} min"
    else:
        return f"Long Trip > {p75_time} min"

In [None]:
m2["trip_duration_categories"] = m2.apply(lambda x: trip_duration(x), axis=1)

In [None]:
m2.trip_duration_categories.value_counts()

In [None]:
for i in [p25_time, p50_time, p75_time]:
    print(len(m2.loc[m2.minutes_elapsed >= i]))

In [None]:
p25_length = m2.proportion_route_length.quantile(0.25).astype(int)
p75_length = m2.proportion_route_length.quantile(0.75).astype(int)

In [None]:
p25_length, p75_length

* Flag what's usable
* Need two aggregatiosn, one for trips that are usable, one for shape_ids.

In [None]:
def shape_id_comparison(row):
    if (row.proportion_route_length > 0) and (
        row.proportion_route_length <= p25_length
    ):
        return f" <= {p25_length}%"
    elif (row.proportion_route_length > p25_length) and (
        row.proportion_route_length <= p75_length
    ):
        return f"<= {p75_length}%"
    else:
        return f"> {p75_length}%"

In [None]:
m2["shapeid_vs_longest_shapeid_length"] = m2.apply(
    lambda x: shape_id_comparison(x), axis=1
)

In [None]:
def usable(row):
    if row.shapeid_vs_longest_shapeid_length == (f" <= {p25_length}%"):
        return "Unusable"
    else:
        return "Usable"

In [None]:
m2["usable_y_n"] = m2.apply(lambda x: usable(x), axis=1)

In [None]:
m2["usable_y_n"].value_counts()

In [None]:
summary = (
    m2.groupby(["trip_duration_categories", "usable_y_n"])
    .agg({"total_segments": "sum", "trip_id": "count"})
    .rename(columns={"total_segments": "total_segments", "trip_id": "total_trips"})
    .reset_index()
)

In [None]:
grand_total = (
    m2.groupby(["trip_duration_categories"])
    .agg({"total_segments": "sum", "trip_id": "count"})
    .rename(
        columns={
            "total_segments": "grand_total_segments",
            "trip_id": "grand_total_trips",
        }
    )
    .reset_index()
)

In [None]:
summary_m = summary.merge(grand_total, on=["trip_duration_categories"])

In [None]:
summary_m = summary_m.assign(
    percent_usable_segments=summary_m.total_segments
    / summary_m.grand_total_segments
    * 100,
    percent_usable_trips=summary_m.total_trips / summary_m.grand_total_trips * 100,
)

In [None]:
summary_m

In [None]:
def chart_with_dropdown(
    df,
    dropdown_list: list,
    dropdown_field: str,
    x_axis_chart1: str,
    y_axis_chart1: str,
    color_col1: str,
    chart1_tooltip_cols: list,
    chart_title: str,
):
    """A bar chart controlled by a dropdown filter.
    Args:
        df: the dataframe
        dropdown_list(list): a list of all the values in the dropdown menu,
        dropdown_field(str): column where the dropdown menu's values are drawn from,
        x_axis_chart1(str): x axis value for chart 1 - encode as Q or N,
        y_axis_chart1(str): y axis value for chart 1 - encode as Q or N,
        color_col1(str): column to color the graphs for chart 1,
        chart1_tooltip_cols(list): list of all the columns to populate the tooltip,
        chart_title(str):chart title,
    """
    # Create drop down menu
    input_dropdown = alt.binding_select(options=dropdown_list, name="Select ")

    # The column tied to the drop down menu
    selection = alt.selection_single(fields=[dropdown_field], bind=input_dropdown)

    chart1 = (
        alt.Chart(df)
        .mark_bar()
        .encode(
            x=x_axis_chart1,
            y=(y_axis_chart1),
            color=alt.Color(
                color_col1,
                scale=alt.Scale(range=cp.CALITP_CATEGORY_BRIGHT_COLORS),
                legend=None,
            ),
            tooltip=chart1_tooltip_cols,
        )
        .properties(title=chart_title)
        .add_selection(selection)
        .transform_filter(selection)
    )

    chart1 = styleguide.preset_chart_config(chart1)

    return chart1

### Already Answered Notes/Questions
* What is the calitp url number? What does 0 or 1 mean? V1, operator has different feeds. 
    * 0 could be primary, 1 is backup. This column will be deleted in V2. 
* Do you think that most shape IDS are going to be less than 100% of the length of the longest shape ID? 
    * Not necessarily, shape ID can be a short version of the trip.
* What’s the difference between direction ID and route dir identifier? What does the 0 and 1 mean in direction ID?
    * We don't know where the bus is going, so just do 0 and 1.
    * Route dir identifier: captures route info and direction it is going to capture all the trips. Helps with groupby. 
    * We don't want to stick with trip id, we need to get to route level. 
    * Don't want to lose info on the direction. 
    * Have to distinguish direction or else it'll look like the bus is going backwards when plotting.
    * RT data comes with direction id and can get which direction it ran in from schedule data. 
    * Attach route, join coordinate data to segments. 
    * Use segments and average out trips that occurred on that segment. 
* Ask about graph on Slack. 
* Should I use this `get_routelines` from `A1_vehicle_positions`. 
    * Just read it directly from GCS, don't need buffer.
* Why would the same route ID for the other direction have more segments? 
   * Can have a layover. 
   * A segment must be 1000 meters or less.
* The `route_dir_identifier` is used for segments to cut segments
for both directions the route runs.

* How come there are so many different timestamps within a 30 second increments of each either within the same segment? GTFS pings every 30 seconds.