## RT trip diagnostics: thresholds for usable trips 
### Other Questions
* Should thresholds be on the operator or the operator-route ID level?
* How to figure out whether a segment is acceptable or not?
* Is the `proportion_route_length` tied with usable segments?

In [20]:
import altair as alt
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import pandas as pd
import numpy as np
from calitp.sql import to_snakecase

#from shared_utils import calitp_color_palette as cp
# from shared_utils import geography_utils, styleguide, utils

In [2]:
# Save files to GCS
from calitp.storage import get_fs

fs = get_fs()

In [3]:
# Record start and end time
import datetime

from loguru import logger

In [4]:
import intake

catalog = intake.open_catalog("./catalog_threshold.yml")

In [5]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### Load Files

In [6]:
GCS_DASK_PATH = "gs://calitp-analytics-data/data-analyses/dask_test/"
GCS_RT_PATH = "gs://calitp-analytics-data/data-analyses/rt_delay/"

In [7]:
analysis_date = "2022-10-12"

In [8]:
agency = 282

In [9]:
operator = pd.read_parquet(
    f"{GCS_DASK_PATH}vp_sjoin/vp_segment_{agency}_{analysis_date}.parquet"
)

In [10]:
# routelines = catalog.route_lines.read()

In [11]:
# trips = catalog.trips.read()

In [12]:
# longest_shape = catalog.longest_shape.read()

In [13]:
# crosswalk = catalog.crosswalk.read()

### Task 1
* Using GTFS schedule data, by route_id-shape_id, calculate the route_length of each shape_id as a proportion of the longest shape_id. 
* For <b>each route_id</b>, what's the shortest shape_id length, in proportion to the longest shape_id's length. if it's 100%, then all shape_ids are equal length for that route. if it's 50%, there's a short trip that exists that only runs 50% of the length and turns around.

<b>How</b>
* Need table `trips` from compile cached views -> shape ID and route ID and direction ID -> merge in segments crosswalk with route direction identifier 
* Shapes table -> attach route dir identifier
* Merge in longest shape line using  routes and direction take the fraction. 

In [14]:
def clean_trips():
    df = catalog.trips.read()

    subset = [
        "calitp_itp_id",
        "route_id",
        "direction_id",
        "shape_id",
    ]

    df = df[subset]

    df = df.drop_duplicates().reset_index(drop=True)

    return df

In [15]:
def clean_routelines():
    df = catalog.route_lines.read()

    # Drop CalITP since it's no longer needed
    df = df.drop(columns=["calitp_url_number"])

    df = (df.drop_duplicates()).reset_index(drop=True)

    # Calculate length of geometry
    df = df.assign(
        actual_route_length=(df.geometry.to_crs(geography_utils.CA_NAD83Albers).length)
    )

    return df

In [16]:
def clean_longest_shape():
    df = catalog.longest_shape.read()

    df = df.rename(columns={"route_length": "longest_route_length"})

    return df

In [17]:
def merge_trips_routes_longest_shape():
    trips = clean_trips()
    crosswalk = catalog.crosswalk.read()
    routelines = clean_routelines()
    longest_shape = clean_longest_shape()

    m1 = (
        trips.merge(
            crosswalk, how="inner", on=["calitp_itp_id", "route_id", "direction_id"]
        )
        .merge(routelines, how="inner", on=["calitp_itp_id", "shape_id"])
        .merge(
            longest_shape.drop(columns=["geometry"]),
            how="inner",
            on=["calitp_itp_id", "direction_id", "route_id", "route_dir_identifier"],
        )
    )

    # Calculate out proportion of route length against longest.
    m1["proportion_route_length"] = ((m1["actual_route_length"] / m1["longest_route_length"]) * 100).astype(int)

    m1 = (
    m1.groupby(
        [
            "route_id",
            "calitp_itp_id",
            "route_dir_identifier",
            "shape_id",
            "longest_shape_id",
            "proportion_route_length",
        ]
    )
    .agg({"segment_sequence": "count"})
    .rename(columns = {'segment_sequence':'total_segments'})
    .reset_index())
    
    return m1

In [18]:
trips_routes_shape = merge_trips_routes_longest_shape()

NameError: name 'geography_utils' is not defined

### Task 2
* Testing with Agency 4. 
* Calculate time of trips?


In [None]:
def merge_trip_diagnostics_with_total_segments():
    trip_diagnostics = pd.read_parquet("gs://calitp-analytics-data/data-analyses/dask_test/trip_diagnostics_2022-10-12.parquet", 
    )
    
    segments = gpd.read_parquet(
        f"{GCS_DASK_PATH}longest_shape_segments_{analysis_date}.parquet")
    
    total_segments_by_shape = (segments.groupby(
            ["calitp_itp_id", "route_dir_identifier"])
            .segment_sequence.nunique()
            .reset_index()
            .rename(columns = {"segment_sequence": "total_segments"})
           )
    
    df = pd.merge(
        trip_diagnostics,
        total_segments_by_shape,
        on = ["calitp_itp_id", "route_dir_identifier"],
        how = "inner",
        validate = "m:1",
    )
    
    df = df.assign(
        pct_vp_segments = df.num_segments_with_vp.divide(df.total_segments),
        trip_time = (df.trip_end - df.trip_start) / np.timedelta64(1, 's'),
        total_trips = df.groupby("calitp_itp_id").trip_id.transform("nunique"),
    )
    
    return df

In [None]:
operator_282 = merge_trip_diagnostics_with_total_segments()

In [None]:
operator_282 = operator_282.loc[operator_282.calitp_itp_id == 282].reset_index(drop = True)

In [None]:
operator_282.head()

In [None]:
def summary_valid_trips_by_cutoff(
    df, time_cutoffs: list, segment_cutoffs: list): 

    final = pd.DataFrame()

    for t in time_cutoffs:
        for s in segment_cutoffs:
            valid = (df[(df.trip_time >= t) & (df.pct_vp_segments >= s)]
                     .groupby(["calitp_itp_id", "total_trips"])
                     .trip_id.nunique()
                     .reset_index()
                     .rename(columns = {"trip_id": "n_trips"})
                    )

            valid = valid.assign(
                trip_cutoff = t,
                segment_cutoff = s,
                cutoff = f"{t}+ min & {s*100}%+ segments"
            )

            final = pd.concat([final, valid], axis=0)
    
    
    final = final.assign(
        pct_usable_trips = final.n_trips.divide(final.total_trips)
    )
    
    return final

In [None]:
TIME_CUTOFFS = [5, 10, 15]
SEGMENT_CUTOFFS = [0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.75]

In [None]:
valid_stats = summary_valid_trips_by_cutoff(
    operator_282, TIME_CUTOFFS, SEGMENT_CUTOFFS)

In [None]:
valid_stats

In [None]:
# Find the total number of segments in the specific operator file
# vs. what was recorded in `longest_shape`
m2["segment_proportion"] = ((m2.number_of_segments / m2.segment_sequence) * 100).astype(
    "int64"
)

In [None]:
m2.sample()

In [None]:
m2.segment_proportion.value_counts().head()

In [None]:
m2.loc[m2.route_id == "U"]

In [None]:
m2.loc[m2.route_dir_identifier == 4105021223].shape_id.nunique()

In [None]:
m2.loc[m2.route_dir_identifier == 4105021223].longest_shape_id.nunique()

In [None]:
m2.loc[m2.route_dir_identifier == 4105021223].trip_id.nunique()

In [None]:
m2.loc[m2.route_dir_identifier == 4105021223].sample(5)

In [None]:
m2.loc[m2.trip_id == "6566020"]

In [None]:
operator_4.loc[operator_4.trip_id == "6566020"].head()

In [None]:
operator_4_metrics.loc[operator_4_metrics.trip_id == "6566020"]

In [None]:
# Can't find 1244740981 in this list.
# operator_4.route_dir_identifier.unique().tolist()

In [None]:
# Total route ids using longest_shape/trips/routelines.
routelines_final.loc[routelines_final.calitp_itp_id == 4][["route_id"]].nunique()

In [None]:
m2.route_id.nunique()

In [None]:
merged_routeid = set(m2.route_id.unique().tolist())

In [None]:
routelines_routeid = set(routelines_final.route_id.unique().tolist())

In [None]:
merged_routeid - routelines_routeid

In [None]:
# routelines_routeid - merged_routeid

### Ask
Github
* For each operator, what's the % of RT trip_ids that would remain after those thresholds are used? Make a chart function that takes a single operator. Produce charts for all operators. Is the time or geographic coverage that's driving this excluding of trips? What is a recommended threshold to use?
* For short trips, do they tend to be 50% of the longest route length? 40% 30%? 

Meeting
* Filter out for trips that provide useful information before attaching segments to it. 
* How many shape ID's for that route are usable? 
* What's the typical threshold of the actual length of the route versus the longest length we have on record?
* Example: How many 10 minute unique trip IDs will remain and segments will remain after filtering ones that don't provide insights?
* % of segments that actually show up reflects how much of a trip was recorded in GTFS. 

In [None]:
len(m2)

In [None]:
(m2.proportion_route_length.value_counts() / len(m2) * 100).head(15)

In [None]:
(m2.segment_proportion.value_counts() / len(m2) * 100).head(15)

In [None]:
m2.minutes_elapsed.describe()

In [None]:
p25_time = m2.minutes_elapsed.quantile(0.25).astype(int)
p50_time = m2.minutes_elapsed.quantile(0.50).astype(int)
p75_time = m2.minutes_elapsed.quantile(0.75).astype(int)

In [None]:
def trip_duration(row):
    if (row.minutes_elapsed > 0) and (row.minutes_elapsed <= p25_time):
        return f"Short Trip <= {p25_time} min"
    elif (row.minutes_elapsed > p25_time) and (row.minutes_elapsed <= p75_time):
        return f"Medium Trip <= {p75_time} min"
    else:
        return f"Long Trip > {p75_time} min"

In [None]:
m2["trip_duration_categories"] = m2.apply(lambda x: trip_duration(x), axis=1)

In [None]:
m2.trip_duration_categories.value_counts()

In [None]:
test = m2.loc[m2.segment_proportion < 100][["segment_proportion"]]

In [None]:
test.describe()

In [None]:
p25_length = test.segment_proportion.quantile(0.25).astype(int)
p75_length = test.segment_proportion.quantile(0.75).astype(int)

In [None]:
def shape_id_comparison(row):
    if (row.segment_proportion > 0) and (row.segment_proportion <= p25_length):
        return f" <={p25_length}% of segments appear"
    elif (row.segment_proportion > p25_length) and (
        row.segment_proportion <= p75_length
    ):
        return f"<= {p75_length}% of segments appear"
    else:
        return f">= {p75_length}% of segments appear"

In [None]:
m2["shapeid_vs_longest_shapeid_length"] = m2.apply(
    lambda x: shape_id_comparison(x), axis=1
)

In [None]:
m2.shapeid_vs_longest_shapeid_length.value_counts()

In [None]:
m2.loc[m2.trip_id == "6566020"]

In [None]:
len(m2), len(m2.drop_duplicates())

##### How to incorporate time element?
* Same route_dir_identifier falls into a few different categories? Shouldn't they all be around the same duration in terms of minutes?
* How could the time vary so drastically when the # of segments match up?

In [None]:
m2.loc[m2.route_dir_identifier == 2184919314].minutes_elapsed.describe()

In [None]:
m2.loc[m2.route_dir_identifier == 2184919314][
    ["trip_id", "minutes_elapsed", "trip_duration_categories"]
].head(10)

In [None]:
def usable(row):
    if row.shapeid_vs_longest_shapeid_length == (
        f" <={p25_length}% of segments appear"
    ):
        return "Unusable"
    else:
        return "Usable"

In [None]:
m2["usable_y_n"] = m2.apply(lambda x: usable(x), axis=1)

### Already Answered Notes/Questions
* What is the calitp url number? What does 0 or 1 mean? V1, operator has different feeds. 
    * 0 could be primary, 1 is backup. This column will be deleted in V2. 
* Do you think that most shape IDS are going to be less than 100% of the length of the longest shape ID? 
    * Not necessarily, shape ID can be a short version of the trip.
* What’s the difference between direction ID and route dir identifier? What does the 0 and 1 mean in direction ID?
    * We don't know where the bus is going, so just do 0 and 1.
    * Route dir identifier: captures route info and direction it is going to capture all the trips. Helps with groupby. 
    * We don't want to stick with trip id, we need to get to route level. 
    * Don't want to lose info on the direction. 
    * Have to distinguish direction or else it'll look like the bus is going backwards when plotting.
    * RT data comes with direction id and can get which direction it ran in from schedule data. 
    * Attach route, join coordinate data to segments. 
    * Use segments and average out trips that occurred on that segment. 
* Ask about graph on Slack. 
* Should I use this `get_routelines` from `A1_vehicle_positions`. 
    * Just read it directly from GCS, don't need buffer.
* Why would the same route ID for the other direction have more segments? 
   * Can have a layover. 
   * A segment must be 1000 meters or less.
* The `route_dir_identifier` is used for segments to cut segments
for both directions the route runs.

* How come there are so many different timestamps within a 30 second increments of each either within the same segment? GTFS pings every 30 seconds.