# Filter low trip speeds from route averages

Look at the distribution of speeds at the trip-level.

**Filter out sec_elapsed**
This doesn't change the distribution much, but we should do this to be consistent. To create `vp_usable`, we exclude trips whose (max-min) timestamp is <= 10 minutes. Obviously these trips could have had timestamps that met that condition and still actually produced less than 10 minutes of vp. Exclude them now.

**Filter out extra long trips**
Notice in histograms that we have very long tails, very high `meters_elapsed` and `sec_elapsed`. We should set a maximum trip time threshold, around 3 hrs, and we'll get rid of the long tails. These long tails are what's contributing to very low speeds. 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from segment_speed_utils import helpers
from segment_speed_utils.project_vars import SEGMENT_GCS, analysis_date

dict_inputs = helpers.get_parameters("./scripts/config.yml", "stop_segments")

In [None]:
analysis_date

## Average Trip Speeds

In [None]:
df = pd.read_parquet(
    f"{SEGMENT_GCS}{dict_inputs['trip_speeds_single_summary']}_{analysis_date}.parquet"
)

In [None]:
def make_histogram(df, col): 
    fig, ax = plt.subplots(figsize=(3, 2))
    if col == "speed_mph":
        bins = range(0, 80, 5)
        ax1 = df[col].hist(bins = bins)
        ax1.set_title("Speed")
    elif col == "meters_elapsed":
        bins = range(
            0, int(round(df.meters_elapsed.max(), 0)), 
            1_609 * 5 # increments of 5 miles
        )
        ax2 = df[col].hist(bins = bins)
        ax2.set_title("Meters")

    elif col == "sec_elapsed":
        bins = range(
            0, int(round(df.sec_elapsed.max(), 0)), 
            60 * 30 # increments of 60 min
        )
    
        ax3 = df[col].hist(bins = bins)
        ax3.set_title("Seconds")


def get_stats(df: pd.DataFrame):
    print("----------- Speed -----------")
    col = "speed_mph"
    print(df[col].describe())
    print(make_histogram(df, col))
    
    
    print("----------- Meters Elapsed -----------")
    col = "meters_elapsed"
    print(df[col].describe())
    make_histogram(df, col)
    
    print("----------- Seconds Elapsed -----------")
    col = "sec_elapsed"
    print(df[col].describe())
    make_histogram(df, col)
    
        

In [None]:
get_stats(df)

In [None]:
METERS_CUTOFF = 0
SEC_CUTOFF = 60 * 10

new_df = df[
    (df.meters_elapsed >= METERS_CUTOFF) & 
    (df.sec_elapsed >= SEC_CUTOFF)
]

new_df.shape, df.shape, len(df) - len(new_df)

In [None]:
get_stats(new_df)

In [None]:
METERS_CUTOFF = 1_609 # at least 1 mile
SEC_CUTOFF = 60 * 10
SEC_MAX = 60 * 180

new_df = df[
    (df.meters_elapsed >= METERS_CUTOFF) & 
    (df.sec_elapsed >= SEC_CUTOFF) & 
    (df.sec_elapsed <= SEC_MAX)
]

new_df.shape, df.shape, len(df) - len(new_df)

In [None]:
get_stats(new_df)

In [None]:
# Ok, now low speeds are much better...we have fewer of them
new_df[new_df.speed_mph <=5].speed_mph.hist(bins = range(0, 6, 1))