# Flow vs obs flow & speed vs obs speed

Side-by-side comparison of what seems to be related columns.

## What we learned
* `flow` is always >= `obs_flow`
* Most of the time, `flow = obs_flow`
* Why would imputation be more than what's observed? Is this only true when `flow == 0`? Yes, seems to be imputation happens when it's majority `obs_flow==0`, although there are a small percentage of cases where this isn't true.
* We'll just use `flow` for now, and use imputed values always?
* Most cases are `speed < obs_speed` (so imputed tends to be less than what detector says).
* Looking at descriptives, these are occurring where observed speed is really high, 100-200 mph!

In [1]:
import pandas as pd

from utils import PROCESSED_GCS

In [2]:
def metric_vs_observed(metric: str, filtering: tuple) -> pd.DataFrame:
    PREFIX = "station_weekday_hour"

    metric_df = pd.read_parquet(
        f"{PROCESSED_GCS}{PREFIX}_{metric}.parquet/",
        filters = filtering
    )

    obs_metric_df = pd.read_parquet(
        f"{PROCESSED_GCS}{PREFIX}_obs_{metric}.parquet/",
        filters = filtering
    )
    
    merge_cols = ["station_uuid", "year", "month", "weekday", "hour"]

    df = pd.merge(
        metric_df,
        obs_metric_df,
        on = merge_cols,
        how = "outer"
    )
    
    return df

In [3]:
def lane_comparisons(df: pd.DataFrame, lane_number: int, metric: str):
    print(f"\nMetric: {metric}")
    print(f"Month-Year: {df.month.iloc[0]}-{df.year.iloc[0]}")
    print(f"Weekday: {df.weekday.iloc[0]}   Hour: {df.hour.iloc[0]}")
    print(f"******* lane number: {lane_number} ********")

    N_ROWS = len(df)
    
    def rounded(numerator, denominator):
        return round(numerator / denominator, 3)
    
    col = f"lane_{lane_number}_{metric}"
    obs_col = f"lane_{lane_number}_obs_{metric}"
    
    N_EQUAL = df[df[col] == df[obs_col]].shape[0]
    N_MORE = df[df[col] > df[obs_col]].shape[0]
    N_LESS = df[df[col] < df[obs_col]].shape[0]
    
    
    print(f"# rows: {N_ROWS}")
    print(f"equal: {N_EQUAL}, imputed > obs: {N_MORE}, imputed < obs: {N_LESS}")
    print(f"% equal {rounded(N_EQUAL, N_ROWS)}")
    print(f"greater: {rounded(N_MORE, N_ROWS)}, less: {rounded(N_LESS, N_ROWS)}")
    
    if metric == "speed":
        print("**** values when imputed < obs ****")

        less_df = df.loc[df[col] < df[obs_col]]

        print(less_df[col].describe())
        print(less_df[obs_col].describe())
    
    print("****values when imputed > obs *****")
    
    more_df = df.loc[df[col] > df[obs_col]]
    
    more_df = more_df.assign(
        obs_col_zero = more_df.apply(
            lambda x: True if x[obs_col]==0 
            else False, axis=1)
    )
    
    print(more_df.obs_col_zero.value_counts())


## Flow vs observed flow diagnostics

In [4]:
filtering = [[
    ("year", "==", 2024), ("month", "==", 6),
    ("weekday", "==", 2), ("hour", "==", 18)
]]

METRIC = "flow"
df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: flow
Month-Year: 6-2024
Weekday: 2   Hour: 18
******* lane number: 1 ********
# rows: 7026
equal: 6873, imputed > obs: 153, imputed < obs: 0
% equal 0.978
greater: 0.022, less: 0.0
****values when imputed > obs *****
True     108
False     45
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 6-2024
Weekday: 2   Hour: 18
******* lane number: 2 ********
# rows: 7026
equal: 6906, imputed > obs: 120, imputed < obs: 0
% equal 0.983
greater: 0.017, less: 0.0
****values when imputed > obs *****
True     84
False    36
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 6-2024
Weekday: 2   Hour: 18
******* lane number: 3 ********
# rows: 7026
equal: 6927, imputed > obs: 99, imputed < obs: 0
% equal 0.986
greater: 0.014, less: 0.0
****values when imputed > obs *****
True     66
False    33
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 6-2024
Weekday: 2   Hour: 18
******* lane number: 4 ********
# rows: 7026
equal: 6978, imputed > obs: 48, imputed < obs: 0
%

In [5]:
filtering = [[
    ("year", "==", 2024), ("month", "==", 3),
    ("weekday", "==", 4), ("hour", "==", 10)
]]

df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: flow
Month-Year: 3-2024
Weekday: 4   Hour: 10
******* lane number: 1 ********
# rows: 7884
equal: 7413, imputed > obs: 471, imputed < obs: 0
% equal 0.94
greater: 0.06, less: 0.0
****values when imputed > obs *****
False    375
True      96
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 3-2024
Weekday: 4   Hour: 10
******* lane number: 2 ********
# rows: 7884
equal: 7590, imputed > obs: 294, imputed < obs: 0
% equal 0.963
greater: 0.037, less: 0.0
****values when imputed > obs *****
False    210
True      84
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 3-2024
Weekday: 4   Hour: 10
******* lane number: 3 ********
# rows: 7884
equal: 7620, imputed > obs: 264, imputed < obs: 0
% equal 0.967
greater: 0.033, less: 0.0
****values when imputed > obs *****
False    174
True      90
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 3-2024
Weekday: 4   Hour: 10
******* lane number: 4 ********
# rows: 7884
equal: 7662, imputed > obs: 222, imputed < obs:

In [6]:
filtering = [[
    ("year", "==", 2023), ("month", "==", 7),
    ("weekday", "==", 2), ("hour", "==", 10)
]]

df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: flow
Month-Year: 7-2023
Weekday: 2   Hour: 10
******* lane number: 1 ********
# rows: 7056
equal: 6552, imputed > obs: 504, imputed < obs: 0
% equal 0.929
greater: 0.071, less: 0.0
****values when imputed > obs *****
False    402
True     102
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 7-2023
Weekday: 2   Hour: 10
******* lane number: 2 ********
# rows: 7056
equal: 6723, imputed > obs: 333, imputed < obs: 0
% equal 0.953
greater: 0.047, less: 0.0
****values when imputed > obs *****
False    258
True      75
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 7-2023
Weekday: 2   Hour: 10
******* lane number: 3 ********
# rows: 7056
equal: 6759, imputed > obs: 297, imputed < obs: 0
% equal 0.958
greater: 0.042, less: 0.0
****values when imputed > obs *****
False    204
True      93
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 7-2023
Weekday: 2   Hour: 10
******* lane number: 4 ********
# rows: 7056
equal: 6858, imputed > obs: 198, imputed < ob

In [7]:
filtering = [[
    ("year", "==", 2023), ("month", "==", 9),
    ("weekday", "==", 1), ("hour", "==", 14)
]]

df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: flow
Month-Year: 9-2023
Weekday: 1   Hour: 14
******* lane number: 1 ********
# rows: 7761
equal: 6894, imputed > obs: 867, imputed < obs: 0
% equal 0.888
greater: 0.112, less: 0.0
****values when imputed > obs *****
False    762
True     105
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 9-2023
Weekday: 1   Hour: 14
******* lane number: 2 ********
# rows: 7761
equal: 7245, imputed > obs: 516, imputed < obs: 0
% equal 0.934
greater: 0.066, less: 0.0
****values when imputed > obs *****
False    423
True      93
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 9-2023
Weekday: 1   Hour: 14
******* lane number: 3 ********
# rows: 7761
equal: 7305, imputed > obs: 456, imputed < obs: 0
% equal 0.941
greater: 0.059, less: 0.0
****values when imputed > obs *****
False    381
True      75
Name: obs_col_zero, dtype: int64

Metric: flow
Month-Year: 9-2023
Weekday: 1   Hour: 14
******* lane number: 4 ********
# rows: 7761
equal: 7425, imputed > obs: 336, imputed < ob

## Speed vs observed speed diagnostics

In [8]:
filtering = [[
    ("year", "==", 2024), ("month", "==", 5),
    ("weekday", "==", 2), ("hour", "==", 8)
]]

METRIC = "speed"
df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: speed
Month-Year: 5-2024
Weekday: 2   Hour: 8
******* lane number: 1 ********
# rows: 2606
equal: 70, imputed > obs: 43, imputed < obs: 2479
% equal 0.027
greater: 0.017, less: 0.951
**** values when imputed < obs ****
count       2479.0
mean      57.44674
std      14.862704
min           5.42
25%        50.2975
50%          62.78
75%          67.87
max          79.48
Name: lane_1_speed, dtype: Float64
count        2479.0
mean     266.002098
std       84.267447
min            22.6
25%          210.65
50%           288.6
75%           326.1
max           397.4
Name: lane_1_obs_speed, dtype: Float64
****values when imputed > obs *****
True     33
False    10
Name: obs_col_zero, dtype: int64

Metric: speed
Month-Year: 5-2024
Weekday: 2   Hour: 8
******* lane number: 2 ********
# rows: 2606
equal: 55, imputed > obs: 35, imputed < obs: 1631
% equal 0.021
greater: 0.013, less: 0.626
**** values when imputed < obs ****
count       1631.0
mean     54.435811
std      15.345102
min     

In [9]:
filtering = [[
    ("year", "==", 2023), ("month", "==", 8),
    ("weekday", "==", 5), ("hour", "==", 7)
]]

df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: speed
Month-Year: 8-2023
Weekday: 5   Hour: 7
******* lane number: 1 ********
# rows: 2311
equal: 98, imputed > obs: 31, imputed < obs: 2169
% equal 0.042
greater: 0.013, less: 0.939
**** values when imputed < obs ****
count       2169.0
mean     69.474558
std       5.819333
min         29.675
25%          64.85
50%           71.4
75%         74.325
max         83.475
Name: lane_1_speed, dtype: Float64
count        2169.0
mean     260.652006
std       47.678904
min            62.9
25%           255.3
50%           278.3
75%           296.5
max           333.9
Name: lane_1_obs_speed, dtype: Float64
****values when imputed > obs *****
True     26
False     5
Name: obs_col_zero, dtype: int64

Metric: speed
Month-Year: 8-2023
Weekday: 5   Hour: 7
******* lane number: 2 ********
# rows: 2311
equal: 60, imputed > obs: 27, imputed < obs: 1459
% equal 0.026
greater: 0.012, less: 0.631
**** values when imputed < obs ****
count       1459.0
mean     68.689011
std       4.427647
min     

In [10]:
filtering = [[
    ("year", "==", 2023), ("month", "==", 10),
    ("weekday", "==", 3), ("hour", "==", 13)
]]

df = metric_vs_observed(METRIC, filtering)

for i in range(1, 9):
    lane_comparisons(df, i, METRIC)


Metric: speed
Month-Year: 10-2023
Weekday: 3   Hour: 13
******* lane number: 1 ********
# rows: 2604
equal: 89, imputed > obs: 67, imputed < obs: 2430
% equal 0.034
greater: 0.026, less: 0.933
**** values when imputed < obs ****
count       2430.0
mean     64.361485
std       9.077721
min         17.875
25%       60.73125
50%           65.0
75%         71.075
max           78.9
Name: lane_1_speed, dtype: Float64
count        2430.0
mean     238.871605
std       55.517956
min            51.3
25%           212.4
50%           258.0
75%           280.4
max           315.6
Name: lane_1_obs_speed, dtype: Float64
****values when imputed > obs *****
True     49
False    18
Name: obs_col_zero, dtype: int64

Metric: speed
Month-Year: 10-2023
Weekday: 3   Hour: 13
******* lane number: 2 ********
# rows: 2604
equal: 67, imputed > obs: 40, imputed < obs: 1622
% equal 0.026
greater: 0.015, less: 0.623
**** values when imputed < obs ****
count       1622.0
mean     61.670176
std       8.343252
min 