# Get PEMS stations to match up with SHN postmiles

## Takeaways
* PEMS user guide section on postmiles vs absolute postmiles
   * `postmiles` reset at county line
   * `abs_postmiles` is distance is from the origin 
* SHN postmiles metadata explains that `odometer` counts the distance from the origin. 
   * `pm` is postmiles
   * `pmc` is postmiles combined from `PmPrefix, PM, PmSuffix`
* Since PEMS stations and SHN postmiles data sources both contain county, district, and freeway information, we'll prefer PEMS for county, district identifiers.
   * For columns in common, we will keep names matching PEMS if it's straight rename
   * Whatever format is the values are in PEMS, we'll clean up SHN postmiles to match
* **Merging in SHN postmiles gets us 99.9% from PEMS**
* 5 rows need to try again...these are all 91 freeway in Orange County.
   * We'll do a second merge that just grabs the postmile closest.
   * Not a big deal because we do round to 1 decimal place, but detectors can be located between 2 different postmiles, and our rounding helps us get to the closest, but no guarantee that the closest postmile actually exists.

### References
* PEMS User Guide
* [SHN postmiles ESRI layer](https://caltrans-gis.dot.ca.gov/arcgis/rest/services/CHhighway/SHN_Postmiles_Tenth/FeatureServer/0)
* [SHN postmiles metadata](https://caltrans-gis.dot.ca.gov/arcgis/rest/services/CHhighway/SHN_Postmiles_Tenth/FeatureServer/0/metadata)

In [1]:
import geopandas as gpd
import pandas as pd

from utils import PROCESSED_GCS
from shared_utils.shared_data import GCS_FILE_PATH as SHARED_GCS

In [2]:
postmiles = gpd.read_parquet(
    f"{SHARED_GCS}state_highway_network_postmiles.parquet"
)

station_crosswalk = pd.read_parquet(
    f"{PROCESSED_GCS}station_crosswalk.parquet",
) 

In [3]:
def clean_postmiles(gdf: gpd.GeoDataFrame):
    """
    Clean SHN postmiles dataset.
    Be explicit about columns we don't want to keep
    
    We'll favor the PEMS data source over this one (for county, district, etc)
    and prefer column names / value formatting that match with PEMS.
    
    Also, postmiles will contain information about where the postmile
    is located, offset, etc, which we probably don't need either
    """
    drop_cols = [
        "district", "county",
        "direction",
        #route+suffix, sometimes it's 5
        # sometimes it's 5S. either way, we have freeway_direction
        "routes", "rtesuffix",
        "pmrouteid", 
        "pm", "pmc",
        "pminterval",
        "pmprefix", "pmsuffix",
        "aligncode", "pmoffset",
    ]
    
    rename_dict = {
        "route": "freeway_id"
    }
    
    gdf = gdf.assign(
        # NB becomes N
        freeway_direction = gdf.direction.str.replace("B", ""),
        # pm gets us better merges than pmc
        # pmc is PM(combined) which includes prefix/suffix
        abs_pm = gdf.odometer.round(1)
    ).drop(columns = drop_cols).rename(columns = rename_dict)
    
    return gdf


def clean_station_freeway_info(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean up PEMS station dataset, so we can 
    merge with SHN postmiles dataset.
    """
    rename_dict = {
        "freeway_dir": "freeway_direction"
    }
    # Stations have abs_postmile, numeric, and we'll round to 1 decimal
    # place to match what SHN postmiles (odometer) would be
    df = df.assign(
        abs_pm = df.abs_postmile.round(1),
    ).rename(columns = rename_dict)
    
    return df

## Merging PEMS stations with postmiles

* We care about figuring out why `left_only` exists.
* `right_only` means there is a postmile and no detector there, and that's expected.
* `left` merge is what we want. Looks good, 99.9% of rows are in both.

In [4]:
merge_cols = ["freeway_id", "freeway_direction", "abs_pm"]

m1 = pd.merge(
    station_crosswalk.pipe(clean_station_freeway_info),
    postmiles.pipe(clean_postmiles),
    on = merge_cols,
    how = "outer",
    indicator = True
)

m1._merge.value_counts()

right_only    308949
both            4334
left_only          5
Name: _merge, dtype: int64

In [5]:
m2 = pd.merge(
    station_crosswalk.pipe(clean_station_freeway_info),
    postmiles.pipe(clean_postmiles),
    on = merge_cols,
    how = "left",
    indicator = True
)

m2._merge.value_counts(normalize=True)

both          0.998848
left_only     0.001152
right_only    0.000000
Name: _merge, dtype: float64

Find out specifically what's going on in the 5 rows (5 stations, but just 2 postmile rows).

Try to find a pair in the right only list.

In [6]:
station_find_me = m1[m1._merge=="left_only"][
    merge_cols + ["district_id"]].drop_duplicates()
postmiles_find_me = m1[m1._merge=="right_only"][
    merge_cols].drop_duplicates()

In [7]:
stations_to_fix = m1[m1._merge=="left_only"].station_uuid.unique()

In [8]:
# All on freeway 91 in D12 (orange county)
station_find_me[
    (station_find_me.freeway_id==91)
].district_id.value_counts()

12.0    2
Name: district_id, dtype: int64

In [9]:
m1[m1._merge=="left_only"].freeway_id.value_counts()

91    5
Name: freeway_id, dtype: int64

In [10]:
m1[m1._merge=="left_only"].abs_pm.value_counts()

36.4    4
35.4    1
Name: abs_pm, dtype: int64

In [11]:
station_find_me[
    (station_find_me.freeway_id==91)
].abs_pm.value_counts()

36.4    1
35.4    1
Name: abs_pm, dtype: int64

In [12]:
station_find_me[
    (station_find_me.freeway_id==91)
]

Unnamed: 0,freeway_id,freeway_direction,abs_pm,district_id
3513,91,E,36.4,12.0
3535,91,E,35.4,12.0


In [13]:
station_find_me[
    (station_find_me.freeway_id==91)
].abs_pm.unique()

array([36.4, 35.4])

In [14]:
postmiles_find_me[
    (postmiles_find_me.freeway_id==91) & 
    (postmiles_find_me.abs_pm == 35.4)  
]

Unnamed: 0,freeway_id,freeway_direction,abs_pm
164463,91,W,35.4


In [15]:
# use the closest one, just 0.1 mile before the detector
# make sure we pick from the same direction (eastbound)
postmile_list = postmiles_find_me[
    (postmiles_find_me.freeway_id==91) & 
    (postmiles_find_me.freeway_direction=="E") & 
    (postmiles_find_me.abs_pm >= 35) & 
    (postmiles_find_me.abs_pm <= 37)
].abs_pm.unique()

sorted(postmile_list)

[35.2, 35.3, 35.5, 35.6, 35.7, 35.8, 36.0, 36.2, 36.3, 36.7, 36.9, 37.0]

## Functions for script

In [16]:
def merge_stations_to_shn_postmiles(
    station_crosswalk: pd.DataFrame,
    postmiles: pd.DataFrame
) -> pd.DataFrame:
    merge_cols = ["freeway_id", "freeway_direction", "abs_pm"]

    station2 = clean_station_freeway_info(station_crosswalk)
    postmiles2 = clean_postmiles(postmiles)
    
    m1 = pd.merge(
        station2,
        postmiles2,
        on = merge_cols,
        how = "left",
        indicator = True
    )
    
    station_cols = station2.columns.tolist()
    print(station_cols)
    ok_df = m1[m1._merge=="both"].drop(columns = "_merge")
    fix_df = m1[m1._merge=="left_only"][station_cols]

    station_fix_me = {
        35.4: 35.3,
        36.4: 36.3
    }
    
    fix_df = fix_df.assign(
        abs_pm = fix_df.abs_pm.map(station_fix_me)
    )
    
    m2 = pd.merge(
        fix_df,
        postmiles2,
        on = merge_cols,
        how = "inner",
    )
    
    df = pd.concat([m1, m2], axis=0, ignore_index=True)
    
    return df

In [17]:
final = merge_stations_to_shn_postmiles(station_crosswalk, postmiles)

['station_id', 'freeway_id', 'freeway_direction', 'city_id', 'county_id', 'district_id', 'station_type', 'param_set', 'length', 'abs_postmile', 'physical_lanes', 'station_uuid', 'abs_pm']


In [18]:
station_crosswalk.station_uuid.nunique()

4242

In [19]:
final.station_uuid.nunique()

4242

In [20]:
type(final)

pandas.core.frame.DataFrame

In [21]:
final.columns

Index(['station_id', 'freeway_id', 'freeway_direction', 'city_id', 'county_id',
       'district_id', 'station_type', 'param_set', 'length', 'abs_postmile',
       'physical_lanes', 'station_uuid', 'abs_pm', 'odometer', 'hwysegment',
       'routetype', 'geometry', '_merge'],
      dtype='object')