This notebook will calculate new crashes based only on existing crashes, so it is consistent with whatever type/source of existing crash data the tool uses, and is not dependent on the current implementation of the crashes model

This is the equation being implemented:
$NC_{cmojk}=EC_{cmoj} * (1 + \sum_{i}\sum_{F}E_{ik} * \frac{Ni}{L} * I_{F})^{p}*CRF_{mojk}$

In [None]:
import numpy as np
import pandas as pd

In [None]:
existing_crashes = pd.read_csv('output_2023_09_05/reports/safety-4-combined-b-crashes-all.csv')
infrastructure = pd.read_csv('output_2023_09_05/reports/overall-4-infrastructure-safety.csv')
infrastructure_volume_changes = pd.read_csv('output_2023_09_05/lookups/per_element_travel_adjustments.csv')

In [None]:
## Pull out the required variables: element, crashes volume increase per element, element share, improvement type, CRF
CRFmojk = existing_crashes["CRFmojk"]
element = infrastructure["Infrastructure type"]
element = "conventional-bike-lane"
share = infrastructure["Project share"]
## Fix later

In [None]:
## testing
volume_change = infrastructure_volume_changes[infrastructure_volume_changes["element"] == element]
volume_change.reset_index().at[0,"mean adjustment (%)"]

In [None]:
## Define the function
## have to iterate through all infrastructure for the one project ID
def new_NC(Project_ID, Estimate, EC, CRF):
    project_elements = infrastructure[infrastructure["Project ID"] == Project_ID]
    volume_change_factor = 1
    for index, element in project_elements.iterrows():
        element_name = element["Infrastructure type"]
        share = element["Project share"]
        improvement_type = element["Improvement type"]
        if improvement_type == "retrofit":
            improvement_factor = 0.1
        else:
            improvement_factor = 1
        sel_volume_change = infrastructure_volume_changes[infrastructure_volume_changes["element"] == element_name]
        if len(sel_volume_change) == 0:
            volume_change = 0
        else:
            volume_change = (sel_volume_change.reset_index().at[0, Estimate + " adjustment (%)"])/100
        volume_change_factor += volume_change * share * improvement_factor
    ## volume change factor raised to "safety in numbers" power
    volume_change_factor = (volume_change_factor)**0.5
    NC = EC * volume_change_factor * CRF
    return NC
## I know this is not going to work because pandas iterates over columns, not rows, but just sketching out an idea
## Terrible and messy code
## Fix later - this is not the final implementation for the tool anyway, just testing to see if the equation improves the results

In [None]:
def apply_new_NC(element,EC_type):
    ## request the EC_type (eg what was used in the tool) to be added as a column to this table
    ## then use whatever the tool used for the NC_new column
    Project_ID = element["Project ID"]
    EC = element["ECmoj" + EC_type]
    Estimate = element["K estimate"]
    CRF = element["CRFmojk"]
    NC = new_NC(Project_ID,Estimate,EC,CRF)
    return(NC)

In [None]:
for index, row in existing_crashes.iterrows():
    existing_crashes.at[index,"NC_user"] = apply_new_NC(row," with user input")
    existing_crashes.at[index,"NC_model"] = apply_new_NC(row," model")

### Calculate Crash Change

In [None]:
## Calculate crash change: NC - EC for both model and user input (later add "whatever the tool used" version)
existing_crashes["Change_user"] = existing_crashes["NC_user"] - existing_crashes["ECmoj with user input"]
existing_crashes["Change_model"] = existing_crashes["NC_model"] - existing_crashes["ECmoj model"]

In [None]:
total_crash_change = pd.read_csv('output_2023_09_05/reports/safety-4-combined-c-crashes-volume.csv')

In [None]:
## Sum crashes by location type - maintain split by project, mode, outcome, estimate
change_model_mok = existing_crashes.groupby(["Project ID","M Mode","O Outcome","K estimate"])["Change_model"].sum()
change_user_mok = existing_crashes.groupby(["Project ID","M Mode","O Outcome","K estimate"])["Change_user"].sum()
change_user_mok.loc[("644adafaab814ec4fdd30fab","bicycling","crash","lower")]

In [None]:
## Try to add Crash_change_model and Crash_change_user to the total_crash_change table
for index, row in total_crash_change.iterrows():
    ## The set of characteristics for this project, mode, outcome, estimate
    row_chars = (row["Project ID"],row["M Mode"],row["O Outcome"],row["K Estimate"])
    ## The corresponding model and user crash changes (summed by location type in the previous cell)
    total_crash_change.at[index,"model_crash_change"] = change_model_mok.loc[row_chars]
    total_crash_change.at[index,"user_crash_change"] = change_user_mok.loc[row_chars]

### Graphs of new NC vs old NC results

In [None]:
## Validate that these are equal?
(existing_crashes["NC_model"]-existing_crashes["NCmojk"]).plot()
## some weird stuff going on here
## Maybe errors with how one or the other was calculated
existing_crashes[abs(existing_crashes["NC_model"]-existing_crashes["NCmojk"]) > 100]
## Look at these projects somehow and try to figure out why they are so off

In [None]:
(existing_crashes["NC_model"]-existing_crashes["ECmoj with user input"]).plot(figsize = (15,10), ylim=(-25,25))
(existing_crashes["Change_model"]).plot(figsize = (15,10), ylim=(-25,25))
(existing_crashes["Change_user"]).plot(figsize = (15,10), ylim=(-25,25))
## The user-input -> user-input definitely looks the most reasonable - no more than a change in 25 crashes or so

In [None]:
existing_crashes.plot(y=["ECmoj model","NC_model","ECmoj with user input","NC_user"], figsize = (15,10), ylim=(0,50))

In [None]:
total_crash_change.plot(y=["Change in crashes","model_crash_change","user_crash_change"],figsize = (15,10))

In [None]:
## Which projects are now negative? (decrease in crashes)
existing_crashes[existing_crashes["Change_user"] < 0]
## What proportion are now negative? (out of all projects)
len(existing_crashes[existing_crashes["Change_user"] < 0])/len(existing_crashes[existing_crashes["ECmoj with user input"].notna()])

In [None]:
existing_crashes[existing_crashes["Change_model"] < 0]
len(existing_crashes[(existing_crashes["Change_model"]) < 0])/len(existing_crashes)

In [None]:
## what proportion was originally negative?
## (just in the original condition where the EC was user input, since that is what this new equation addresses)
existing_crashes[existing_crashes["NCmojk"]-existing_crashes["ECmoj with user input"] < 0]
len(existing_crashes[(existing_crashes["NCmojk"]-existing_crashes["ECmoj with user input"]) < 0])/len(existing_crashes[existing_crashes["ECmoj with user input"].notna()])

How did it get worse somehow??
Maybe this will get better with the relative crash rates, since it probably has to do with the increases in volume (which are now actually being applied to the user based EC, while before they would be applied to the model based EC??)

In [None]:
## Graph separated by mode and location type - look into this further later, there seems to be big disparities between bicycling roadway and walking roadway etc
existing_crashes["Change_user"].groupby([existing_crashes["J Location"], existing_crashes["M Mode"]]).plot(legend="true",figsize = (15,10))