# Investigate, Test, Reflect 1
#### Ashlynn Wimer
#### 10/17/2024

In this notebook, I investigate, test, and reflect on the application of association rules to transportation safety. In particular, I test the utility of OSM tag data as a source of tag data for understanding severity of traffic crashes.

## Investigate

The past decade has seen large applications of association rule based methods to transportation safety, with much of the prior literature focusing on traffic crashes in particular. Shahin et al. (2021) apply the method to understand traffic intersection crashes and their causes, Das et al. (2019) uses it to understand motivators of hit and run crashes in segments and intersections, and Kho et al. (2021) utilize it to understand fatal road collisions writ-large. Overall, authors tend to utilize association rule based methods on datasets that consist of environmental factors (e.g. time of day and weather condition), crash report data (e.g. severity and crash description) and road factors (e.g. one way street or two way street) (Kho et al., 2021; Das et al., 2019; Shahin et al., 2021). However, specific application of the association rules method differs. Some prior research uses association rules data mining as the primary approach to understanding the dataset (Das et al., 2019; Das et al., 2020; Xing et al., 2024). Other authors have used the method to extend k-modes analyses (Shahin et al., 2021) or predictive analyses (Kho et al., 2021). Datasets also vary dramatically in size, with some authors using datasets of fewer than 1000 crashes (Shahin et al., 2021) while others mini datasets of well over 100,000 traffic crashes (Das et al., 2019).

Overall, association rule mining has a clear and consistent usage within transportation safety, and a large variety of implementation approaches. However, standard datasets persist and privilege certain characteristics of the built environment over others (e.g. consideration of the lane structure of a road but no consideration of the presence or lack of sidewalks and bike lanes). This is potentially due to dataset limitations, as information on road factors are easier to acquire than information on bike lanes and sidewalks. 

## Test

In an attempt to circumnavigate road descriptor limitations, we test the viability of OpenStreetMap data as a descriptor of the built environment around traffic crashes. Open StreetMap, or OSM, is a volunteer driven map of the world in which characteristics of the (built) environment, such as buildings, amenities, and roads, are mapped from satellite imagery. These features are categorized into three data types: nodes (points), ways (usually line data), and relations (usually polygons or complex assemblages of nodes and ways). Entities on OSM have additional metadata in the form of tags. For ways, these tags may indicate the type of way (e.g. "highway=motorway" would indicate a high speed road for cars) or may indicate factors around the road (e.g. the "sidewalk=*" key indicates the presence of sidewalks). 

Due to its relevance to a larger research project in this class, we combine OSM way data from New York City with records of the prior 50,000 traffic crashes which have occurred in New York City in 2024. Using the market basket as a metaphor, we treat traffic crashes as "transactions" in which one "buys" tag data from OSM and injury metrics from the crash dataset. Thus, one of our transactions could look like {foot|no, bicycle|no, per_injured} if a person was injured at a relatively data sparse way which banned foot and bicycle traffic. We utilize PySpark's FPGrowth function to mine for associations between the OSM tags and crash severity metrics. 

Notably, we opt to _not_ heavily preprocess our OSM tags, instead accepting every tag associated with a way. This is largely due to the semi-standardized and vast nature of OSM tags. As OSM is volunteer driven, not every feature deserving of a tag receives that tag, nor are all tags applied uniformly. The sidewalk tag on OSM, for instance, is expected to have four potential values: "no", "both", "right", or "left". However, some deprecated tags, such as "none" or "lane," still see infrequent and ill-advised use (OSM Contributors, n.d.). Lack of standardization thus makes filtering such tags prior to analysis difficult. Additionally, this analysis is prelimianry. Thus, not prefiltering our data allows us to discover which tags are informative, even if we're forced to filter through the discovered association rules or have relevant association rules garner lower confidence values than deserved.

All generating code for this analysis can be found on GitHub at [https://github.com/bucketteOfIvy/macs40123](https://github.com/bucketteOfIvy/macs40123). This entire analysis -- including data acquisition -- can be reran by cloning the repository to Midway and calling `run_all.src` from within `ITR_1/scripts`. 

### Bloopers

The "Test" section of this assignment requests an account of what did and didn't work, so I want to highlight a few bloopers that will not be directly visible in the data below.

1. **Data Pre-processing**. It turns out that wrangling OSM data in this manner is unpleasant. Two factors are at play. First, spatially joining point data to line data typically requires making a small buffer polygon around your lines. Secondly, all traffic crashes in NYC are snapped to their nearest intersection for reporting. By default, these factors mean that each traffic crash is affiliated with multiple ways (at least one for each crossroad of the intersection), but in this specific instance it _also_ means that our analysis is shockingly sensitive to the buffer size, with "reasonable but large" buffers (e.g. 30 feet) risking merging individual crashes with unrelated ways. In this case, I played with the data until I eventually settled on a buffer of 5 feet, which was itself scary given that the accuracy of my coordinate reference system is 2 meters.

2. **Woopsie! Trivial Results**. In a prior run of FPGrowth, my sets of words included all relevant severity indicators. If someone was injured, "per_injured" would be in the basket, and if that person was a cyclist, so would "cyc_injured". That led to the discovery of a very amusing set of trivial association rules which mapped itemsets contaning injury indicators to a separate injury indicators (Who would have thought that an accident in which a cyclist was injured would _also_ be an accident where a person was injured?). I fixed this by keeping my injury metrics separate, so that -- in the below exploration -- I have three sets of itemsets: one where "per_injured" is a possible tag, one where "cyc_injured" is a possible tag, and one where "ped_injured" is a possible tag. Unfortunately, this has removed all results where an injury is a consequent of the association rule, but it does mean that the rules are much less trivial.

3. **Too Few Deaths**. In the first approach to this dataset, I used fatalities as my indicator for severity. I quickly discovered that the words indicating "death" in my itemsets became wildly infrequent, making their associations very hard to mine without incredibly low minimum support values. There are likely ways around this issue that still focus on fatalities, but I opted to swap to injuries instead to catch more data points. While this did force me to rewrite a good bit of code and repull my data, this is among the first times that I've been pleased to discover that there _isn't_ enough data to do a given analysis.

And with that, let's get into the actual results!

### Results

As hinted in the Bloopers section, we have three separate sets of itemsets. The first has an indicator tag for crashes in which at least one person was injured; the second has an indicator tag for crashes in which at least one pedestrian was injured; and the third has an indicator tag for crashes in which at least one cyclist was injured. As we're interested in specifically associations between the built environment and injuries, we specifically check the highest and lowest lift rules in which the injury tag is an antecedent or consequent for each dataset.

Amusingly, this really means that we only care about antecedents, as no association rule with a minimum support of 1% was found with injuries as a consequents:

In [25]:
import pandas as pd

# read in all datasets for convenience
assocPer = pd.read_parquet('./data/itemsets_and_association_rules/perInjassociationRules.parquet')
assocPed = pd.read_parquet('./data/itemsets_and_association_rules/pedInjassociationRules.parquet')
assocCyc = pd.read_parquet('./data/itemsets_and_association_rules/cycInjassociationRules.parquet')

# filter down to cases where the consequent is a person being injured
perInj_consequent = assocPer['consequent'].apply(lambda x: "'per_injured'" in x)
pedInj_consequent = assocPed['consequent'].apply(lambda x: "'ped_injured'" in x)
cycInj_consequent = assocCyc['consequent'].apply(lambda x: "'cyc_injured'" in x)

assocPer[perInj_consequent].sort_values('lift', ascending=False).head(20)

Unnamed: 0,antecedent,consequent,confidence,lift,support


In [5]:
assocPed[pedInj_consequent]

Unnamed: 0,antecedent,consequent,confidence,lift,support


In [8]:
assocCyc[cycInj_consequent]

Unnamed: 0,antecedent,consequent,confidence,lift,support


Additionally, we have no association rule with an injury tag as an antecedent and a lift that is much less than 1:

In [26]:
# filter down to cases where the consequent is a person being injured
perInj_antecedent = assocPer['antecedent'].apply(lambda x: "'per_injured'" in x)
pedInj_antecedent = assocPed['antecedent'].apply(lambda x: "'ped_injured'" in x)
cycInj_antecedent = assocCyc['antecedent'].apply(lambda x: "'cyc_injured'" in x)

assocPer[perInj_antecedent].sort_values('lift', ascending=True).head(1)

Unnamed: 0,antecedent,consequent,confidence,lift,support
34061,"['lanes:backward|2', 'per_injured']",[NY'],0.80032,0.985905,0.021826


In [11]:
assocPed[pedInj_antecedent].sort_values('lift', ascending=True).head(1)

Unnamed: 0,antecedent,consequent,confidence,lift,support
57380,"['ped_injured', 'lanes|2', 'highway|secondary']",[NY'],0.802529,0.988626,0.012012


In [10]:
assocCyc[cycInj_antecedent].sort_values('lift', ascending=True).head(1)

Unnamed: 0,antecedent,consequent,confidence,lift,support
1578,"['cyc_injured', 'highway|tertiary']",[NY'],0.800353,0.985946,0.013192


Thus, we're primarily interested and able to investigate high lift associations where an injury is an antecedent. In order to uncover "junk tags" that are not informative of the built environment, we create a running list of junk tags below, and ignore any association rule whose consequent is one such tag. We do so iteratively, meaning the actual junk tag discovery is behind the scenes. Still, a few of the tags included in this list:

- `NY'` -- the tag indicating a road is in New York.
- `"horse|no"` -- the tag indicating that the road bans horses
- `tiger|[something]` -- these tags are pulled from the TIGER/Line shapefiles, and tend to give information about the road's location or metadata, but not about its built characteristics.

In [55]:
import numpy as np

junk_tags = ["NY'", "'horse|no'", "tiger:zip_right|11235", 
             "'tiger:zip_left|11235'", "'tiger:zip_left|11234'", "tiger:zip_right|11234'",
            "'tiger:reviewed|no'", "'tiger:county|New York", "'tiger:cfcc|A41'", "'tiger:name_type|Ave'"]

per_nojunk = assocPer['consequent'].apply(lambda x: set(x) & set(junk_tags) == set()) \
                & assocPer['antecedent'].apply(lambda x: set(x) & set(junk_tags) == set())
ped_nojunk = assocPed['consequent'].apply(lambda x: set(x) & set(junk_tags) == set()) \
                & assocPed['antecedent'].apply(lambda x: set(x) & set(junk_tags) == set())
cyc_nojunk = assocCyc['consequent'].apply(lambda x: set(x) & set(junk_tags) == set()) \
                & assocCyc['antecedent'].apply(lambda x: set(x) & set(junk_tags) == set())

In [56]:
pd.set_option('display.max_colwidth', None)
assocPer[perInj_antecedent & per_nojunk].sort_values('lift', ascending=False).head(20)

Unnamed: 0,antecedent,consequent,confidence,lift,support
29994,"['foot|no', 'highway|motorway', 'hgv|designated', 'per_injured']",['bicycle|no'],1.0,40.807487,0.010076
30070,"['foot|no', 'highway|motorway', 'per_injured']",['bicycle|no'],1.0,40.807487,0.01009
29900,"['foot|no', 'hgv|designated', 'oneway|yes', 'per_injured']",['bicycle|no'],0.988588,40.341781,0.01009
29919,"['foot|no', 'hgv|designated', 'per_injured']",['bicycle|no'],0.946019,38.604653,0.010207
30171,"['foot|no', 'oneway|yes', 'per_injured']",['bicycle|no'],0.935356,38.169536,0.010323
354,"['bicycle|no', 'hgv|designated', 'per_injured']",['foot|no'],0.998575,35.925179,0.010207
335,"['bicycle|no', 'hgv|designated', 'oneway|yes', 'per_injured']",['foot|no'],0.998559,35.924588,0.01009
430,"['bicycle|no', 'highway|motorway', 'hgv|designated', 'per_injured']",['foot|no'],0.998557,35.924513,0.010076
31655,"['highway|motorway', 'hgv|designated', 'per_injured']",['bicycle|no'],0.808635,32.998353,0.01009
503,"['bicycle|no', 'highway|motorway', 'per_injured']",['foot|no'],0.897668,32.294902,0.01009


In [57]:
assocPed[pedInj_antecedent & ped_nojunk].sort_values('lift', ascending=False).head(20)

Unnamed: 0,antecedent,consequent,confidence,lift,support
57569,"['ped_injured', 'sidewalk:both|separate', 'lit|yes']",['surface|asphalt'],0.990305,1.650684,0.010411
57430,"['ped_injured', 'lit|yes', 'maxspeed|25 mph']",['surface|asphalt'],0.981627,1.63622,0.010891
2507,"['cycleway:both|no', 'ped_injured', 'oneway|no']",['surface|asphalt'],0.971053,1.618594,0.010746
57431,"['ped_injured', 'lit|yes', 'oneway|no']",['surface|asphalt'],0.958998,1.5985,0.01226
57417,"['ped_injured', 'lit|yes', 'hgv|local']",['surface|asphalt'],0.928981,1.548466,0.011809
57416,"['ped_injured', 'lit|yes']",['surface|asphalt'],0.919494,1.532654,0.023282
2506,"['cycleway:both|no', 'ped_injured']",['surface|asphalt'],0.914768,1.524776,0.015784
57545,"['ped_injured', 'oneway|yes', 'maxspeed|25 mph']",['surface|asphalt'],0.907097,1.511989,0.010236
57429,"['ped_injured', 'lit|yes', 'highway|secondary']",['surface|asphalt'],0.888355,1.48075,0.010775
57585,"['ped_injured', 'sidewalk|separate']",['surface|asphalt'],0.854227,1.423864,0.012799


In [58]:
assocCyc[cycInj_antecedent & cyc_nojunk].sort_values('lift', ascending=False).head(20)

Unnamed: 0,antecedent,consequent,confidence,lift,support
1592,"['cyc_injured', 'lit|yes']",['surface|asphalt'],0.925024,1.541872,0.013832
1689,"['cyc_injured', 'sidewalk:both|separate']",['surface|asphalt'],0.871357,1.452416,0.012624
1677,"['cyc_injured', 'oneway|yes']",['surface|asphalt'],0.837569,1.396098,0.011037
1603,"['cyc_injured', 'maxspeed|25 mph']",['surface|asphalt'],0.811811,1.353163,0.02682
1606,"['cyc_injured', 'maxspeed|25 mph', 'hgv|local']",['surface|asphalt'],0.807044,1.345218,0.015347


## Reflect

Our results have a few interesting factors. First, it is important to note that we did _not_ find association rules with injury as a consequent, meaning that our results can be taken to imply that built environmental factors are in some way causing or implying that an injury will occur. However, it is possible that such results are still in the data. We decided not to preprocess our OSM tags in this approach, potentially lowering the support values of otherwise interesting itemsets.

Nonetheless, we did find results in line with prior expectations in all three cases. In the crashes where pedstrians or cyclists were injured, an asphalt road surface was the only consequent to appear. This makes sense at face. In the United States, asphalt road surfacing is the most common surfacing material for car bearing roads (where pedestrian-car and cyclist-car mixing can occur). Thus, it makes sense that this would be an incredibly common consequent in cases where pedestrian or cyclist injuries occured.

Our association rules on the most general injury case, crashes where any injury occurred, contain the most interesting results. We find that foot and bicycle traffic being banned ("foot|no" and "bicycle|no") are common when injuries occur. In fact, 6 of the 13 itemsets with foot or bicycle traffic being banned as a consequent have "highway|motorway" as an antecedent, indicating that the roadway for the crash was a high speed road.

The remainder of our association rules are less clearly socially interpretable. We find that, if a road has five lanes--two of which are backward lanes--and a person is injured, then the road likely has three forward lanes, as well as numerous association rules that imply there is a one way cycleway on the left. These rules are likely signs that in future attempts with this method, built environmental factors should be carefully selected and merged in advanced of running the association rule algorithm.

## Conclusion

We utilize a novel dataset as a source for association rule data mining of built environmental factors in crashes in which at least one injury occurs. Our results indicate that, with refinement, this approach could be fruitful; even though our results are heavily hindered by a lack of data pre-processing, we discover results that broadly agree with prior work on traffic safety. Future approach utilizing this method for transportation safety studies should feature intense preprocessing and merging of similar tags, shrinking down the total number of tokens available to the model. 

# Bibliography
- Das, S., Kong, X., & Tsapakis, I. (2019). Hit and run crash analysis using association rules mining. Journal of Transportation Safety & Security, 13(2). https://doi.org/doi.org/10.1080/19439962.2019.1611682
- Das, S., Sun, X., Goel, S., Sun, M., Rahman, A., & Dutta, A. (2022). Flooding related traffic crashes: Findings from association rules. Journal of Transportation Safety & Security, 14(1), 111–129. https://doi.org/10.1080/19439962.2020.1734130
- Kho, S. M., Pahlavani, P., & Bigdeli, B. (2021). Classification and association rule mining of road collisions for analyzing the fatal severity, a case study. Journal of Transport & Health, 23, 101278. https://doi.org/10.1016/j.jth.2021.101278
- OSM Contributors. (n.d.). Key:sidewalk—OpenStreetMap Wiki. Retrieved October 17, 2024, from https://wiki.openstreetmap.org/wiki/Key:sidewalk
- Shahin, M., Saeidi, S., Shah, S. A., Kaushik, M., Sharma, R., Peious, S. A., & Draheim, D. (2021). Cluster-Based Association Rule Mining for an Intersection Accident Dataset. 2021 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), 1–6. https://doi.org/10.1109/ICECube53880.2021.9628206
- Xing, G., Chen, S., Ma, Y., Zhang, C., Xie, Z., & Zhu, Y. (2024). Understanding distracted driving patterns of ride-hailing drivers from multi-source data: Applying association rule mining. Journal of Transportation Safety & Security, 16(4), 390–420. https://doi.org/10.1080/19439962.2023.2221204