# Identifying US Traffic Accidents Hotspots Using H3 and Kepler

This notebook does the following:

1. Imports US Traffic Accident data (source) 
2. Geocodes the accident cooardinates to H3 bins
3. Clusters together H3 bins with close proximity to create accident regions
4. Calculates Metrics for each of those Regions
5. Visualize the data within Kepler.gl


In [1]:
# import libraries
import pandas as pd
from h3 import h3
import keplergl
from keplergl import KeplerGl
import geopandas as gpd
from shapely.geometry import shape
import numpy as np
import json
import os
import sys
from datetime import date

In [2]:
# import functions that generate h3 clusters and kepler visualization
sys.path.append('../helper_functions/')

from helper_functions import image_processing # used for the smoothing function to create HDWR
from helper_functions import kepler_config # stored Kepler config file for map visualization

In [3]:
# resolution of h3 hexs. lower is larger hexs.
h3_resolution = 9

# diameter to use for dilation and erosion during image processing.
# higher values will make hot spots larger, and increase runtime
image_processing_diameter = 3 


# set the parameter for which hexs to inlcude in clustering
# Lowering this threshold will increase runtime and the number of hotspots visualized
hex_accident_percentile = .999

## 1. Import Data

Data was obtained from here: https://www.kaggle.com/sobhanmoosavi/us-accidents

In [4]:
#load data and measure time for reference
df = pd.read_csv('data/US_Accidents_Dec20.csv')

In [5]:
df.head()

Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,MapQuest,201.0,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,MapQuest,201.0,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,MapQuest,201.0,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,MapQuest,201.0,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,MapQuest,201.0,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,...,False,False,False,False,True,False,Day,Day,Day,Day


In [6]:
df.dtypes

ID                        object
Source                    object
TMC                      float64
Severity                   int64
Start_Time                object
End_Time                  object
Start_Lat                float64
Start_Lng                float64
End_Lat                  float64
End_Lng                  float64
Distance(mi)             float64
Description               object
Number                   float64
Street                    object
Side                      object
City                      object
County                    object
State                     object
Zipcode                   object
Country                   object
Timezone                  object
Airport_Code              object
Weather_Timestamp         object
Temperature(F)           float64
Wind_Chill(F)            float64
Humidity(%)              float64
Pressure(in)             float64
Visibility(mi)           float64
Wind_Direction            object
Wind_Speed(mph)          float64
Precipitat

In [7]:
# working with a total of 4M+ accidents
df.shape

(4232541, 49)

## 2. Geocodes the accident cooardinates to H3 bins

Now add H3 bins to our dataset for geospatial aggregation

In [8]:
# function that takes lat and lon coordinates and creates an H3 column
def coordinates_to_h3(row):
    return h3.geo_to_h3(row['Start_Lat'], row['Start_Lng'], h3_resolution)

In [9]:
# apply function to our dataset
df['h3'] = df.apply(coordinates_to_h3,axis=1)

## 3. Create Clusters in H3
Use Image processing to group the hexs with the highest crash frequency together

In [10]:
# count the number of accidents per h3 hex
accidents_by_hex = df.groupby('h3').count()

In [11]:
accidents_by_hex.reset_index(inplace=True)

In [12]:
top_accident_hexs =  accidents_by_hex[accidents_by_hex.ID > accidents_by_hex.ID.quantile(hex_accident_percentile)] 

In [13]:
top_accident_hexs.shape

(547, 50)

In [14]:
# create a smaller dataset of only accidents in the most frequent H3 bins
df_filtered = df.merge(top_accident_hexs,how='inner',on='h3',copy = False)

In [15]:
# ~250 of the 4M accidents are in the highest percentile hexs
df_filtered.shape

(252928, 99)

In [16]:
# create grouped hexs
smoothed_work_hexs = image_processing.apply_closing(list(set(df_filtered['h3'])), image_processing_diameter)


## 4. Calculate Metrics on Clusters
Calculate simple metrics like total accidents, total hexs, and total crashes per hex on each hotspot to provide tool tips and so we can dynamically update the color of different hotspots in Kepler. 

In [17]:
# store the smoothed hexs as HDWR shapes with associated flow data  

region_polygons = h3.h3_set_to_multi_polygon(smoothed_work_hexs,geo_json=True)
hdwr_geojson = {"type":"FeatureCollection", "features": []}
for region_counter in range(len(region_polygons)):
    this_polygon = {
        'type': 'Polygon',
        'coordinates': region_polygons[region_counter]
    }
    these_hexs = h3.polyfill(this_polygon,h3_resolution,geo_json_conformant=True)

    hex_mask = (df_filtered['h3'].isin(these_hexs) == True)
    total_crashes = df_filtered[hex_mask].shape[0]
    total_hexs = df_filtered[hex_mask].h3.nunique()
    crashes_per_hex = total_crashes / total_hexs
        

    hdwr_geojson['features'].append( {
                "type": "Feature",
                "geometry": {
                    "type":"Polygon",
                    "coordinates": region_polygons[region_counter]
                }
        , 
            "properties": {
                "total_crashes": round(total_crashes,0),
                "total_hexs": round(total_hexs,0),
                "crashes_per_hex": round(crashes_per_hex,0),
            }


        }
        )

## 5. Create Data Visualization in Kepler

Now visualize the data in Kepler.GL to navigate to the key accident hotspots in the country. 

In [18]:
kepler_map = keplergl.KeplerGl(height=600)
kepler_map.add_data(data = hdwr_geojson, name = 'US Traffic Accidents')
kepler_map.config = kepler_config.kepler_config


kepler_map

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [{'dataId': ['US Traffic Accidents'], 'id…

In [19]:
# save kepler config file to helper functions folder to save visualization settings
kepler_config.kepler_config = kepler_map.config

In [20]:
#save map
kepler_map.save_to_html(file_name='US_Accident_Hotspots.html')

Map saved to US_Accident_Hotspots.html!


# Next Steps / Future Ideas

(In no particular order)

1. Analyze crash types by hotspot to understand key causes
2. Analyze Time component of crashes
3. Analyze severity by hotspots / factor into colors clustering
4. Play with hyperparameters to align on final reasonable shapes