# Aaron Ma

### Environmental and road conditions

Environmental and road conditions play a key part in the probability of an accident occuring. Knowing weather conditions, road type, visibility, urban vs rural can improve road maintenance or infrastructure planning that would allow policymakers to make better informed decisions that would lower the rate of road accident.

#### Key research questions
- Are certain weather conditions more likely to result in multi-vehicle accidents?
- How does visibility level impact pedestrian or cyclist involvement differently in different levels of traffic volume? 
- How do seasonal changes impact the frequency of accidents?

## EDA

### Imports

In [2]:
import os

import altair as alt
import pandas as pd

from toolz.curried import pipe
def json_dir(data, data_dir='altairdata'):
    os.makedirs(data_dir, exist_ok=True)
    return pipe(data, alt.to_json(filename=data_dir + '/{prefix}-{hash}.{extension}') )

# Register and enable the new transformer
alt.data_transformers.register('json_dir', json_dir)
alt.data_transformers.enable('json_dir')

# Handle large data sets (default shows only 5000)
# See here: https://altair-viz.github.io/user_guide/data_transformers.html
alt.data_transformers.disable_max_rows()

alt.renderers.enable('jupyterlab')

RendererRegistry.enable('jupyterlab')

### Loading in the data

In [3]:
accidents = pd.read_csv('../../data/raw/road_accident_dataset.csv')
accidents.head()

Unnamed: 0,Country,Year,Month,Day of Week,Time of Day,Urban/Rural,Road Type,Weather Conditions,Visibility Level,Number of Vehicles Involved,...,Number of Fatalities,Emergency Response Time,Traffic Volume,Road Condition,Accident Cause,Insurance Claims,Medical Cost,Economic Loss,Region,Population Density
0,USA,2002,October,Tuesday,Evening,Rural,Street,Windy,220.414651,1,...,2,58.62572,7412.75276,Wet,Weather,4,40499.856982,22072.878502,Europe,3866.273014
1,UK,2014,December,Saturday,Evening,Urban,Street,Windy,168.311358,3,...,1,58.04138,4458.62882,Snow-covered,Mechanical Failure,3,6486.600073,9534.399441,North America,2333.916224
2,USA,2012,July,Sunday,Afternoon,Urban,Highway,Snowy,341.286506,4,...,4,42.374452,9856.915064,Wet,Speeding,4,29164.412982,58009.145124,South America,4408.889129
3,UK,2017,May,Saturday,Evening,Urban,Main Road,Clear,489.384536,2,...,3,48.554014,4958.646267,Icy,Distracted Driving,3,25797.212566,20907.151302,Australia,2810.822423
4,Canada,2002,July,Tuesday,Afternoon,Rural,Highway,Rainy,348.34485,1,...,4,18.31825,3843.191463,Icy,Distracted Driving,8,15605.293921,13584.060759,South America,3883.645634


In [4]:
print(f'Dataset shape: \n{accidents.shape}')
print(f'Dataset columns: \n{accidents.columns}')
accidents.info()

Dataset shape: 
(132000, 30)
Dataset columns: 
Index(['Country', 'Year', 'Month', 'Day of Week', 'Time of Day', 'Urban/Rural',
       'Road Type', 'Weather Conditions', 'Visibility Level',
       'Number of Vehicles Involved', 'Speed Limit', 'Driver Age Group',
       'Driver Gender', 'Driver Alcohol Level', 'Driver Fatigue',
       'Vehicle Condition', 'Pedestrians Involved', 'Cyclists Involved',
       'Accident Severity', 'Number of Injuries', 'Number of Fatalities',
       'Emergency Response Time', 'Traffic Volume', 'Road Condition',
       'Accident Cause', 'Insurance Claims', 'Medical Cost', 'Economic Loss',
       'Region', 'Population Density'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132000 entries, 0 to 131999
Data columns (total 30 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Country                      132000 non-null  object 
 1   Year                  

In [5]:
accidents.describe()

Unnamed: 0,Year,Visibility Level,Number of Vehicles Involved,Speed Limit,Driver Alcohol Level,Driver Fatigue,Pedestrians Involved,Cyclists Involved,Number of Injuries,Number of Fatalities,Emergency Response Time,Traffic Volume,Insurance Claims,Medical Cost,Economic Loss,Population Density
count,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0,132000.0
mean,2011.973348,275.038776,2.501227,74.544068,0.125232,0.500576,1.000773,0.998356,9.508205,1.995439,32.491746,5041.929098,4.495621,25198.454901,50437.505615,2506.476223
std,7.198624,129.923625,1.117272,26.001448,0.072225,0.500002,0.816304,0.817764,5.774366,1.412974,15.889537,2860.671611,2.867347,14274.771691,28584.290822,1440.646352
min,2000.0,50.001928,1.0,30.0,2e-06,0.0,0.0,0.0,0.0,0.0,5.000177,100.062626,0.0,500.11009,1000.335085,10.002669
25%,2006.0,162.33886,2.0,52.0,0.06263,0.0,0.0,0.0,5.0,1.0,18.732879,2560.601299,2.0,12836.933596,25692.817343,1258.158299
50%,2012.0,274.67299,3.0,74.0,0.125468,1.0,1.0,1.0,9.0,2.0,32.534944,5037.909855,4.0,25188.202669,50395.499874,2506.203333
75%,2018.0,388.014111,3.0,97.0,0.187876,1.0,2.0,2.0,15.0,3.0,46.289527,7524.638162,7.0,37529.024899,75186.626093,3756.65295
max,2024.0,499.999646,4.0,119.0,0.249999,1.0,2.0,2.0,19.0,4.0,59.999588,9999.997468,9.0,49999.93013,99999.622968,4999.991745


In [6]:
accidents.describe(include=['object']) 

Unnamed: 0,Country,Month,Day of Week,Time of Day,Urban/Rural,Road Type,Weather Conditions,Driver Age Group,Driver Gender,Vehicle Condition,Accident Severity,Road Condition,Accident Cause,Region
count,132000,132000,132000,132000,132000,132000,132000,132000,132000,132000,132000,132000,132000,132000
unique,10,12,7,4,2,3,5,5,2,3,3,4,5,5
top,Canada,May,Tuesday,Night,Rural,Main Road,Windy,<18,Male,Good,Minor,Wet,Drunk Driving,Australia
freq,13349,11158,19061,33231,66502,44197,26626,26524,66098,44094,44063,33356,26506,26625


In [None]:
stacked_bar = alt.Chart(accidents).mark_bar().encode(x= "count():Q", 
                                                y = "Weather Conditions:N", 
                                                color = "Number of Vehicles Involved", 
                                                tooltip=['count():Q', 'Number of Vehicles Involved']).facet("Road Type", columns = 1
                                                                                                            ).properties(title='Number of Vehicles Involved in Road Accidents per Weather and Road Type')

stacked_bar

Based on the graph above, there are little to no differences in inter road type comparisons and intra road type comparisons. The number of vehicles involved seems to stay athe same at around 2000 per weather condition.

In [None]:
# Create new Season column to show seasonal data
accidents.loc[accidents['Month'].isin(['March', 'April', 'May']), 'Season'] = 'Spring'
accidents.loc[accidents['Month'].isin(['June', 'July', 'August']), 'Season'] = 'Summer'
accidents.loc[accidents['Month'].isin(['September', 'October', 'November']), 'Season'] = 'Fall'
accidents.loc[accidents['Month'].isin(['December', 'January', 'February']), 'Season'] = 'Winter'

In [None]:
seasonal_accident_cause = alt.Chart(accidents).mark_bar().encode(x="count():Q", y = alt.Y("Accident Cause:N"), color=('Accident Cause:N'), tooltip = ['count():Q']) .facet('Season', columns =1)
seasonal_accident_cause_zoom = alt.Chart(accidents).mark_bar().encode(x=alt.X("count():Q", scale = alt.Scale(domain=(6000,7000))), 
                                                                y = alt.Y("Accident Cause:N"), color = ('Accident Cause:N'),
                                                                tooltip = ['count():Q']).facet('Season', columns =1)

alt.hconcat(seasonal_accident_cause, seasonal_accident_cause_zoom).properties(title='Number of Road Accidents by Accident Cause and Season')

On the faceted chart above, it appears that there are marginal differences between different numbers of accident causes and seasonality. In intra-seasonal analysis, we can see that for different seasons in a 1000 domain scale, different seasons have varying top causes for accidents occuring. For example, Speeding in Fall versus Drunk Driving in Spring.

In [None]:
pedestrian_heatmap = alt.Chart(accidents).mark_rect().encode(
        x=alt.X("Visibility Level:Q", title="Visibility Level"),
        y=alt.Y("Traffic Volume:Q", title="Traffic Volume"),
        color='sum(Pedestrians Involved):Q').properties(title='Pedestrian Road Accidents Involvement by Visibility Level and Traffic Volume')

cyclist_heatmap = alt.Chart(accidents).mark_rect().encode(
        x=alt.X("Visibility Level:Q", title="Visibility Level"),
        y=alt.Y("Traffic Volume:Q", title="Traffic Volume"),
        color='sum(Cyclists Involved):Q').properties(title='Cyclist Road Accidents Involvement by Visibility Level and Traffic Volume')

pedestrian_heatmap | cyclist_heatmap

Based on the heatmaps above, we can see that for the pedestrian heatmap, there is a clear center of the most pedestrian involvement in accidents from 250 to 300 visibility and 600 to 9000 traffic volume. There are also outliers where from any point onwards of 450 visibility level, regardless of traffic volume, number of pedestrians involved in an accident are are 2.
However, in regards to the heatmap concerning cyclist involvement, we can observe that at medium to low visibility level (0-250), regardless of traffic volume, the number of cyclist involvement is maximized at 2. There is also an outlier where at greater visibility levels, from 400-450, the number of cyclists involved in accidents are also maximized.

What can be observed is that the pedestrian heatmap shows a much clearer pattern between the three variables while the cyclist heatmap is more indiscriminate as seen by the large hues of dark blue.

## Task Analysis

### **1. Are certain weather conditions more likely to result in multi-vehicle accidents?**
- **Retrieve Value**: Extract `Weather Conditions`, `Number of Vehicles Involved`, and `Road Type`
- **Group**: Groupby `Weather Conditions` and `Road Type`
- **Aggregate**: Caculate average of `Number of Vehicles Involved` per group
- **Analyze**: Analyze relationships between groups
- **Visualize**: Visualize different groups

---

### **2. How does visibility level impact pedestrian or cyclist involvement differently in different levels of traffic volume?**
- **Retrieve Value**: Extract `Visibility Level`, `Pedestrians Involved`, `Cyclists Involved`,  and `Traffic Volume`
- **Group**: Separate by pedestrian or cyclist involvement with `Visibility Level` and `Traffic Volume`
- **Aggregate**: Caculate average of number of pedestrians/cyclists involved at each level of traffic volume and visibility level
- **Analyze**: Analyze relationships between groups
- **Visualize**: Visualize different groups and juxtapose pedestrian and cyclist representations

---

### **3. How do seasonal changes impact the frequency of accidents?**
- **Retrieve Value**: Extract `Month`, and `Accident Cause`
- **Create**: Create new data from `Month`, separating into `Season` by 3 month groups
- **Group**: Group by `Season`
- **Aggregate**: Caculate average of number of accidents occurred per group (season)
- **Analyze**: Analyze relationships between groups
- **Visualize**: Visualize different groups and facet seasonal representation

---