In [1]:
import pandas as pd

# Motivation

We have chosen a dataset with country wide traffic accidents in the US. This dataset has a lot of data that we can explore, while using the multitude of the tools we have taught in the course. It also opens up the possibility to do advanced analysis. Our goal is to show the user how dangerous it can be to drive during bad/extreme weather conditions, using interactive tools. We believe that by doing this, the user is more entertained and invested in the findings of our analysis.

# Basic stats

Before starting the analysis, we have had to remove some of the non-important features of our dataset. We did this by removing 26 attributes. Furthermore some cleanup was needed to remove n/a values. After cleanup, our dataset is 350MB large and contains 20 attributes. As the data is from an american source, we change all units from empirial to metric units.

To determine how weather conditions affect car accidents we will have a main focus on the severity of the car accidents, which is a attribute ranging from 1-4 (4 being the most catastrophic). Our main focus will therefore be on how weather increases or decreases the severity of accidents. 



In [2]:
#Preparing data
data = pd.read_csv("data/US_Accidents_Dec20.csv")
CleanedData = data.drop(['Number','Distance(mi)','Airport_Code','Street','Side','Country','Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal','Turning_Loop','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight', 'End_Lat','End_Lng', 'Wind_Direction'],axis='columns', inplace=False)
CleanedData.info()
CleanedData = CleanedData.dropna()
CleanedData['Temperature'] = CleanedData['Temperature(F)'].apply(lambda x: (x-32) * (5/9))
CleanedData['Wind_Chill'] = CleanedData['Wind_Chill(F)'].apply(lambda x: (x-32) * (5/9))
CleanedData['Visibility'] = CleanedData['Visibility(mi)'].apply(lambda x: x*1.609344)
CleanedData['Wind_Speed'] = CleanedData['Wind_Speed(mph)'].apply(lambda x: x*1.609344*1000/3600)
CleanedData['Precipitation'] = CleanedData['Precipitation(in)'].apply(lambda x: x*25.4)
CleanedData['Pressure'] = CleanedData['Pressure(in)'].apply(lambda x: x*33.86)

df = CleanedData

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4232541 entries, 0 to 4232540
Data columns (total 23 columns):
 #   Column             Dtype  
---  ------             -----  
 0   ID                 object 
 1   Source             object 
 2   TMC                float64
 3   Severity           int64  
 4   Start_Time         object 
 5   End_Time           object 
 6   Start_Lat          float64
 7   Start_Lng          float64
 8   Description        object 
 9   City               object 
 10  County             object 
 11  State              object 
 12  Zipcode            object 
 13  Timezone           object 
 14  Weather_Timestamp  object 
 15  Temperature(F)     float64
 16  Wind_Chill(F)      float64
 17  Humidity(%)        float64
 18  Pressure(in)       float64
 19  Visibility(mi)     float64
 20  Wind_Speed(mph)    float64
 21  Precipitation(in)  float64
 22  Weather_Condition  object 
dtypes: float64(10), int64(1), object(12)
memory usage: 742.7+ MB


## Explorative data analysis and creating KPI's

To understand the dataset better, we create some key point indicators. This includes the most important continuous attributes of the dataset.

In [3]:
key_attributes = ["Severity","Temperature","Wind_Chill","Humidity(%)","Pressure","Visibility","Wind_Speed","Precipitation"]

KPI = pd.DataFrame(data={'KPI': ['Mean']})

for i in key_attributes:
    KPI[i] = df[i].mean()

print(KPI)

    KPI  Severity  Temperature  Wind_Chill  Humidity(%)    Pressure  \
0  Mean  2.315307    16.784703   16.022281    67.756379  994.245741   

   Visibility  Wind_Speed  Precipitation  
0   14.248369     3.27704       0.200254  


Now that we have created some basic key point indicators from our dataset, we would like to get an insight on how the different weather conditions affect the severity of the car accidents. One insight we get from the mean values is that the Temperature and Wind_Chill are quite similar, and therefore they should be closely correlated. To investigate this further we try to calculate the correlation between wind chill and temperature.



In [4]:
data_subset = data.drop(['Number','Distance(mi)','Airport_Code','Street','Side','Country','Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal','Turning_Loop','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight', 'End_Lat','End_Lng', 'Wind_Direction'],axis='columns', inplace=False)
data_subset = data_subset.dropna()
corr_matrix = data_subset.corr().abs()
print("Correlation between Wind_Chill and temperature: ", corr_matrix['Wind_Chill(F)'].iloc[4])

Correlation between Wind_Chill and temperature:  0.994171131228068


From the correlation between wind chill and temperautre calculated above, it becomes quite clear, that they are highly correlated, and most likely can describe the same parts of the data. This means that going forward we will leave out wind chill, as most cars have thermometers built in the car, and therefore this is the information that will be most usefull for viewers when making recommendations, as they can easiely accuire this information by themselves.

Another observation we have made by looking at the data, is that a lot of the values for precipitation are missing. We would therefore like to investigate this attribute further:

In [5]:
count = df["Precipitation(in)"]==0
print("Amount of cases where there is no precipation at all in the dataset is:", round(100*(count.sum()/len(count)),2),"%")
print("------------------------------------------------")
KPI = pd.DataFrame(data={'KPI': ['Mean']})
KPI["Severity"] = df[df["Precipitation(in)"]>0]["Severity"].mean()
print(KPI)

Amount of cases where there is no precipation at all in the dataset is: 89.79 %
------------------------------------------------
    KPI  Severity
0  Mean  2.403472


In 89.79% of all car accidents in the dataset, there is no precipitation at all. In the rest of the 10.21% cases, the mean severity 2.40, which is only 0.20 higher than the mean of the whole dataset. As this is only a slight increase, and it only effects 10.21% of all car accidents in the US, we decide not to include precipitation in our deeper data analysis.

We would like to further investigate which weather attributes that can explain the severity of the car accidents the best. To do so we will investigate how correlated these attributes are to the severity of the accidents:

In [10]:
#The correlation matrix "corr_matrix" have been computed earlier, and we merely print the matrix:
data_subset2 = data_subset.drop(['Wind_Chill(F)', 'Precipitation(in)','TMC','Start_Lat','Start_Lng'],axis='columns', inplace=False)
corr_matrix = data_subset2.corr().abs()
print(corr_matrix['Severity'])

Severity           1.000000
Temperature(F)     0.014100
Humidity(%)        0.014327
Pressure(in)       0.005219
Visibility(mi)     0.029172
Wind_Speed(mph)    0.056656
Name: Severity, dtype: float64


In the correlation matrix it can be seen that windspeed are by far the most correlated attribute to severity. After that the rest of the attributes start to look more alike, accept for pressure which seams to be quite little correlated to severity. In order to figure out the quality for the rest of the weather meassurements ability to explain the severity of an accident, we will do some furher explorative analysis.

We can make some more basic exploratory data analysis, by calculating the mean for the same attributes, but focusing on specific values of severity. Specifically we would like to analyse on the extremes, which is severity 1 and severity 4. We will use this method to complemt the results we got from the correlation matrix.

In [12]:
key_attributes = ["Severity","Temperature","Humidity(%)","Pressure","Visibility","Wind_Speed","Precipitation"]

KPI = pd.DataFrame(data={'KPI': ['Mean']})

for i in key_attributes:
    KPI[i] = df[df["Severity"]==1][i].mean()
print(KPI)
print("-----------------------------------------------------------------------------------------------------")
KPI = pd.DataFrame(data={'KPI': ['Mean']})
for i in key_attributes:
    KPI[i] = df[df["Severity"]==4][i].mean()
print(KPI)

    KPI  Severity  Temperature  Humidity(%)   Pressure  Visibility  \
0  Mean       1.0     15.67138    69.248485  995.11154   13.957158   

   Wind_Speed  Precipitation  
0     3.04949       0.164715  
-----------------------------------------------------------------------------------------------------
    KPI  Severity  Temperature  Humidity(%)    Pressure  Visibility  \
0  Mean       4.0    17.385588    69.437079  986.632452   14.376095   

   Wind_Speed  Precipitation  
0    3.249123       0.275286  


The idea is now to find which attributes is the most different between severity 1 and severity 4. These are the attributes we wish to use when going deeper into our data analysis.

* Temperature is clearly lower on severity 4, so we keep that for further analysis.
* Humidity is clearly higher on severity 4, so we keep that for further analysis.
* We see no significant changes in the mean of pressure, which only helps the argument that this shouldn't be used for the analysis.
* Visibility is clearly lower on severity 4, so we keep that for further analysis.
* Wind speed is clearly lower on severity 4, so we keep that for further analysis. One thing to note, is that this indicates that higher wind speeds reduces the severity, which we did not expect.
* Precipation is only slightly higher between the two severities. But as we have discovered earlier precipation have relatively few meassured values, and therefore isn't that good of an attribute anyways.

To match the explorative analysis just made on the basis of means, we would like to further analyse this using the correlations of the weather attributes and severity:


The attributes we decided was best for our data analysis are:
* Temperature
* Humidity
* Visiblity
* Wind Speed

# Data Analysis

In the above section "explorative data analysis", a lot about what we have learned about the data have been explained, and therefore it should be thought of as part of the analysis. In this section we will therefore explain what we have learned about the data, based on the visualizations that we have created based on what we have learned from the explorative data anlysis. 

From the histogram plots we have learned that two of our four variables being temperature and humidity, effect the severity in a negative way (making it more severe) when they are extreme. Meaning we see a more frequent number of severe accidents when the temperature is either really high or really low, and the same goes for humidity. This is not quite the same for the two other variables, what we saw here, was that for wind speed, it affected the severity in a negative way, when the wind speed was high, which do make a lot of sense. It was the other way around that poor visibility (numeric low) made the car accidents more severe. The fact that both really low and high temperatures and humidities seems to make car accidents more severe, may also be why that the correlations between these two and severity, which was computed in the previous section, wasn't as high as first anticipated. 

After analysing how different weather conditions affect traffic accidents across the US, we though it would be relevant to create a model, which can predict the serverity of a accident, given different weather conditions. To do so we performed some data engineering, by decoding the "weather condition" to a numeric value symbolizing one of the 111 conditions. Next we used a supervised learning classifier (logistic regression) to predidict the severity of an accident for future accident. This was possible as we had labeled data, where the severity were out labels. 

Going thorugh what was done step by step, it looks like the following. First we load the necessary packages, and the data that have been cleaned previously:



In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
import pickle
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import pandas as pd 
data = pd.read_csv("data/DataForModel.csv")

Next we downsample the amount of values with severity = 2, in order to gen a more evenly distributed dataset.

In [14]:
new_severity2_n = int(0.3*len(data['Severity']))
data_subset = data.drop(['Month', 'Source','TMC','Start_Lat','Start_Lng','State','Unnamed: 0','Weather_Condition'] ,axis = 1, inplace = False)
Severity2_down = data_subset[data_subset['Severity']==2][0:new_severity2_n]

Data_set_finished = pd.concat([data_subset[data_subset['Severity']==1],Severity2_down,data_subset[data_subset['Severity']==3],data_subset[data_subset['Severity']==4]])

Next we define a training set and a test set as well as the model for predicting.

In [15]:
X = Data_set_finished.drop(labels = ['Severity'], axis = 1, inplace = False)
y = Data_set_finished['Severity']
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=1/3, random_state=100)

regr = RandomForestRegressor(max_depth=3, random_state=42)


Next we train the model, and create predictions with the model trained. Also we print the predictions and the mean square error associated with thtese predictions.

In [17]:
regr.fit(X_train, y_train)

yhat = regr.predict(X_test)

print("The predicted severity values: ", yhat, "and the actual severity values :", y_test )

MSE = mean_squared_error(y_test, yhat)
print("the mean squared error for these predictions are: ",MSE)

The predicted severity values:  [2.4465947  3.00325502 2.59055548 ... 2.59055548 3.00309466 2.32716815] and the actual severity values : 226103    2
536325    3
566228    3
353       3
115786    2
         ..
385085    3
168203    3
271976    3
550900    3
18512     2
Name: Severity, Length: 129740, dtype: int64
the mean squared error for these predictions are:  0.2562599780270776


We see that the model quite nicely captures the severity of the accidents with an MSE at 0.25 which is a distance from the mean at roughly 10%. We have choosen to make regression instead of classification, as this is more usefull for the website we have created. It might make more sense to have classification, when one is looking at these results alone. But at out website where we have the sliders to adjust the weather meassurements, we began to use the KPI that is the mean of the severity in an interval way more, and therefore we didn't have these four categories (1,2,3,4) but instead a lot of decimal numbers. So to use out predictive model with the sliders it needed to be a regression model, which can predict something in between these categories. 

# Genre

To tell our story we used some different tools. We used the tool zooming from the "Visual Narrative - highlighting" type tool. This is because at first our data has a huge scope, and a zoom is quite usefull to tell stories in a way more local and precise manner. 

In order to not diorientate the viewer, another visual narrative tool in the visual structure toolbox have been used. We created these sliders, that the viewer can drag, and see how it effects the data, this way the scenery of the data change, but as the viewer make these changes by one self, the idea is that it shouldn't feel too disorientating. More specificly these sliders are of the type timeline sliders/ progress bars. For the last type in the Visual narrative catogory "transitional guidance", none of these tools have been used. 

The narrative structure used to lead the reader is to folded. From the type "ordering" both "linear" and "random access" was used random access as we would like the reader to investigate the data and the story it tells by one self, using the tools that we have prepared for the reader, but in order to also help the reader to reach the conclusion we made, and see our entire analysis we have also prepared a "linear" storytelling, used to lead the reader through our entire thoughtprocess.   


The tools mentioned to let the reader exploit the data and learn the story it tells, we have created tools of the interactivity type: "navigative buttons, filter/selection/seach

In order to get the story across to the reader the Narrative structure messaging tools like headlines/captions and annotations have been used.

# Visualizations

We will create a userbased interface, which will enables the user to investigate the data by ones self. But also to help the user understand the data we will display the data in a "cartoon" type, telling our story of the analysis, in order to inspire the user, for how the data can be investigated.

We have created som sliders where the user can apply filters dynamiclly to the data. The visulizations we have created are primarily three different plots, that have been used in a few different ways duo to the filters. First there is a geo-data plot, where all the accidents have been plotted on the map of The United States as a heat map. Next we have to bar plots, which also can change dynamicly as the filters are changed. The first show the distribution between the four severities across all accidents within the current filter specifications. The other show kind of the same data as the heat map, but instead of being on the map, we show the distribution of the accidents pr. 10000 citizen across all the states in The United States, thus it paints a picture, of the chance of being involved in a car accident, given the state that one is driving in.

We chose the geo data plot as we think it gives a nice view of where accidents are focused, as a map is very familiar, and therefore it might be easier to place the accidents, compared to a bar chart where it just lists all the states. But in order to make the areas more compareable, we though it would be easier to have a bar chart as the bars given hight is very easy to compare, the bar chart is a really good plot, when the purpose for the plot is to give the reader a chance to compare the differet states. At last we used a bar plot again, for the same reason. It is easy to compare the distribution between the four categories of severity, as height of the bars are easy for the eye to distinguish. 

# Discussion

One thing we noticed throughout the process of analysing the dataset is that while we certainly can see that weather effects the severity of the accidents, it is by no mean a huge factor in them. When we chose this dataset and determined what we wanted to focus on, we had no real chance to predict that this would be the outcome. Our creation is therefore not that effective at showing american drivers the dangers of weather when driving cars, which was our initial idea with the project. What we instead have found using this dataset, is that our initial hypothesis was correct, but not as an important factor as we thought. While this makes for a more boring website experience, it is still a very interesting discovery. We believe that the reason behind this, is how much cars have been improved over the years, and how many tools they have to tackle weather effects. Some of these tools are airbags, ABS brakes, tire technology and improvements in the software of cars.

We believe that our visual representation and the speed which our website updates went really well. It shows the viewer what we want them to see, without clutter or unnecessary information. We spendt quite some time improving this aspect, as we truly believe that it is one of the most important feature when showing data to people that are not as used to it. If it on the surface seems boring or too advanced it might scare potentiel viewers away. It was important for us, that we did not do that. Keeping the deeper data analysis behind the scenes, while still giving the user tools to get to the same conclusions was a focus of ours.

The model we trained could have been improved in the regard that it in by far most cases predict that the severity of the given accident, for which the weather meassurement are available, is two. This is most likely because for the data which the model have been trained on have a very uneven distribution of the different severity classes (1-4). There is a huge overweight of accidents where the severity have been noted as 2. To make of for this overweight we downsampled a lot of these accidents where the severity was 2. This made the training set alot smaller. Instead one could have upsampled the other categories, and actaully creating a larger dataset. But this is more tricky, as one needs to make fragmentations that can be used for this upsampling in order to avoid overfitting.

# Contributions

Frederik was the lead on the code for the regression model and the bar plot with severity. Anton was the lead for building the website and the geo data plot. Rasmus was the lead on the code for the dataprocessing and the bar plot with severity. 

For the explainer notebook Rasmus was the lead at the Motivation and basic stats section.

Anton was lead in the visulizations section.

Frederik was lead in the genre section.

Frederik and Anton worked together as leads in the data analysis section.

Anton and Rasmus worked together as leads in the Explorative data analysis and discussion section.
