In [2]:
import pandas as pd

# Motivation

We have chosen a dataset with country wide traffic accidents in the US. This dataset has a lot of data that we can explore, while using the multitude of the tools we have taught in the course. It also opens up the possibility to do advanced analysis. Our goal is to show the user how dangerous it can be to drive during bad/extreme weather conditions, using interactive tools. We believe that by doing this, the user is more entertained and invested in the findings of our analysis.

# Basic stats

Before starting the analysis, we have had to remove some of the non-important features of our dataset. We did this by removing 26 attributes. Furthermore some cleanup was needed to remove n/a values. After cleanup, our dataset is 350MB large and contains 20 attributes. As the data is from an american source, we change all units from empirial to metric units.

To determine how weather conditions affect car accidents we will have a main focus on the severity of the car accidents, which is a attribute ranging from 1-4 (4 being the most catastrophic). Our main focus will therefore be on how weather increases or decreases the severity of accidents. 


In [94]:
#Preparing data
data = pd.read_csv("data/US_Accidents_Dec20.csv")
CleanedData = data.drop(['Number','Distance(mi)','Airport_Code','Street','Side','Country','Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal','Turning_Loop','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight', 'End_Lat','End_Lng', 'Wind_Direction'],axis='columns', inplace=False)
CleanedData.info()
CleanedData = CleanedData.dropna()
CleanedData['Temperature'] = CleanedData['Temperature(F)'].apply(lambda x: (x-32) * (5/9))
CleanedData['Wind_Chill'] = CleanedData['Wind_Chill(F)'].apply(lambda x: (x-32) * (5/9))
CleanedData['Visibility'] = CleanedData['Visibility(mi)'].apply(lambda x: x*1.609344)
CleanedData['Wind_Speed'] = CleanedData['Wind_Speed(mph)'].apply(lambda x: x*1.609344*1000/3600)
CleanedData['Precipitation'] = CleanedData['Precipitation(in)'].apply(lambda x: x*25.4)
CleanedData['Pressure'] = CleanedData['Pressure(in)'].apply(lambda x: x*33.86)

df = CleanedData

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906610 entries, 0 to 2906609
Data columns (total 21 columns):
 #   Column             Dtype  
---  ------             -----  
 0   ID                 object 
 1   Severity           int64  
 2   Start_Time         object 
 3   End_Time           object 
 4   Start_Lat          float64
 5   Start_Lng          float64
 6   Description        object 
 7   City               object 
 8   County             object 
 9   State              object 
 10  Zipcode            object 
 11  Timezone           object 
 12  Weather_Timestamp  object 
 13  Temperature(F)     float64
 14  Wind_Chill(F)      float64
 15  Humidity(%)        float64
 16  Pressure(in)       float64
 17  Visibility(mi)     float64
 18  Wind_Speed(mph)    float64
 19  Precipitation(in)  float64
 20  Weather_Condition  object 
dtypes: float64(9), int64(1), object(11)
memory usage: 465.7+ MB


## Creating KPI's

To understand the dataset better, we create some key point indicators. This includes the most important continuous attributes of the dataset.

In [95]:
key_attributes = ["Severity","Temperature","Wind_Chill","Humidity(%)","Pressure","Visibility","Wind_Speed","Precipitation"]

KPI = pd.DataFrame(data={'KPI': ['Mean']})

for i in key_attributes:
    KPI[i] = df[i].mean()

print(KPI)

    KPI  Severity  Temperature  Wind_Chill  Humidity(%)    Pressure  \
0  Mean  2.203788    15.471941   14.727518    66.120855  993.686312   

   Visibility  Wind_Speed  Precipitation  
0   14.409621    3.207654        0.15927  


Now that we have created some basic key point indicators from our dataset, we would like to get an insight on how the different weather conditions affect the severity of the car accidents. One insight we get from the mean values is that the Temperature and Wind_Chill are quite similar, and thefore moving forward we removed Wind_Chill. 

We can make some basic exploratory data analysis, by calculating the mean for the same attributes, but focusing on specific values of severity. Specifically we would like to analyse on the extremes, which is severity 1 and severity 4.

In [96]:
key_attributes = ["Severity","Temperature","Humidity(%)","Pressure","Visibility","Wind_Speed","Precipitation"]

KPI = pd.DataFrame(data={'KPI': ['Mean']})

for i in key_attributes:
    KPI[i] = df[df["Severity"]==1][i].mean()
print(KPI)
print("-----------------------------------------------------------------------------------------------------")
KPI = pd.DataFrame(data={'KPI': ['Mean']})
for i in key_attributes:
    KPI[i] = df[df["Severity"]==4][i].mean()
print(KPI)

    KPI  Severity  Temperature  Humidity(%)    Pressure  Visibility  \
0  Mean       1.0    21.421112    51.404737  984.270458    15.29778   

   Wind_Speed  Precipitation  
0    3.728722       0.137604  
-----------------------------------------------------------------------------------------------------
    KPI  Severity  Temperature  Humidity(%)    Pressure  Visibility  \
0  Mean       4.0    14.790214     68.03242  989.773515   14.188859   

   Wind_Speed  Precipitation  
0    3.420655       0.155917  


The idea is now to find which attributes is the most different between severity 1 and severity 4. These are the attributes we wish to use when going deeper into our data analysis.

* Temperature is clearly lower on severity 4, so we keep that for further analysis.
* Humidity is clearly higher on severity 4, so we keep that for further analysis.
* We see no significant changes in the mean of pressure, so we do not use it for further analysis
* Visibility is clearly lower on severity 4, so we keep that for further analysis.
* Wind speed is clearly lower on severity 4, so we keep that for further analysis. One thing to note, is that this indicates that higher wind speeds reduces the severity, which we did not expect.
* Precipation is only slightly higher between the two severities. This can be caused by having a large amount of cases, where there is no precipitation at all, so we would like to make an extra check on this attribute.

In [97]:
count = df["Precipitation(in)"]==0
print("Amount of cases where there is no precipation at all in the dataset is:", round(100*(count.sum()/len(count)),2),"%")
print("------------------------------------------------")
KPI = pd.DataFrame(data={'KPI': ['Mean']})
KPI["Severity"] = df[df["Precipitation(in)"]>0]["Severity"].mean()
print(KPI)


Amount of cases where there is no precipation at all in the dataset is: 91.25 %
------------------------------------------------
    KPI  Severity
0  Mean   2.28682


In 91.25% of all car accidents in the dataset, there is no precipitation at all. In the rest of the 8.75% cases, the mean severity 2.29, which is only 0.09 higher than the mean of the whole dataset. As this is only a slight increase, and it only effects 8.75% of all car accidents in the US, we decide not to include precipitation in our deeper data analysis.

The attributes we decided was best for our data analysis are:
* Temperature
* Humidity
* Visiblity
* Wind Speed

# Data Analysis



# Genre

We will create a userbased interface, which will enables the user to investigate the data by ones self. But also to help the user understand the data we will display the data in a "cartoon" type, telling our story of the analysis, in order to inspire the user, for how the data can be investigated.


# Visualizations

We have chosen to use a mix of different plots, where our most important visualizations are when using geodata. This type of plot is easy to understand and therefore a good choice, because we try to cater to all type of users.


# Discussion



# Contributions



