In [None]:
import pandas as pd

# Motivation

We have chosen a dataset with country wide traffic accidents in the US. This dataset has a lot of data that we can explore, while using the multitude of the tools we have taught in the course. It also opens up the possibility to do advanced analysis. Our goal is to show the user how dangerous it can be to drive during bad/extreme weather conditions, using interactive tools. We believe that by doing this, the user is more entertained and invested in the findings of our analysis.

# Basic stats

Before starting the analysis, we have had to remove some of the non-important features of our dataset. We did this by removing 26 attributes. Furthermore some cleanup was needed to remove n/a values. After cleanup, our dataset is 350MB large and contains 20 attributes. As the data is from an american source, we change all units from empirial to metric units.

To determine how weather conditions affect car accidents we will have a main focus on the severity of the car accidents, which is a attribute ranging from 1-4 (4 being the most catastrophic). Our main focus will therefore be on how weather increases or decreases the severity of accidents. 


MANGLER: "Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis."




In [None]:
#Preparing data
data = pd.read_csv("data/US_Accidents_Dec20.csv")
CleanedData = data.drop(['Number','Distance(mi)','Airport_Code','Street','Side','Country','Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal','Turning_Loop','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight', 'End_Lat','End_Lng', 'Wind_Direction'],axis='columns', inplace=False)
CleanedData.info()
CleanedData = CleanedData.dropna()
CleanedData['Temperature'] = CleanedData['Temperature(F)'].apply(lambda x: (x-32) * (5/9))
CleanedData['Wind_Chill'] = CleanedData['Wind_Chill(F)'].apply(lambda x: (x-32) * (5/9))
CleanedData['Visibility'] = CleanedData['Visibility(mi)'].apply(lambda x: x*1.609344)
CleanedData['Wind_Speed'] = CleanedData['Wind_Speed(mph)'].apply(lambda x: x*1.609344*1000/3600)
CleanedData['Precipitation'] = CleanedData['Precipitation(in)'].apply(lambda x: x*25.4)
CleanedData['Pressure'] = CleanedData['Pressure(in)'].apply(lambda x: x*33.86)

df = CleanedData

## Creating KPI's

To understand the dataset better, we create some key point indicators. This includes the most important continuous attributes of the dataset.

In [None]:
key_attributes = ["Severity","Temperature","Wind_Chill","Humidity(%)","Pressure","Visibility","Wind_Speed","Precipitation"]

KPI = pd.DataFrame(data={'KPI': ['Mean']})

for i in key_attributes:
    KPI[i] = df[i].mean()

print(KPI)

Now that we have created some basic key point indicators from our dataset, we would like to get an insight on how the different weather conditions affect the severity of the car accidents. One insight we get from the mean values is that the Temperature and Wind_Chill are quite similar, and thefore moving forward we removed Wind_Chill. 

We can make some basic exploratory data analysis, by calculating the mean for the same attributes, but focusing on specific values of severity. Specifically we would like to analyse on the extremes, which is severity 1 and severity 4.

In [None]:
key_attributes = ["Severity","Temperature","Humidity(%)","Pressure","Visibility","Wind_Speed","Precipitation"]

KPI = pd.DataFrame(data={'KPI': ['Mean']})

for i in key_attributes:
    KPI[i] = df[df["Severity"]==1][i].mean()
print(KPI)
print("-----------------------------------------------------------------------------------------------------")
KPI = pd.DataFrame(data={'KPI': ['Mean']})
for i in key_attributes:
    KPI[i] = df[df["Severity"]==4][i].mean()
print(KPI)

The idea is now to find which attributes is the most different between severity 1 and severity 4. These are the attributes we wish to use when going deeper into our data analysis.

* Temperature is clearly lower on severity 4, so we keep that for further analysis.
* Humidity is clearly higher on severity 4, so we keep that for further analysis.
* We see no significant changes in the mean of pressure, so we do not use it for further analysis
* Visibility is clearly lower on severity 4, so we keep that for further analysis.
* Wind speed is clearly lower on severity 4, so we keep that for further analysis. One thing to note, is that this indicates that higher wind speeds reduces the severity, which we did not expect.
* Precipation is only slightly higher between the two severities. This can be caused by having a large amount of cases, where there is no precipitation at all, so we would like to make an extra check on this attribute.

count = df["Precipitation(in)"]==0
print("Amount of cases where there is no precipation at all in the dataset is:", round(100*(count.sum()/len(count)),2),"%")
print("------------------------------------------------")
KPI = pd.DataFrame(data={'KPI': ['Mean']})
KPI["Severity"] = df[df["Precipitation(in)"]>0]["Severity"].mean()
print(KPI)

In 91.25% of all car accidents in the dataset, there is no precipitation at all. In the rest of the 8.75% cases, the mean severity 2.29, which is only 0.09 higher than the mean of the whole dataset. As this is only a slight increase, and it only effects 8.75% of all car accidents in the US, we decide not to include precipitation in our deeper data analysis.

The attributes we decided was best for our data analysis are:
* Temperature
* Humidity
* Visiblity
* Wind Speed

# Data Analysis

Describe your data analysis and explain what you've learned about the dataset.
If relevant, talk about your machine-learning.

Hvad vi har lært må vi nok vente lidt med - kan vi måske køre løs på i morgen (Kjær og mus) - tænker dette bliver det primære afsnit i "explainer_notebook". 

After analysing how different weather conditions affect traffic accidents across the US, we though it would be relevant to create a model, which can predict the serverity of a accident, given different weather conditions. To do so we performed some data engineering, by decoding the "weather condition" to a numeric value sympolizing one of the 111 conditions. Next we used a supervised learning classifier (logistic regression) to predidict the severity of an accident for future accident. This was possible as we had labeled data, where the severity were out labels. 


# Genre

So far har jeg skrevet lidt om highlighting, ordering, interactivity og messaging. (mangler visual structuring og transition guidance)

To tell our story we used some different tools. We used the tool zooming from the "Visual Narrative - highlighting" type tool. This is because at first our data has a huge scope, and a zoom is quite usefull to tell stories in a way more local and precise manner. 

The narrative structure used to lead the reader is to folded. From the type "ordering" both "linear" and "random access" was used random access as we would like the reader to investigate the data and the story it tells by one self, using the tools that we have prepared for the reader, but in order to also help the reader to reach the conclusion we made, and see our entire analysis we have also prepared a "linear" storytelling, used to lead the reader through our entire thoughtprocess.   


The tools mentioned to let the reader exploit the data and learn the story it tells, we have created tools of the interactivity type: "navigative buttons, filter/selection/seach

In order to get the story across to the reader the Narrative structure messaging tools like headlines/captions and annotations have been used.

# Visualizations

Explain the visualizations you've chosen.
Why are they right for the story you want to tell?

We have chosen to use a mix of different plots, where our most important visualizations are when using geodata. This type of plot is easy to understand and therefore a good choice, because we try to cater to all type of users.

We will create a userbased interface, which will enables the user to investigate the data by ones self. But also to help the user understand the data we will display the data in a "cartoon" type, telling our story of the analysis, in order to inspire the user, for how the data can be investigated.

# Discussion
Think critically about your creation
What went well?,
What is still missing? What could be improved?, Why?


# Contributions

Who did what?
You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
It is not OK simply to write "All group members contributed equally".
