# Reducing the number of high fatality accidents

## 📖 Background
You work for the road safety team within the department of transport and are looking into how they can reduce the number of major incidents. The safety team classes major incidents as fatal accidents involving 3+ casualties. They are trying to learn more about the characteristics of these major incidents so they can brainstorm interventions that could lower the number of deaths. They have asked for your assistance with answering a number of questions.

## 💾 The data
The reporting department have been collecting data on every accident that is reported. They've included this along with a lookup file for 2020's accidents.

*Published by the department for transport. https://data.gov.uk/dataset/road-accidents-safety-data* 
*Contains public sector information licensed under the Open Government Licence v3.0.*

## 💪 Competition challenge

Create a report that covers the following:

1. What time of day and day of the week do most major incidents happen?
2. Are there any patterns in the time of day/ day of the week when major incidents occur?
3. What characteristics stand out in major incidents compared with other accidents?
4. On what areas would you recommend the planning team focus their brainstorming efforts to reduce major incidents?

In [3]:
# Importing dependecies
import pandas as pd

In [24]:
# Loading accident dataset
accidents = pd.read_csv('Resources/accident_data.csv')
accidents.head()

Unnamed: 0.1,Unnamed: 0,accident_index,accident_year,accident_reference,longitude,latitude,accident_severity,number_of_vehicles,number_of_casualties,date,...,second_road_class,second_road_number,pedestrian_crossing_human_control,pedestrian_crossing_physical_facilities,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area
0,0,2020010219808,2020,10219808,-0.254001,51.462262,3,1,1,04/02/2020,...,6,0,9,9,1,9,9,0,0,1
1,1,2020010220496,2020,10220496,-0.139253,51.470327,3,1,2,27/04/2020,...,6,0,0,4,1,1,1,0,0,1
2,2,2020010228005,2020,10228005,-0.178719,51.529614,3,1,1,01/01/2020,...,6,0,0,0,4,1,2,0,0,1
3,3,2020010228006,2020,10228006,-0.001683,51.54121,2,1,1,01/01/2020,...,6,0,0,4,4,1,1,0,0,1
4,4,2020010228011,2020,10228011,-0.137592,51.515704,3,1,2,01/01/2020,...,5,0,0,0,4,1,1,0,0,1


In [5]:
# Loading Lookup Dataset
# Lookup is a reference for codes and formats used in the accidents dataset
lookup = pd.read_csv('Resources/lookup_Data.csv')
lookup.head()

Unnamed: 0.1,Unnamed: 0,table,field name,code/format,label,note
0,0,Accident,accident_index,,,unique value for each accident. The accident_i...
1,1,Accident,accident_year,,,
2,2,Accident,accident_reference,,,In year id used by the police to reference a c...
3,3,Accident,longitude,,,Null if not known
4,4,Accident,Latitude,,,Null if not known


## Preprocessing
We will look through the data:
* Remove duplicates
* Identify irrelevant columns
* Reformat column names
* look for outliers


In [32]:
# Checking the length of the accidents dataframe before checking for duplicats
initial_length = len(accidents)

# Dropping duplicate records using pandas
df2 = accidents.drop_duplicates(keep= False)

# Checking the final length of accidents after removing duplicates
final_length = len(df2)

print(initial_length , '/' , final_length)

# It looks like no duplicate records were present

91199 / 91199


In [None]:
# List out the columns present in accidents
for col in accidents.columns:
    print(col)

# All Columns appear to useful in the context of car crashes

## Time Frequency of Incidents
We will investigate the most frequent time of day and day of the week that the accidents occur