# 800_Q2_final_prep

## Purpose
In this notebook we will we will prepare the final dataset to answer our second research question. We will specifically focus on joining the datasets for different years and separate the date column into 3 separate columns for day, month and year.

## Notebook Contents:
* __1:__ Loading the Datasets
   
* __2:__ Question 2(A)
    * __2.1:__ Split Date
    * __2.2:__ Sort by Year
    
* __3:__ Question 2(B)
    * __3.1:__ Split Date
    * __3.2:__ Sort by Year
    
* __4:__ Dropping Years

* __5:__ Convert Data Types

* __6:__ Concatenating Datasets

* __7:__ Saving Data to Pickle Files

* __8:__ Creating Data Dictionaries

## Datasets
__Input:__The following pickle files contain the data with the missing values removed. The data is for the years 1979-2004 and 2015-2016. Each pickle file only contains the columns needed to answer the research question that data is being used for, (e.g.): Where it says Q2B in the pickle name that data is to be used to answer research Q2 part B.
* 500_prep_missing_values_0514_Q2A.pkl, 500_prep_missing_values_0514_Q2B.pkl, 500_prep_missing_values_7904a_Q2B.pkl, 500_prep_missing_values_df7904b_Q2B.pkl, 600_prep_missing_values_7904c_Q2A.pkl, 600_prep_missing_values_7904c_Q2B.pkl, 600_prep_missing_values_7904d_Q2A.pkl, 600_prep_missing_values_7904d_Q2B.pkl, 600_prep_missing_values_1516_Q2A.pkl, 600_prep_missing_values_1516_Q2B.pkl


__Output:__ The following pickle files contain the final prepped data to be able to answer research question 2.
* 800_Q2A_final_prep_1.pkl, 800_Q2A_final_prep_2.pkl, 800_Q2A_final_prep_3.pkl, 800_Q2B_final_prep_1.pkl, 800_Q2B_final_prep_2.pkl, 800_Q2B_final_prep_3.pkl

In [1]:
import os
import sys

import pandas as pd
import numpy as np

module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.helpers import data_dictionary

%matplotlib inline

# 1. Loading the Datasets

In [2]:
df0514_Q2A = pd.read_pickle('../../data/processed/500_prep_missing_values_0514_Q2A.pkl')
df0514_Q2A.shape

(2036044, 16)

In [3]:
df0514_Q2B = pd.read_pickle('../../data/processed/500_prep_missing_values_0514_Q2B.pkl')
df0514_Q2B.shape

(2759693, 15)

In [4]:
df7904a_Q2B = pd.read_pickle('../../data/processed/500_prep_missing_values_7904a_Q2B.pkl')
df7904a_Q2B.shape

(2411743, 15)

In [5]:
df7904b_Q2B = pd.read_pickle('../../data/processed/500_prep_missing_values_df7904b_Q2B.pkl')
df7904b_Q2B.shape

(2621931, 15)

In [6]:
df7904c_Q2A = pd.read_pickle('../../data/processed/600_prep_missing_values_7904c_Q2A.pkl')
df7904c_Q2A.shape

(2033305, 16)

In [7]:
df7904c_Q2B = pd.read_pickle('../../data/processed/600_prep_missing_values_7904c_Q2B.pkl')
df7904c_Q2B.shape

(2808172, 15)

In [8]:
df7904d_Q2A = pd.read_pickle('../../data/processed/600_prep_missing_values_7904d_Q2A.pkl')
df7904d_Q2A.shape

(1242776, 16)

In [9]:
df7904d_Q2B = pd.read_pickle('../../data/processed/600_prep_missing_values_7904d_Q2B.pkl')
df7904d_Q2B.shape

(1818065, 15)

In [10]:
df1516_Q2A = pd.read_pickle('../../data/processed/600_prep_missing_values_1516_Q2A.pkl')
df1516_Q2A.shape

(319026, 16)

In [11]:
df1516_Q2B = pd.read_pickle('../../data/processed/600_prep_missing_values_1516_Q2B.pkl')
df1516_Q2B.shape

(434637, 15)

# 2. Question 2(A)

####  Here we are checking the size of what a dataframe with all dataframes for Q2A would be if we concatenated all of them into one dataframe.

In [12]:
# All Q2A dataframe sizes
2036044 + 2033305 + 1298507 + 319026

5686882

We decided that this would be too large of a dataframe to work with. Therefore,  we will concatenate the dataframes so that we have multiple smaller dataframes later in this notebook.

## 2.1
## Split Date
Here we will split the date column into day, month and year so that we can reference data by the year.

In [13]:
df7904c_Q2A[['Date_day', 'Month', 'Year']] = df7904c_Q2A['Date'].str.split(pat = '/', n=-1, expand=True)
df7904c_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
6000001,199301NI00616,slight,2,1,20/10/1993,thursday,11:25,3,daylight,fine no high winds,dry,none,car,turning left,male,6.0,20,10,1993


In [14]:
df7904d_Q2A[['Date_day', 'Month', 'Year']] = df7904d_Q2A['Date'].str.split(pat = '/', n=-1, expand=True)
df7904d_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
9000001,200001TX00967,slight,3,2,11/09/2000,tuesday,06:47,25,daylight,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,going ahead other,male,5.0,11,9,2000


In [15]:
df0514_Q2A[['Date_day', 'Month', 'Year']] = df0514_Q2A['Date'].str.split(pat = '/', n=-1, expand=True)
df0514_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
1,200501BS00002,slight,1,1,05/01/2005,thursday,17:36,12,darkness - lights lit,fine no high winds,dry,none,bus or coach,slowing or stopping,male,3.0,5,1,2005


In [16]:
df1516_Q2A[['Date_day', 'Month', 'Year']] = df1516_Q2A['Date'].str.split(pat = '/', n=-1, expand=True)
df1516_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
0,201501BS70001,slight,1,1,12/01/2015,tuesday,18:45,12,darkness - lights lit,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,turning right,male,4.0,12,1,2015


## 2.2
## Sort by Year
Here we will sort the data by year so that the data is easier to work with.

In [17]:
df7904c_Q2A.sort_values(by=['Year'])
df7904c_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
6000001,199301NI00616,slight,2,1,20/10/1993,thursday,11:25,3,daylight,fine no high winds,dry,none,car,turning left,male,6.0,20,10,1993


In [18]:
df7904d_Q2A.sort_values(by=['Year'])
df7904d_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
9000001,200001TX00967,slight,3,2,11/09/2000,tuesday,06:47,25,daylight,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,going ahead other,male,5.0,11,9,2000


In [19]:
df0514_Q2A.sort_values(by=['Year'])
df0514_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
1,200501BS00002,slight,1,1,05/01/2005,thursday,17:36,12,darkness - lights lit,fine no high winds,dry,none,bus or coach,slowing or stopping,male,3.0,5,1,2005


In [20]:
df1516_Q2A.sort_values(by=['Year'])
df1516_Q2A.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Age_of_Vehicle,Date_day,Month,Year
0,201501BS70001,slight,1,1,12/01/2015,tuesday,18:45,12,darkness - lights lit,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,turning right,male,4.0,12,1,2015


# 3. Question 2(B)

####  Here we are checking the size of what a dataframe with all dataframes for Q2B would be if we concatenated all of them into one dataframe.

In [21]:
# All Q2B dataframe sizes
2759693 + 2411743 + 2621931 + 2808172 + 1966854 + 434637

13003030

We decided that this would be too large of a dataframe to work with. Therefore,  we will concatenate the dataframes so that we have multiple smaller dataframes later in this notebook.

## 3.1
## Split Date
Here we will split the date column into day, month and year so that we can reference data by the year.

In [22]:
df7904a_Q2B [['Date_day', 'Month', 'Year']] = df7904a_Q2B ['Date'].str.split(pat = '/', n=-1, expand=True)
df7904a_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
85587,197903A102220,serious,2,2,07/04/1979,sunday,12:30,63,daylight,fine no high winds,wet or damp,none,bus or coach,goinf ahead right-hand bend,male,7,4,1979


In [23]:
df7904b_Q2B[['Date_day', 'Month', 'Year']] = df7904b_Q2B['Date'].str.split(pat = '/', n=-1, expand=True)
df7904b_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
3000004,198601TD00758,slight,2,1,07/08/1986,friday,13:30,25,daylight,fine no high winds,dry,none,car,overtaking moving vehicle - offside,female,7,8,1986


In [24]:
df7904c_Q2B[['Date_day', 'Month', 'Year']] = df7904c_Q2B['Date'].str.split(pat = '/', n=-1, expand=True)
df7904c_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
6000001,199301NI00616,slight,2,1,20/10/1993,thursday,11:25,3,daylight,fine no high winds,dry,none,car,turning left,male,20,10,1993


In [25]:
df7904d_Q2B[['Date_day', 'Month', 'Year']] = df7904d_Q2B['Date'].str.split(pat = '/', n=-1, expand=True)
df7904d_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
9000001,200001TX00967,slight,3,2,11/09/2000,tuesday,06:47,25,daylight,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,going ahead other,male,11,9,2000


In [26]:
df0514_Q2B[['Date_day', 'Month', 'Year']] = df0514_Q2B['Date'].str.split(pat = '/', n=-1, expand=True)
df0514_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
0,200501BS00001,serious,1,1,04/01/2005,wednesday,17:42,12,daylight,raining no high winds,wet or damp,none,car,going ahead other,female,4,1,2005


In [27]:
df1516_Q2B[['Date_day', 'Month', 'Year']] = df1516_Q2B['Date'].str.split(pat = '/', n=-1, expand=True)
df1516_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
0,201501BS70001,slight,1,1,12/01/2015,tuesday,18:45,12,darkness - lights lit,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,turning right,male,12,1,2015


## 3.2
## Sort by Year
Here we will sort the data by year so that the data is easier to work with.

In [28]:
df7904a_Q2B.sort_values(by=['Year'])
df7904a_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
85587,197903A102220,serious,2,2,07/04/1979,sunday,12:30,63,daylight,fine no high winds,wet or damp,none,bus or coach,goinf ahead right-hand bend,male,7,4,1979


In [29]:
df7904b_Q2B.sort_values(by=['Year'])
df7904b_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
3000004,198601TD00758,slight,2,1,07/08/1986,friday,13:30,25,daylight,fine no high winds,dry,none,car,overtaking moving vehicle - offside,female,7,8,1986


In [30]:
df7904c_Q2B.sort_values(by=['Year'])
df7904c_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
6000001,199301NI00616,slight,2,1,20/10/1993,thursday,11:25,3,daylight,fine no high winds,dry,none,car,turning left,male,20,10,1993


In [31]:
df7904d_Q2B.sort_values(by=['Year'])
df7904d_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
9000001,200001TX00967,slight,3,2,11/09/2000,tuesday,06:47,25,daylight,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,going ahead other,male,11,9,2000


In [32]:
df0514_Q2B.sort_values(by=['Year'])
df0514_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
0,200501BS00001,serious,1,1,04/01/2005,wednesday,17:42,12,daylight,raining no high winds,wet or damp,none,car,going ahead other,female,4,1,2005


In [33]:
df1516_Q2B.sort_values(by=['Year'])
df1516_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
0,201501BS70001,slight,1,1,12/01/2015,tuesday,18:45,12,darkness - lights lit,fine no high winds,dry,none,van / goods 3.5 tonnes mgw or under,turning right,male,12,1,2015


# 4. Dropping Years

Data for Question 2A starts from 20/10/1993 whereas data for Question 2B starts from 07/04/1979. We want to have the same start year for both questions and therefore have decided to drop data in the dataframes for Question 2B from before 1993.

Below you can see that the dataframe df7904a date ends before 20/10/1993 we will not include this dataframe in our analysis.

In [34]:
df7904a_Q2B.tail()

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
2999995,198601TD00752,slight,2,1,06/08/1986,thursday,14:25,25,daylight,fine no high winds,dry,none,car,waiting to turn right,male,6,8,1986
2999996,198601TD00752,slight,2,1,06/08/1986,thursday,14:25,25,daylight,fine no high winds,dry,none,car,going ahead other,female,6,8,1986
2999997,198601TD00753,slight,2,1,04/08/1986,tuesday,17:35,25,daylight,fine no high winds,dry,none,motorcycle,waiting to turn right,male,4,8,1986
2999998,198601TD00753,slight,2,1,04/08/1986,tuesday,17:35,25,daylight,fine no high winds,dry,none,motorcycle,going ahead other,male,4,8,1986
2999999,198601TD00754,slight,2,1,07/08/1986,friday,14:30,25,daylight,fine no high winds,dry,none,car,waiting to turn left,male,7,8,1986


The following dataframe contains data up to and including 1993. Therefore, we will drop data from before this year. However in order to do this we must first convert the year column from string to int.

In [35]:
df7904b_Q2B.tail()

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
5999991,199301NI00611,slight,2,1,14/11/1993,monday,11:35,3,daylight,fine + high winds,wet or damp,none,car,waiting to go - held up,female,14,11,1993
5999992,199301NI00612,slight,1,2,13/11/1993,sunday,09:10,3,daylight,raining no high winds,wet or damp,none,car,going ahead other,male,13,11,1993
5999993,199301NI00613,slight,3,1,14/11/1993,monday,03:36,3,darkness - lights lit,raining + high winds,wet or damp,none,car,u-turn,male,14,11,1993
5999994,199301NI00613,slight,3,1,14/11/1993,monday,03:36,3,darkness - lights lit,raining + high winds,wet or damp,none,car,going ahead other,male,14,11,1993
5999999,199301NI00615,slight,2,1,08/11/1993,tuesday,17:15,3,darkness - lights lit,fine no high winds,dry,none,pedal cycle,going ahead other,female,8,11,1993


As the other datasets are from 1993 onwards we will include these in our final analysis and therefore no rows in them need to be dropped.

# 5. Convert Data Types
Next we want to convert the Date_day, Month and Year columns we created earlier from type string to int. 

## Question 2A

In [36]:
df7904c_Q2A.dtypes

Accident_Index                 object
Accident_Severity              object
Number_of_Vehicles              int64
Number_of_Casualties            int64
Date                           object
Day_of_Week                    object
Time                           object
Local_Authority_(District)      int64
Light_Conditions               object
Weather_Conditions             object
Road_Surface_Conditions        object
Special_Conditions_at_Site     object
Vehicle_Type                   object
Vehicle_Manoeuvre              object
Sex_of_Driver                  object
Age_of_Vehicle                float64
Date_day                       object
Month                          object
Year                           object
dtype: object

In [37]:
df7904c_Q2A[['Date_day', 'Month', 'Year']] = df7904c_Q2A[['Date_day', 'Month', 'Year']].astype('int64')

Below you can see that the specified columns have been converted from type object -> int.

In [38]:
df7904c_Q2A.dtypes

Accident_Index                 object
Accident_Severity              object
Number_of_Vehicles              int64
Number_of_Casualties            int64
Date                           object
Day_of_Week                    object
Time                           object
Local_Authority_(District)      int64
Light_Conditions               object
Weather_Conditions             object
Road_Surface_Conditions        object
Special_Conditions_at_Site     object
Vehicle_Type                   object
Vehicle_Manoeuvre              object
Sex_of_Driver                  object
Age_of_Vehicle                float64
Date_day                        int64
Month                           int64
Year                            int64
dtype: object

We will now repeat this for the other 3 datasets being used for research question 2A.

In [39]:
df7904d_Q2A[['Date_day', 'Month', 'Year']] = df7904d_Q2A[['Date_day', 'Month', 'Year']].astype('int64')

In [40]:
df0514_Q2A[['Date_day', 'Month', 'Year']] = df0514_Q2A[['Date_day', 'Month', 'Year']].astype('int64')

In [41]:
df1516_Q2A[['Date_day', 'Month', 'Year']] = df1516_Q2A[['Date_day', 'Month', 'Year']].astype('int64')

## Question 2B

In [42]:
df7904b_Q2B.dtypes

Accident_Index                object
Accident_Severity             object
Number_of_Vehicles             int64
Number_of_Casualties           int64
Date                          object
Day_of_Week                   object
Time                          object
Local_Authority_(District)     int64
Light_Conditions              object
Weather_Conditions            object
Road_Surface_Conditions       object
Special_Conditions_at_Site    object
Vehicle_Type                  object
Vehicle_Manoeuvre             object
Sex_of_Driver                 object
Date_day                      object
Month                         object
Year                          object
dtype: object

In [43]:
df7904b_Q2B[['Date_day', 'Month', 'Year']] = df7904b_Q2B[['Date_day', 'Month', 'Year']].astype('int64')

Below you can see that the specified columns have been converted from type object -> int.

In [44]:
df7904b_Q2B.dtypes

Accident_Index                object
Accident_Severity             object
Number_of_Vehicles             int64
Number_of_Casualties           int64
Date                          object
Day_of_Week                   object
Time                          object
Local_Authority_(District)     int64
Light_Conditions              object
Weather_Conditions            object
Road_Surface_Conditions       object
Special_Conditions_at_Site    object
Vehicle_Type                  object
Vehicle_Manoeuvre             object
Sex_of_Driver                 object
Date_day                       int64
Month                          int64
Year                           int64
dtype: object

We will now repeat this for the other 4 datasets being used for research question 2B.

In [45]:
df7904c_Q2B[['Date_day', 'Month', 'Year']] = df7904c_Q2B[['Date_day', 'Month', 'Year']].astype('int64')

In [46]:
df7904d_Q2B[['Date_day', 'Month', 'Year']] = df7904d_Q2B[['Date_day', 'Month', 'Year']].astype('int64')

In [47]:
df0514_Q2B[['Date_day', 'Month', 'Year']] = df0514_Q2B[['Date_day', 'Month', 'Year']].astype('int64')

In [48]:
df1516_Q2B[['Date_day', 'Month', 'Year']] = df1516_Q2B[['Date_day', 'Month', 'Year']].astype('int64')

Now we can drop the data we don't want to keep (i.e.): data before 1993.

In [49]:
df7904b_Q2B = df7904b_Q2B[df7904b_Q2B['Year'] >= 1993]
df7904b_Q2B.head(1)

Unnamed: 0,Accident_Index,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Vehicle_Type,Vehicle_Manoeuvre,Sex_of_Driver,Date_day,Month,Year
5968608,1993010SA0120,slight,3,2,18/02/1993,friday,16:45,33,daylight,fine no high winds,dry,none,car,waiting to go - held up,male,18,2,1993


# 6. Concatenating Datasets

## Question 2A

In [50]:
df7904c_Q2A.shape

(2033305, 19)

In [51]:
df7904d_Q2A.shape

(1242776, 19)

In [52]:
df0514_Q2A.shape

(2036044, 19)

In [53]:
df1516_Q2A.shape

(319026, 19)

We have decided to join the df0514_Q2A and df1516_Q2A dataframes because df1516_Q2A is small. We have decided to leave the first two dataframes as they are.

In [54]:
df0516_Q2A=pd.concat([df0514_Q2A, df1516_Q2A])

In [55]:
df0516_Q2A.shape

(2355070, 19)

As you can see below no null values were created after concatenating the two dataframes.

In [56]:
df0516_Q2A.isnull().sum()

Accident_Index                0
Accident_Severity             0
Number_of_Vehicles            0
Number_of_Casualties          0
Date                          0
Day_of_Week                   0
Time                          0
Local_Authority_(District)    0
Light_Conditions              0
Weather_Conditions            0
Road_Surface_Conditions       0
Special_Conditions_at_Site    0
Vehicle_Type                  0
Vehicle_Manoeuvre             0
Sex_of_Driver                 0
Age_of_Vehicle                0
Date_day                      0
Month                         0
Year                          0
dtype: int64

## Question 2B

In [57]:
df7904b_Q2B.shape

(24290, 18)

In [58]:
df7904c_Q2B.shape

(2808172, 18)

In [59]:
df7904d_Q2B.shape

(1818065, 18)

In [60]:
df0514_Q2B.shape

(2759693, 18)

In [61]:
df1516_Q2B.shape

(434637, 18)

In [62]:
24290 + 2808172 + 1966854 + 2759693 + 434637

7993646

You can see here that after dropping data our total is much smaller. It has gone from 13003030 rows to 7993646. 

We have decided to concat the first two datasets and the last two datasets as the df7904b_Q2B and df1516_Q2B dataframes are much smaller.

In [63]:
df7904bc_Q2B=pd.concat([df7904b_Q2B, df7904c_Q2B])

In [64]:
df7904bc_Q2B.shape

(2832462, 18)

In [65]:
df0516_Q2B=pd.concat([df0514_Q2B, df1516_Q2B])

In [66]:
df0516_Q2B.shape

(3194330, 18)

As you can see below no null values were created after concatenating the dataframes.

In [67]:
df7904bc_Q2B.isnull().sum()

Accident_Index                0
Accident_Severity             0
Number_of_Vehicles            0
Number_of_Casualties          0
Date                          0
Day_of_Week                   0
Time                          0
Local_Authority_(District)    0
Light_Conditions              0
Weather_Conditions            0
Road_Surface_Conditions       0
Special_Conditions_at_Site    0
Vehicle_Type                  0
Vehicle_Manoeuvre             0
Sex_of_Driver                 0
Date_day                      0
Month                         0
Year                          0
dtype: int64

In [68]:
df0516_Q2B.isnull().sum()

Accident_Index                0
Accident_Severity             0
Number_of_Vehicles            0
Number_of_Casualties          0
Date                          0
Day_of_Week                   0
Time                          0
Local_Authority_(District)    0
Light_Conditions              0
Weather_Conditions            0
Road_Surface_Conditions       0
Special_Conditions_at_Site    0
Vehicle_Type                  0
Vehicle_Manoeuvre             0
Sex_of_Driver                 0
Date_day                      0
Month                         0
Year                          0
dtype: int64

# 7. Saving Data to Pickle Files

## Q2A

In [69]:
# df7904c_Q2A
pickle_save_time = %timeit -o df7904c_Q2A.to_pickle("../../data/processed/800_Q2A_final_prep_1.pkl")

pickle_save_time

13.7 s ± 2.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 13.7 s ± 2.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [70]:
# df7904d_Q2A
pickle_save_time = %timeit -o df7904d_Q2A.to_pickle("../../data/processed/800_Q2A_final_prep_2.pkl")

pickle_save_time

10.5 s ± 3.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 10.5 s ± 3.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [71]:
# df0516_Q2A
pickle_save_time = %timeit -o df0516_Q2A.to_pickle("../../data/processed/800_Q2A_final_prep_3.pkl")

pickle_save_time

16.1 s ± 1.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 16.1 s ± 1.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

## Q2B

In [72]:
# df7904bc_Q2B
pickle_save_time = %timeit -o df7904bc_Q2B.to_pickle("../../data/processed/800_Q2B_final_prep_1.pkl")

pickle_save_time

18.7 s ± 4.64 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 18.7 s ± 4.64 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [73]:
# df7904d_Q2B
pickle_save_time = %timeit -o df7904d_Q2B.to_pickle("../../data/processed/800_Q2B_final_prep_2.pkl")

pickle_save_time

10.8 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 10.8 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [74]:
# df0516_Q2B
pickle_save_time = %timeit -o df0516_Q2B.to_pickle("../../data/processed/800_Q2B_final_prep_3.pkl")

pickle_save_time

20.8 s ± 4.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 20.8 s ± 4.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

# 8. Creating Data Dictionaries

## Q2A

In [75]:
data_dictionary.save(
    '../../data/processed/800_Q2A_final_prep_1.pkl', 

"""\
Aggregate raw data for UK Road Safety data for years 1979 - 2004 (c) Q2A (df7904c_Q2A).
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2033305.0,1356471.0,199722AX01148,71.0,,,,,,,,0,0.0
Accident_Severity,2033305.0,3.0,slight,1708165.0,,,,,,,,0,0.0
Number_of_Vehicles,2033300.0,,,,2.11651,1.28066,1.0,2.0,2.0,2.0,88.0,0,0.0
Number_of_Casualties,2033300.0,,,,1.48016,1.15529,1.0,1.0,1.0,2.0,80.0,0,0.0
Date,2033305.0,2922.0,25/04/1997,1801.0,,,,,,,,0,0.0


In [76]:
data_dictionary.save(
    '../../data/processed/800_Q2A_final_prep_2.pkl', 

"""\
Aggregate raw data for UK Road Safety data for years 1979 - 2004 (d) Q2A (df7904d_Q2A).
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,1242776.0,835645.0,2003070300788,28.0,,,,,,,,0,0.0
Accident_Severity,1242776.0,3.0,slight,1068500.0,,,,,,,,0,0.0
Number_of_Vehicles,1242780.0,,,,2.13035,0.979638,1.0,2.0,2.0,2.0,66.0,0,0.0
Number_of_Casualties,1242780.0,,,,1.49748,1.02673,1.0,1.0,1.0,2.0,90.0,0,0.0
Date,1242776.0,1827.0,26/05/2000,1280.0,,,,,,,,0,0.0


In [77]:
data_dictionary.save(
    '../../data/processed/800_Q2A_final_prep_3.pkl', 

"""\
Aggregate raw data for UK Road Safety data for years 2005 - 2016 Q2A (df0516_Q2A).
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2355070.0,1561553.0,2013460234852,52.0,,,,,,,,0,0.0
Accident_Severity,2355070.0,3.0,slight,2028868.0,,,,,,,,0,0.0
Number_of_Vehicles,2355070.0,,,,2.11405,0.947346,1.0,2.0,2.0,2.0,67.0,0,0.0
Number_of_Casualties,2355070.0,,,,1.4762,1.0322,1.0,1.0,1.0,2.0,93.0,0,0.0
Date,2355070.0,4383.0,21/10/2005,1011.0,,,,,,,,0,0.0


## Q2B

In [78]:
data_dictionary.save(
    '../../data/processed/800_Q2B_final_prep_1.pkl', 

"""\
Aggregate raw data for UK Road Safety data for years 1979 - 2004 (bc) Q2B (df7904bc_Q2B).
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2832462.0,1617016.0,199722AX01148,88.0,,,,,,,,0,0.0
Accident_Severity,2832462.0,3.0,slight,2381016.0,,,,,,,,0,0.0
Number_of_Vehicles,2832460.0,,,,2.10844,1.23626,1.0,2.0,2.0,2.0,88.0,0,0.0
Number_of_Casualties,2832460.0,,,,1.44127,1.11126,1.0,1.0,1.0,2.0,80.0,0,0.0
Date,2832462.0,2922.0,25/04/1997,2338.0,,,,,,,,0,0.0


In [79]:
data_dictionary.save(
    '../../data/processed/800_Q2B_final_prep_2.pkl', 

"""\
Aggregate raw data for UK Road Safety data for years 1979 - 2004 (d) Q2B (df7904d_Q2B).
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,1818065.0,1033652.0,2003070300788,40.0,,,,,,,,0,0.0
Accident_Severity,1818065.0,3.0,slight,1562694.0,,,,,,,,0,0.0
Number_of_Vehicles,1818060.0,,,,2.121,0.95333,1.0,2.0,2.0,2.0,66.0,0,0.0
Number_of_Casualties,1818060.0,,,,1.45772,0.994761,1.0,1.0,1.0,2.0,90.0,0,0.0
Date,1818065.0,1827.0,25/10/2002,1801.0,,,,,,,,0,0.0


In [80]:
data_dictionary.save(
    '../../data/processed/800_Q2B_final_prep_3.pkl', 

"""\
Aggregate raw data for UK Road Safety data for years 2005 - 2016 Q2B (df0516_Q2B).
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,3194330.0,1818307.0,2013460234852,67.0,,,,,,,,0,0.0
Accident_Severity,3194330.0,3.0,slight,2745292.0,,,,,,,,0,0.0
Number_of_Vehicles,3194330.0,,,,2.10637,0.921413,1.0,2.0,2.0,2.0,67.0,0,0.0
Number_of_Casualties,3194330.0,,,,1.43969,0.998722,1.0,1.0,1.0,2.0,93.0,0,0.0
Date,3194330.0,4383.0,21/10/2005,1471.0,,,,,,,,0,0.0
