# 900_Q3_final_prep

## Purpose

The Purpose of this Notebook is to finalise our preparation of our Road Safety datasets for our third research question. We will focus on our 'Date' column, sorting our data by year and concatenating our datasets to form our final prep pickle files which will enable us to answer research question three: _" Does Location impact Road Safety? "_


### Notebook Contents:

* __1:__ Loading our Datasets

* __2:__ Question Three: (A)
     * __2.1:__ Sort by Year
     * __2.2:__ Creating new Datasets
     * __2.3:__ Check for Null Values

* __3:__ Question Three: (B)     
     * __3.1:__ Sort by Year
     * __3.2:__ Creating new Datasets
     * __3.3:__ Check for Null Values

* __4:__ Saving Datasets to pickle files


* __5:__ Creating Data Dictionaries


## Datasets

* __Input__: 

* 600_prep_missing_values_7904c_Q3A.pkl (Recorded UK Road Accident Data 1979 - 2004 (c) for RQ3 (A), with missing values removed)


* 600_prep_missing_values_7904c_Q3B.pkl (Recorded UK Road Accident Data 1979 - 2004 (c) for RQ3 (B), with missing values removed)


* 600_prep_missing_values_7904d_Q3A.pkl (Recorded UK Road Accident Data 1979 - 2004 (d) for RQ3 (A), with missing values removed)


* 600_prep_missing_values_7904d_Q3B.pkl (Recorded UK Road Accident Data 1979 - 2004 (d) for RQ3 (B), with missing values removed)


* 500_prep_missing_values_0514_Q3A.pkl (Recorded UK Road Accident Data 2005 - 2014 for RQ3 (A), with missing values removed)


* 500_prep_missing_values_0514_Q3B.pkl (Recorded UK Road Accident Data 2005 - 2014 for RQ3 (B), with missing values removed)


* 600_prep_missing_values_1516_Q3A.pkl (Recorded UK Road Accident Data 2015 - 2016 for RQ3 (A), with missing values removed)


* 600_prep_missing_values_1516_Q3B.pkl (Recorded UK Road Accident Data 2015 - 2016 for RQ3 (B), with missing values removed)


* __Output__: 

* 900_Q3A_final_prep_1.pkl (Fully prepared dataset 1 of UK Road Safety Data from 1999 - 2014, for RQ3(A))


* 900_Q3A_final_prep_2.pkl (Fully prepared dataset 2 of UK Road Safety Data from 1999 - 2014, for RQ3(A))


* 900_Q3B_final_prep_1.pkl (Fully prepared dataset 1 of UK Road Safety Data from 1999 - 2014, for RQ3(B))


* 900_Q3B_final_prep_2.pkl (Fully prepared dataset 2 of UK Road Safety Data from 1999 - 2014, for RQ3(B))

In [1]:
import os
import sys

import pandas as pd
import numpy as np

module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.helpers import data_dictionary

%matplotlib inline

## 1.
## Loading the Datasets

Firstly, we will read in our input files using the pd.read_pickle method.

In [5]:
df_7904c_Q3A= pd.read_pickle('../../data/processed/600_prep_missing_values_7904c_Q3A.pkl')
df_7904c_Q3A.shape

(448332, 16)

In [6]:
df_7904c_Q3B= pd.read_pickle('../../data/processed/600_prep_missing_values_7904c_Q3B.pkl')
df_7904c_Q3B.shape

(199269, 13)

In [7]:
df_7904d_Q3A= pd.read_pickle('../../data/processed/600_prep_missing_values_7904d_Q3A.pkl')
df_7904d_Q3A.shape

(1842428, 16)

In [8]:
df_7904d_Q3B= pd.read_pickle('../../data/processed/600_prep_missing_values_7904d_Q3B.pkl')
df_7904d_Q3B.shape

(1047006, 13)

In [9]:
df_0514_Q3A= pd.read_pickle('../../data/processed/500_prep_missing_values_0514_Q3A.pkl')
df_0514_Q3A.shape

(2810328, 16)

In [10]:
df_0514_Q3B= pd.read_pickle('../../data/processed/500_prep_missing_values_0514_Q3B.pkl')
df_0514_Q3B.shape

(2133735, 13)

In [11]:
df_1516_Q3A= pd.read_pickle('../../data/processed/600_prep_missing_values_1516_Q3A.pkl')
df_1516_Q3A.shape

(443646, 16)

In [12]:
df_1516_Q3B= pd.read_pickle('../../data/processed/600_prep_missing_values_1516_Q3B.pkl')
df_1516_Q3B.shape

(183141, 13)

## 2.
# Question Three: 

## (A)

First we will begin by adding up the rows of each of our datasets (seen above), to see how much data we will be working with.

In [13]:
448332 + 1844266 + 2810328 + 443646

5546572

From above, you can see that each of our datasets for Q3 (A) adds up to just over 5,500,000 rows.

As this is such a large amount of rows, we will continue to use multiple datasets rather than concatenating them for the time being.


## 2.1
## _ Sort By Year _

we will split our 'Date' column up into three separate columns for Day, Month and Year using the str.split method.

In [14]:
df_7904c_Q3A[['Date_Day', 'Month', 'Year']] = df_7904c_Q3A['Date'].str.split(pat = '/', n=-1, expand=True)

In [15]:
df_7904d_Q3A[['Date_Day', 'Month', 'Year']] = df_7904d_Q3A['Date'].str.split(pat = '/', n=-1, expand=True)

In [16]:
df_0514_Q3A[['Date_Day', 'Month', 'Year']] = df_0514_Q3A['Date'].str.split(pat = '/', n=-1, expand=True)

In [17]:
df_1516_Q3A[['Date_Day', 'Month', 'Year']] = df_1516_Q3A['Date'].str.split(pat = '/', n=-1, expand=True)

We will now sort our datasets in ascending order according to the 'Year' Column. 

In [18]:
df_7904c_Q3A.sort_values('Year')
df_7904c_Q3A.head(2) #display first 2 rows of dataset to see oldest recorded date.

Unnamed: 0,Accident_Index,Longitude,Latitude,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Speed_limit,Junction_Detail,Urban_or_Rural_Area,Vehicle_Type,Vehicle_Location-Restricted_Lane,Sex_of_Driver,Date_Day,Month,Year
8521268,1999010SU0945,-0.271752,51.715661,slight,1,1,25/12/1999,sunday,09:30,33,70,slip road,rural,car,On main carriageway - not in restricted lane,female,25,12,1999
8521269,1999010SU0946,-0.239977,51.695136,slight,2,1,17/12/1999,saturday,18:38,33,70,not at junction or within 20 metres,rural,goods 7.5 tonnes mgw and over,On main carriageway - not in restricted lane,male,17,12,1999


In [19]:
df_7904d_Q3A.sort_values('Year')
df_7904d_Q3A.head(2)

Unnamed: 0,Accident_Index,Longitude,Latitude,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Speed_limit,Junction_Detail,Urban_or_Rural_Area,Vehicle_Type,Vehicle_Location-Restricted_Lane,Sex_of_Driver,Date_Day,Month,Year
9000001,200001TX00967,-0.314746,51.489498,slight,3,2,11/09/2000,tuesday,06:47,25,40,not at junction or within 20 metres,urban,van / goods 3.5 tonnes mgw or under,On main carriageway - not in restricted lane,male,11,9,2000
9000002,200001TX00967,-0.314746,51.489498,slight,3,2,11/09/2000,tuesday,06:47,25,40,not at junction or within 20 metres,urban,car,On main carriageway - not in restricted lane,male,11,9,2000


In [20]:
df_0514_Q3A.sort_values('Year')
df_0514_Q3A.head(2)

Unnamed: 0,Accident_Index,Longitude,Latitude,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Speed_limit,Junction_Detail,Vehicle_Type,Urban_or_Rural_Area,Vehicle_Location-Restricted_Lane,Sex_of_Driver,Date_Day,Month,Year
0,200501BS00001,-0.19117,51.489096,serious,1,1,04/01/2005,wednesday,17:42,12,30,not at junction or within 20 metres,car,urban,On main carriageway - not in restricted lane,female,4,1,2005
1,200501BS00002,-0.211708,51.520075,slight,1,1,05/01/2005,thursday,17:36,12,30,crossroads,bus or coach,urban,On main carriageway - not in restricted lane,male,5,1,2005


In [21]:
df_1516_Q3A.sort_values('Year')
df_1516_Q3A.tail(2) #display last 2 rows of dataset to see most recent recorded date.

Unnamed: 0,Accident_Index,Longitude,Latitude,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Speed_limit,Junction_Detail,Urban_or_Rural_Area,Vehicle_Type,Vehicle_Location-Restricted_Lane,Sex_of_Driver,Date_Day,Month,Year
479546,2016984131316,-3.272584,54.989597,slight,1,3,29/10/2016,sunday,20:00,917,40.0,not at junction or within 20 metres,rural,car,On main carriageway - not in restricted lane,male,29,10,2016
479547,2016984133416,-3.448392,55.310151,slight,1,2,25/12/2016,monday,12:30,917,70.0,not at junction or within 20 metres,rural,car,On main carriageway - not in restricted lane,male,25,12,2016


### _Analysis_

From above, we can see that our datasets for Question Three part (A) contain data from 1999 to 2016.

## 2.2
## Creating new datasets

Now that we have our final datasets fully prepared, we will concatenate our datasets so that we can reduce the number of datasets we will need to perform our analysis on to answer our research question three (A)

We will do this by creating two (reduced from four), each with no more than 3 million rows (as this sized dataset can be easily managed by our processors). We will ensure that the accidents recorded in each final dataset will consist of accidents in ascending years. 


*We will not be using our df_1516_Q3A Dataset due to missing date values from 2015 in Q3 (B) below*

In [22]:
df_7904_Q3A = pd.concat([df_7904c_Q3A,df_7904d_Q3A])
df_7904_Q3A.shape

(2290760, 19)

## 2.3
## Check for Null Values

Ensure our Final Datasets do not contain null values

In [23]:
df_7904_Q3A.isnull().sum()

Accident_Index                      0
Longitude                           0
Latitude                            0
Accident_Severity                   0
Number_of_Vehicles                  0
Number_of_Casualties                0
Date                                0
Day_of_Week                         0
Time                                0
Local_Authority_(District)          0
Speed_limit                         0
Junction_Detail                     0
Urban_or_Rural_Area                 0
Vehicle_Type                        0
Vehicle_Location-Restricted_Lane    0
Sex_of_Driver                       0
Date_Day                            0
Month                               0
Year                                0
dtype: int64

In [24]:
df_0514_Q3A.isnull().sum()

Accident_Index                      0
Longitude                           0
Latitude                            0
Accident_Severity                   0
Number_of_Vehicles                  0
Number_of_Casualties                0
Date                                0
Day_of_Week                         0
Time                                0
Local_Authority_(District)          0
Speed_limit                         0
Junction_Detail                     0
Vehicle_Type                        0
Urban_or_Rural_Area                 0
Vehicle_Location-Restricted_Lane    0
Sex_of_Driver                       0
Date_Day                            0
Month                               0
Year                                0
dtype: int64

## 3.
# Question Three: 

## (B)

First we will begin by adding up the rows of each of our datasets (seen above when reading in our datasets), to see how much data we will be working with.

In [25]:
199269 + 1047200 + 2133735 + 183141

3563345

As we can see from above, our datasets for Question 3 (B) total up to above 3.5 million rows. For this reason, we will sort each individual datasets below before concatenating them.

## 3.1
## _Sort By Year_

we will split our 'Date' column up into three separate columns for Day, Month and Year using the str.split method.

In [26]:
df_7904c_Q3B[['Date_Day', 'Month', 'Year']] = df_7904c_Q3B['Date'].str.split(pat = '/', n=-1, expand=True)  

In [27]:
df_7904d_Q3B[['Date_Day', 'Month', 'Year']] = df_7904d_Q3B['Date'].str.split(pat = '/', n=-1, expand=True) 

In [28]:
df_0514_Q3B[['Date_Day', 'Month', 'Year']] = df_0514_Q3B['Date'].str.split(pat = '/', n=-1, expand=True) 

In [29]:
df_1516_Q3B[['Date_Day', 'Month', 'Year']] = df_1516_Q3B['Date'].str.split(pat = '/', n=-1, expand=True) 

In [30]:
df_7904c_Q3B.sort_values('Year')
df_7904c_Q3B.head(2) #display first two rows to see oldest recorded date

Unnamed: 0,Accident_Index,Longitude,Latitude,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Local_Authority_(District),Special_Conditions_at_Site,Urban_or_Rural_Area,Vehicle_Type,Sex_of_Driver,Driver_IMD_Decile,Date_Day,Month,Year
8521268,1999010SU0945,-0.271752,51.715661,1,1,25/12/1999,sunday,33,none,rural,car,female,most deprived 10%,25,12,1999
8521270,1999010SU0946,-0.239977,51.695136,2,1,17/12/1999,saturday,33,none,rural,car,male,most deprived 40-50%,17,12,1999


In [31]:
df_7904d_Q3B.sort_values('Year')
df_7904d_Q3B.head(2)

Unnamed: 0,Accident_Index,Longitude,Latitude,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Local_Authority_(District),Special_Conditions_at_Site,Urban_or_Rural_Area,Vehicle_Type,Sex_of_Driver,Driver_IMD_Decile,Date_Day,Month,Year
9000690,200001TX01383,-0.349817,51.471652,2,1,07/09/2000,friday,25,none,urban,van / goods 3.5 tonnes mgw or under,male,most deprived 10-20%,7,9,2000
9000691,200001TX01383,-0.349817,51.471652,2,1,07/09/2000,friday,25,none,urban,car,male,most deprived 40-50%,7,9,2000


In [32]:
df_0514_Q3B.sort_values('Year')
df_0514_Q3B.head(2)

Unnamed: 0,Accident_Index,Longitude,Latitude,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Local_Authority_(District),Special_Conditions_at_Site,Urban_or_Rural_Area,Vehicle_Type,Sex_of_Driver,Driver_IMD_Decile,Date_Day,Month,Year
0,200501BS00001,-0.19117,51.489096,1,1,04/01/2005,wednesday,12,none,urban,car,female,less deprived 30-40%,4,1,2005
2,200501BS00003,-0.206458,51.525301,2,1,06/01/2005,friday,12,none,urban,bus or coach,male,most deprived 10-20%,6,1,2005


In [33]:
df_1516_Q3B.sort_values('Year')
df_1516_Q3B.head(2)

Unnamed: 0,Accident_Index,Longitude,Latitude,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Local_Authority_(District),Special_Conditions_at_Site,Urban_or_Rural_Area,Vehicle_Type,Sex_of_Driver,Driver_IMD_Decile,Date_Day,Month,Year
227048,2016010000005,-0.279323,51.584754,2,1,01/11/2016,wednesday,28,none,urban,taxi/private hire car,male,most deprived 20-30%,1,11,2016
227049,2016010000005,-0.279323,51.584754,2,1,01/11/2016,wednesday,28,none,urban,motorcycle,male,most deprived 30-40%,1,11,2016


### _Analysis_

From above, we can see that our datasets for Question Three part (B) contain data from 1999 to 2016. However, we are missing data for 2015.

We want to ensure that the datasets used for Question three parts (A) and (B) are consistent, ie. containing the same years. Unfrotunately as we are missing all data from 2015 for part (B) *we must disregard the df_1516_Q3A and df_1516_Q3B.* This is due to the adverse effect that this missing year would have on our result graphs. (i.e. a graph skipping from 2014 to 2016).

Therefore, our resulting datasets for Question Three parts (A) and (B) will both range from the years 1999 - 2014.

## 3.2
## Creating new datasets

Now that we have our final datasets fully prepared, we will concatenate our datasets so that we can reduce the number of datasets we will need to perform our analysis on to answer our research question three (B).

We will do this by creating two (reduced from four), each with around 2 million rows (as this sized dataset can be easily managed by our processors). We will ensure that the accidents recorded in each final dataset will consist of accidents in ascending years. 

In [31]:
df_7904_Q3B = pd.concat([df_7904c_Q3B,df_7904d_Q3B])
df_7904_Q3B.shape

(1246275, 16)

## 3.3
## Check for Null Values

Ensure our Final Datasets do not contain null values

In [33]:
df_7904_Q3B.isnull().sum()

Accident_Index                0
Longitude                     0
Latitude                      0
Number_of_Vehicles            0
Number_of_Casualties          0
Date                          0
Day_of_Week                   0
Local_Authority_(District)    0
Special_Conditions_at_Site    0
Urban_or_Rural_Area           0
Vehicle_Type                  0
Sex_of_Driver                 0
Driver_IMD_Decile             0
Date_Day                      0
Month                         0
Year                          0
dtype: int64

In [13]:
df_0514_Q3B.isnull().sum()

Accident_Index                0
Longitude                     0
Latitude                      0
Number_of_Vehicles            0
Number_of_Casualties          0
Date                          0
Day_of_Week                   0
Local_Authority_(District)    0
Special_Conditions_at_Site    0
Urban_or_Rural_Area           0
Vehicle_Type                  0
Sex_of_Driver                 0
Driver_IMD_Decile             0
Date_Day                      0
Month                         0
Year                          0
dtype: int64

## 4.
## Save to Pickle Files

Once we have checked for null values, we will now save our final datasets into pickle files

### (A)

In [34]:
pickle_save_time = %timeit -o df_7904_Q3A.to_pickle("../../data/processed/900_Q3A_final_prep_1.pkl") #save dataset into a pickle file and print the save time

pickle_save_time

17 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 17 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [35]:
pickle_save_time = %timeit -o df_0514_Q3A.to_pickle("../../data/processed/900_Q3A_final_prep_2.pkl") #save dataset into a pickle file and print the save time

pickle_save_time

20.5 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 20.5 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

### (B)

In [38]:
pickle_save_time = %timeit -o df_7904_Q3B.to_pickle("../../data/processed/900_Q3B_final_prep_1.pkl") #save dataset into a pickle file and print the save time

pickle_save_time

9.26 s ± 795 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 9.26 s ± 795 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [14]:
pickle_save_time = %timeit -o df_0514_Q3B.to_pickle("../../data/processed/900_Q3B_final_prep_2.pkl") #save dataset into a pickle file and print the save time

pickle_save_time

15.6 s ± 360 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 15.6 s ± 360 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

## 5.
## Create Data Dictionaries

We will create data dictionaries for each of our pickle files, summarising their contents.

### (A)

In [36]:
data_dictionary.save(
    '../../data/processed/900_Q3A_final_prep_1.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2290760.0,1303909.0,2003070300788,40.0,,,,,,,,0,0.0
Longitude,2290760.0,,,,-1.44765,1.37638,-7.53617,-2.34996,-1.43441,-0.24279,1.76059,0,0.0
Latitude,2290760.0,,,,52.5707,1.40685,49.9128,51.4983,52.3226,53.4615,60.8017,0,0.0
Accident_Severity,2290760.0,3.0,slight,1968731.0,,,,,,,,0,0.0
Number_of_Vehicles,2290760.0,,,,2.11502,0.943349,1.0,2.0,2.0,2.0,66.0,0,0.0


In [37]:
data_dictionary.save(
    '../../data/processed/900_Q3A_final_prep_2.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2810328.0,1601362.0,2013460234852,67.0,,,,,,,,0,0.0
Longitude,2810330.0,,,,-1.42802,1.39369,-7.51623,-2.34529,-1.39056,-0.228997,1.76201,0,0.0
Latitude,2810330.0,,,,52.558,1.42906,49.9129,51.487,52.2685,53.4516,60.7575,0,0.0
Accident_Severity,2810328.0,3.0,slight,2423471.0,,,,,,,,0,0.0
Number_of_Vehicles,2810330.0,,,,2.10489,0.930107,1.0,2.0,2.0,2.0,67.0,0,0.0


### (B)

In [42]:
data_dictionary.save(
    '../../data/processed/900_Q3B_final_prep_1.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,1246275.0,820848.0,2003070300788.0,29.0,,,,,,,,0,0.0
Longitude,1246280.0,,,,-1.13765,1.0968,-7.52928,-2.07314,-1.22489,-0.180311,1.75861,0,0.0
Latitude,1246280.0,,,,52.3971,1.06722,50.3271,51.5116,52.2574,53.3896,60.2478,0,0.0
Number_of_Vehicles,1246280.0,,,,2.13014,0.938863,1.0,2.0,2.0,2.0,40.0,0,0.0
Number_of_Casualties,1246280.0,,,,1.46266,0.967684,1.0,1.0,1.0,2.0,67.0,0,0.0


In [15]:
data_dictionary.save(
    '../../data/processed/900_Q3B_final_prep_2.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2133735.0,1325488.0,2013460234852.0,52.0,,,,,,,,0,0.0
Longitude,2133740.0,,,,-1.19098,1.22116,-7.50412,-2.06828,-1.22207,-0.190866,1.76201,0,0.0
Latitude,2133740.0,,,,52.3433,1.15,49.9129,51.4563,52.1328,53.3776,60.311,0,0.0
Number_of_Vehicles,2133740.0,,,,2.11117,0.921435,1.0,2.0,2.0,2.0,67.0,0,0.0
Number_of_Casualties,2133740.0,,,,1.45142,1.00674,1.0,1.0,1.0,2.0,93.0,0,0.0
