# 100_load_roadsafety_datasets_1979-2014

### Purpose

In this notebook we will begin our analysis of UK Road Safety Data 1979 - 2014. Primarily, we will focus on loading and reviewing our datasets. These datasets are split up into four Vehicle and Accident Datasets for all Recorded UK Road Accidents from  1979 - 2004, and 2005 - 2014. We will then save these datasets into files.

### Notebook Contents:

* __1:__ Loading our Datasets

* __2:__ Combining the Datasets

     * __2.1:__ UK Road Accident data 1979 - 2004
     * __2.2:__ UK Road Accident data 2005 - 2014


* __3:__ Analyse Datasets

* __4:__ Saving the Datasets
   
    * __4.1:__ Saving to Pickle Files
    * __4.2:__ Creating Data Dictionaries


### Datasets

* __Input__:


* Vehicles7904.csv (Vehicle data for all Recorded UK Road Accidents from 1979 - 2004)


* Accidents7904.csv (Accident data for all Recorded UK Road Accidents from 1979 - 2004)


* Vehicles0514.csv (Vehicle data for all Recorded UK Road Accidents from 2005 - 2014)


* Accidents0514.csv (Accident data for all Recorded UK Road Accidents from 2005 - 2014)




* __Output__:


* 100_0514_accidents.pkl (Vehicle and Accident Data for all Recorded UK Road Accidents from 2005 - 2014)


* 100_7904a_accidents.pkl (First 3 million lines containing Vehicle and Accident Data from combined input files:  Vehicles7904.csv &  Accidents7904.csv)


* 100_7904b_accidents.pkl (Second 3 million lines containing Vehicle and Accident Data from combined input files:  Vehicles7904.csv &  Accidents7904.csv)


* 100_7904c_accidents.pkl (Third 3 million lines containing Vehicle and Accident Data from combined input files:  Vehicles7904.csv &  Accidents7904.csv)


* 100_7904d_accidents.pkl (Last 3 million lines containing Vehicle and Accident Data from combined input files:  Vehicles7904.csv &  Accidents7904.csv)

In [1]:
import os
import sys
import hashlib
import pandas as pd

module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.helpers import data_dictionary
    
%matplotlib inline

## 1. Loading the Datasets

Our Datasets are all in a standard CSV format, we have decided to use the Pandas read_csv method to read each individual csv file into a dataframe as we are dealing with very large Datasets.

In [2]:
Veh_7904 = pd.read_csv('../../data/raw/Vehicles7904.csv') #Vehicle Data 1979 - 2004
Veh_7904.shape

(10981968, 21)

In [3]:
Acc_7904 = pd.read_csv('../../data/raw/Accidents7904.csv') #Accident Data 1979 - 2004
Acc_7904.shape

  interactivity=interactivity, compiler=compiler, result=result)


(6224198, 32)

In [4]:
Veh_0514 = pd.read_csv('../../data/raw/Vehicles0514.csv') #Vehicle Data 2005 - 2014
Veh_0514.shape

(3004425, 22)

In [5]:
Acc_0514 = pd.read_csv('../../data/raw/Accidents0514.csv') # Accident Data 2005 - 2014
Acc_0514.shape

  interactivity=interactivity, compiler=compiler, result=result)


(1640597, 32)

We now have each of our 4 datasets:
* Two datasets containing accident reports and vehicles involved in UK road accidents from 1979 to 2004
* Two datasets containing accident reports and vehicles involved in UK road accidents from 2005 to 2014
* Acc_7904 and Acc_0514 both have 32 columns
* Veh_7904 and Veh_0514 contain 21 and 22 columns respectively. We will look further in to this mismatching of columns later in the notebook. (__3.__)

We acknowledge that this wasn't a great example of DRY (_do not repeat yourself_) coding. However, due to the size of our datasets, we have chosen to load 4 datasets in this notebook, and the remaining 4 in the next notebook (200_load_roadsafety_datasets_2015-2016.ipynb). As a result, we needed to manually read the datasets which were relevant to this notebook.

# 2. Combining the Datasets

We have decided to use the merge method to combine our vehicle and accident datasets for both year groups (1979-2004 and 2005-2014), producing two new datasets containing accident and vehicle data for 1979 - 2004, and 2005 - 2014.

* We have chosen to use the _merge_ method due to the row differences in our datasets. 

*  than 1 vehicle is often involved in a road accident, and each vehicle involved is represented as a row in our Veh_7904 and Veh_0514 datasets. 

* We will do a check to ensure that our merged datasets contain only rows which have a corresponding accident index in both datasets. 

* We will therefore merge based on the common "Accident_Index" column.

## 2.1 
### 1979 - 2004


Here, we will merge our Veh_7904 and Acc_7904 Datasets. 


Firstly, we need to rename the column "Acc_Index" to "Accident_Index" in our Veh_7904 Datasets before they can be merged.

In [6]:
New_Veh_7904 = Veh_7904.rename(index=str, columns={"Acc_Index": "Accident_Index"}) #create new dataframe for Veh_7904 with updated column name
New_Veh_7904.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,...,1st_Point_of_Impact,Was_Vehicle_Left_Hand_Drive?,Journey_Purpose_of_Driver,Sex_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type
0,197901A11AD14,1,109,0,18,-1,-1,-1,-1,-1,...,-1,-1,-1,1,7,-1,-1,-1,-1,-1
1,197901A11AD14,2,104,0,13,-1,-1,-1,-1,-1,...,-1,-1,-1,1,-1,-1,-1,-1,-1,-1
2,197901A1BAW34,1,109,0,18,-1,-1,-1,-1,-1,...,-1,-1,-1,1,-1,-1,-1,-1,-1,-1
3,197901A1BFD77,1,109,0,18,-1,-1,1,-1,-1,...,-1,-1,-1,1,5,-1,-1,-1,-1,-1
4,197901A1BFD77,2,109,0,18,-1,-1,-1,-1,-1,...,-1,-1,-1,1,7,-1,-1,-1,-1,-1


Now that our Veh_7904 and Acc_7904 Datasets have a column named 'Accident_Index', we will use the merge method to join the datasets based on this column.

This will merge only the rows with corresponding Accident Index in both datasets. Rows in the vehicle dataset with no corresponding accident_index in the accident dataset will be removed from the resulting dataset.

In [7]:
pd.set_option('display.max_columns', 54) # set the max display columns to 54 so that all columns are shown
All_7904 = pd.merge(Acc_7904, New_Veh_7904, on='Accident_Index') #merge Acc_7904 and New_Veh_7904 datasets based on Accident_Index Column
All_7904.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Local_Authority_(Highway),1st_Road_Class,1st_Road_Number,Road_Type,Speed_limit,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,Hit_Object_off_Carriageway,1st_Point_of_Impact,Was_Vehicle_Left_Hand_Drive?,Journey_Purpose_of_Driver,Sex_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type
0,197901A11AD14,,,,,1,3,2,1,18/01/1979,5,08:00,11,9999,3,4,1,30,1,4,-1,-1,-1,-1,1,8,1,-1,0,-1,-1,,1,109,0,18,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,7,-1,-1,-1,-1,-1
1,197901A11AD14,,,,,1,3,2,1,18/01/1979,5,08:00,11,9999,3,4,1,30,1,4,-1,-1,-1,-1,1,8,1,-1,0,-1,-1,,2,104,0,13,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1
2,197901A1BAW34,198460.0,894000.0,,,1,3,1,1,01/01/1979,2,01:00,23,9999,6,0,9,30,3,4,-1,-1,-1,-1,4,8,3,-1,0,-1,-1,,1,109,0,18,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1
3,197901A1BFD77,406380.0,307000.0,,,1,3,2,3,01/01/1979,2,01:25,17,9999,3,112,9,30,6,4,-1,-1,-1,-1,4,8,3,-1,0,-1,-1,,1,109,0,18,-1,-1,1,-1,-1,-1,-1,-1,-1,1,5,-1,-1,-1,-1,-1
4,197901A1BFD77,406380.0,307000.0,,,1,3,2,3,01/01/1979,2,01:25,17,9999,3,112,9,30,6,4,-1,-1,-1,-1,4,8,3,-1,0,-1,-1,,2,109,0,18,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,7,-1,-1,-1,-1,-1


Now that our datasets have been merged we will check that the resulting ("All_7904") dataframe contains the correct number of columns and rows after merging, which they do.

In [8]:
print("Acc_7904 dimensions: {}".format(Acc_7904.shape)) #original Accident dataset 1979 - 2004
print("New_Veh_7904 dimensions: {}".format(New_Veh_7904.shape)) #original Vehicle dataset 1979 - 2004
print("All_7904 dimensions: {}".format(All_7904.shape)) #merged Accident and Vehicle dataset 1979 - 2004

Acc_7904 dimensions: (6224198, 32)
New_Veh_7904 dimensions: (10981968, 21)
All_7904 dimensions: (10981968, 52)


## 2.2
### 2005 to 2014

We will use the merge method again to merge the vehicle and accident datasets for 2005 to 2014, based on the Accident_Index column which is common to both datasets.

As above, the merge method will only join rows with corresponding Accident Index's in both datasets.For these datasets, the 'Accident_Index' column is commonly named, so there will be no need to rename it before merging.

In [9]:
All_0514 = pd.merge(Acc_0514, Veh_0514, on='Accident_Index') #merge both datasets based on Accident_Index Column
All_0514.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Local_Authority_(Highway),1st_Road_Class,1st_Road_Number,Road_Type,Speed_limit,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,Hit_Object_off_Carriageway,1st_Point_of_Impact,Was_Vehicle_Left_Hand_Drive?,Journey_Purpose_of_Driver,Sex_of_Driver,Age_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type
0,200501BS00001,525680.0,178240.0,-0.19117,51.489096,1,2,1,1,04/01/2005,3,17:42,12,E09000020,3,3218,6,30,0,-1,-1,0,0,1,1,2,2,0,0,1,1,E01002849,1,9,0,18,0,0,0,0,0,0,1,1,15,2,74,10,-1,-1,-1,7,1
1,200501BS00002,524170.0,181650.0,-0.211708,51.520075,1,3,1,1,05/01/2005,4,17:36,12,E09000020,4,450,3,30,6,2,5,0,0,5,4,1,1,0,0,1,1,E01002909,1,11,0,4,0,3,0,0,0,0,4,1,1,1,42,7,8268,2,3,-1,-1
2,200501BS00003,524520.0,182240.0,-0.206458,51.525301,1,3,2,1,06/01/2005,5,00:15,12,E09000020,5,0,6,30,0,-1,-1,0,0,0,4,1,1,0,0,1,1,E01002857,1,11,0,17,0,0,0,4,0,0,4,1,1,1,35,6,8300,2,5,2,1
3,200501BS00003,524520.0,182240.0,-0.206458,51.525301,1,3,2,1,06/01/2005,5,00:15,12,E09000020,5,0,6,30,0,-1,-1,0,0,0,4,1,1,0,0,1,1,E01002857,2,9,0,2,0,0,0,0,0,0,3,1,15,1,62,9,1762,1,6,1,1
4,200501BS00004,526900.0,177530.0,-0.173862,51.482442,1,3,1,1,07/01/2005,6,10:35,12,E09000020,3,3220,6,30,0,-1,-1,0,0,0,1,1,1,0,0,1,1,E01002840,1,9,0,18,0,0,0,0,0,0,1,1,15,2,49,8,1769,1,4,2,1


Our datasets have been merged. Below we will check that the resulting ("All_0514") dataframe contains the correct number of columns and rows after merging, which it does.

In [10]:
print("Acc_0514 dimensions: {}".format(Acc_0514.shape)) #original Accident dataset 2005 - 2014
print("Veh_0514 dimensions: {}".format(Veh_0514.shape)) #original Vehicle dataset 2005 - 2014
print("All_0514 dimensions: {}".format(All_0514.shape)) #merged Accident & Vehicle dataset 2005 - 2014

Acc_0514 dimensions: (1640597, 32)
Veh_0514 dimensions: (3004425, 22)
All_0514 dimensions: (3004425, 53)


# 3. Analyse Datasets 

Once our datasets have been merged, we need to check that our merged datasets "All_7904" and "All_0514" contain the same number of columns.

It is important to perform this check to ensure that we have the same data in each notebook, so that we can analyse them equally in subsequent notebooks.

In [11]:
All_0514.shape

(3004425, 53)

In [12]:
All_7904.shape

(10981968, 52)

We can see from above All_0514 has 53 columns, while All_7904 has only 52 columns.

In order to check which column(s) do not match within each dataset, will list the columns in both datasets and compare them.

In [13]:
cols_7904 = list(All_7904.columns) #create a list of the columns in the All_7904 dataset
cols_7904_df = pd.DataFrame(cols_7904) #create a dataframe out of this list

In [14]:
cols_0514 = list(All_0514.columns) #create a list of the columns in the All_0514 dataset
cols_0514_df = pd.DataFrame(cols_0514) #create a dataframe out of this list

In [15]:
cols_0514_df.isin(cols_7904_df) #compare whether both dataframes match

Unnamed: 0,0
0,True
1,True
2,True
3,True
4,True
5,True
6,True
7,True
8,True
9,True


From the above output, we can see that the last 7 columns in both the All_7904 and All_0514 datasets return a false output, implying that they do not match.

Below we will list the column names for both datasets, focusing on the last 7 column names in each dataset to spot the issue.

In [16]:
All_0514.columns #print the column names in All_0514 dataset

Index(['Accident_Index', 'Location_Easting_OSGR', 'Location_Northing_OSGR',
       'Longitude', 'Latitude', 'Police_Force', 'Accident_Severity',
       'Number_of_Vehicles', 'Number_of_Casualties', 'Date', 'Day_of_Week',
       'Time', 'Local_Authority_(District)', 'Local_Authority_(Highway)',
       '1st_Road_Class', '1st_Road_Number', 'Road_Type', 'Speed_limit',
       'Junction_Detail', 'Junction_Control', '2nd_Road_Class',
       '2nd_Road_Number', 'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions',
       'Weather_Conditions', 'Road_Surface_Conditions',
       'Special_Conditions_at_Site', 'Carriageway_Hazards',
       'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident',
       'LSOA_of_Accident_Location', 'Vehicle_Reference', 'Vehicle_Type',
       'Towing_and_Articulation', 'Vehicle_Manoeuvre',
       'Vehicle_Location-Restricted_Lane', 'Junction_Location',
       'Skidding_and_Overturning', 'Hit_Object_in_C

In [17]:
All_7904.columns  #print the column names in All_7904 dataset

Index(['Accident_Index', 'Location_Easting_OSGR', 'Location_Northing_OSGR',
       'Longitude', 'Latitude', 'Police_Force', 'Accident_Severity',
       'Number_of_Vehicles', 'Number_of_Casualties', 'Date', 'Day_of_Week',
       'Time', 'Local_Authority_(District)', 'Local_Authority_(Highway)',
       '1st_Road_Class', '1st_Road_Number', 'Road_Type', 'Speed_limit',
       'Junction_Detail', 'Junction_Control', '2nd_Road_Class',
       '2nd_Road_Number', 'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions',
       'Weather_Conditions', 'Road_Surface_Conditions',
       'Special_Conditions_at_Site', 'Carriageway_Hazards',
       'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident',
       'LSOA_of_Accident_Location', 'Vehicle_Reference', 'Vehicle_Type',
       'Towing_and_Articulation', 'Vehicle_Manoeuvre',
       'Vehicle_Location-Restricted_Lane', 'Junction_Location',
       'Skidding_and_Overturning', 'Hit_Object_in_C

#### Problem
We can see from above that there is no "Age_of_Driver" column in the All_7904 dataframe, which has an effect on the indexing of subsequent rows in the dataset. This explains why both datasets have mismatching columns.

#### Solution
We have got a column for "Age_Band_of_Driver" which lists the age range of drivers, and is present in both the All_7904 and All_0514 datasets. 

Due to our All_7904 dataset covering the largest range of accident reports, and contributing over two thirds of our data, we have decided to remove the "Age_of_Driver" column from the All_0514 dataset.

We will use the 'Age_Band_of_Driver' column in future analysis.

Using the drop method, we will remove the 'Age_of_Driver' column from the All_0514 dataset.

In [18]:
new_0514 = All_0514.drop(columns=['Age_of_Driver']) #create a new dataframe with no 'Age_of_Driver' column
new_0514.columns #list the columns in our new dataframe

Index(['Accident_Index', 'Location_Easting_OSGR', 'Location_Northing_OSGR',
       'Longitude', 'Latitude', 'Police_Force', 'Accident_Severity',
       'Number_of_Vehicles', 'Number_of_Casualties', 'Date', 'Day_of_Week',
       'Time', 'Local_Authority_(District)', 'Local_Authority_(Highway)',
       '1st_Road_Class', '1st_Road_Number', 'Road_Type', 'Speed_limit',
       'Junction_Detail', 'Junction_Control', '2nd_Road_Class',
       '2nd_Road_Number', 'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions',
       'Weather_Conditions', 'Road_Surface_Conditions',
       'Special_Conditions_at_Site', 'Carriageway_Hazards',
       'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident',
       'LSOA_of_Accident_Location', 'Vehicle_Reference', 'Vehicle_Type',
       'Towing_and_Articulation', 'Vehicle_Manoeuvre',
       'Vehicle_Location-Restricted_Lane', 'Junction_Location',
       'Skidding_and_Overturning', 'Hit_Object_in_C

Now we will check that the column names in our new_0514 and All_7904 datasets match.

In [19]:
new_cols_0514 = list(new_0514.columns) #create a list of the columns in the new_0514 dataset
new_cols_0514_df = pd.DataFrame(new_cols_0514)  #create a dataframe out of this list

In [20]:
new_cols_0514_df.isin(cols_7904_df) #compare whether both dataframes match

Unnamed: 0,0
0,True
1,True
2,True
3,True
4,True
5,True
6,True
7,True
8,True
9,True


We can now see that our datasets are matching. We will now save them to files.

# 4. Saving the Datasets

Due to the size of our All_7904 dataset (~10.8 million rows), we have chosen to separate it into 4 smaller datasets, each with approximately 3 million rows.

This is necessary to allow us to query our data without processor problems due to the large size of the dataset.

Before we do this, we will select a subset of columns to save, since not all are needed or useful going forward.

In [26]:
cols_to_save = [
       'Accident_Index', 'Longitude', 'Latitude', 'Accident_Severity',
       'Number_of_Vehicles', 'Number_of_Casualties', 'Date', 'Day_of_Week', 
       'Time', 'Local_Authority_(District)', 'Road_Type', 'Speed_limit',
       'Junction_Detail', 'Light_Conditions','Weather_Conditions', 
       'Road_Surface_Conditions','Special_Conditions_at_Site', 
       'Urban_or_Rural_Area', 'Vehicle_Type', 'Vehicle_Manoeuvre', 
       'Vehicle_Location-Restricted_Lane', 'Journey_Purpose_of_Driver', 
       'Sex_of_Driver', 'Age_Band_of_Driver', 'Age_of_Vehicle', 
       'Driver_IMD_Decile', 'Driver_Home_Area_Type'
]

## 4.1
### Saving to Pickle files

Due to the large size of our datasets (~3 million rows each) , we have chosen to save our datasets as pickle files as they have the fastest save time and readtimes and a compact memory footprint.

Firstly, we will save the New_0514 dataset into a pickle file

### 2005 - 2014

In [50]:
pickle_save_time = %timeit -o new_0514[cols_to_save].to_pickle("../../data/processed/100_0514_accidents.pkl") #save 2005-2014 dataset into a pickle file and print the save time

pickle_save_time

14.4 s ± 1.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 14.4 s ± 1.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

Next, we will split the All_7904 dataset into four smaller datasets of approximately 3 million rows each, and save these as separate pickle files.

### 1979 - 2004

__(A)__

In [24]:
a_7904 = All_7904[:3000000] #create a new dataframe containing the first 3 million rows from the all_7904 dataset

In [27]:
pickle_save_time_7904a = %timeit -o a_7904[cols_to_save].to_pickle("../../data/processed/100_7904a_accidents.pkl") #save dataset into a pickle file and print the save time

pickle_save_time_7904a

14.4 s ± 549 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 14.4 s ± 549 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

__(B)__

In [28]:
b_7904 = All_7904[3000001:6000000] #create a new dataframe containing the second 3 million rows from the all_7904 dataset

In [29]:
pickle_save_time_7904b = %timeit -o b_7904[cols_to_save].to_pickle("../../data/processed/100_7904b_accidents.pkl") #save dataset into a pickle file and print the save time

pickle_save_time_7904b

13.7 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 13.7 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

__(C)__

In [30]:
c_7904 = All_7904[6000001:9000000] #create a new dataframe containing the third 3 million rows from the all_7904 dataset

In [31]:
pickle_save_time_7904c = %timeit -o c_7904[cols_to_save].to_pickle("../../data/processed/100_7904c_accidents.pkl") #save dataset into a pickle file and print the save time

pickle_save_time_7904c

14.3 s ± 933 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 14.3 s ± 933 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

__(D)__

In [32]:
d_7904 = All_7904[9000001:] #create a new dataframe containing the remaining rows from the all_7904 dataset

In [33]:
pickle_save_time_7904d = %timeit -o d_7904[cols_to_save].to_pickle("../../data/processed/100_7904d_accidents.pkl") #save dataset into a pickle file and print the save time

pickle_save_time_7904d

9.19 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 9.19 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)>

## 4.2
### Create Data Dictionaries

Below we will create dictionary files for each of our above pickle files, to get a summary of their contents.

__2005 - 2014__

In [35]:
data_dictionary.save(
    '../../data/processed/100_0514_accidents.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,3004425.0,1640597.0,2013460234852.0,67.0,,,,,,,,0,0.0
Longitude,3004230.0,,,,-1.42592,1.3926,-7.51623,-2.34207,-1.38969,-0.226342,1.76201,198,0.00659
Latitude,3004230.0,,,,52.557,1.42629,49.9129,51.4873,52.2679,53.4524,60.7575,198,0.00659
Accident_Severity,3004420.0,,,,2.8504,0.390599,1.0,3.0,3.0,3.0,3.0,0,0.0
Number_of_Vehicles,3004420.0,,,,2.11068,0.937462,1.0,2.0,2.0,2.0,67.0,0,0.0


__ 1979 - 2004 (A)__

In [34]:
data_dictionary.save(
    '../../data/processed/100_7904a_accidents.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,3000000.0,1783184.0,198213Q011682,61.0,,,,,,,,0,0.0
Longitude,0.0,,,,,,,,,,,3000000,100.0
Latitude,0.0,,,,,,,,,,,3000000,100.0
Accident_Severity,3000000.0,,,,2.71464,0.494487,1.0,2.0,3.0,3.0,3.0,0,0.0
Number_of_Vehicles,3000000.0,,,,1.97101,0.949776,1.0,2.0,2.0,2.0,61.0,0,0.0


__ 1979 - 2004 (B)__

In [36]:
data_dictionary.save(
    '../../data/processed/100_7904b_accidents.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2999999.0,1709440.0,199213MU34592,192.0,,,,,,,,0,0.0
Longitude,0.0,,,,,,,,,,,2999999,100.0
Latitude,0.0,,,,,,,,,,,2999999,100.0
Accident_Severity,3000000.0,,,,2.76869,0.462499,1.0,3.0,3.0,3.0,3.0,0,0.0
Number_of_Vehicles,3000000.0,,,,2.08314,1.98272,1.0,2.0,2.0,2.0,192.0,0,0.0


__ 1979 - 2004 (C)__

In [37]:
data_dictionary.save(
    '../../data/processed/100_7904c_accidents.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,2999999.0,1653282.0,199722AX01148,88.0,,,,,,,,0,0.0
Longitude,476279.0,,,,-1.29859,1.35915,-7.51329,-2.22435,-1.19521,-0.143788,1.75861,2523720,84.124028
Latitude,476279.0,,,,52.4471,1.35793,49.9143,51.4898,51.8747,53.3991,60.6934,2523720,84.124028
Accident_Severity,3000000.0,,,,2.82871,0.410611,1.0,3.0,3.0,3.0,3.0,0,0.0
Number_of_Vehicles,3000000.0,,,,2.11218,1.2306,1.0,2.0,2.0,2.0,88.0,0,0.0


__ 1979 - 2004 (D)__

In [38]:
data_dictionary.save(
    '../../data/processed/100_7904d_accidents.pkl', 

"""\
Aggregate raw data for road accidents.
""").head()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,Missing,%Missing
Accident_Index,1981967.0,1078292.0,200346SW73961,66.0,,,,,,,,0,0.0
Longitude,1977320.0,,,,-1.47628,1.37502,-7.53617,-2.37656,-1.46803,-0.288233,1.76059,4648,0.234514
Latitude,1977320.0,,,,52.5957,1.41162,49.9128,51.5014,52.4046,53.4749,60.8017,4648,0.234514
Accident_Severity,1981970.0,,,,2.8474,0.396213,1.0,3.0,3.0,3.0,3.0,0,0.0
Number_of_Vehicles,1981970.0,,,,2.12637,1.00562,1.0,2.0,2.0,2.0,66.0,0,0.0
