# Dataset Pre-Processing

In this notebook, we present a series of steps to illustrate the process of cleaning out a dataset before using it for a learnng task related to patterns-recognition.

For this purpose, we are going to use the **1.6 million UK traffic accidents** dataset that can be downloaded from the next link: https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales

From its documentation, this dataset contains the most relevant information for police reported accidents in the UK between 2005 and 2014, although 2008 is missing. Each of the 1.6 million accidents reported in this dataset is described with 33 features (33 columns for each instance).

We will attempt to visualize and understand the most relevant contents of the dataset while implementing improvements that can make it more suitable for a learning task. Considering that the type of pre-processing to apply may change depending on the application, we will pretend that this data will be used to train some sort of model that can predict the probability of an accident for a given set of features. Considering this, we will suggest some of the features are removed since they may not be relevant for an application like this one.


## Loading the Dataset

In [1]:
# Start by importing relevant python modules
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [3]:
# Load data from the 3 datasets and merge into a single data-frame
# Note: Specifying type for one of the columns since it seems to have mixed datatypes
df_05_07 = pd.read_csv('dataset/raw/accidents_2005_to_2007.csv', dtype={'LSOA_of_Accident_Location': 'string'})
df_09_11 = pd.read_csv('dataset/raw/accidents_2009_to_2011.csv', dtype={'LSOA_of_Accident_Location': 'string'})
df_12_14 = pd.read_csv('dataset/raw/accidents_2012_to_2014.csv', dtype={'LSOA_of_Accident_Location': 'string'})
df_accidents_05_14 = pd.concat([df_05_07, df_09_11, df_12_14], ignore_index=True)

In [4]:
# Confirm values have been concatenated vertically by printing the DF size
print("df_accidents_05_14 shape is: {}".format(df_accidents_05_14.shape))

df_accidents_05_14 shape is: (1504150, 33)


From the printed text above, We can observe how the whole raw dataset has a little more than 1.5 million instances and not quite 1.6 million.

In [5]:
# Print head to visualize initial column values
df_accidents_05_14.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,Year
0,200501BS00001,525680.0,178240.0,-0.19117,51.489096,1,2,1,1,04/01/2005,...,Zebra crossing,Daylight: Street light present,Raining without high winds,Wet/Damp,,,1,Yes,E01002849,2005
1,200501BS00002,524170.0,181650.0,-0.211708,51.520075,1,3,1,1,05/01/2005,...,Pedestrian phase at traffic signal junction,Darkness: Street lights present and lit,Fine without high winds,Dry,,,1,Yes,E01002909,2005
2,200501BS00003,524520.0,182240.0,-0.206458,51.525301,1,3,2,1,06/01/2005,...,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,,,1,Yes,E01002857,2005
3,200501BS00004,526900.0,177530.0,-0.173862,51.482442,1,3,1,1,07/01/2005,...,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,,,1,Yes,E01002840,2005
4,200501BS00005,528060.0,179040.0,-0.156618,51.495752,1,3,1,1,10/01/2005,...,No physical crossing within 50 meters,Darkness: Street lighting unknown,Fine without high winds,Wet/Damp,,,1,Yes,E01002863,2005
