# Dataset Preparation

In the [supervised_learning notebook](supervised_learning.ipynb), we compare and analyze the results for different types of models applied to a supervised learning task. The present notebook outlines the basic steps to prepare the dataset for the execution of this learning task.

For simplicity and familiarity, we will use the dataset **UK Traffic Accidents** that was used in the assignment 1. For more details: [Assignmet 1 Github Page](https://github.com/agomez08/patrones_inv1).

As a goal, we plan on using this accidents data to attempt to **predict the severity of the accident given the relevant features in the dataset**. We will start with an exported CSV file from the pre-processing applied in the previous assignment called *'uk_accidents_processed.csv'*. We will take this pre-processing further now that there is a specific application in mind.

## Loading the Data

In [1]:
# Start by importing relevant python modules
import pandas as pd
import sklearn

In [2]:
# Read exported data from previous investigation
df_accidents_05_14 = pd.read_csv('dataset/uk_accidents_processed.csv')

Below we identify the number of instances for each of the possible severities. From the dataset documentation and the changes already applied on the pre-processing, this is what each of the numbers mean:

1 - Slight severity.

2 - Serious severity.

3 - Fatal severity.

In [3]:
# Count by each type
df_accidents_05_14['Accident_Severity'].value_counts()

1    1235802
2      63253
3      17821
Name: Accident_Severity, dtype: int64

## Categories Reduction

To simplify the prediction task and to start balancing the classes, we will reduce the scope of this experiment to attempt to identify an accident as Severe or Non-Severe:

0 - Non-Severe (Slight severity).

1 - Severe (Serious or Fatal severity).

We apply these changes below:

In [4]:
# Save 1 as 0
df_accidents_05_14.loc[df_accidents_05_14['Accident_Severity'] == 1, 'Accident_Severity'] = 0

# Save 2 and 3 as 1
df_accidents_05_14.loc[df_accidents_05_14['Accident_Severity'] == 2, 'Accident_Severity'] = 1
df_accidents_05_14.loc[df_accidents_05_14['Accident_Severity'] == 3, 'Accident_Severity'] = 1

In [5]:
# Rename the column to be more representative:
df_accidents_05_14 = df_accidents_05_14.rename(columns={'Accident_Severity': 'Severe_Accident'})

With this change we have the classes distribution shown below:

In [6]:
# Count by each type
df_accidents_05_14['Severe_Accident'].value_counts()

0    1235802
1      81074
Name: Severe_Accident, dtype: int64

## Balancing Classes

To balance the classes, and since we have a very high-number of instances anyways, we will down-sample the *Non-Severe* class to have the same number of samples as the *Severe* class. For this purpose we use the *resample* function of *sklearn* which will apply a random method for this purpose.

In [7]:
# Downsample non-severe category
df_accidents_non_severe = df_accidents_05_14[df_accidents_05_14['Severe_Accident'] == 0] 
df_accidents_non_severe = sklearn.utils.resample(df_accidents_non_severe, replace=False,
                                                 n_samples=81074, random_state=18)
 
# Combine again
df_accidents_severe = df_accidents_05_14[df_accidents_05_14['Severe_Accident'] == 1]
df_accidents = pd.concat([df_accidents_non_severe, df_accidents_severe])

As we can see below, the two classes are now balanced:

In [8]:
# Count by each type
df_accidents['Severe_Accident'].value_counts()

0    81074
1    81074
Name: Severe_Accident, dtype: int64

In [9]:
# Re-organize and then reset index
df_accidents.sort_index(inplace=True)
df_accidents.reset_index(drop=True, inplace=True)

## Dropping Irrelevant Features

Below we proceeed to remove different columns that due to their nature are not considered relevant for the application we have in mind for this data. Some of them are also consequences of the accidents which are highly related to the severity and wouldn't be available for the prediction (*'Number_of_Vehicles'* and *'Number_of_Casualties'*).

In [10]:
df_accidents.drop(inplace=True, columns=['Police_Force', 'Number_of_Vehicles', 'Number_of_Casualties', '1st_Road_Class',
                                        '1st_Road_Number', '2nd_Road_Class', '2nd_Road_Number', 'Police_Attended',
                                         'Year', 'Local_Authority_Highway'])

## Storing Resultant Dataset
Below we present a glance of the obtained dataset that will be used to train the supervised learning models. As we can observe it has 162148 instances and 58 features.

In [11]:
df_accidents

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Severe_Accident,Day_of_Week,Local_Authority_District,Speed_limit,Urban,Month_Day,Month,Hour,...,Spe_Cond_Oil_Diesel,Spe_Cond_Sign_Defective_Obscured,Spe_Cond_Road_Surface_Defective,Spe_Cond_Roadworks,Carr_Hazards_Animal,Carr_Hazards_Dislodged_Vehicle_Load,Carr_Hazards_Previous_Accident,Carr_Hazards_None,Carr_Hazards_Other_Object,Carr_Hazards_Pedestrian
0,0.527380,0.179280,0,7,12,30,1,29,1,7,...,0,0,0,0,0,0,0,1,0,0
1,0.524470,0.180980,0,1,12,30,1,30,1,20,...,0,0,0,0,0,0,0,1,0,0
2,0.524600,0.181280,0,6,12,30,1,18,2,17,...,0,0,0,0,0,0,0,1,0,0
3,0.526900,0.178470,0,3,12,30,1,1,3,10,...,0,0,0,0,0,0,0,1,0,0
4,0.526520,0.178020,0,2,12,30,1,14,3,20,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162143,0.312122,0.605613,1,4,917,60,0,28,5,15,...,0,0,0,0,0,0,0,1,0,0
162144,0.339072,0.597409,1,4,917,60,0,30,7,12,...,0,0,0,0,0,0,0,1,0,0
162145,0.319657,0.566553,1,4,917,30,0,5,11,15,...,0,0,0,0,0,0,0,1,0,0
162146,0.310037,0.597647,1,1,917,70,0,7,12,22,...,0,0,0,0,0,0,0,1,0,0


In [12]:
# Save to separate file called uk_accidents_for_sev_prediction.csv
df_accidents.to_csv('dataset/uk_accidents_for_sev_prediction.csv', index=False)