<a href="https://colab.research.google.com/github/adolfolh/casuality-classification/blob/main/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧽 **`DATA CLEANING & PREPARATION`**

## **Steps I take to clean and prepare this data for Analysis:**

1. Checking for duplicate data and removing them

In [33]:
# Run to download data
!wget https://github.com/adolfolh/casuality-classification/raw/main/data/casualties-2020.csv

--2022-08-12 09:46:08--  https://github.com/adolfolh/casuality-classification/raw/main/data/casualties-2020.csv
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/adolfolh/casuality-classification/main/data/casualties-2020.csv [following]
--2022-08-12 09:46:09--  https://raw.githubusercontent.com/adolfolh/casuality-classification/main/data/casualties-2020.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7097456 (6.8M) [text/plain]
Saving to: ‘casualties-2020.csv.1’


2022-08-12 09:46:09 (112 MB/s) - ‘casualties-2020.csv.1’ saved [7097456/7097456]



In [34]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px

In [35]:
# Import data
raw_data = pd.read_csv("casualties-2020.csv", dtype={
    "accident_index" : "string",
    "accident_reference" : "string",
    "casualty_class" : "category",
    "sex_of_casualty" : "category",
    "age_band_of_casualty" : "category",
    "casualty_severity" : "category",
    "pedestrian_location" : "category",
    "pedestrian_movement" : "category",
    "car_passenger" : "category",
    "bus_or_coach_passenger" : "category",
    "pedestrian_road_maintenance_worker" : "category",
    "casualty_type" : "category",
    "casualty_home_area_type" : "category",
    "casualty_imd_decile" : "category"
    })

In [36]:
raw_data.head()

Unnamed: 0,accident_index,accident_year,accident_reference,vehicle_reference,casualty_reference,casualty_class,sex_of_casualty,age_of_casualty,age_band_of_casualty,casualty_severity,pedestrian_location,pedestrian_movement,car_passenger,bus_or_coach_passenger,pedestrian_road_maintenance_worker,casualty_type,casualty_home_area_type,casualty_imd_decile
0,2020010219808,2020,10219808,1,1,3,1,31,6,3,9,5,0,0,0,0,1,4
1,2020010220496,2020,10220496,1,1,3,2,2,1,3,1,1,0,0,0,0,1,2
2,2020010220496,2020,10220496,1,2,3,2,4,1,3,1,1,0,0,0,0,1,2
3,2020010228005,2020,10228005,1,1,3,1,23,5,3,5,9,0,0,0,0,1,3
4,2020010228006,2020,10228006,1,1,3,1,47,8,2,4,1,0,0,0,0,1,3


### 1. Checking for duplicate data and removing them

In [37]:
dupl = raw_data.duplicated()
dupl.groupby(dupl).size()

False    115584
dtype: int64

There is no duplicate data in the dataset.

### 2. Remove unnecessary rows/columns

In [38]:
# Delete reference/index, year and age band columns.
small_df = raw_data.drop(['accident_index','accident_year','accident_reference','vehicle_reference','casualty_reference','age_band_of_casualty'], axis=1, inplace=False)

In [39]:
# Delete unwanted rows
small_df.drop(small_df[small_df['age_of_casualty'] == -1].index, axis=0, inplace=True) # dropped rows where age was not known
small_df.drop(small_df[(small_df['sex_of_casualty'] == '-1') | (small_df['sex_of_casualty'] == '9') ].index, axis=0, inplace=True) # dropped rows where sex was not known
small_df.drop(small_df[(small_df['pedestrian_location'] == '-1') | (small_df['pedestrian_location'] == '10') ].index, axis=0, inplace=True) # dropped rows where pedestrian location was not known
small_df.drop(small_df[(small_df['pedestrian_movement'] == '-1') | (small_df['pedestrian_movement'] == '9') ].index, axis=0, inplace=True) # dropped rows where pedestrian movement was not known
small_df.drop(small_df[(small_df['car_passenger'] == '-1') | (small_df['car_passenger'] == '9') ].index, axis=0, inplace=True) # dropped rows where car passenger was not known
small_df.drop(small_df[(small_df['bus_or_coach_passenger'] == '-1') | (small_df['bus_or_coach_passenger'] == '9') ].index, axis=0, inplace=True) # dropped rows where bus or coach passenger was not known
small_df.drop(small_df[small_df['pedestrian_road_maintenance_worker'] == '-1'].index, axis=0, inplace=True) # dropped rows where pedestrian_road_maintenance_worker was not known
small_df.drop(small_df[small_df['casualty_imd_decile'] == '-1'].index, axis=0, inplace=True) # dropped rows where casualty_imd_decile was not known
small_df.drop(small_df[small_df['casualty_home_area_type'] == '-1'].index, axis=0, inplace=True) # dropped rows where casualty_home_area_type was not known

for x in ['103','104','105','106','108','109','110','113']:
  small_df.drop(small_df[small_df['casualty_type'] == x].index, axis=0, inplace=True) # dropped rows where casualty type was not known or useless

In [40]:
raw_data.describe(include='all')

Unnamed: 0,accident_index,accident_year,accident_reference,vehicle_reference,casualty_reference,casualty_class,sex_of_casualty,age_of_casualty,age_band_of_casualty,casualty_severity,pedestrian_location,pedestrian_movement,car_passenger,bus_or_coach_passenger,pedestrian_road_maintenance_worker,casualty_type,casualty_home_area_type,casualty_imd_decile
count,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0,115584.0
unique,91199.0,,91199.0,,,3.0,4.0,,12.0,3.0,12.0,11.0,5.0,7.0,4.0,21.0,4.0,11.0
top,2020440349165.0,,440349165.0,,,1.0,1.0,,6.0,3.0,0.0,0.0,0.0,0.0,0.0,9.0,1.0,2.0
freq,41.0,,41.0,,,79330.0,72335.0,,25511.0,94022.0,100834.0,100833.0,96655.0,114275.0,114672.0,62698.0,85122.0,13604.0
mean,,2020.0,,1.460557,1.34779,,,36.489748,,,,,,,,,,
std,,0.0,,2.991765,4.036721,,,18.985022,,,,,,,,,,
min,,2020.0,,1.0,1.0,,,-1.0,,,,,,,,,,
25%,,2020.0,,1.0,1.0,,,23.0,,,,,,,,,,
50%,,2020.0,,1.0,1.0,,,33.0,,,,,,,,,,
75%,,2020.0,,2.0,1.0,,,50.0,,,,,,,,,,


## **Download dataframe**

In [41]:
clean_df = small_df.copy()
clean_df.reset_index(inplace=True)
clean_df.drop(['index'], axis=1, inplace=True)

In [44]:
clean_df.to_csv(r'clean-casualties-2020.csv')