# First Year Project
## Project 1 - Road Collissions Analysis, ITU Copenhagen

This notebook contains all of the code developed for project 1, to explore the data set of Road Collisions from UK.gov with a focus on the city of Birmingham.

Nicola (niccl@itu.dk)<br>
Emma (@itu.dk)<br>
Karlis (@itu.dk)<br>
Kirstine (@itu.dk)<br>
Danielle (ddeq@itu.dk)<br>

Created: 12-02-2021
<br>Last modified: 12-02-2021

## Imports

In [None]:
import numpy as np

## Constants

In [None]:
PATH = {}
PATH["data_raw"] = "../Data/raw/"
PATH["data_interim"] = "../Data/interim/"
PATH["data_processed"] = "../Data/processed/"
PATH["data_external"] = "../Data/external/"

FILENAME = {}
FILENAME["accidents"] = "Road Safety Data - Accidents 2019.csv"
FILENAME["casualties"] = "Road Safety Data - Casualties 2019.csv"
FILENAME["vehicles"] = "Road Safety Data- Vehicles 2019.csv" 

TABLENAMES = ["accidents", "casualties", "vehicles"]

## Load raw data

The data were downloaded from here on Jan 4th: https://data.gov.uk/dataset/road-accidents-safety-data
That page was updated afterwards (Jan 8th), so local and online data may be inconsistent.

In [None]:
dataraw = {}
dataraw["accidents"] = np.genfromtxt(PATH["data_raw"]+FILENAME["accidents"], delimiter=',', dtype=None, names=True, encoding='utf-8-sig')
dataraw["vehicles"] = np.genfromtxt(PATH["data_raw"]+FILENAME["vehicles"], delimiter=',', dtype=None, names=True, encoding='utf-8-sig')
dataraw["casualties"] = np.genfromtxt(PATH["data_raw"]+FILENAME["casualties"], delimiter=',', dtype=None, names=True, encoding='utf-8-sig')

In [None]:
# Get a dictionary of the variable names for each table
variable_names_raw = {}
for variable_name in TABLENAMES:
    variable_names_raw[variable_name] = list(dataraw[variable_name].dtype.names)

In [None]:
#variable_names_raw["accidents"]

In [None]:
#dataraw["accidents"][:5]

## Report the dimensions of the data (number of tables, rows, fields).

We have 3 tables: Accidents, Casualites, and Vehicles data tables.

### Number of records in each table

In [None]:
print(f"Number of records in accident table: {dataraw['accidents'].shape}")
print(f"Number of records in vehicles table: {dataraw['vehicles'].shape}")
print(f"Number of records in casualties table: {dataraw['casualties'].shape}")

### Number of fields in each table

In [None]:
print(f"Number of fields in the accident table: {len(dataraw['accidents'].dtype)}")
print(f"Number of fields in the vehicles table: {len(dataraw['vehicles'].dtype)}")
print(f"Number of fields in the casualties table: {len(dataraw['casualties'].dtype)}")

In [None]:
#dataraw["accidents"]["Local_Authority_District"]

## Masking for Birmingham

Birmingham is listed as code "300" under the "Local_Authority_District" field. First we narrowed down the accidents table to only show records that match this code in that field.

In [None]:
# Create a dictionary to hold all of the clean data
data_clean = {}

# creating the mask to show the data for Birmingham on the accidents sheet
birmingham_mask = np.where(dataraw["accidents"]["Local_Authority_District"] == 300)
data_clean["accidents"] = dataraw["accidents"][birmingham_mask]

Then we narrowed down the casualties and vehicles sheets to only have records that would be reflected in the accident sheet, thereby having only records of that pertain to Birmingham.

In [None]:
# Create an array of the accident IDs related to Birmingham (use this to filter the other tables)
accident_ids = np.array(data_clean["accidents"]["Accident_Index"])

# Create a mask to filter casualties table based on the accident ids
casualties_mask = np.where(np.isin(dataraw["casualties"]["Accident_Index"], (accident_ids)))
data_clean["casualties"] = dataraw["casualties"][casualties_mask]

In [None]:
# Create a mask to filter vehicles table based on the accident ids
vehicles_mask = np.where(np.isin(dataraw["vehicles"]["Accident_Index"], (accident_ids)))
data_clean["vehicles"] = dataraw["vehicles"][vehicles_mask]

In [None]:
# An example of how we would save a copy of the processed data to the correct directory
#clean_accidents.to_csv(r'../Data/interim/clean_accidents.csv', index = False, header = True)
#np.savetxt(PATH["data_interim"] + "clean_accidents.csv", clean_accidents, delimiter=",")

### Does every AccidentID in the casualties and vehicles tables have their corresponding AccidentID in the accident table?

Yes, after cleaning both the casualties and vehicles sheets to only have records where their accident ID is in the cleaned Birmingham accident sheet.

## Identify for each variable whether it is numerical or categorical.

### Insight: Mixed variable types

Accidents have mixed data types, including strings, floats, and integers. Categorical variables are encoded as integers. The meaning of these categories can be looked up in: ../references/variable lookup.xls

In [None]:
# Number of entries in each table after being filtered to Birmingham

print(f"Number of records in Birmingham accident table: {data_clean['accidents'].shape}")
print(f"Number of records in Birmingham casualties table: {data_clean['casualties'].shape}")
print(f"Number of records in Birmingham vehicles table: {data_clean['vehicles'].shape}")

We manually went through the variable lists to come up with which variables were categorical or numerical.

References<br>
what is categorical - http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm <br>
the hell is LSOA - https://datadictionary.nhs.uk/nhs_business_definitions/lower_layer_super_output_area.html#:~:text=A%20Lower%20Layer%20Super%20Output,Lower%20Layer%20Super%20Output%20Areas
decile -http://mast.roadsafetyanalysis.org/wiki/index.php?title=Driver_IMD_Decile#:~:text=An%20IMD%20decile%20is%20a,the%2010%25%20least%20deprived%20areas.


Acidents <br><br>
Accident_Index  - Categorical<br>
Location_Easting_OSGR  - Numerical<br>
Location_Northing_OSGR  - Numerical<br>
Longitude  - Numerical<br>
Latitude  - Numerical<br>
Police_Force  - Categorical<br>
Accident_Severity  - Categorical<br>
Number_of_Vehicles  - Numerical<br>
Number_of_Casualties  - Numerical<br>
Date  - Numerical/Categorical<br>
Day_of_Week  - Categorical<br>
Time  - Numerical(but can be categorical)<br>
Local_Authority_(District)  - Categorical<br>
Local_Authority_(Highway)  - Categorical<br>
1st_Road_Class  - Categorical<br>
1st_Road_Number  - Categorical<br>
Road_Type  - Categorical<br>
Speed_limit  - Categorical(we questioned this)<br>
Junction_Detail  - Categorical<br>
Junction_Control  - Categorical<br>
2nd_Road_Class  - Categorical<br>
2nd_Road_Number  - Categorical<br>
Pedestrian_Crossing-Human_Control  - Categorical<br>
Pedestrian_Crossing-Physical_Facilities  - Categorical<br>
Light_Conditions  - Categorical<br>
Weather_Conditions  - Categorical<br>
Road_Surface_Conditions  - Categorical<br>
Special_Conditions_at_Site  - Categorical<br>
Carriageway_Hazards  - Categorical<br>
Urban_or_Rural_Area  - Categorical<br>
Did_Police_Officer_Attend_Scene_of_Accident  - Categorical<br>
LSOA_of_Accident_Location  - Categorical<br>

Vehicles<br><br>
Accident_Index  - Categorical<br>
Vehicle_Reference  - Categorical<br>
Vehicle_Type  - Categorical<br>
Towing_and_Articulation  - Categorical<br>
Vehicle_Manoeuvre  - Categorical<br>
Vehicle_Location-Restricted_Lane  - Categorical<br>
Junction_Location  - Categorical<br>
Skidding_and_Overturning  - Categorical<br>
Hit_Object_in_Carriageway  - Categorical<br>
Vehicle_Leaving_Carriageway  - Categorical<br>
Hit_Object_off_Carriageway  - Categorical<br>
1st_Point_of_Impact  - Categorical<br>
Was_Vehicle_Left_Hand_Drive?  - Categorical<br>
Journey_Purpose_of_Driver  - Categorical<br>
Sex_of_Driver  - Categorical<br>
Age_of_Driver  - Numerical<br>
Age_Band_of_Driver  - Categorical<br>
Engine_Capacity_(CC)  - Numerical<br>
Propulsion_Code  - Categorical(probably)<br>
Age_of_Vehicle  - Numerical<br>
Driver_IMD_Decile  - Categorical<br>
Driver_Home_Area_Type  - Categorical<br>
Vehicle_IMD_Decile  - Categorical<br>

Casualties <br><br>
Accident_Index  - Categorical<br>
Vehicle_Reference  - Categorical<br>
Casualty_Reference  - Categorical<br>
Casualty_Class  - Categorical<br>
Sex_of_Casualty  - Categorical<br>
Age_of_Casualty  - Numerical<br>
Age_Band_of_Casualty  - Categorical<br>
Casualty_Severity  - Categorical<br>
Pedestrian_Location  - Categorical<br>
Pedestrian_Movement  - Categorical<br>
Car_Passenger  - Categorical<br>
Bus_or_Coach_Passenger  - Categorical<br>
Pedestrian_Road_Maintenance_Worker  - Categorical<br>
Casualty_Type  - Categorical<br>
Casualty_Home_Area_Type  - Categorical<br>
Casualty_IMD_Decile  - Categorical<br>

## Dealing with missing data

### First we will go through missing data from the unfiltered data for all of UK

In [None]:
dataraw_masked = {}
for variable_name in TABLENAMES:
    dataraw_masked[variable_name] = np.genfromtxt(PATH["data_raw"] + FILENAME[variable_name], delimiter = ',', dtype = None, names = True, encoding='utf-8-sig', usemask = True)

In [None]:
print(f"Number of rows with missing values: {np.count_nonzero(dataraw_masked['accidents'].mask)}")

In [None]:
print(f"Percentage of rows with missing values: {round(5776/117536, 2)}%")

In [None]:
# Rows where data is missing
row_incomplete = np.where(dataraw_masked["accidents"].mask)[0]

In [None]:
missingpositions = {}
missingvalues = 0
missingconfigurations = set()
for rowpos in row_incomplete:
    missingpositions_thisrow = list(np.where(list(dataraw_masked["accidents"].mask[rowpos]))[0])
    missingpositions[rowpos] = missingpositions_thisrow
    missingvalues += len(missingpositions_thisrow)
    missingconfigurations.add(tuple(missingpositions_thisrow))
    
missingfieldnames = [np.array(variable_names_raw["accidents"])[c] for c in [list(b) for b in missingconfigurations]]

In [None]:
print("Incomplete rows: " + str(np.count_nonzero(dataraw_masked["accidents"].mask)))
print("Missing values: " + str(missingvalues))

print("\nMissing field configurations: " + str(missingconfigurations))
for i in missingfieldnames:
    print(i)

### Then, we will repeat the same steps with the tables that only contain records of Birmingham.

In [None]:
# Create a dictionary for the clean, masked data
data_clean_masked = {}

# creating the mask to show the data for Birmingham on the accidents sheet
birmingham_mask2 = np.where(dataraw_masked["accidents"]["Local_Authority_District"] == 300)
data_clean_masked["accidents"] = dataraw_masked["accidents"][birmingham_mask2]

In [None]:
# Create an array of the accident IDs related to Birmingham (use this to filter the other tables)
accident_ids2 = np.array(data_clean_masked["accidents"]["Accident_Index"])

# Create a mask to filter casualties table based on the accident ids
casualties_mask2 = np.where(np.isin(dataraw_masked["casualties"]["Accident_Index"], (accident_ids2)))
data_clean_masked["casualties"] = dataraw_masked["casualties"][casualties_mask2]

In [None]:
# Create a mask to filter vehicles table based on the accident ids
vehicles_mask2 = np.where(np.isin(dataraw_masked["vehicles"]["Accident_Index"], (accident_ids2)))
data_clean_masked["vehicles"] = dataraw_masked["vehicles"][vehicles_mask2]

In [None]:
print(f"Number of rows with missing values: {np.count_nonzero(data_clean_masked['accidents'].mask)}")
print(f"Number of rows with missing values: {np.count_nonzero(data_clean_masked['casualties'].mask)}")
print(f"Number of rows with missing values: {np.count_nonzero(data_clean_masked['vehicles'].mask)}")

It seems there are no records in our narrowed down data set that have missing values.

## Report a five number summary for all numerical variables (where this makes sense)

## Report a box plot for all numerical variables (where this makes sense)

## Report a frequency histogram for all numerical variables, and a frequency bar plot for all categorical variables (where this makes sense)