# Capstone project - severity of car accidents in Catalonia between 2010 and 2020

## Contents
* [1. Introduction](#introduction)
* [2. Data](#data)
  * [2.1. Data Importing](#di)
  * [2.2. Data Cleaning](#dc)

## 1. Introduction <a name='introduction'></a>

Traffic accidents, little instants in which the life can change, and in some occasions, end; year by year thousands of lives are lost due this terrible incidents which can be caused for the most insignificant reasons, as animals on the way or a distraction while driving.
There is not a unique reason for the traffic accidents, the causes are too diverse and there are so many factors intervening at the same time that every case needs to be evaluated in detail in order to find all the hidden answers, answers which can be founded on the driver, the speed, the weather, etc.
On this project we’ll find the correlation between the specific factors and the severity of the accidents, understanding severity as the amount of human lives lost on the collision, for this we’ll need a dataset in which all the different conditions present on the moment of the incident are signed.

## 2. Data <a name='data'></a>

For this project we will use the dataset "Accidents de trànsit amb morts o ferits greus a Catalunya" wich reports all the traffic incidets reported on Catalonya on the period 2010-2020 with the details about weather conditions, road conditions and number of casualties, the data set can be found on the web page of Catalonia government open data “https://analisi.transparenciacatalunya.cat/Transport/Accidents-de-tr-nsit-amb-morts-o-ferits-greus-a-Ca/rmgc-ncpb”. On this open source dataset we'll find all the information related to the accidents and the different condtions that affected it.

###  2.1. Data Importing <a name='di'></a>

In [1]:
# Import main libraries
import pandas as pd
import numpy as np

In [2]:
# get the data
!wget -O traffic_accidents.csv https://analisi.transparenciacatalunya.cat/resource/rmgc-ncpb.csv

--2020-09-06 09:41:48--  https://analisi.transparenciacatalunya.cat/resource/rmgc-ncpb.csv
Resolving analisi.transparenciacatalunya.cat (analisi.transparenciacatalunya.cat)... 52.16.175.47, 52.16.222.96, 52.51.221.164
Connecting to analisi.transparenciacatalunya.cat (analisi.transparenciacatalunya.cat)|52.16.175.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘traffic_accidents.csv’

    [   <=>                                 ] 614,365     1.18MB/s   in 0.5s   

2020-09-06 09:41:50 (1.18 MB/s) - ‘traffic_accidents.csv’ saved [614365]



In [3]:
# Define the dataframe
df = pd.read_csv('traffic_accidents.csv')
df.head()

Unnamed: 0,any,zona,dat,via,pk,nommun,nomcom,nomdem,f_morts,f_ferits_greus,...,d_superficie,d_tipus_via,d_titularitat_via,d_tracat_altimetric,d_vent,grupdialab,hor,gruphor,tipacc,tipdia
0,2010,Zona urbana,2010-01-25T23:33:00.000,SE,999999,CANOVES I SAMALUS,Valles Oriental,Barcelona,0,1,...,Sec i net,Via urbana( inclou carrer i carrer residencial),,,"Calma, vent molt suau",Feiners,2333,Nit,Col.lisió de vehicles en marxa,dill-dij
1,2010,Carretera,2010-10-31T01:00:00.000,N-240,999,LLEIDA,Segria,Lleida,0,1,...,Sec i net,Carretera convencional,Estatal,Pla,"Calma, vent molt suau",CapDeSetmana,1,Nit,Sortida de la calcada sense especificar,dg
2,2010,Carretera,2010-05-17T15:27:00.000,N-II,7087,FORNELLS DE LA SELVA,Girones,Girona,1,0,...,Sec i net,Carretera convencional,Estatal,Rampa o pendent,"Calma, vent molt suau",Feiners,1527,Tarda,Col.lisió de vehicles en marxa,dill-dij
3,2010,Zona urbana,2010-08-21T22:30:00.000,SE,999999,BARCELONA,Barcelones,Barcelona,0,2,...,Sec i net,Via urbana( inclou carrer i carrer residencial),,,"Calma, vent molt suau",CapDeSetmana,223,Nit,Col.lisió de vehicles en marxa,dis
4,2010,Zona urbana,2010-05-07T17:45:00.000,SE,999999,BADALONA,Barcelones,Barcelona,0,1,...,Sec i net,Via urbana( inclou carrer i carrer residencial),,,"Calma, vent molt suau",CapDeSetmana,1745,Tarda,Bolcada a la calcada,div


In [4]:
# Display the type of data in the columns
df.dtypes

any                            int64
zona                          object
dat                           object
via                           object
pk                             int64
nommun                        object
nomcom                        object
nomdem                        object
f_morts                        int64
f_ferits_greus                 int64
f_ferits_lleus                 int64
f_victimes                     int64
f_unitats_implicades           int64
f_vianants_implicades          int64
f_bicicletes_implicades        int64
f_ciclomotors_implicades       int64
f_motocicletes_implicades      int64
f_veh_lleugers_implicades      int64
f_veh_pesants_implicades       int64
f_altres_unit_implicades       int64
f_unit_desc_implicades         int64
c_velocitat_via              float64
d_acc_amb_fuga                object
d_boira                       object
d_caract_entorn               object
d_carril_especial             object
d_circulacio_mesures_esp      object
d

In [5]:
# Description of the dataframe
df.describe()

Unnamed: 0,any,pk,f_morts,f_ferits_greus,f_ferits_lleus,f_victimes,f_unitats_implicades,f_vianants_implicades,f_bicicletes_implicades,f_ciclomotors_implicades,f_motocicletes_implicades,f_veh_lleugers_implicades,f_veh_pesants_implicades,f_altres_unit_implicades,f_unit_desc_implicades,c_velocitat_via,hor
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,879.0,1000.0
mean,2010.0,572623.259,0.183,1.024,0.515,1.722,2.037,0.27,0.064,0.13,0.331,1.06,0.163,0.019,0.0,87.042093,840.776
std,0.0,494317.964487,0.440119,0.575982,1.073745,1.328337,0.752461,0.505327,0.264526,0.348168,0.558346,0.816742,0.465454,0.143735,0.0,22.981359,760.323907
min,2010.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0
25%,2010.0,318.75,0.0,1.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,80.0,125.0
50%,2010.0,999999.0,0.0,1.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,100.0,749.0
75%,2010.0,999999.0,0.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,100.0,1538.25
max,2010.0,999999.0,4.0,5.0,13.0,14.0,8.0,4.0,2.0,2.0,6.0,5.0,6.0,2.0,0.0,120.0,2359.0


Size and shape of the dataframe

In [6]:
# Data size
print('Dataset size is', df.size)


# Data shape
print('Dataset shape is', df.shape)

Dataset size is 58000
Dataset shape is (1000, 58)


### 2.2. Data Cleaning <a name='dc'></a>

First we'll remove all the unnecessary columns

In [7]:
# Drop unnecesary columns
df.drop(columns = ['any', 'pk', 'nomdem', 'nomcom', 'dat', 'd_gravetat', 'hor', 'nommun', 'd_subtipus_tram', 'tipacc', 'zona', 'd_titularitat_via', 'via', 'd_acc_amb_fuga', 'd_influit_boira',
                  'd_influit_caract_entorn', 'd_influit_circulacio', 'd_influit_estat_clima', 'd_influit_inten_vent', 'd_influit_lluminositat', 'd_influit_mesu_esp', 'd_influit_visibilitat'], inplace = True)
df.head()

Unnamed: 0,f_morts,f_ferits_greus,f_ferits_lleus,f_victimes,f_unitats_implicades,f_vianants_implicades,f_bicicletes_implicades,f_ciclomotors_implicades,f_motocicletes_implicades,f_veh_lleugers_implicades,...,d_sentits_via,d_subtipus_accident,d_subzona,d_superficie,d_tipus_via,d_tracat_altimetric,d_vent,grupdialab,gruphor,tipdia
0,0,1,0,1,2,0,0,0,0,1,...,Un sol sentit,Encalç,Zona urbana,Sec i net,Via urbana( inclou carrer i carrer residencial),,"Calma, vent molt suau",Feiners,Nit,dill-dij
1,0,1,3,4,1,0,0,0,0,1,...,Doble sentit,Resta sortides de via,Carretera,Sec i net,Carretera convencional,Pla,"Calma, vent molt suau",CapDeSetmana,Nit,dg
2,1,0,2,3,4,0,0,0,0,2,...,Doble sentit,Col·lisió frontal,Carretera,Sec i net,Carretera convencional,Rampa o pendent,"Calma, vent molt suau",Feiners,Tarda,dill-dij
3,0,2,7,9,2,0,0,0,0,2,...,Un sol sentit,Envestida (frontal lateral),Zona urbana,Sec i net,Via urbana( inclou carrer i carrer residencial),,"Calma, vent molt suau",CapDeSetmana,Nit,dis
4,0,1,0,1,1,0,0,0,1,0,...,Un sol sentit,Caiguda en la via,Zona urbana,Sec i net,Via urbana( inclou carrer i carrer residencial),,"Calma, vent molt suau",CapDeSetmana,Tarda,div


Now we'll see the size and shape of the new dataframe

In [8]:
# New data size
print('New dataset size is', df.size)


# New data shape
print('New dataset shape is', df.shape)

New dataset size is 36000
New dataset shape is (1000, 36)


Now we'll translate the name of the columns to english for a better understanding

In [9]:
df.rename(columns = {'f_morts':'deaths', 'f_ferits_greus':'major_injuries', 'f_ferits_lleus':'minor_injuries', 'f_victimes':'total_victims', 'f_unitats_implicades':'involved_vehicle(s)',
                     'f_vianants_implicades':'involved_pedestrian(s)', 'f_bicicletes_implicades':'involved_bycicle(s)', 'f_ciclomotors_implicades':'involved_moped(s)',
                     'f_motocicletes_implicades':'involved_motorcycle(s)', 'f_veh_lleugers_implicades':'light_vehicle(s)_involved', 'f_veh_pesants_implicades':'heavy_vehicle(s)_involved', 
                     'f_altres_unit_implicades':'other_type_unit_involved', 'f_unit_desc_implicades':'unknown_unit_involved', 'c_velocitat_via':'maximun_allowed_speed', 'd_boira':'haze',
                     'd_caract_entorn':'terrain_characteristics', 'd_carril_especial':'special_lane', 'd_circulacio_mesures_esp':'special_traffic_measures', 'd_climatologia':'climatology',
                     'd_func_esp_via':'special_function_track', 'd_influit_obj_calcada':'object_on_way', 'd_influit_solcs_rases':'f/d_in_road', 'd_inter_seccio':'intersection',
                     'd_limit_velocitat':'speed_limit_visualization', 'd_lluminositat':'illumination', 'd_regulacio_prioritat':'priority_regulation', 'd_sentits_via':'directions',
                     'd_subtipus_accident':'type of accident', 'd_subzona':'zone', 'd_superficie':'road_condition', 'd_tipus_via':'road_type', 'd_tracat_altimetric':'altimeter_layout',
                     'd_vent':'wind', 'grupdialab':'day_type', 'gruphor':'moment', 'tipdia':'day_of_week'}, inplace = True)
# f/d = furrows and ditches



# Display the first rows on wwith the new column names
df.head()

Unnamed: 0,deaths,major_injuries,minor_injuries,total_victims,involved_vehicle(s),involved_pedestrian(s),involved_bycicle(s),involved_moped(s),involved_motorcycle(s),light_vehicle(s)_involved,...,directions,type of accident,zone,road_condition,road_type,altimeter_layout,wind,day_type,moment,day_of_week
0,0,1,0,1,2,0,0,0,0,1,...,Un sol sentit,Encalç,Zona urbana,Sec i net,Via urbana( inclou carrer i carrer residencial),,"Calma, vent molt suau",Feiners,Nit,dill-dij
1,0,1,3,4,1,0,0,0,0,1,...,Doble sentit,Resta sortides de via,Carretera,Sec i net,Carretera convencional,Pla,"Calma, vent molt suau",CapDeSetmana,Nit,dg
2,1,0,2,3,4,0,0,0,0,2,...,Doble sentit,Col·lisió frontal,Carretera,Sec i net,Carretera convencional,Rampa o pendent,"Calma, vent molt suau",Feiners,Tarda,dill-dij
3,0,2,7,9,2,0,0,0,0,2,...,Un sol sentit,Envestida (frontal lateral),Zona urbana,Sec i net,Via urbana( inclou carrer i carrer residencial),,"Calma, vent molt suau",CapDeSetmana,Nit,dis
4,0,1,0,1,1,0,0,0,1,0,...,Un sol sentit,Caiguda en la via,Zona urbana,Sec i net,Via urbana( inclou carrer i carrer residencial),,"Calma, vent molt suau",CapDeSetmana,Tarda,div


In [10]:
# Type of objects on each column
df.dtypes

deaths                         int64
major_injuries                 int64
minor_injuries                 int64
total_victims                  int64
involved_vehicle(s)            int64
involved_pedestrian(s)         int64
involved_bycicle(s)            int64
involved_moped(s)              int64
involved_motorcycle(s)         int64
light_vehicle(s)_involved      int64
heavy_vehicle(s)_involved      int64
other_type_unit_involved       int64
unknown_unit_involved          int64
maximun_allowed_speed        float64
haze                          object
terrain_characteristics       object
special_lane                  object
special_traffic_measures      object
climatology                   object
special_function_track        object
object_on_way                 object
f/d_in_road                   object
intersection                  object
speed_limit_visualization     object
illumination                  object
priority_regulation           object
directions                    object
t

#### The next to do is convert all the values into numericall values.

We'll start converting values in column 'haze', this column has only two inputs of type object, 'No n'hi ha' and 'Si'; they're transformed to int values

In [11]:
# Replace values in column 'haze'
# No n'hi ha = no, replaced by 0
# si = yes, replaced by 1
df['haze'].replace(["No n'hi ha", "Si"], [0, 1], inplace = True)


# Transform values in terrain_characteristics to int64
# As the 'Sense Especificar' value means that there are not special characteristics, this one will have the value of 0
# 'A nivel' = At the level, replaced by 1
# 'Desmunt' = disassemble, replaced by 2
# 'Mixt' = mixed, replaced by 3
# 'Sense Especificar' = Unknown (No special characteristics), replaced by 0
# 'Terraplé' = embankment, replaced by 4
df['terrain_characteristics'].replace(["Sense Especificar", "A nivell", "Desmunt", "Mixt", "Terraplé"], [0, 1, 2, 3, 4], inplace = True)



df['special_lane'].replace(["Sense Especificar", "No n'hi ha", "Altres", "Carril acceleració", "Carril avançament", "Carril bici", "Carril bus", "Carril central", "Carril d'alentiment",
                            "Carril habilitat en sentit contrari habitual", "Carril lent", "Carril reversible", "Habilitació voral/carril addicional"], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                            inplace = True)

In [12]:
df.dtypes

deaths                         int64
major_injuries                 int64
minor_injuries                 int64
total_victims                  int64
involved_vehicle(s)            int64
involved_pedestrian(s)         int64
involved_bycicle(s)            int64
involved_moped(s)              int64
involved_motorcycle(s)         int64
light_vehicle(s)_involved      int64
heavy_vehicle(s)_involved      int64
other_type_unit_involved       int64
unknown_unit_involved          int64
maximun_allowed_speed        float64
haze                           int64
terrain_characteristics        int64
special_lane                   int64
special_traffic_measures      object
climatology                   object
special_function_track        object
object_on_way                 object
f/d_in_road                   object
intersection                  object
speed_limit_visualization     object
illumination                  object
priority_regulation           object
directions                    object
t