# Road accidents' forecast in Belgium

## I. Introduction/Business Problem
In this section we will firstly discuss the problem that we face, and secondly we will describe the data set that we used to help us solve it.

### Purpose of the notebook
This notebook presents the results of the investigations on the probabilities of getting a road accident in Belgium and the severity likely to occur. Severity is defined in terms of fatality and casualties. This notebook would thus be useful for anyone owning a car, since one should be able to see whether it is a good idea to take her car under specific circonstances. The model is going to warn people when they should be more careful than normal, meaning that they should drive slower for instance or try to find an alternative to the car.

### Data set
The open data used for making the present analysis is gathered from the website of "Statbel", the Belgian statistical office. The data can be retrieve on the following link: https://statbel.fgov.be/en/open-data?category=162. It contains 655 467 observation (i.e. road accidents) and 35 attributes (among others day of the week, road type, build up area, type of collision, light conditions, municipality and district, number of deadly accidents in the last 30 days etc.).


In [53]:
import pandas as pd
df = pd.read_csv(r'C:\Users\NJ5866\Desktop\Road_accidents_Belgium.zip\Road_accidents_Belgium.csv')
df.head()

FileNotFoundError: [Errno 2] File C:\Users\NJ5866\Desktop\Road_accidents_Belgium.zip\Road_accidents_Belgium.csv does not exist: 'C:\\Users\\NJ5866\\Desktop\\Road_accidents_Belgium.zip\\Road_accidents_Belgium.csv'

In [16]:
df.tail()

Unnamed: 0,DT_DAY,DT_HOUR,CD_DAY_OF_WEEK,TX_DAY_OF_WEEK_DESCR_FR,TX_DAY_OF_WEEK_DESCR_NL,CD_BUILD_UP_AREA,TX_BUILD_UP_AREA_DESCR_NL,TX_BUILD_UP_AREA_DESCR_FR,CD_COLL_TYPE,TX_COLL_TYPE_DESCR_NL,...,TX_PROV_DESCR_FR,CD_RGN_REFNIS,TX_RGN_DESCR_NL,TX_RGN_DESCR_FR,MS_ACCT,MS_ACCT_WITH_DEAD,MS_ACCT_WITH_DEAD_30_DAYS,MS_ACCT_WITH_MORY_INJ,MS_ACCT_WITH_SERLY_INJ,MS_ACCT_WITH_SLY_INJ
655463,4/okt/05,15,2,Mardi,dinsdag,2.0,Buiten bebouwde kom,Hors agglomération,4.0,Langs opzij,...,Province de Brabant wallon,3000,Waals Gewest,Région wallonne,1,0,0,0,0,1
655464,26/nov/05,3,6,Samedi,zaterdag,2.0,Buiten bebouwde kom,Hors agglomération,4.0,Langs opzij,...,Province de Brabant wallon,3000,Waals Gewest,Région wallonne,1,0,0,0,0,1
655465,2/okt/05,4,7,Dimanche,zondag,1.0,Binnen bebouwde kom,En agglomération,4.0,Langs opzij,...,Province de Brabant wallon,3000,Waals Gewest,Région wallonne,1,0,0,0,0,1
655466,21/nov/05,11,1,Lundi,maandag,2.0,Buiten bebouwde kom,Hors agglomération,4.0,Langs opzij,...,Province de Brabant wallon,3000,Waals Gewest,Région wallonne,1,0,0,0,0,1
655467,16/okt/05,3,7,Dimanche,zondag,2.0,Buiten bebouwde kom,Hors agglomération,3.0,Langs achteren (of naast elkaar),...,Province de Brabant wallon,3000,Waals Gewest,Région wallonne,1,0,0,0,0,1


As one might notice, some variables are simply translations of others (French and Dutch), so in order to avoid using twice the same variables, we can simplify our data set by removing one of the two columns for each of such variables. Arbitrarily, French has been chosen to be kept. Let's check all variable which need to be removed (column names ending with '_NL'). As one can see, 9 variables will be withdrawn.

In [34]:
list(df.columns.values) 

['DT_DAY',
 'DT_HOUR',
 'CD_DAY_OF_WEEK',
 'TX_DAY_OF_WEEK_DESCR_FR',
 'TX_DAY_OF_WEEK_DESCR_NL',
 'CD_BUILD_UP_AREA',
 'TX_BUILD_UP_AREA_DESCR_NL',
 'TX_BUILD_UP_AREA_DESCR_FR',
 'CD_COLL_TYPE',
 'TX_COLL_TYPE_DESCR_NL',
 'TX_COLL_TYPE_DESCR_FR',
 'CD_LIGHT_COND',
 'TX_LIGHT_COND_DESCR_NL',
 'TX_LIGHT_COND_DESCR_FR',
 'CD_ROAD_TYPE',
 'TX_ROAD_TYPE_DESCR_NL',
 'TX_ROAD_TYPE_DESCR_FR',
 'CD_MUNTY_REFNIS',
 'TX_MUNTY_DESCR_NL',
 'TX_MUNTY_DESCR_FR',
 'CD_DSTR_REFNIS',
 'TX_ADM_DSTR_DESCR_NL',
 'TX_ADM_DSTR_DESCR_FR',
 'CD_PROV_REFNIS',
 'TX_PROV_DESCR_NL',
 'TX_PROV_DESCR_FR',
 'CD_RGN_REFNIS',
 'TX_RGN_DESCR_NL',
 'TX_RGN_DESCR_FR',
 'MS_ACCT',
 'MS_ACCT_WITH_DEAD',
 'MS_ACCT_WITH_DEAD_30_DAYS',
 'MS_ACCT_WITH_MORY_INJ',
 'MS_ACCT_WITH_SERLY_INJ',
 'MS_ACCT_WITH_SLY_INJ']

In [32]:
df_new = df.drop(["TX_DAY_OF_WEEK_DESCR_NL", "TX_BUILD_UP_AREA_DESCR_NL", "TX_COLL_TYPE_DESCR_NL", "TX_LIGHT_COND_DESCR_NL", "TX_ROAD_TYPE_DESCR_NL", "TX_MUNTY_DESCR_NL", "TX_ADM_DSTR_DESCR_NL", "TX_PROV_DESCR_NL", "TX_RGN_DESCR_NL" ], axis=1)
df_new.head()

Unnamed: 0,DT_DAY,DT_HOUR,CD_DAY_OF_WEEK,TX_DAY_OF_WEEK_DESCR_FR,CD_BUILD_UP_AREA,TX_BUILD_UP_AREA_DESCR_FR,CD_COLL_TYPE,TX_COLL_TYPE_DESCR_FR,CD_LIGHT_COND,TX_LIGHT_COND_DESCR_FR,...,CD_PROV_REFNIS,TX_PROV_DESCR_FR,CD_RGN_REFNIS,TX_RGN_DESCR_FR,MS_ACCT,MS_ACCT_WITH_DEAD,MS_ACCT_WITH_DEAD_30_DAYS,MS_ACCT_WITH_MORY_INJ,MS_ACCT_WITH_SERLY_INJ,MS_ACCT_WITH_SLY_INJ
0,27/sep/19,18,5,Vendredi,2.0,Hors agglomération,1.0,Collision en chaine (entre 4 conducteurs ou plus),1.0,Plein jour,...,10000.0,Province d’Anvers,2000,Région flamande,1,1,1,0,0,0
1,20/nov/19,12,3,Mercredi,1.0,En agglomération,4.0,Par le côté,1.0,Plein jour,...,10000.0,Province d’Anvers,2000,Région flamande,1,0,0,0,0,1
2,15/jul/19,14,1,Lundi,1.0,En agglomération,,Non disponible,1.0,Plein jour,...,10000.0,Province d’Anvers,2000,Région flamande,1,0,0,0,0,1
3,20/apr/19,2,6,Samedi,2.0,Hors agglomération,1.0,Collision en chaine (entre 4 conducteurs ou plus),3.0,"Nuit, éclairage public allumé",...,10000.0,Province d’Anvers,2000,Région flamande,1,0,0,0,1,0
4,25/okt/19,12,5,Vendredi,2.0,Hors agglomération,5.0,Avec un piéton,1.0,Plein jour,...,10000.0,Province d’Anvers,2000,Région flamande,1,0,0,0,0,1
