# Applied Data Science Capstone - Part 1 - Problem Description and Data Description 

# Problem Description

This capstone project addresses car accidents and the prediction of their severities given historical data on car accidents collected over many years. Predicting the severity of a car accident will help dispatching the appropriate emergency services to the accident scene. Ultimately, this will make an effective use of resources and help save more lives. In addition to predicting severity, the exploration of collected data can help pinpointing other problems that can be identified and addressed. Road quality and conditions can be improved again leading to the well being and safety of the population.

Car accident data that can be useful for predicting the severity of accidents and for developing policies and making decisions that can improve the safety conditions and make efficient use of financial and other resources should include for each reported accident, among other things: the time (hour, minute) of the accident, the day of the accident, the location of the accident and its severity, the number of people involved, the road conditions and the weather condition during the accident. Once an accident is reported, this system would predict the severity of the reported accident based on the learning from the historical data. Based on the prediction, appropriate actions can be performed.

This system can be used by various government departments and authorities to deal in an efficient and effective way with the accident itself. In addition, improvements and new policies and guidelines can be developed based on the analysis of the collected data. Departments and authorities that can be involved and can benefit from this system include the police department, the fire department, hospitals, civil defense, traffic and transport department and public works. 





# Data Description

The data on car accidents that we have used in this project was obtained from kaggle (https://www.kaggle.com/ahmedlahlou/accidents-in-france-from-2005-to-2016). It consists of five different excel sheets recording information coolected about car accidents in France from 2005 to 2016. 

The five files that we have downloaded from kaggle are:
1. The characteristics.csv file which contains data related to the characteristics of each accident recorded. The file contains 16 columns and 839985 rows. The columns include the accident number, date and time of the accident, the address, the lighting condition and other information.
2. The places.csv file which contains data related to the places where each accident occured. The file contains 18 columns and 839985 rows. The columns include the accident number, the lane and other information. 
3. The users.csv file which contains data related to the people involved in each accident recorded. The file contains 12 columns and 1876005 rows. The columns include the accident number, the year of birth, the gravity of the injury, the gender and other information.
4. The characteristics.csv file which contains data related to the characteristics of each accident recorded. The file contains 9 columns and 1433389 rows. The columns include the accident number, the type of vehicle and other information about the vehicles involved.
5. The holidays.csv file which contains data related to the dates of holidays in France during 2005-2016. The file contains 2 columns and 132 rows. The columns include the date of the holiday and the holiday name.

Below we provide the python code that gives information about the five files that will be used in our project.

In [33]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [30]:
# load downloaded data files

In [31]:
# First the characteristics file
dataframe_characteristics = pd.read_csv('caracteristics.csv', encoding = 'latin-1', low_memory = False)
dataframe_characteristics.head(5)

Unnamed: 0,Num_Acc,an,mois,jour,hrmn,lum,agg,int,atm,col,com,adr,gps,lat,long,dep
0,201600000001,16,2,1,1445,1,2,1,8.0,3.0,5.0,"46, rue Sonneville",M,0.0,0,590
1,201600000002,16,3,16,1800,1,2,6,1.0,6.0,5.0,1a rue du cimetière,M,0.0,0,590
2,201600000003,16,7,13,1900,1,1,1,1.0,6.0,11.0,,M,0.0,0,590
3,201600000004,16,8,15,1930,2,2,1,7.0,3.0,477.0,52 rue victor hugo,M,0.0,0,590
4,201600000005,16,12,23,1100,1,2,3,1.0,3.0,11.0,rue Joliot curie,M,0.0,0,590


In [32]:
# column names in the data frame for characteristics
dataframe_characteristics.columns

Index(['Num_Acc', 'an', 'mois', 'jour', 'hrmn', 'lum', 'agg', 'int', 'atm',
       'col', 'com', 'adr', 'gps', 'lat', 'long', 'dep'],
      dtype='object')

In [15]:
# number of rows and columns in the data frame for characteristics
dataframe_characteristics.shape

(839985, 16)

In [16]:
# Second -  the places file
dataframe_places = pd.read_csv('places.csv', encoding = 'latin-1', low_memory = False)
dataframe_places.head(5)

Unnamed: 0,Num_Acc,catr,voie,v1,v2,circ,nbv,pr,pr1,vosp,prof,plan,lartpc,larrout,surf,infra,situ,env1
0,201600000001,3.0,39,,,2.0,0.0,,,0.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,0.0
1,201600000002,3.0,39,,,1.0,0.0,,,0.0,1.0,2.0,0.0,58.0,1.0,0.0,1.0,0.0
2,201600000003,3.0,1,,,2.0,2.0,,,0.0,1.0,3.0,0.0,68.0,2.0,0.0,3.0,99.0
3,201600000004,4.0,0,,,2.0,0.0,,,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,99.0
4,201600000005,4.0,0,,,0.0,0.0,,,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,3.0


In [17]:
# column names in the data frame for places
dataframe_places.columns

Index(['Num_Acc', 'catr', 'voie', 'v1', 'v2', 'circ', 'nbv', 'pr', 'pr1',
       'vosp', 'prof', 'plan', 'lartpc', 'larrout', 'surf', 'infra', 'situ',
       'env1'],
      dtype='object')

In [19]:
# number of rows and columns in the data frame for places
dataframe_places.shape

(839985, 18)

In [20]:
# Third -  the users file
dataframe_users = pd.read_csv('users.csv', encoding = 'latin-1', low_memory = False)
dataframe_users.head(5)

Unnamed: 0,Num_Acc,place,catu,grav,sexe,trajet,secu,locp,actp,etatp,an_nais,num_veh
0,201600000001,1.0,1,1,2,0.0,11.0,0.0,0.0,0.0,1983.0,B02
1,201600000001,1.0,1,3,1,9.0,21.0,0.0,0.0,0.0,2001.0,A01
2,201600000002,1.0,1,3,1,5.0,11.0,0.0,0.0,0.0,1960.0,A01
3,201600000002,2.0,2,3,1,0.0,11.0,0.0,0.0,0.0,2000.0,A01
4,201600000002,3.0,2,3,2,0.0,11.0,0.0,0.0,0.0,1962.0,A01


In [21]:
# column names in the data frame for users
dataframe_users.columns

Index(['Num_Acc', 'place', 'catu', 'grav', 'sexe', 'trajet', 'secu', 'locp',
       'actp', 'etatp', 'an_nais', 'num_veh'],
      dtype='object')

In [22]:
# number of rows and columns in the data frame for users
dataframe_users.shape

(1876005, 12)

In [23]:
# Fourth -  the vehicles file
dataframe_vehicles = pd.read_csv('vehicles.csv', encoding = 'latin-1', low_memory = False)
dataframe_vehicles.head(5)

Unnamed: 0,Num_Acc,senc,catv,occutc,obs,obsm,choc,manv,num_veh
0,201600000001,0.0,7,0,0.0,0.0,1.0,1.0,B02
1,201600000001,0.0,2,0,0.0,0.0,7.0,15.0,A01
2,201600000002,0.0,7,0,6.0,0.0,1.0,1.0,A01
3,201600000003,0.0,7,0,0.0,1.0,6.0,1.0,A01
4,201600000004,0.0,32,0,0.0,0.0,1.0,1.0,B02


In [24]:
# column names in the data frame for vehicles
dataframe_vehicles.columns

Index(['Num_Acc', 'senc', 'catv', 'occutc', 'obs', 'obsm', 'choc', 'manv',
       'num_veh'],
      dtype='object')

In [25]:
# number of rows and columns in the data frame for vehicles
dataframe_vehicles.shape

(1433389, 9)

In [26]:
# Fifth -  the holidays file
dataframe_holidays = pd.read_csv('holidays.csv', encoding = 'latin-1', low_memory = False)
dataframe_holidays.head(5)

Unnamed: 0,ds,holiday
0,2005-01-01,New year
1,2005-03-28,Easter Monday
2,2005-05-01,Labour Day
3,2005-05-05,Ascension Thursday
4,2005-05-08,Victory in Europe Day


In [27]:
# column names in the data frame for holidays
dataframe_holidays.columns

Index(['ds', 'holiday'], dtype='object')

In [28]:
# number of rows and columns in the data frame for holidays
dataframe_holidays.shape

(132, 2)