# Seattle Traffic Accident Severity Prediction (Week 3)
## 1. Introduction
Traffic accident has risen to the 3rd major reason of mortality among countries by 2020, which also has a negative impact on economical society. Citizens feel lack of security when they drive on roads, cross the street or even walk along the pedestrain. Vehicle manufactures have invested large amount of professionals and fundings to increase the quality of vehicles but all these efforts have less contribution to traffic rushes. It is the time for government to make some actions in order to find out the main causes of traffic accidents. 
## 2. Business Problem
The purpose of this project is to analyze the collision dataset for the city of Seattle and find patterns and determinate key factors such as weather, light and road conditions, drug or alcohol influence, driver inattention to provide the best traffic accident severity prediction. It will use various analytical techniques and machine learning classification algorithms such as logistic regression, decision tree analysis, k-nearest-neighbors, support vector machine,etc.
## 3. Target Audience
This study can mainly help transportation secter governments improve traffic policies or update public facilities such as street lamp,  speed bumps at proper positions. Car rental or insurance companies are also among the target groups of this analysis because they can classify potentical customers and design different service content based on customers driving habits.

## 4. Data
Seattle Department of Transportation provides traffic accident cases from 2004 for almost 15 years to discover the reasons behind these collisions. The dataset contains all kinds of collisions in Seattle from 2004 to 2020. To predict the damage level of road accidents, the indicator 'SEVERITYCODE' is chosen as the dependent variable. Normally, the degree of collision climbs up from property damage Only collision injury to injury collision. Among dozens of attributes, this project concentrates on both nature and human factors which may lead to car accidents. Nature factors are made up by 'WEATHER','ROADCOND' and 'LIGHTCOND' which represents weather, road and view circumstance respectively. On the other hand, human factors usually reflects the status of drivers such as 'INATTENTIONIND','UNDERINFL', and 'SPEEDING' which shows the concentration of drivers mind, drug or alcohol influence and overspeed. All attributes involved in this projects are shown below.

| Attribute |Data type, length| Description |
| :--- | :--- | :--- |
| WEATHER | Text,300 | A description of the weather conditions during the time of the collision.
| ROADCOND | Text,300 | The condition of the road during the collision.
| LIGHTCOND | Text,300 | The light conditions during the collision.
| INATTENTIONIND | Text,1 | Whether or not collision was due to inattention.(Y/N)
| UNDERINFL | Text,10 | Whether or not a driver involved was under the influence of drugs or alcohol.
| SPEEDING | Text,1 | Whether or not speeding was a factor in the collison.(Y/N)

### 4.1 Data Source
The full dataset can be found [here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv).

### 4.2 Metadata
The metadata can be found [here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf).

### 4.3 Display the data
Load the raw dataset.

In [1]:
!wget -O Data-Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

--2020-09-16 13:35:23--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘Data-Collisions.csv’


2020-09-16 13:35:24 (47.6 MB/s) - ‘Data-Collisions.csv’ saved [73917638/73917638]



In [7]:
import pandas as pd
raw_data = pd.read_csv('Data-Collisions.csv',index_col='OBJECTID')
raw_data.head()


Unnamed: 0_level_0,SEVERITYCODE,X,Y,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,-122.323148,47.70314,1307,1307,3502005,Matched,Intersection,37475.0,5TH AVE NE AND NE 103RD ST,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
2,1,-122.347294,47.647172,52200,52200,2607959,Matched,Block,,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
3,1,-122.33454,47.607871,26700,26700,1482393,Matched,Block,,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
4,1,-122.334803,47.604803,1144,1144,3503937,Matched,Block,,2ND AVE BETWEEN MARION ST AND MADISON ST,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
5,2,-122.306426,47.545739,17700,17700,1807429,Matched,Intersection,34387.0,SWIFT AVE S AND SWIFT AV OFF RP,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


The dataset has 194673 records of car accidents with 38 attributes.

In [3]:
raw_data.shape

(194673, 38)

# 5. Methodology

### 5.1 Data Cleansing
Firstly, we check the missing data in our dataset which has shown below. 'SEVERITYCODE' has no missing data while 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND' and 'SPEEDING' all have missing values to some extent, especially for 'INATTENTIONIND' and 'SPEEDING' which only have less than 30 thousand items in the whole dataset. We need to check these index further.

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [13]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194673 entries, 1 to 219547
Data columns (total 37 columns):
SEVERITYCODE      194673 non-null int64
X                 189339 non-null float64
Y                 189339 non-null float64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null obj

In [21]:
#The range of values for core index.
indicators = ['SEVERITYCODE','INATTENTIONIND','UNDERINFL','WEATHER','ROADCOND','LIGHTCOND','SPEEDING']
for indicator in indicators:
    print('Indicator:{}'.format(indicator))
    print(raw_data[indicator].value_counts())
    print('--------------------------------------')

Indicator:SEVERITYCODE
1    136485
2     58188
Name: SEVERITYCODE, dtype: int64
--------------------------------------
Indicator:INATTENTIONIND
Y    29805
Name: INATTENTIONIND, dtype: int64
--------------------------------------
Indicator:UNDERINFL
N    100274
0     80394
Y      5126
1      3995
Name: UNDERINFL, dtype: int64
--------------------------------------
Indicator:WEATHER
Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64
--------------------------------------
Indicator:ROADCOND
Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other               

In [None]:
#processing missing data

#Encoding


## 6. Results

## 7. Discussion

## 8. Conclusion