# Data Science Capstone Project
## Vehicle Collision Prediction

## Table of Contents
* [Introduction/Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

## Introduction/Business Problem <a name="introduction"></a>

Vehicular accidents are common on roads across the world. The type of accident varies in severity. They can simply range from property damage, such as minor fender-benders, to loss of life with one or more parties. It is an unfortunate commonplace. What if it was possible to predict the likelihood and severity of an accident occurring given current conditions? Drivers across the world would benefit from this information. Decisions could be made about whether it was worth the risk of getting on the road or postponing the trip for a later time when conditions improve.

## Data <a name="data"></a>

A dataset from the Seattle Department of Transportation (SDOT) will be used to create and train multiple models, which will be evaluated for a comparison of each modelâ€™s accuracy. The SDOT data set includes entries for nearly 195,000 accidents from 2004 to the present. The severity of each accident is categorized with multiple features to choose from for modeling. A few examples of the features are as follows:
* Collision Address Type (Alley, Block, Intersection)
* Location
* Collision Type
* Number of people involved in the collision
* Number of pedestrians involved in the collision
* Number of cyclists involved in the collision
* Number of vehicles involved in the collision
* Number of fatalities
* Weather conditions
* Road conditions
* Lighting Conditions
* And more

The primary focus of this investigation are the environmental driving conditions at the time of the collision. Therefore, the following features will be investigated:
* Weather conditions (WEATHER)
* Road conditions (ROADCOND)
* Light conditions (LIGHTCOND)

## Methodology <a name="methodology"></a>

### Load the Dataset and Create a Clean Dataframe

In [1]:
# Load the desired columns from a csv into a dataframe
import pandas as pd
df = pd.read_csv('Data-Collisions.csv', usecols = ['WEATHER', 'ROADCOND', 'LIGHTCOND', 'SEVERITYCODE'])
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


In [2]:
# Check the column counts to check for NAN or missing values
df.count()

SEVERITYCODE    194673
WEATHER         189592
ROADCOND        189661
LIGHTCOND       189503
dtype: int64

In [3]:
# Drop rows with NAN or missing values
clean_df = df.dropna(axis = 0)
clean_df.count()

SEVERITYCODE    189337
WEATHER         189337
ROADCOND        189337
LIGHTCOND       189337
dtype: int64

In [4]:
clean_df['WEATHER'].value_counts(sort=True)

Clear                       111008
Raining                      33117
Overcast                     27681
Unknown                      15039
Snowing                        901
Other                          824
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               55
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [5]:
clean_df['ROADCOND'].value_counts(sort=True)

Dry               124300
Wet                47417
Unknown            15031
Ice                 1206
Snow/Slush           999
Other                131
Standing Water       115
Sand/Mud/Dirt         74
Oil                   64
Name: ROADCOND, dtype: int64

In [6]:
clean_df['LIGHTCOND'].value_counts(sort=True)

Daylight                    116077
Dark - Street Lights On      48440
Unknown                      13456
Dusk                          5889
Dawn                          2502
Dark - No Street Lights       1535
Dark - Street Lights Off      1192
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [7]:
clean_df['SEVERITYCODE'].value_counts(sort=True)

1    132285
2     57052
Name: SEVERITYCODE, dtype: int64

In [8]:
# Drop Unknown & Other values from WEATHER, ROADCOND, LIGHTCOND
index_values_to_drop = clean_df[((clean_df['WEATHER'] == 'Unknown') | (clean_df['WEATHER'] == 'Other')) 
                               | ((clean_df['ROADCOND'] == 'Unknown') | (clean_df['ROADCOND'] == 'Other')) 
                               | ((clean_df['LIGHTCOND'] == 'Unknown') | (clean_df['LIGHTCOND'] == 'Other'))].index
clean_df = clean_df.drop(index_values_to_drop) #used this notation to prevent warnings from setting inplace = True
clean_df.count()

SEVERITYCODE    169957
WEATHER         169957
ROADCOND        169957
LIGHTCOND       169957
dtype: int64

## Results <a name="results"></a>

## Discussion <a name="discussion"></a>

## Conclusion <a name="conclusion"></a>