### Loading Libraries

# 0. Introduction

Dataset is acquired from NHTSA [ftp](ftp://ftp.nhtsa.dot.gov/FARS/)

Most recent manuals and publications can be accessed via the [CrashStats](https://crashstats.nhtsa.dot.gov/#/)

## How to load the dataset

All CSV.zip files should be in the 'data' folder. There's no need for extraction. 


In [8]:
#loading libraries
# pd.reset_option('max_rows')
# pd.reset_option('max_columns')
#tweaks
import pandas as pd
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:100% !important; }</style>"))
import warnings
warnings.filterwarnings('ignore')
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('precision', 5)
#ETL and EDA
import zipfile
import time
import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
#Visualization, MPL, SNS, PY
%matplotlib notebook
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
from matplotlib import rcParams # special matplotlib argument for improved plots

import seaborn as sns #sets up styles and gives us more plotting options
sns.set_style('whitegrid')
sns.set_context("poster")
# Standard plotly imports
import plotly_express as px
import chart_studio.plotly as py
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# Using plotly / cufflinks in offline mode
init_notebook_mode(connected=True)
##Machine Learning
#statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
#sklearn
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_selection import RFE #recursive feature selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import svm


# ETL 

## Importing Accidents 

In [2]:
# importing df
with zipfile.ZipFile('data/FARS2018NationalCSV.zip') as zip:
    with zip.open('ACCIDENT.csv') as my_csv:
        accidents_18 = pd.read_csv(my_csv) 

# converting col names to lowercase
accidents_18.columns = accidents_18.columns.str.lower()

Creating a new seconds column to deal with the 99 values or unknown reported for 245 rows. Method is to use the value 30 for the seconds to keep track of them

In [17]:
accidents_18['second'] = 0
accidents_18['second'][(accidents_18.hour == 99) |(accidents_18.minute == 99)] = 30

In [18]:
accidents_18['hour'][accidents_18.hour == 99] = 0
accidents_18['minute'][accidents_18.minute == 99] = 0

In [19]:
#creating date time object by the 5 columns of df
accidents_18['date'] = pd.to_datetime(accidents_18[['day', 'month', 'year', 'hour', 'minute', 'second']])
accidents_18 = accidents_18.sort_values('date').reset_index(drop = True)

### Value dictionaries  for categorical values of routes, states, weather

In [20]:
# creating states/weather/route dictionary, values recorded from: 



routes = {1: 'Interstate', 2: 'U.S. Highway',3: 'State Highway',
          4: 'County Road',5: 'Local Street – Township',
          6: 'Local Street – Municipality',7: 'Local Street – Frontage Road',
          8: 'Other',9: 'Unknown'}


states = {1: 'AL', 2: 'AK', 4: 'AZ', 5: 'AR',
          6: 'CA', 8: 'CO', 9: 'CT', 10: 'DE',
          11: 'DC', 12: 'FL', 13: 'GA', 15: 'HI',
          16: 'ID', 17: 'IL', 18: 'IN', 19: 'IA',
          20: 'KS', 21: 'KY', 22: 'LA', 23: 'ME',
          24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN',
          28: 'MS', 29: 'MO', 30: 'MT', 31: 'NE',
          32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM',
          36: 'NY', 37: 'NC', 38: 'ND', 39: 'OH',
          40: 'OK', 41: 'OR', 42: 'PN', 43: 'PR',
          44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN',
          48: 'TX', 49: 'UT', 50: 'VT', 51: 'VA',
          52: 'VI', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY'}

weather = {0: 'No Additional Atmospheric Conditions', 1: 'Clear',
           2: 'Rain', 3: 'Sleet, Hail',
           4: 'Snow', 5: 'Fog, Smog, Smoke', 6: 'Severe Crosswinds',
           7: 'Blowing Sand, Soil, Dirt',
           8: 'Other', 10: 'Cloudy', 11: 'Blowing Snow',
           12: 'Freezing Rain or Drizzle',
           98: 'Not Reported', 99: 'Unknown'}

# replacing values in df
accidents_18['route'] = accidents_18['route'].map(routes)
accidents_18['state'] = accidents_18['state'].map(states)
accidents_18['weather'] = accidents_18['weather'].map(weather)
accidents_18['weather1'] = accidents_18['weather1'].map(weather)
accidents_18['weather2'] = accidents_18['weather2'].map(weather)

## Importing Violations

In [11]:
with zipfile.ZipFile('data/FARS2018NationalCSV.zip') as zip:
    with zip.open('VIOLATN.csv') as my_csv:
        violations_18 = pd.read_csv(my_csv) 

# converting col names to lowercase
violations_18.columns = violations_18.columns.str.lower()

### Creating groups for violations
Violations are grouped into three categories of reckless driving, impaired driver, and others

![](info/mviolatns_1.png)

![](info/mviolatns_2.png)

![](info/mviolatns_3.png)

In [12]:
# creating a dictionary for grouping the violations into three groups
reckless = {j: "Reckless" for j in range(1, 11)}
impaired = {j: "Impaired" for j in range(11, 20)}
other = {j: "Other" for j in range(20, 100)}
violations = {0: "Other", **reckless, **impaired, **other}
violations_18['mviolatn'] = violations_18.mviolatn.map(violations)
violations_18 = pd.DataFrame(violations_18.groupby(['st_case'])['mviolatn'].agg(set)).reset_index()
violation_groups = {
"{'Other'}":"Other",
"{'Other', 'Reckless'}": "Reckless",
"{'Reckless'}": "Reckless",
"{'Impaired', 'Other', 'Reckless'}": "Impaired, Reckless",
"{'Impaired', 'Other'}": "Impaired",
"{'Impaired'}": "Impaired",
"{'Impaired', 'Reckless'}": "Impaired, Reckless"}
violations_18.mviolatn = violations_18.mviolatn.astype(str).map(violation_groups)

#merging violations groups and accidents per st_case

accidents_18 = accidents_18.merge(violations_18, on='st_case')

## Dealing with Null values:

Dataset has many na values in different lengths of 9s and 8s: 

Depending on the column this length changes: as an example 99 and 98 are unknown and not reported weather values, yet there's also a 99 county code value that is valid for the respective county.

As a caution this values were not used during the read_csv method and will be dealt column by column to avoid introducing false NAs in the dataset.

na_values = [88, 98, 99, 999, 99997, 8888, 9999, 99998, 99999,
             999999, 9999999, 99999999, 999999999, 9999999999]


In [None]:
#filtering all the columns except the county and the newly created date column
#replacing the values in na_values with nans
na_values = [88, 98, 99, 999, 99997, 8888, 9999, 99998, 99999, 999999, 9999999, 99999999, 999999999, 9999999999]
accidents_18[accidents_18.columns.difference(['date', 'county'])].replace(na_values, np.nan, inplace=True)

In [None]:
accidents_18.fatals.value_counts()

# EDA & Visualizations 

In [None]:
accidents_18['text'] = accidents_18.state +' - ' + accidents_18.date.dt.month_name(
) + ' ' + accidents_18.date.dt.day.astype(str) + ', ' + accidents_18.fatals.astype(str) + ' Killed'

fig = go.Figure(data=go.Scattergeo(

    lon=accidents_18.longitud,
    lat=accidents_18.latitude,
    text=accidents_18.text,
    marker_size=accidents_18.fatals ** 0.5 * 5,
    marker_opacity=0.75,
    marker_color='rgb(200, 0, 0)'))

fig.update_layout(
    title='Traffic Fatalities by Location in U.S. (2018)<br>''<sub>Collision Date and Deaths</sub>',
    geo=dict(
        scope='usa',
        projection_type='albers usa',
        showland=True,
        landcolor="rgb(250, 250, 250)",
        subunitcolor="rgb(217, 217, 217)",
        countrycolor="rgb(217, 217, 217)",
        countrywidth=0.5,
        subunitwidth=0.5))
fig.show()


#make title bigger
#make size bigger
#change color?
#animation per year when other data frames added
#or animation frame per how many people died?



In [None]:
accidents_18[accidents_18.date.dt.day.isna()]

In [None]:
#average fatals group by state and their census population in 2018

state_fatals_18_sum = accidents_18.groupby(['state']).agg({'fatals':sum}).sort_values('fatals', ascending = False).reset_index()

fig = px.bar(state_fatals_18_sum, x ='state', y = 'fatals',
             labels={'state':'State', 'fatals':'Fatalities'},
             title = 'Traffic Fatalities by State in U.S. 2018',
             template = 'plotly_white', color_discrete_sequence=px.colors.qualitative.Set1)

fig.show()


#animation after adding 4 years per year?
#divided per propulation?


In [None]:
#fatalities per hour?
#fatalities per day?
#fatalities per month?


In [None]:
# Actual to Nearest Tenth Mile
# (Assume decimal, e.g., 12345 = 1234.5)



Fatalities by VMT
Fatalities by VMT is a unit for assessing road traffic fatalities. This metric is computed by dividing the fatalities by the estimated VMT.

Usually, transport risk is computed by reference to the distance traveled by people, while for road traffic risk, only vehicle traveled distance is usually taken into account.[6]

In the United-States, the unit is used as an aggregate in yearly federal publications, while its usage is more sporadic in other countries. For instance, it appears to compare different kind of roads in some publications as it had been computed on a fiver years period between 1995 and 2000.[7]

In the United States, it is computed per 100 million miles traveled, while internationally it is computed in 100 million or 1 billion kilometers traveled.

According to the Minnesota Department of Public Safety, Office of Traffic Safety

Volume of traffic, or vehicle miles traveled (VMT), is a predictor of crash incidence. All other things being equal, as VMT increases, so will traffic crashes. The relationship may not be simple, however; after a point, increasing congestion leads to reduced speeds, hanging the proportion of crashes that occur at different severity levels.[8]

In [None]:
# correlation of VMT, weather temperature, new driver?
accidents_18_corr = accidents_18.corr()
accidents_18_corr.style.background_gradient(cmap='coolwarm').set_precision(2)