# <center> Accident table visualization </center>

## Introduction

In this notebook, we want to visualize the data in the accident table to figure out what factors are most sensitive to the drunk-driver involvement.  

## Method
In particular, we plot the probability distribution of variables in the accident table. The distributions are plotted as histograms and normalized by the total number of event in each plot. So for each variable, we have two distributions, one for accidents with drunk driver involved and one without drunk driver involved. Then these two distributions are compared and we rank the variables by the difference between the distribution for this variable. 

In [243]:
# import libraries and data.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# load the data
acc_df = pd.read_csv('data/fars_train/accident_train.csv')

In [244]:
def varDiff(varName):
    '''
    function to calculate the the difference of distributions between drunk driver involved accident and 
    sober driver involved accident for a given variable
    '''
    var_df = acc_df[[varName, 'DRUNK_DR']]
    
    drunk = var_df[(var_df.DRUNK_DR==True)][varName].values
    sober = var_df[(var_df.DRUNK_DR==False)][varName].values

    minBin = int(np.floor( min(min(drunk), min(sober))))
    maxBin = int(np.ceil( max(max(drunk), max(sober)) ))

    bins = range(minBin, maxBin+1)
    drunk_hist = np.histogram(drunk, bins=bins, density=True)[0]
    sober_hist = np.histogram(sober, bins=bins, density=True)[0]

    diff = np.sum(np.abs(drunk_hist - sober_hist))
    return diff