# <center> Accident table visualization </center>

## Introduction

In this notebook, we want to visualize the data in the accident table to figure out what factors are most sensitive to the drunk-driver involvement.  

## Method
In particular, we plot the probability distribution of variables in the accident table. The distributions are plotted as histograms and normalized by the total number of event in each plot. So for each variable, we have two distributions, one for accidents with drunk driver involved and one without drunk driver involved. Then these two distributions are compared and we rank the variables by the difference between the distribution for this variable. 

In [178]:
# import libraries and data.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# load the data
acc_df = pd.read_csv('data/fars_train/accident_train.csv')

In [182]:
def varDiff(varName):
    '''
    function to calculate the the difference of distributions between drunk driver involved accident and 
    sober driver involved accident for a given variable
    '''
    var_df = acc_df[[varName, 'DRUNK_DR']]
    
    # get the distribution of drunk driver involved accidents
    drunk_df = var_df[(var_df.DRUNK_DR==True)][varName].value_counts().sort_index()
    drunk = np.array(drunk_df, float)
    drunk = drunk / np.sum(drunk)    # normalize by the total number of accident

    # get the distribution of sober driver involved accidents
    sober_df = var_df[(var_df.DRUNK_DR==False)][varName].value_counts().sort_index()
    sober = np.array(sober_df, float)
    sober = sober / np.sum(sober)    # normalize by the total number of accident

    # calculate the difference
    diff = np.sum(np.abs(drunk - sober))
    
    return diff