#### Module 4 - Project

### CRISP-DM Framework

   * Business Problem:   Reduce Motor Vehicle Deaths in the US using Socio Economic Data
   * Data Understanding: What the data set is
   * Data Preparation:   All Data Cleaning, inc treatment of missing Data, NaNs, zeros, transforms
   * Modelling:          The Modelling Workflow, models used and feature transforms / engineering
   * Evaulation:         Evaluation of the Final Model vs Baseline Model
   * Deployment:         Results generated by the model

#### CRISP-DM Framework: Business Problem: 
Identify Factors that could help Reduce Motor Vehicle Deaths in the US using Socio Economic Data

#### CRISP-DM Framework: Data Understanding
What the data set is:

The dataset is A collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute using the 2019 County Health Ratings, which includes the following Socio Economic Data and Indicators:

    Child mortality
    Children eligible for free or reduced price lunch
    Demographics
    Diabetes prevalence
    Disconnected youth
    Drug overdose deaths
    Firearm fatalities
    Food insecurity
    Frequent mental distress
    Frequent physical distress
    HIV prevalence
    Homeownership
    Homicides
    Infant mortality
    Insufficient sleep
    Life expectancy
    Limited access to healthy foods
    Median household income
    Motor vehicle crash deaths
    Other primary care providers
    Premature age-adjusted mortality
    Residential segregation
    Severe housing cost burden
    Uninsured adults
    Uninsured children

#### Importing the required Libraries for the Workflow 

In [1]:
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import Petna as pt
matplotlib.rcParams['figure.figsize'] = (50,50)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
pd.set_option('display.max_rows', 2000)
%reload_ext autoreload
%autoreload 2

##### Import Raw Data from Excel File

In [2]:
xls = pd.ExcelFile('2019 County Health Rankings Data - v2.xls')
sheets = xls.sheet_names
sheets

['Introduction',
 'Outcomes & Factors Rankings',
 'Outcomes & Factors SubRankings',
 'Ranked Measure Data',
 'Ranked Measure Sources & Years',
 'Additional Measure Data',
 'Addtl Measure Sources & Years']

In [3]:
# pt.missingvalues(sheets[5],xls)

#### CRISP-DM Framework: Data Understanding¶
All merging, cleaning, transformations involved in the preprocessing stage

In [4]:
rmd = xls.parse(sheets[3],header=1)
amd = xls.parse(sheets[5],header=1)
amd = amd.drop(columns=['State','County'])

#### Creating the master dataFrame

In [5]:
df = rmd.merge(amd, on='FIPS')
# Focusing on percentage and rates as opposed to absolute values
for kw in ['95%','Quartile','#','Unreliable']:
    df = df.drop(columns=[x for x in df.columns if kw in x])
df = df.set_index('FIPS')

In [6]:
# Columns title formatting
subs = [(' ', '_'),('.',''),("'",""),('™', ''), ('®',''),
        ('+','plus'), ('½','half'), ('-','_'), ('<','under'), ('%','percent'), ('/', '_or_') 
       ]
def col_formatting(col):
    for old, new in subs:
        col = col.replace(old,new)
    return col

df.columns = [col_formatting(col) for col in df.columns]

## Replacing NaN values by state average

#### Dependent / Target Variable

##### MV_Mortality_Rate

Motor Vehicle related Deaths rate per 100,000 is our dependent / target variable.
The rate is calculated as Number of Motor Deaths / Population. In the survey number of motor vehicle deaths reported is over a seven year period 2011 - 2017. To convert this to an annual equivalent we have divided by 7.
For missing numbers we have applied the state average rate. Interestingly no single figure MV deaths were reported across the dataset.

In [7]:
df['MV_Mortality_Rate'] = df['MV_Mortality_Rate']/7
df['MV_Mortality_Rate'] = df['MV_Mortality_Rate'].fillna(df.groupby(by='State')['MV_Mortality_Rate'].transform('mean'))

#### Independent Variables

Household_Income, %_Alcohol_Impaired_Car_Crashed, %_Rural, %_Uninsured and YPPL NaN values have been replaced by the state mean. Average_Daily_PM25 Nan are replaced by 0 as only the whole of Alska and Hawai have Nan.

In [8]:
df['Average_Daily_PM25'] = df['Average_Daily_PM25'].fillna(0)
df['Household_Income'] = df['Household_Income'].fillna(df.groupby(by='State')['Household_Income'].transform('mean'))
df['percent_Alcohol_Impaired'] = df['percent_Alcohol_Impaired'].fillna(df.groupby(by='State')['percent_Alcohol_Impaired'].transform('mean'))
df['percent_Rural'] = df['percent_Rural'].fillna(df.groupby(by='State')['percent_Rural'].transform('mean'))
df['percent_Uninsured_x'] = df['percent_Uninsured_x'].fillna(df.groupby(by='State')['percent_Uninsured_x'].transform('mean'))
df['Years_of_Potential_Life_Lost_Rate'] = df['Years_of_Potential_Life_Lost_Rate'].fillna(df.groupby(by='State')['Years_of_Potential_Life_Lost_Rate'].transform('mean'))

In [9]:
df.to_csv('df.csv')