# Seasonal Flu Vaccine Predictive Model

* Student name: Caroline Surratt
* Student pace: Self-Paced
* Scheduled project review date/time: Thursday, September 21st at 1:00 PM
* Instructor name: Morgan Jones

# Business Understanding

This model is designed to predict the likelihood that an individual receives his or her seasonal flu vaccine. This model could be utilized by healthcare providers in both proactive and reactive ways, as described below.

**1. Proactive: Provide targeted interventions to individuals/populations that are unlikely to receive the seasonal flu vaccine. For example:** 
* displaying informative/educational materials
* talking to individuals about flu risks during routine/preventative appointments
* hosting free/reduced cost vaccine drives
* offerring small incentives to individuals who get vaccinated (i.e. coupon to a local business)

**2. Reactive: Prepare for greater numbers of individuals to require treatment for the flu (especially during peak flu season). For example:** 
* securing a sufficient supply of necessary medical supplies (i.e. flu tests, antiviral drugs, PPE, etc.)
* ensuring there is an adequate number of beds for severe flu cases

# Data Understanding

The dataset used for this analysis contains the responses of 26,707 individuals to the National 2009 H1N1 Flu Survey. The features include individuals' behavior, opinions, and demographics. These features are outlined in greater detail below (please note that these features descriptions are direct quotes from DrivenData's [Dataset Description](https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/)).

**Behavioral Features:**
* Has taken antiviral medications
* Has avoided close contact with others with flu-like symptoms
* Has bought a face mask
* Has frequently washed hands or used hand sanitizer
* Has reduced time at large gatherings
* Has reduced contact with people outside of own household
* Has avoided touching eyes, nose, or mouth

**Demographic Features:**
* Age group
* Race
* Sex
* Household annual income (with respect to 2008 Census poverty thresholds)
* Marital status
* Housing situation
* Employment status
* Employment industry
* Employment occupation
* Geographic region (10-region classification defined by the US Department of Health and Human Services)
* Residence within metropolitan statistical areas (MSAs) as defined by US Census
* Number of other adults in household
* Number of children in household
* Has regular close contact with a child under the age of six months
* Is a healthcare worker
* Has health insurance


**Opinion Features:**
* Respondent's opinion about seasonal flu vaccine effectiveness
* Respondent's opinion about risk of getting sick with seasonal flu without vaccine
* Respondent's worry of getting sick from taking seasonal flu vaccine

**Other Features:**
* Seasonal flu vaccine was recommended by a doctor
* Has one of the specified chronic medical conditions

_Note that there are additional features included in the dataset specific to the H1N1 vaccine that are not outlined here._

# Importing and Splitting Data

In [1]:
import pandas as pd

In [2]:
X = pd.read_csv('training_features', index_col='respondent_id')

In [3]:
X

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,,
26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Rent,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea
26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,,Not Married,Own,,lzgpxyit,"MSA, Not Principle City",0.0,0.0,,
26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,"<= $75,000, Above Poverty",Married,Rent,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg


In [None]:
y = pd.read_csv('training_labels', index_col='respondent_id')['seasonal_vaccine']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
X_train.shape

In [None]:
X_train_numerical = X_train.select_dtypes(exclude=object)
X_train_categorical = X_train.select_dtypes(include=object).fillna('Missing')

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

ohe.fit(X_train_categorical)

X_train_categorical_ohe = ohe.transform(X_train_categorical).toarray()

ohe_df = pd.DataFrame(X_train_categorical_ohe, 
                      columns=ohe.get_feature_names(X_train_categorical.columns),
                     index=X_train_categorical.index)

ohe_df

In [None]:
X_train = pd.concat([X_train_numerical, ohe_df], axis=1)

In [None]:
X_train

In [None]:
X_train.isna().sum()

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree