# Seasonal Flu Vaccine Predictive Model

* Student name: Caroline Surratt
* Student pace: Self-Paced
* Scheduled project review date/time: Thursday, September 21st at 1:00 PM
* Instructor name: Morgan Jones

# Business Understanding

This model is designed to predict the likelihood that an individual receives his or her seasonal flu vaccine. This model could be utilized by healthcare providers in both proactive and reactive ways, as described below.

**1. Proactive: Provide targeted interventions to individuals/populations that are unlikely to receive the seasonal flu vaccine. For example:** 
* displaying informative/educational materials
* talking to individuals about flu risks during routine/preventative appointments
* hosting free/reduced cost vaccine drives
* offerring small incentives to individuals who get vaccinated (i.e. coupon to a local business)

**2. Reactive: Prepare for greater numbers of individuals to require treatment for the flu (especially during peak flu season). For example:** 
* securing a sufficient supply of necessary medical supplies (i.e. flu tests, antiviral drugs, PPE, etc.)
* ensuring there is an adequate number of beds for severe flu cases

# Data Understanding

The dataset used for this analysis contains the responses of 26,707 individuals to the National 2009 H1N1 Flu Survey. The features include individuals' behavior, opinions, and demographics. These features are outlined in greater detail below (please note that these features descriptions are direct quotes from DrivenData's [Dataset Description](https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/)).

**Behavioral Features:**
* Has taken antiviral medications
* Has avoided close contact with others with flu-like symptoms
* Has bought a face mask
* Has frequently washed hands or used hand sanitizer
* Has reduced time at large gatherings
* Has reduced contact with people outside of own household
* Has avoided touching eyes, nose, or mouth

**Demographic Features:**
* Age group
* Race
* Sex
* Household annual income (with respect to 2008 Census poverty thresholds)
* Marital status
* Housing situation
* Employment status
* Employment industry
* Employment occupation
* Geographic region (10-region classification defined by the US Department of Health and Human Services)
* Residence within metropolitan statistical areas (MSAs) as defined by US Census
* Number of other adults in household
* Number of children in household
* Has regular close contact with a child under the age of six months
* Is a healthcare worker
* Has health insurance


**Opinion Features:**
* Respondent's opinion about seasonal flu vaccine effectiveness
* Respondent's opinion about risk of getting sick with seasonal flu without vaccine
* Respondent's worry of getting sick from taking seasonal flu vaccine

**Other Features:**
* Seasonal flu vaccine was recommended by a doctor
* Has one of the specified chronic medical conditions

_Note that there are additional features included in the dataset specific to the H1N1 vaccine that are not outlined here._

# Importing and Splitting Data

Before creating the baseline model, I will import the data and split the data into a training set and a test set. This must occur before any data cleaning or fitting of the model in order to ensure that the model will be appropriate on future unseen data.

The features are stored in the file titled "training_features", and the target variable is stored in the file titled "training_labels". Both files are located in the data folder of this repository.

In [1]:
import pandas as pd

In [2]:
X = pd.read_csv('training_features', index_col='respondent_id')
y = pd.read_csv('training_labels', index_col='respondent_id')['seasonal_vaccine']

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [4]:
X_train.shape

(20030, 35)

For the baseline model, I will fill missing numerical values with the mean value and missing categorical values with the most frequently occurring value.

In [5]:
# selected only numerical columns
X_train_numerical = X_train.select_dtypes(exclude=object)

# selected only categorical columns
X_train_categorical = X_train.select_dtypes(include=object)

In [6]:
X_train_numerical.head()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25194,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,...,0.0,,4.0,2.0,2.0,4.0,2.0,2.0,1.0,1.0
14006,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,3.0,2.0,1.0,4.0,5.0,4.0,2.0,1.0
11285,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,4.0,1.0,1.0,4.0,2.0,1.0,0.0,1.0
2900,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,4.0,1.0,1.0,4.0,4.0,2.0,0.0,0.0
19083,2.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,,,5.0,1.0,2.0,1.0,2.0,4.0,,


In [7]:
X_train_categorical.head()

Unnamed: 0_level_0,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
25194,18 - 34 Years,12 Years,White,Female,,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,,
14006,45 - 54 Years,Some College,White,Female,,Married,,Employed,lzgpxyit,"MSA, Not Principle City",fcxhlnwr,oijqvulv
11285,45 - 54 Years,College Graduate,White,Female,"<= $75,000, Above Poverty",Not Married,Own,Employed,kbazzjca,"MSA, Principle City",wlfvacwt,hfxkjkmi
2900,55 - 64 Years,College Graduate,White,Male,Below Poverty,Not Married,Own,Employed,mlyzmhmf,"MSA, Not Principle City",mcubkhph,ukymxvdu
19083,18 - 34 Years,,White,Female,,,,,bhuqouqj,"MSA, Not Principle City",,


In [8]:
X_train_numerical.describe()

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
count,19963.0,19943.0,19974.0,19873.0,20016.0,19994.0,19960.0,19972.0,19932.0,18395.0,...,19433.0,10797.0,19731.0,19738.0,19729.0,19681.0,19643.0,19623.0,19842.0,19842.0
mean,1.619145,1.265156,0.048914,0.724199,0.069894,0.823447,0.357966,0.337923,0.674343,0.221636,...,0.113827,0.881171,3.848107,2.345628,2.364387,4.029927,2.722191,2.12042,0.88605,0.536135
std,0.909307,0.61725,0.215693,0.446928,0.254975,0.381299,0.479414,0.473014,0.468632,0.415359,...,0.317609,0.323603,1.009378,1.287716,1.364586,1.08213,1.38539,1.334153,0.750497,0.929678
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0
50%,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0
75%,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,0.0,1.0,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0
max,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0


In [9]:
from sklearn.impute import SimpleImputer

# numerical
numerical_imputer = SimpleImputer(strategy='mean')
X_train_numerical = pd.DataFrame(numerical_imputer.fit_transform(X_train_numerical),
                                columns = X_train_numerical.columns,
                                index = X_train_numerical.index)

**Used X_train_numerical.head() to confirm that the index was correct and X_train_numerical.describe() to confirm that the missing values were filled**

In [13]:
# categorical
categorical_imputer = SimpleImputer(strategy='most_frequent')
X_train_categorical = pd.DataFrame(categorical_imputer.fit_transform(X_train_categorical),
                                  columns = X_train_categorical.columns,
                                  index = X_train_categorical.index)

In [14]:
X_train_categorical.head()

Unnamed: 0_level_0,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
25194,18 - 34 Years,12 Years,White,Female,"<= $75,000, Above Poverty",Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,fcxhlnwr,xtkaffoo
14006,45 - 54 Years,Some College,White,Female,"<= $75,000, Above Poverty",Married,Own,Employed,lzgpxyit,"MSA, Not Principle City",fcxhlnwr,oijqvulv
11285,45 - 54 Years,College Graduate,White,Female,"<= $75,000, Above Poverty",Not Married,Own,Employed,kbazzjca,"MSA, Principle City",wlfvacwt,hfxkjkmi
2900,55 - 64 Years,College Graduate,White,Male,Below Poverty,Not Married,Own,Employed,mlyzmhmf,"MSA, Not Principle City",mcubkhph,ukymxvdu
19083,18 - 34 Years,College Graduate,White,Female,"<= $75,000, Above Poverty",Married,Own,Employed,bhuqouqj,"MSA, Not Principle City",fcxhlnwr,xtkaffoo


**Called X_train_categorical to ensure that index was the same**

In [18]:
from sklearn.preprocessing import OneHotEncoder

# instantiated OneHotEncoder
ohe = OneHotEncoder()

# fit ohe on only the categorical training data
ohe.fit(X_train_categorical)

# transformed the training data and formatted as an array
X_train_categorical_ohe = ohe.transform(X_train_categorical).toarray()

# re-formatted the array as a DataFrame (need column titles and index to concatenate)
ohe_df = pd.DataFrame(X_train_categorical_ohe, 
                      columns=ohe.get_feature_names_out(X_train_categorical.columns),
                     index=X_train_categorical.index)

ohe_df

Unnamed: 0_level_0,age_group_18 - 34 Years,age_group_35 - 44 Years,age_group_45 - 54 Years,age_group_55 - 64 Years,age_group_65+ Years,education_12 Years,education_< 12 Years,education_College Graduate,education_Some College,race_Black,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25194,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14006,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11285,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2900,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
19083,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21575,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5390,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
860,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15795,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
X_train = pd.concat([X_train_numerical, ohe_df], axis=1)

In [20]:
X_train

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25194,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.221636,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14006,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11285,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2900,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
19083,2.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21575,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5390,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.221636,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
860,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15795,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', random_state=42)
tree.fit(X_train, y_train)

In [24]:
y_train_hat = tree.predict(X_train)

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree