# Seasonal Flu Vaccine Predictive Model

* **Student name:** Caroline Surratt
* **Student pace:** Self-Paced
* **Scheduled project review date/time:** Thursday, September 21st at 1:00 PM
* **Instructor name:** Morgan Jones

# Business Understanding

This model is designed to predict the likelihood that an individual receives his or her seasonal flu vaccine. This model could be utilized by healthcare providers in both proactive and reactive ways, as described below.

**1. Proactive: Provide targeted interventions to individuals/populations that are unlikely to receive the seasonal flu vaccine. For example:** 
* displaying informative/educational materials
* talking to individuals about flu risks during routine/preventative appointments
* hosting free/reduced cost vaccine drives
* offerring small incentives to individuals who get vaccinated (i.e. coupon to a local business)

**2. Reactive: Prepare for greater numbers of individuals to require treatment for the flu (especially during peak flu season). For example:** 
* securing a sufficient supply of necessary medical supplies (i.e. flu tests, antiviral drugs, PPE, etc.)
* ensuring there is an adequate number of hospital beds for severe flu cases

# Data Understanding

The dataset used for this analysis contains the responses of 26,707 individuals to the National 2009 H1N1 Flu Survey. The features include individuals' behavior, opinions, and demographics. These features are outlined in greater detail below (please note that these features descriptions are direct quotes from DrivenData's [Dataset Description](https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/)).

**Behavioral Features:**
* Has taken antiviral medications
* Has avoided close contact with others with flu-like symptoms
* Has bought a face mask
* Has frequently washed hands or used hand sanitizer
* Has reduced time at large gatherings
* Has reduced contact with people outside of own household
* Has avoided touching eyes, nose, or mouth

**Demographic Features:**
* Age group
* Race
* Sex
* Household annual income (with respect to 2008 Census poverty thresholds)
* Marital status
* Housing situation
* Employment status
* Employment industry
* Employment occupation
* Geographic region (10-region classification defined by the US Department of Health and Human Services)
* Residence within metropolitan statistical areas (MSAs) as defined by US Census
* Number of other adults in household
* Number of children in household
* Has regular close contact with a child under the age of six months
* Is a healthcare worker
* Has health insurance


**Opinion Features:**
* Respondent's opinion about seasonal flu vaccine effectiveness
* Respondent's opinion about risk of getting sick with seasonal flu without vaccine
* Respondent's worry of getting sick from taking seasonal flu vaccine

**Other Features:**
* Seasonal flu vaccine was recommended by a doctor
* Has one of the specified chronic medical conditions

_Note that there are additional features included in the dataset specific to the H1N1 vaccine that are not outlined here._

# Importing and Splitting Data

In the cell below, I will import the features and the target variable using Pandas.

The features are stored in the file titled "training_features", and the target variable is stored in the file titled "training_labels". Both files are located in the data folder of this repository.

In [1]:
import pandas as pd

X = pd.read_csv('training_features', index_col='respondent_id')
y = pd.read_csv('training_labels', index_col='respondent_id')['seasonal_vaccine']

Before any any exploratory analysis or model creation, I will split the data into a training set and a test set. This must occur before any data cleaning or fitting of the model in order to ensure that the model will be appropriate on future unseen data.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [12]:
y_train.value_counts(normalize=True)

seasonal_vaccine
0    0.531103
1    0.468897
Name: proportion, dtype: float64

# Exploratory Analysis

# Data Preprocessing

### Baseline Decision Tree

I will fill missing numerical values with the mean value and missing categorical values with the most frequently occurring value.

In [3]:
# selects only numerical columns
X_train_numerical = X_train.select_dtypes(exclude=object)

# selects only categorical columns
X_train_categorical = X_train.select_dtypes(include=object)

In [4]:
from sklearn.impute import SimpleImputer

# instantiates SimpleImputer that will fill missing values with the column mean
numerical_imputer = SimpleImputer(strategy='mean')

# fits the SimpleImputer object on the numerical training data and formats as DataFrame
X_train_numerical = pd.DataFrame(numerical_imputer.fit_transform(X_train_numerical),
                                columns = X_train_numerical.columns,
                                index = X_train_numerical.index)

In [5]:
# categorical
categorical_imputer = SimpleImputer(strategy='most_frequent')
X_train_categorical = pd.DataFrame(categorical_imputer.fit_transform(X_train_categorical),
                                  columns = X_train_categorical.columns,
                                  index = X_train_categorical.index)

In [6]:
from sklearn.preprocessing import OneHotEncoder

# instantiated OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)

# fit and transform ohe on the categorical training data
X_train_categorical_ohe = ohe.fit_transform(X_train_categorical)

# re-formatted the array as a DataFrame (need column titles and index to concatenate)
X_train_categorical_ohe = pd.DataFrame(X_train_categorical_ohe, 
                                       columns=ohe.get_feature_names_out(X_train_categorical.columns),
                                       index=X_train_categorical.index)

X_train_categorical_ohe

Unnamed: 0_level_0,age_group_18 - 34 Years,age_group_35 - 44 Years,age_group_45 - 54 Years,age_group_55 - 64 Years,age_group_65+ Years,education_12 Years,education_< 12 Years,education_College Graduate,education_Some College,race_Black,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25194,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14006,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11285,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2900,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
19083,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21575,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5390,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
860,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15795,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
X_train = pd.concat([X_train_numerical, X_train_categorical_ohe], axis=1)

In [8]:
X_train

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25194,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.221636,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14006,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11285,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2900,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
19083,2.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21575,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5390,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.221636,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
860,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15795,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
from sklearn.tree import DecisionTreeClassifier

entropy_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)
entropy_tree.fit(X_train, y_train)

In [11]:
gini_tree = DecisionTreeClassifier(random_state=42)
gini_tree.fit(X_train, y_train)

In [17]:
from sklearn.metrics import accuracy_score

# entropy tree
y_train_hat = entropy_tree.predict(X_train)
print('Entropy Train Accuracy: ', accuracy_score(y_train, y_train_hat))

# gini tree
y_train_hat = gini_tree.predict(X_train)
print('Gini Train Accuracy: ', accuracy_score(y_train, y_train_hat))

Entropy Train Accuracy:  1.0
Gini Train Accuracy:  1.0


### Preprocessing Testing Data

In [18]:
# selects only numerical columns
X_test_numerical = X_test.select_dtypes(exclude=object)

# selects only categorical columns
X_test_categorical = X_test.select_dtypes(include=object)

In [19]:
# fills missing values in X_test_numerical with the column mean of training data
X_test_numerical = pd.DataFrame(numerical_imputer.transform(X_test_numerical),
                               columns = X_test_numerical.columns,
                               index = X_test_numerical.index)

# fills missing values in X_test_categorical with the column mode of training data
X_test_categorical = pd.DataFrame(categorical_imputer.transform(X_test_categorical),
                                 columns = X_test_categorical.columns,
                                 index = X_test_categorical.index)

In [20]:
# one-hot encodes testing data using the ohe object fit on the training data
X_test_categorical_ohe = ohe.transform(X_test_categorical)

# reformats the array as a DataFrame
X_test_categorical_ohe = pd.DataFrame(X_test_categorical_ohe,
                                      columns = ohe.get_feature_names_out(X_test_categorical.columns),
                                      index = X_test_categorical.index)

In [21]:
X_test = pd.concat([X_test_numerical, X_test_categorical_ohe], axis = 1)

In [22]:
X_test

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15772,2.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9407,3.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.221636,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
16515,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23353,2.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10008,1.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25990,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.221636,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14302,2.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3817,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
13912,1.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
y_test_hat_entropy = entropy_tree.predict(X_test)
print("Entropy Test Accuracy: ", accuracy_score(y_test, y_test_hat_entropy))

y_test_hat_gini = gini_tree.predict(X_test)
print("Gini Test Accuracy: ", accuracy_score(y_test, y_test_hat_gini))

Entropy Test Accuracy:  0.6946233338325596
Gini Test Accuracy:  0.6938744945334732


In [24]:
from sklearn.metrics import confusion_matrix
cnf_entropy = confusion_matrix(y_test, y_test_hat_entropy)
cnf_gini = confusion_matrix(y_test, y_test_hat_gini)

In [25]:
cnf_entropy

array([[2591, 1043],
       [ 996, 2047]])

In [26]:
cnf_gini

array([[2578, 1056],
       [ 988, 2055]])

In [27]:
# baseline Logistic Regression model
import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree

In [None]:
# need to run in terminal: conda install -c conda-forge statsmodels