# Seasonal Flu Vaccine Predictive Model

* **Student name:** Caroline Surratt
* **Student pace:** Self-Paced
* **Scheduled project review date/time:** Tuesday, October 3rd at 10:00 AM
* **Instructor name:** Morgan Jones

# Importing Data and Exploratory Analysis

In the cell below, I will import the features and the target variable using Pandas.

The features are stored in the file titled "training_features", and the target variable is stored in the file titled "training_labels". Both files are located in the data folder of this repository.

In [10]:
import pandas as pd

X = pd.read_csv('training_features', index_col='respondent_id')
y = pd.read_csv('training_labels', index_col='respondent_id')['seasonal_vaccine']

In [11]:
print(X.shape)

(26707, 35)


Again, as noted in the Data Understanding section, this dataset contains 26,707 entries, with each entry containing information about 35 features. These features will be discussed in more detail below.

## Train-Test Split

Before any any exploratory analysis or model creation, I will split the data into a training set and a test set. This must occur before any data cleaning or fitting of the model in order to ensure that the model will be appropriate on future unseen data.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Exploration of Features

In [13]:
X_train

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25194,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,...,,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,1.0,1.0,,
14006,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,,Married,,Employed,lzgpxyit,"MSA, Not Principle City",2.0,1.0,fcxhlnwr,oijqvulv
11285,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,kbazzjca,"MSA, Principle City",0.0,1.0,wlfvacwt,hfxkjkmi
2900,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Own,Employed,mlyzmhmf,"MSA, Not Principle City",0.0,0.0,mcubkhph,ukymxvdu
19083,2.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,,,,,bhuqouqj,"MSA, Not Principle City",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21575,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,"> $75,000",Not Married,Own,Not in Labor Force,qufhixun,"MSA, Principle City",0.0,0.0,,
5390,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,,...,"<= $75,000, Above Poverty",Not Married,Own,Unemployed,mlyzmhmf,"MSA, Principle City",0.0,0.0,,
860,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,Non-MSA,1.0,0.0,atmlpfrs,xqwwgdyp
15795,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,"> $75,000",Married,Own,Employed,kbazzjca,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea


In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20030 entries, 25194 to 23654
Data columns (total 35 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 19963 non-null  float64
 1   h1n1_knowledge               19943 non-null  float64
 2   behavioral_antiviral_meds    19974 non-null  float64
 3   behavioral_avoidance         19873 non-null  float64
 4   behavioral_face_mask         20016 non-null  float64
 5   behavioral_wash_hands        19994 non-null  float64
 6   behavioral_large_gatherings  19960 non-null  float64
 7   behavioral_outside_home      19972 non-null  float64
 8   behavioral_touch_face        19932 non-null  float64
 9   doctor_recc_h1n1             18395 non-null  float64
 10  doctor_recc_seasonal         18395 non-null  float64
 11  chronic_med_condition        19313 non-null  float64
 12  child_under_6_months         19425 non-null  float64
 13  health_worker    

In [15]:
X.isna().sum().sort_values(ascending=False)

employment_occupation          13470
employment_industry            13330
health_insurance               12274
income_poverty                  4423
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
rent_or_own                     2042
employment_status               1463
marital_status                  1408
education                       1407
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
opinion_seas_sick_from_vacc      537
opinion_seas_risk                514
opinion_seas_vacc_effective      462
opinion_h1n1_sick_from_vacc      395
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
household_children               249
household_adults                 249
behavioral_avoidance             208
behavioral_touch_face            128
h1n1_knowledge                   116
h1n1_concern                      92
behavioral_large_gatherings       87
behavioral_outside_home           82
b

In [6]:
print("Number of numeric columns: {}".format(len(X_train.select_dtypes(exclude="object").columns)))
print("Number of categorical columns: {}".format(len(X_train.select_dtypes(include="object").columns)))

Number of numeric columns: 23
Number of categorical columns: 12


 23 of these features are numerical and 12 are categorical. As can be seen in the above DataFrame, some of the categorical values are descriptive (i.e. the values in the "rent_or_own" column are "Rent" and "Own") while other categorical values are random strings (i.e. the values in the "employment_industry" column.
 
There are several features that have a significant number of missing values. These values will need to be imputed in order to build a model that can make predictions on data that may also include missing values. The specific strategy for imputing will be discussed in the preprocessing section below.

In [16]:
import seaborn as sns

# Create the subplots
fig, axes = plt.subplots(nrows=7, ncols=5, figsize=(12, 10))
for i, column in enumerate(X_train):
    sns.histplot(X_train, ax=axes[i // 7, i % 5]).set_title(column)

ModuleNotFoundError: No module named 'seaborn'

In [8]:
y_train.value_counts(normalize=True)

seasonal_vaccine
0    0.531103
1    0.468897
Name: proportion, dtype: float64

The target variable is a binary value that indicates whether an individual did (1) or did not (0) receive their seasonal flu vaccine. In this dataset, approximately 53% of individuals _did_ get vaccinated, and the remaining 47% _did not_ get vaccinated.

## Preprocessing

### Dealing with Missing Data

I will fill missing numerical values with the mean value and missing categorical values with the most frequently occurring value.

First, I will split the features into numerical and categorical features.

In [None]:
# selects only numerical columns
X_train_numerical = X_train.select_dtypes(exclude=object)

# selects only categorical columns
X_train_categorical = X_train.select_dtypes(include=object)

Now, I will use SimpleImputer to fill the missing numerical values with the mean of the column and missing categorical values with the most frequently occurring value in the column.

In [None]:
X_train_numerical.isna().sum()

In [None]:
from sklearn.impute import SimpleImputer

# instantiates SimpleImputer that will fill missing values with the column mean
numerical_imputer = SimpleImputer(strategy='mean')

# fits/transforms the SimpleImputer object with the numerical training data and formats as DataFrame
X_train_numerical = pd.DataFrame(numerical_imputer.fit_transform(X_train_numerical),
                                columns = X_train_numerical.columns,
                                index = X_train_numerical.index)

# instantiates SimpleImputer that will fill missing values with most frequent column value
categorical_imputer = SimpleImputer(strategy='most_frequent')

# fits/transforms the SimpleImputer object with the categorical training data and formats as a DataFrame
X_train_categorical = pd.DataFrame(categorical_imputer.fit_transform(X_train_categorical),
                                  columns = X_train_categorical.columns,
                                  index = X_train_categorical.index)

### One-Hot Encoding

Now, I will one-hot encode the categorical columns.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# instantiates OneHotEncoder
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# fits and transforms OneHotEncoder object on the categorical training data
X_train_categorical_ohe = ohe.fit_transform(X_train_categorical)

# re-formats the array as a DataFrame (in order to concatenate with numerical training data)
X_train_categorical_ohe = pd.DataFrame(X_train_categorical_ohe, 
                                       columns=ohe.get_feature_names_out(X_train_categorical.columns),
                                       index=X_train_categorical.index)

X_train_categorical_ohe

### Normalizing Numeric Values

Lastly, I will normalize the data in order to prevent variables with larger scales from having a disproportional impact.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_numerical = pd.DataFrame(scaler.fit_transform(X_train_numerical),
                                index=X_train_numerical.index,
                                columns=X_train_numerical.columns)

### Concatenating Numerical and Categorical Data

Finally, I will concatenate the numerical and categorical training data into a single DataFrame.

In [None]:
X_train = pd.concat([X_train_numerical, X_train_categorical_ohe], axis=1)

X_train

# USING LABEL ENCODER

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


# selects only numerical columns
X_train_numerical = X_train.select_dtypes(exclude=object)

# selects only categorical columns
X_train_categorical = X_train.select_dtypes(include=object)

from sklearn.impute import SimpleImputer

# instantiates SimpleImputer that will fill missing values with the column mean
numerical_imputer = SimpleImputer(strategy='mean')

# fits/transforms the SimpleImputer object with the numerical training data and formats as DataFrame
X_train_numerical = pd.DataFrame(numerical_imputer.fit_transform(X_train_numerical),
                                columns = X_train_numerical.columns,
                                index = X_train_numerical.index)

# instantiates SimpleImputer that will fill missing values with most frequent column value
categorical_imputer = SimpleImputer(strategy='most_frequent')

# fits/transforms the SimpleImputer object with the categorical training data and formats as a DataFrame
X_train_categorical = pd.DataFrame(categorical_imputer.fit_transform(X_train_categorical),
                                  columns = X_train_categorical.columns,
                                  index = X_train_categorical.index)

# Fitting the Decision Tree

First, I will fit a decision tree classifier object on the training data. For the baseline model, I will not adjust any of the hyperparameters. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

baseline_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)

baseline_tree.fit(X_train, y_train)

# Model Evaluation

## Performance on Training Data

Now, I will use the baseline model to predict the target variable for both the training data and the testing data.

In [None]:
y_hat_train = baseline_tree.predict(X_train)

In [None]:
import numpy as np

train_residuals = np.abs(y_train - y_hat_train)

print(pd.Series(train_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(train_residuals, name="Residuals (proportions)").value_counts(normalize=True))

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_hat_train, y_train)

## Performance on Testing Data

### Preprocessing Testing Data

In [None]:
# selects only numerical columns
X_test_numerical = X_test.select_dtypes(exclude=object)

# selects only categorical columns
X_test_categorical = X_test.select_dtypes(include=object)

# transforms the numerical testing data and formats as DataFrame
X_test_numerical = pd.DataFrame(numerical_imputer.transform(X_test_numerical),
                                columns = X_test_numerical.columns,
                                index = X_test_numerical.index)


# transforms the categorical testing data and formats as DataFrame
X_test_categorical = pd.DataFrame(categorical_imputer.transform(X_test_categorical),
                                  columns = X_test_categorical.columns,
                                  index = X_test_categorical.index)


# One-hot encodes categorical testing data 
X_test_categorical_ohe = ohe.transform(X_test_categorical)

# re-formatts the array as a DataFrame (in order to concatenate with numerical testing data)
X_test_categorical_ohe = pd.DataFrame(X_test_categorical_ohe, 
                                       columns=ohe.get_feature_names_out(X_test_categorical.columns),
                                       index=X_test_categorical.index)

X_test_numerical = pd.DataFrame(scaler.transform(X_test_numerical),
                                index=X_test_numerical.index,
                                columns=X_test_numerical.columns)

X_test = pd.concat([X_test_numerical, X_test_categorical_ohe], axis=1)

### Predict Testing Targets

In [None]:
y_hat_test = baseline_tree.predict(X_test)

In [None]:
test_residuals = np.abs(y_test - y_hat_test)

In [None]:
print(pd.Series(test_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(test_residuals, name="Residuals (proportions)").value_counts(normalize=True))


In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
confusion_matrix(y_test, y_hat_test)

In [None]:
accuracy_score(y_test, y_hat_test)

### Still need to adjust hyperparameters (plan to use GridSearchCV)

In [None]:
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 7, 10, 15, 20],
    'min_samples_split': [5, 10, 20, 30, 50],
    'min_samples_leaf': [5, 10, 20, 30, 50]
}

gs_tree = GridSearchCV(clf, param_grid, cv=3)
gs_tree.fit(X_train, y_train)

gs_tree.best_params_

In [None]:
tuned_tree = DecisionTreeClassifier(criterion='entropy', max_depth=10, min_samples_leaf=50, min_samples_split=5)
tuned_tree.fit(X_train, y_train)

In [None]:
y_hat_test = tuned_tree.predict(X_test)

In [None]:
accuracy_score(y_test, y_hat_test)

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, solver='liblinear')

In [None]:
logreg.fit(X_train, y_train)

In [None]:
y_hat_test = logreg.predict(X_test)

In [None]:
accuracy_score(y_test, y_hat_test)

# Using LabelEncoder instead of OneHotEncoder

In [None]:
# Old code - ignore

In [None]:
gini_tree = DecisionTreeClassifier(random_state=42)
gini_tree.fit(X_train, y_train)

In [None]:
# baseline Logistic Regression model in StatsModels
import statsmodels.api as sm

X_train_for_statsmodels = sm.add_constant(X_train)

model = sm.Logit(y_train, X_train_for_statsmodels)
result = model.fit()

result.summary()

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, 
                            C=1e12, 
                            solver='liblinear')
logreg.fit(X_train, y_train)

In [None]:
y_test_hat_logreg = logreg.predict(X_test)

In [None]:
confusion_matrix(y_test, y_test_hat_logreg)

In [None]:
accuracy_score(y_test, y_test_hat_logreg)

In [None]:
logreg2 = LogisticRegression(fit_intercept=False,
                            C=1e12,
                            solver='liblinear')

logreg2.fit(X_train, y_train)

In [None]:
y_test_hat_logreg2 = logreg2.predict(X_test)

In [None]:
accuracy_score(y_test, y_test_hat_logreg2)

### Dropping irrelevant columns from decision tree

In [None]:
X = pd.read_csv('training_features', index_col='respondent_id')
y = pd.read_csv('training_labels', index_col='respondent_id')['seasonal_vaccine']

In [None]:
X.columns

In [None]:
columns_to_drop = ['h1n1_concern', 'h1n1_knowledge', 'doctor_recc_h1n1', 
                   'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 
                   'opinion_h1n1_sick_from_vacc']
                   

In [None]:
X.drop(columns_to_drop, axis=1, inplace=True)

## New model after dropped columns

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# selects only numerical columns
X_train_numerical = X_train.select_dtypes(exclude=object)

# selects only categorical columns
X_train_categorical = X_train.select_dtypes(include=object)

# instantiates SimpleImputer that will fill missing values with the column mean
numerical_imputer = SimpleImputer(strategy='mean')

# fits the SimpleImputer object on the numerical training data and formats as DataFrame
X_train_numerical = pd.DataFrame(numerical_imputer.fit_transform(X_train_numerical),
                                columns = X_train_numerical.columns,
                                index = X_train_numerical.index)


# categorical
categorical_imputer = SimpleImputer(strategy='most_frequent')
X_train_categorical = pd.DataFrame(categorical_imputer.fit_transform(X_train_categorical),
                                  columns = X_train_categorical.columns,
                                  index = X_train_categorical.index)

In [None]:
# instantiated OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)

# fit and transform ohe on the categorical training data
X_train_categorical_ohe = ohe.fit_transform(X_train_categorical)

# re-formatted the array as a DataFrame (need column titles and index to concatenate)
X_train_categorical_ohe = pd.DataFrame(X_train_categorical_ohe, 
                                       columns=ohe.get_feature_names_out(X_train_categorical.columns),
                                       index=X_train_categorical.index)

X_train_categorical_ohe


In [None]:
X_train = pd.concat([X_train_numerical, X_train_categorical_ohe], axis=1)

In [None]:
entropy_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)
entropy_tree.fit(X_train, y_train)

gini_tree = DecisionTreeClassifier(random_state=42)
gini_tree.fit(X_train, y_train)

In [None]:
# selects only numerical columns
X_test_numerical = X_test.select_dtypes(exclude=object)

# selects only categorical columns
X_test_categorical = X_test.select_dtypes(include=object)

# fills missing values in X_test_numerical with the column mean of training data
X_test_numerical = pd.DataFrame(numerical_imputer.transform(X_test_numerical),
                               columns = X_test_numerical.columns,
                               index = X_test_numerical.index)

# fills missing values in X_test_categorical with the column mode of training data
X_test_categorical = pd.DataFrame(categorical_imputer.transform(X_test_categorical),
                                 columns = X_test_categorical.columns,
                                 index = X_test_categorical.index)

# one-hot encodes testing data using the ohe object fit on the training data
X_test_categorical_ohe = ohe.transform(X_test_categorical)

# reformats the array as a DataFrame
X_test_categorical_ohe = pd.DataFrame(X_test_categorical_ohe,
                                      columns = ohe.get_feature_names_out(X_test_categorical.columns),
                                      index = X_test_categorical.index)

X_test = pd.concat([X_test_numerical, X_test_categorical_ohe], axis = 1)

y_test_hat_entropy = entropy_tree.predict(X_test)
print("Entropy Test Accuracy: ", accuracy_score(y_test, y_test_hat_entropy))

y_test_hat_gini = gini_tree.predict(X_test)
print("Gini Test Accuracy: ", accuracy_score(y_test, y_test_hat_gini))

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree

In [None]:
# need to run in terminal: conda install -c conda-forge statsmodels