<a href="https://colab.research.google.com/github/anujoshi3390/H1N1-Vaccines/blob/main/N1H1_Vaccines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## H1N1 Vaccines

# 1. Project Description & Objective


DESCRIPTION: Beginning in the spring of 2009, the H1N1 influenza virus, commonly referred to as "swine flu," swept across the globe. It is estimated that in the first year, swine flu was responsible for 363,000 deaths worldwide.

OBJECTIVE: The objective of this project is to identify individuals who will not receive a swine flu vaccine. Predicting individuals who will not receive the vaccine will help to inform the CDC’s marketing strategies. To do this, I have analyzed over 26,000 data points collected via phone call through the National 2009 H1N1 Flu Survey. This survey was administered by the National Center for Health Statistics in collaboration with the CDC.

DEFINING KEY TERMS: Throughout this notebook, when the term “unvaccinated” is used, it refers specifically to the annual swine flu vaccinestill recommended today by the CDC.

METHODOLOGY: Logistic Regression

TARGET: h1n1_vaccine - 1 represents no vaccine - 0 represents vaccine

PERFORMANCE METRICS: Performance will be evaluated according to the area under the receiver operating characteristic curve (ROC) along with the accuracy, recall, precision, and f1 scores. A higher value indicates stronger performance.

# 2. Imports and Data

In [None]:
import pandas as pd
import numpy as np
from numpy import ravel
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer, PolynomialFeatures,\
Normalizer, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder

from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.compose import ColumnTransformer,  make_column_selector as selector
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import plot_confusion_matrix, recall_score,\
    accuracy_score, precision_score, f1_score, plot_roc_curve, roc_auc_score

from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline
from sklearn.metrics import roc_curve, auc
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import VarianceThreshold

In [None]:
trainfeatures = pd.read_csv('training_set_features.csv')
trainlabels = pd.read_csv('training_set_labels.csv')

In [None]:
trainfeatures.shape, trainlabels.shape

Merging features and labels dataframes to create one comprehensive dataset.

In [None]:
df = trainfeatures.merge(trainlabels,on='respondent_id',how='outer')

In [None]:
df.head()

In [None]:
df.h1n1_vaccine.value_counts()

# 3. Exploratory Data Analysis

In [None]:
df.info()

n our question, we want to uncover who is will not receive a h1n1 vaccine, therefore, we will switch our target variable so that 0 represents receiving the vaccine and 1 represents not receiving the vaccine.

In [None]:
df.h1n1_vaccine = df.h1n1_vaccine.replace(0, "no vaccine")
df.h1n1_vaccine = df.h1n1_vaccine.replace(1, "vaccine")


In [None]:
df.h1n1_vaccine.value_counts()

In [None]:
df.h1n1_vaccine = df.h1n1_vaccine.replace("no vaccine", 1)
df.h1n1_vaccine = df.h1n1_vaccine.replace("vaccine", 0)

In [None]:
df.h1n1_vaccine.value_counts()

In [None]:
df['h1n1_vaccine'].value_counts(normalize=True)

There may be a class imbalance issue given that 78.75% of the dataset did not receive the h1n1 vaccine.

In [None]:
df['h1n1_vaccine'].value_counts(normalize=True).plot(kind='bar');

Let's explore the relationship of other categorical variables with the target variables.

In [None]:
sns.barplot(x=df['sex'], y=df['h1n1_vaccine']);

In [None]:
sns.barplot(x=df['education'], y=df['h1n1_vaccine'] * 100);

In [None]:
sns.barplot(x=df['income_poverty'], y=df['h1n1_vaccine']);


In [None]:
sns.barplot(x=df['employment_status'], y=df['h1n1_vaccine']);


In [None]:
sns.barplot(x=df['rent_or_own'], y=df['h1n1_vaccine']);


In [None]:
sns.barplot(x=df['age_group'], y=df['h1n1_vaccine']);


In [None]:
sns.barplot(x=df['doctor_recc_h1n1'], y=df['h1n1_vaccine']);


# 4. Pipeline Creation - H1N1 Vaccine and All Variables

Creating categorical, binary, and simple imputer only column lists to use in pipelines.

In [None]:
df.sex.value_counts()

In [None]:
df.marital_status.value_counts()

In [None]:
df.rent_or_own.value_counts()

In [None]:
categorical_columns = ['age_group','education','race','income_poverty','employment_status', 'hhs_geo_region','census_msa',
                       'household_adults','household_children','employment_industry','employment_occupation']

#binary columns include columns that are binary but need to be one-hot-encoded because they are objects
binary_columns = ['sex','marital_status','rent_or_own']

#simple impute only includes: (1) binary variables that contain 0 and 1 already and (2) survey data that is on a scale from 0-x.
simple_impute_only = ['opinion_h1n1_vacc_effective','opinion_h1n1_risk','opinion_h1n1_sick_from_vacc',
                       'opinion_seas_vacc_effective','opinion_seas_risk','opinion_seas_sick_from_vacc','h1n1_knowledge',
                       'h1n1_concern', 'behavioral_antiviral_meds','behavioral_avoidance', 'behavioral_face_mask',
                      'behavioral_wash_hands','behavioral_large_gatherings','behavioral_outside_home',
                      'behavioral_touch_face','doctor_recc_h1n1','doctor_recc_seasonal','chronic_med_condition',
                      'child_under_6_months','health_worker','health_insurance']

Our X and y exclude respondent ID and both the h1n1 and seasonal vaccine target variables. We are not paying attention to seasonal vaccines in this project.

In [None]:
X = df.iloc[:,1:-2]
y = df['h1n1_vaccine']

In [None]:
Split the data so that we have our training set and testing set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

In [None]:
X_train.shape, X_test.shape

Creating subpipes based on the needs of the respective columns.

In [None]:
subpipe_categorical = Pipeline(steps=[
    ('cat_impute', SimpleImputer(strategy='most_frequent', add_indicator=True)),
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

subpipe_binary = Pipeline([
    ('bin_impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

subpipe_simple_impute_only = Pipeline([
    ('sio_impute', SimpleImputer(strategy='most_frequent'))
])


In [None]:
CT = ColumnTransformer(transformers=[
    ('categorical', subpipe_categorical, categorical_columns),
    ('binary', subpipe_binary, binary_columns),
    ('simple_impute_only', subpipe_simple_impute_only, simple_impute_only)
], remainder='passthrough')

## 4A. Dummy Model

In [None]:
Creating first dummy/baseline model to compare our following models to.

In [None]:
dummy_pipeline = Pipeline(steps=[
    ('CT',CT),
    ('dummy',DummyClassifier(strategy='most_frequent'))
])

In [None]:
dummy_pipeline.fit(X_train, y_train)

In [None]:
print('Training set score: ' + str(dummy_pipeline.score(X_train,y_train)))
print('Test set score: ' + str(dummy_pipeline.score(X_test,y_test)))

In [None]:
plot_roc_curve(dummy_pipeline, X_test, y_test, name='dummy model');

Our model must perform better than this!

# 4B. First Simple Model

In [None]:
fsm_pipeline = Pipeline([
    ('CT',CT),
    ('first_simple_model', LogisticRegression())
])

In [None]:
fsm_pipeline.fit(X_train, y_train)

In [None]:
print('Training set score: ' + str(fsm_pipeline.score(X_train,y_train)))
print('Test set score: ' + str(fsm_pipeline.score(X_test,y_test)))

In [None]:
plot_roc_curve(fsm_pipeline, X_test, y_test, name='logistic regression');

This is an improvement from the dummy model. Let's tune it with gridsearching.

# 4C. Tuning First Simple Model with GridSearching


Creating parameters for the grid search. The gridsearch takes 25-40 minutes.

In [None]:
params = {
    'first_simple_model__solver' : ['liblinear','lbfgs'],
    'first_simple_model__max_iter': [10, 100,1000,10_000],
    'first_simple_model__C' : [0.0001, 0.001, 0.01, 0.1, 1],
    'first_simple_model__tol' : [0.0001, 0.001, 0.01, 0.1]
}

In [None]:
gridsearch = GridSearchCV(
    estimator = fsm_pipeline,
    param_grid=params,
    cv=5,
    verbose=1
)

In [None]:
gridsearch.fit(X_train,y_train)

In [None]:
best_params = gridsearch.best_params_
best_params

In [None]:
best_pipe = gridsearch.best_estimator_
best_pipe

In [None]:
result_df = pd.DataFrame.from_dict(gridsearch.cv_results_, orient='columns')
result_df

In [None]:
print('Training set score: ' + str(best_pipe.score(X_train,y_train)))
print('Test set score: ' + str(best_pipe.score(X_test,y_test)))

In [None]:
plot_roc_curve(best_pipe, X_train, y_train, name='GridSearch Logistic Regression');


In [None]:
cv = cross_validate(best_pipe, X_train, y_train, scoring=['r2','neg_mean_squared_error'], return_train_score=True)
cv

In [None]:
y_preds4c = best_pipe.predict(X_test)

In [None]:
plot_confusion_matrix(best_pipe, X_test, y_test);

In [None]:
print(f"""
Our 4C model's accuracy on the test set is {round(accuracy_score(y_test, y_preds4c), 4) * 100}%. \n
Our 4C model's recall on the test set is {round(recall_score(y_test, y_preds4c), 5) * 100}%. \n
Our 4C model's precision on the test set is {round(precision_score(y_test, y_preds4c), 4) * 100}%. \n
Our 4C model's f1-score on the test is {round(f1_score(y_test, y_preds4c), 3) * 100}%.
""")

Tuning our model based on the gridsearch improved our model!

# 4D. Logistic Regression with SMOTE

Earlier we mentioned a possible class imbalance (section 3). Let's see if adding SMOTE to the model improves the performance.

In [None]:
smote_pipeline = ImPipeline(steps=[
    ('ct', CT),
    ('sm', SMOTE(random_state=42)),
    ('best_pipe', LogisticRegression(C=1,max_iter=100, solver='lbfgs',tol=.0001))
])

In [None]:
smote_pipeline.fit(X_train, y_train)

In [None]:
print('Training set score: ' + str(smote_pipeline.score(X_train,y_train)))
print('Test set score: ' + str(smote_pipeline.score(X_test,y_test)))


In [None]:
plot_roc_curve(smote_pipeline, X_test, y_test, name='Smote Model');


In [None]:
cv = cross_validate(smote_pipeline, X_train, y_train, scoring=['r2','neg_mean_squared_error'], return_train_score=True)
cv

In [None]:
y_preds4d = smote_pipeline.predict(X_test)

In [None]:
print(f"""
Our 4d model's accuracy on the test set is {round(accuracy_score(y_test, y_preds4d), 4) * 100}%. \n
Our 4d model's recall on the test set is {round(recall_score(y_test, y_preds4d), 5) * 100}%. \n
Our 4d model's precision on the test set is {round(precision_score(y_test, y_preds4d), 4) * 100}%. \n
Our 4d model's f1-score on the test is {round(f1_score(y_test, y_preds4d), 3) * 100}%.
""")


Adding SMOTE to the model decreased it's performance so if we are going to move forward with including all the columns in our model, we will move forward with the gridsearch logistic regression without SMOTE.

# 5. Pipeline Creation - H1N1 Vaccine and Select Columns

Before choosing our model, let's explore what happens to our model if we decrease the number of columns/variables.



In [None]:
X_select = df[['age_group', 'education', 'race', 'income_poverty', 'employment_status', 'marital_status', 'sex','hhs_geo_region']]
y_select = df['h1n1_vaccine']

In [None]:
categorical_columns = ['age_group', 'education', 'race', 'income_poverty', 'employment_status', 'hhs_geo_region']
binary_columns = ['sex', 'marital_status']


In [None]:
X_train_select, X_test_select, y_train_select, y_test_select = train_test_split(X_select, y_select, random_state=42)

In [None]:
subpipe_categorical = Pipeline(steps=[
    ('cat_impute', SimpleImputer(strategy='most_frequent', add_indicator=True)),
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

subpipe_binary = Pipeline([
    ('bin_impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

In [None]:
CT = ColumnTransformer(transformers=[
    ('categorical', subpipe_categorical, categorical_columns),
    ('binary', subpipe_binary, binary_columns)
], remainder='passthrough')

# 5A. Dummy Model

Creating our dummy model for our new pipelines that include only select columns.



In [None]:
dummy_pipeline = Pipeline(steps=[
    ('CT', CT),
    ('dummy', DummyClassifier(strategy='most_frequent'))
])

In [None]:
dummy_pipeline.fit(X_train_select,y_train_select)

In [None]:
print('Training set score: ' + str(dummy_pipeline.score(X_train_select,y_train_select)))
print('Test set score: ' + str(dummy_pipeline.score(X_test_select,y_test_select)))


In [None]:
plot_roc_curve(dummy_pipeline, X_test_select, y_test_select, name='dummy model');


# 5B. First Simple Model for Select Columns

In [None]:
fsm_pipeline = Pipeline([
    ('CT', CT),
    ('first_simple_model', LogisticRegression())
])

In [None]:
fsm_pipeline.fit(X_train_select,y_train_select)

In [None]:
print('Training set score: ' + str(fsm_pipeline.score(X_train_select,y_train_select)))
print('Test set score: ' + str(fsm_pipeline.score(X_test_select,y_test_select)))

In [None]:
plot_roc_curve(fsm_pipeline, X_test_select, y_test_select, name='first simple model');


This does not look good. Let's gridsearch to see if it improves our model at all.

## 5C. Tuning First Simple Model with GridSearching

In [None]:
params = {
    'first_simple_model__solver' : ['liblinear','lbfgs'],
    'first_simple_model__max_iter': [10, 100,1000,10_000],
    'first_simple_model__C' : [0.0001, 0.001, 0.01, 0.1, 1],
    'first_simple_model__tol' : [0.0001, 0.001, 0.01, 0.1]
}

In [None]:
gridsearch_select = GridSearchCV(
    estimator = fsm_pipeline,
    param_grid = params,
    cv=5,
    verbose=1)

In [None]:
gridsearch_select.fit(X_train_select,y_train_select)


In [None]:
best_params_select = gridsearch_select.best_params_
best_params_select

In [None]:
best_pipe_select = gridsearch_select.best_estimator_
best_pipe_select

In [None]:
result_df = pd.DataFrame.from_dict(gridsearch_select.cv_results_,orient='columns')
resu

In [None]:
print('Training set score: ' + str(best_pipe_select.score(X_train_select,y_train)))
print('Test set score: ' + str(best_pipe_select.score(X_test_select,y_test)))

In [None]:
plot_roc_curve(best_pipe_select, X_train_select, y_train_select, name = 'GridSearch Logistic Regression with Select Variables');


Overall, this is not a good performance. Let's go back to using all columns in our model.

# 6. FINAL STAGE - Exploring Prediction

Refer to section 4C of selecting the best model including all the variables.

In [None]:
final_model = best_pipe
final_model

# 6A. Final Model Evaluation

Let's evaluate our model once more just to be extra careful.

In [None]:
plot_roc_curve(best_pipe, X_train, y_train, name='Final Model');

In [None]:
print('Training set score: ' + str(final_model.score(X_train,y_train)))
print('Test set score: ' + str(final_model.score(X_test,y_test)))

In [None]:
cv = cross_validate(final_model, X_train, y_train, scoring=['r2','neg_mean_squared_error'], return_train_score=True)
cv

In [None]:
plot_confusion_matrix(final_model, X_test, y_test);

In [None]:
final_preds = final_model.predict(X_test)

## 6B. Final Model Interpretation

In [None]:
print(f"""
The final model's accuracy on the test set is {round(accuracy_score(y_test, final_preds), 4) * 100}%. \n
The final model's recall on the test set is {round(recall_score(y_test, final_preds), 4) * 100}%. \n
The final model's precision on the test set is {round(precision_score(y_test, final_preds), 4) * 100}%. \n
The final model's f1-score on the test is {round(f1_score(y_test, final_preds), 4) * 100}%. \n
The final model's ROC AUC score on the test is {round(roc_auc_score(y_test, final_preds, average='macro'), 5) * 100}%.
""")

INTERPRETATION: While the final model’s overall accuracy score was 84%, the recall rate was 95%. This tells us that of individuals who will not receive the vaccine, the model will accurately predict 95% of them. This is good news for us because, for this business question, we are mostly concerned with identifying unvaccinated individuals. It won’t hurt to market to an individual who will indeed get the vaccine. We would rather err on the side of identifying a larger volume of unvaccinated individuals. Let’s take a sneak peak at what this machine learning model tells us.



# 6C. Creating Predictions DataFrame

Let's create a new dataframe that includes our predictions to evaluate what the predictions tell us about people less likely to receive the vaccine.

In [None]:
X_preds = X_test.copy()

In [None]:
X_preds['Predictions'] = final_preds
X_preds['True Values'] = y_test

In [None]:
X_preds

# 6D. Exploring Data Based on Target Predictions #

Creating a function to explore the relationship between our dependent variable of h1n1 vaccine predictions and various independent variables in our dataset.

In [None]:
def explore(x):
    fig, ax = plt.subplots(figsize=(24,12))
    sns.barplot(x=x, y='Predictions', palette='colorblind', data=X_preds);

    x_groupby = (X_preds.groupby([x]).mean().sort_values(by='Predictions', ascending=False))
    x_groupby = x_groupby[['Predictions']]
    print(x_groupby)

In [None]:
explore('employment_status')

In [None]:
explore('age_group')

In [None]:
explore('hhs_geo_region')
#coded in random strings for privacy purposes


In [None]:
explore('rent_or_own')

In [None]:
explore('census_msa')

In [None]:
explore('education')

In [None]:
explore('doctor_recc_h1n1')

In [None]:
explore('marital_status')

In [None]:
explore('child_under_6_months')


In [None]:
explore('household_adults')

In [None]:
explore('household_children')

In [None]:
explore('race')

In [None]:
explore('income_poverty')

In [None]:
explore('employment_occupation')
#coded in random strings for privacy purposes

In [None]:
explore('health_worker')

In [None]:
explore('health_insurance')

Exporting X_preds DataFrame to use in creating Power BI visualizations.

In [None]:
#X_preds.to_csv('h1n1vaccines.csv')