# Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines

## Overview
Can you predict whether people got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors?

In this challenge, we will take a look at vaccination, a key public health measure used to fight infectious diseases. Vaccines provide immunization for individuals, and enough immunization in a community can further reduce the spread of diseases through "herd immunity."

As of the launch of this competition, vaccines for the COVID-19 virus are still under development and not yet available. The competition will instead revisit the public health response to a different recent major respiratory disease pandemic. Beginning in spring 2009, a pandemic caused by the H1N1 influenza virus, colloquially named "swine flu," swept across the world. Researchers estimate that in the first year, it was responsible for between 151,000 to 575,000 deaths globally.

A vaccine for the H1N1 flu virus became publicly available in October 2009. In late 2009 and early 2010, the United States conducted the National 2009 H1N1 Flu Survey. This phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. These additional questions covered their social, economic, and demographic background, opinions on risks of illness and vaccine effectiveness, and behaviors towards mitigating transmission. A better understanding of how these characteristics are associated with personal vaccination patterns can provide guidance for future public health efforts.

This is a practice competition designed to be accessible to participants at all levels. That makes it a great place to dive into the world of data science competitions. Come on in from the waiting room and try your (hopefully steady) hand at predicting vaccinations.

In [1]:
#Step1 - Load Dataset
import nbconvert
import pandas as pd
train_df = pd.read_csv("G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/training_set_features.csv")
train_labels = pd.read_csv("G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/training_set_labels.csv")
test_df = pd.read_csv("G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/test_set_features.csv")
train_df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [2]:
#Step2 - Check Col types, NA, unique.
df = pd.DataFrame(columns = ['Col', 'Type', 'NA', '%NA', 'UniqLen']) 
colList = list(train_df)
for i, value in enumerate(colList):
    df.loc[i] = [value, train_df.dtypes[i], train_df[value].isna().sum(),  train_df[value].isna().sum()/len(train_df), len(train_df[value].unique())]

df

Unnamed: 0,Col,Type,NA,%NA,UniqLen
0,respondent_id,int64,0,0.0,26707
1,h1n1_concern,float64,92,0.003445,5
2,h1n1_knowledge,float64,116,0.004343,4
3,behavioral_antiviral_meds,float64,71,0.002658,3
4,behavioral_avoidance,float64,208,0.007788,3
5,behavioral_face_mask,float64,19,0.000711,3
6,behavioral_wash_hands,float64,42,0.001573,3
7,behavioral_large_gatherings,float64,87,0.003258,3
8,behavioral_outside_home,float64,82,0.00307,3
9,behavioral_touch_face,float64,128,0.004793,3


In [3]:
#Step3 - Remove  50% NA cols 
train_df.drop(['employment_industry', 'employment_occupation'],inplace=True,axis=1)
test_df.drop(['employment_industry', 'employment_occupation'],inplace=True,axis=1)

In [4]:
#Step4 - Dictionary
dic = {}
objList = []
colList = list(train_df)
#Create list of all object col
for i, value in enumerate(colList):
    #Per each object col, find unique values and add into list.
    if train_df[value].dtype == 'object':
        objList += list(train_df[value].unique())
        
#Remove duplicate from list and 'nan'
objList = list(set(objList))
objList.pop(0)

#Build dic with values
for i, value in enumerate(objList):
    dic[value] = (i + 3) * 4 - 1
    #Go over dic and replace strings into numeric
    train_df = train_df.replace(value, dic[value])
    test_df = test_df.replace(value, dic[value])


In [5]:
#Step5 - Adding labels features into train_df
train_df["h1n1_vaccine"] = train_labels["h1n1_vaccine"]
train_df["seasonal_vaccine"] = train_labels["seasonal_vaccine"]
test_df["h1n1_vaccine"] = '0'
test_df["seasonal_vaccine"] = '0'

In [6]:
#Step6 - Fill NAs
train_df = train_df.fillna(0)
test_df = test_df.fillna(0)

In [7]:
#Step7 - Convert all objsct cols into numeric & numeric 64 into 32 format.
colList = list(train_df)
for i, value in enumerate(colList):
    if train_df[value].dtypes == 'int64':
        train_df[value] = train_df[value].astype('int32')
        test_df[value] = test_df[value].astype('int32')
    elif train_df[value].dtypes == 'float64':
        train_df[value] = train_df[value].astype('float32')
        test_df[value] = test_df[value].astype('float32')
    elif train_df[value].dtypes == 'object':
        train_df[value] = train_df[value].astype('int32')
        test_df[value] = test_df[value].astype('int32')
        
train_df.dtypes
train_df.to_csv("G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/training_features1.csv", index=False)
test_df.to_csv("G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/test_features1.csv", index=False)

In [8]:
#Step8 - Check final columns type
train_df.dtypes

respondent_id                    int32
h1n1_concern                   float32
h1n1_knowledge                 float32
behavioral_antiviral_meds      float32
behavioral_avoidance           float32
behavioral_face_mask           float32
behavioral_wash_hands          float32
behavioral_large_gatherings    float32
behavioral_outside_home        float32
behavioral_touch_face          float32
doctor_recc_h1n1               float32
doctor_recc_seasonal           float32
chronic_med_condition          float32
child_under_6_months           float32
health_worker                  float32
health_insurance               float32
opinion_h1n1_vacc_effective    float32
opinion_h1n1_risk              float32
opinion_h1n1_sick_from_vacc    float32
opinion_seas_vacc_effective    float32
opinion_seas_risk              float32
opinion_seas_sick_from_vacc    float32
age_group                        int32
education                      float32
race                             int32
sex                      

## MAchine Learning - Pycaret
In this part, we will check the probability for both vaccination.
Let's run 2 experiments, the 1st for h1n1_vaccine & 2nd onr of seasonal_vaccine.

In [9]:
#Step9 - Load Caret
from pycaret.classification import *

exp1 = setup(train_df, target = 'h1n1_vaccine')

 
Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,1137
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(26707, 36)"
4,Missing Values,False
5,Numeric Features,35
6,Categorical Features,0
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


In [10]:
#Step10 - Compare modules
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Extreme Gradient Boosting,0.871,0.8866,0.5823,0.7573,0.6555,0.5783
1,Extra Trees Classifier,0.8678,0.8792,0.5267,0.7865,0.6267,0.5507
2,CatBoost Classifier,0.8678,0.881,0.5774,0.7425,0.6473,0.568
3,Linear Discriminant Analysis,0.864,0.8865,0.6176,0.7055,0.6567,0.5728
4,Ridge Classifier,0.8635,0.0,0.5471,0.7437,0.6276,0.547
5,Gradient Boosting Classifier,0.8635,0.8869,0.5723,0.7295,0.6392,0.5569
6,Light Gradient Boosting Machine,0.8603,0.8688,0.5851,0.7074,0.638,0.5529
7,Ada Boost Classifier,0.8571,0.8801,0.5671,0.705,0.627,0.5402
8,Quadratic Discriminant Analysis,0.8357,0.8422,0.6049,0.6139,0.6089,0.505
9,Random Forest Classifier,0.8314,0.8345,0.3883,0.6799,0.4907,0.3998


In [11]:
#Step:11 - Stacking model for improve ML
lda = create_model('lda')
gbc = create_model('gbc')
xgboost = create_model('xgboost')

# stacking models
stacker = stack_models(estimator_list = [xgboost ,lda,gbc], meta_model = xgboost)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.9037,0.9142,0.6923,0.8182,0.75,0.6909
1,0.861,0.9138,0.625,0.6944,0.6579,0.5709
2,0.877,0.8537,0.575,0.7931,0.6667,0.5936
3,0.8717,0.891,0.575,0.7667,0.6571,0.5802
4,0.8556,0.8753,0.575,0.697,0.6301,0.5415
5,0.8503,0.8446,0.425,0.7727,0.5484,0.4676
6,0.8449,0.877,0.5,0.6897,0.5797,0.4876
7,0.8182,0.816,0.4,0.6154,0.4848,0.3804
8,0.8441,0.8751,0.5641,0.6471,0.6027,0.5063
9,0.8817,0.9185,0.6667,0.7429,0.7027,0.6291


In [12]:
#Step12 - Save experiment
save_experiment(experiment_name = 'G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/Exp1')

Experiment Succesfully Saved


As we have finished with H1N1 model, let's continue with seasonal vaccine.

In [13]:
#Step13 Create setup
exp2 = setup(train_df, target = 'seasonal_vaccine')

 
Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,5919
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(26707, 36)"
4,Missing Values,False
5,Numeric Features,35
6,Categorical Features,0
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


In [14]:
#Step14 - Compare modules
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Extreme Gradient Boosting,0.8067,0.88,0.7701,0.8079,0.7876,0.6106
1,CatBoost Classifier,0.8062,0.8777,0.7701,0.8068,0.7873,0.6095
2,Gradient Boosting Classifier,0.8057,0.8804,0.7701,0.8062,0.7867,0.6085
3,Light Gradient Boosting Machine,0.8057,0.8727,0.7713,0.8048,0.7869,0.6085
4,Ridge Classifier,0.8035,0.0,0.7264,0.8312,0.7748,0.602
5,Linear Discriminant Analysis,0.8035,0.8695,0.7264,0.8312,0.7748,0.602
6,Extra Trees Classifier,0.8025,0.8728,0.746,0.8159,0.7784,0.6009
7,Ada Boost Classifier,0.8003,0.8739,0.7563,0.8047,0.7786,0.5972
8,Quadratic Discriminant Analysis,0.7751,0.8394,0.7437,0.767,0.7549,0.5473
9,Random Forest Classifier,0.7682,0.8401,0.6908,0.7876,0.7351,0.5307


In [15]:
#Step:15 - Stacking model for improve ML
catboost = create_model('catboost')
lda = create_model('lda')
xgboost = create_model('xgboost')

# stacking models
stacker = stack_models(estimator_list = [xgboost ,lda,gbc], meta_model = xgboost)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.8182,0.9018,0.8276,0.7912,0.809,0.6357
1,0.7647,0.8434,0.7356,0.7529,0.7442,0.5264
2,0.7701,0.8362,0.7701,0.7444,0.7571,0.5389
3,0.8021,0.8808,0.7931,0.7841,0.7886,0.6027
4,0.8075,0.8685,0.7241,0.84,0.7778,0.6096
5,0.8128,0.8931,0.7816,0.8095,0.7953,0.623
6,0.7594,0.8457,0.7126,0.7561,0.7337,0.5146
7,0.8342,0.8929,0.7586,0.8684,0.8098,0.6641
8,0.8172,0.8498,0.7586,0.8354,0.7952,0.6308
9,0.7903,0.8839,0.7356,0.8,0.7665,0.5768


In [16]:
#Step16 - Save 2nd experiment
save_experiment(experiment_name = 'G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/Exp2')

Experiment Succesfully Saved


In [17]:
#Step17 - Load exp1 for predict H1N1 probability.
exp1 = load_experiment('G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/Exp1')
prediction1 = predict_model(stacker, data = test_df)

In [18]:
#Step18 - Load exp1 for predict  seasonal probability.
exp2 = load_experiment('G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/Exp2')
prediction2 = predict_model(stacker, data = test_df)

In [19]:
#Step19 - Build submission
submission = pd.DataFrame(columns=['respondent_id', 'h1n1_vaccine', 'seasonal_vaccine'])
submission['respondent_id'] = test_df['respondent_id']
submission['h1n1_vaccine'] = prediction1['Score']
submission['seasonal_vaccine'] = prediction2['Score']
submission.head(30)

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,26707,0.0764,0.0764
1,26708,0.0764,0.0764
2,26709,0.0764,0.0764
3,26710,0.0622,0.0622
4,26711,0.0764,0.0764
5,26712,0.0622,0.0622
6,26713,0.0764,0.0764
7,26714,0.0764,0.0764
8,26715,0.0764,0.0764
9,26716,0.0622,0.0622


In [20]:
submission.to_csv("G:/DataScienceProject/Drivendata-Predict-H1N1-And-Seasonal-Flu-Vaccines/submit1.csv", index=False)