In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import (BaseEstimator, TransformerMixin)
from sklearn.metrics import accuracy_score

%matplotlib inline

The data used for this analysis will be a 2009 survey conducted for the H1N1 outbreak. This survey was performed by the CDC in order to monitor and evaluate the flu vaccination efforts of adults and children in randomly selected US households. The questions asked of the participants dealt with their H1N1 vaccination status, flu-related behaviors, opinions about flu vaccine safety and effectivenss, recent respiratory illness, and pneumococcal vaccination status <a href="#About the National Immunization Survery">[1]</a>.

The following data from the survey can be found and downloaded <a href="https://www.drivendata.org/competitions/66/flu-shot-learning/data/">here</a><a href="#Source Data Download">[2]</a> with feature descriptions found <a href="https://github.com/cschneck7/Iterative_Classification_Blog/blob/main/data/H1N1_and_Seasonal_Flu_Vaccines_Feature_Information.txt">here</a>.

In [3]:
# Import survey data into dataframes
# The source dataset already had this split feature and target files
X = pd.read_csv('data/source_data/training_set_features.csv')
y = pd.read_csv('data/source_data/training_set_labels.csv')

There are originally two different target variable, for this example we will only concentrate on `h1n1_vaccine`.

In [4]:
# Sets target variable
y = y.h1n1_vaccine

Quick look at feature dataframe shape.

In [7]:
# Returns shape of feature dataframe
X.shape

(26707, 36)

Quick look at missing values in feature dataframe.

In [5]:
# Checks amount of Nan values in feature dataframe
missing_values = X.isna().sum()
missing_values

respondent_id                      0
h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

## References

[1] <a id='About the National Immunization Survery' href="https://webarchive.loc.gov/all/20140511031000/http://www.cdc.gov/nchs/nis/about_nis.htm#h1n1">https://webarchive.loc.gov/all/20140511031000/http://www.cdc.gov/nchs/nis/about_nis.htm#h1n1</a>

[2] <a href='https://www.drivendata.org/competitions/66/flu-shot-learning/data/'>https://www.drivendata.org/competitions/66/flu-shot-learning/data/</a>
