In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import (BaseEstimator, TransformerMixin)
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

%matplotlib inline

The data used for this analysis will be a 2009 survey conducted for the H1N1 outbreak. This survey was performed by the CDC in order to monitor and evaluate the flu vaccination efforts of adults and children in randomly selected US households. The questions asked of the participants dealt with their H1N1 vaccination status, flu-related behaviors, opinions about flu vaccine safety and effectivenss, recent respiratory illness, and pneumococcal vaccination status <a href="#About the National Immunization Survery">[1]</a>.

The following data from the survey can be found and downloaded <a href="https://www.drivendata.org/competitions/66/flu-shot-learning/data/">here</a><a href="#Source Data Download">[2]</a> with feature descriptions found <a href="https://github.com/cschneck7/Iterative_Classification_Blog/blob/main/data/H1N1_and_Seasonal_Flu_Vaccines_Feature_Information.txt">here</a>.

In [2]:
# Import survey data into dataframes
# The source dataset already had this split feature and target files
X = pd.read_csv('data/source_data/training_set_features.csv')
y = pd.read_csv('data/source_data/training_set_labels.csv')

There are originally two different target variable, for this example we will only concentrate on `h1n1_vaccine`.

In [3]:
# Sets target variable
y = y.h1n1_vaccine

Quick look at feature dataframe shape.

In [4]:
# Returns shape of feature dataframe
X.shape

(26707, 36)

Quick look at missing values in feature dataframe.

In [5]:
# Checks amount of Nan values in feature dataframe
missing_values = X.isna().sum()
missing_values

respondent_id                      0
h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

### Most Frequent Entry Imputation

Below we observe the normalized distribution of a feature missing only a few entries and one containing many missed entries. We will use these two features to observe how using most frequent entry imputation is good for features that are almost complete though creates a bias for features missing most entries. This bias is very noticable if the original distribution is almost evenly spread, while less severe and possibly usable at distributions that are very far apart.

In [8]:
# Takes both a feature with little and many missing entries
X_missing = X[['behavioral_face_mask', 'health_insurance']]

# Displays number of missing values as well as normalized
# value distribution of existing values in percentages
print(f'Number of missing values: {X_missing.behavioral_face_mask.isna().sum()}')
print(X_missing.behavioral_face_mask.value_counts(normalize=True))
print(f'Number of missing values: {X_missing.health_insurance.isna().sum()}')
print(X_missing.health_insurance.value_counts(normalize=True))

Number of missing values: 19


0.0    0.931018
1.0    0.068982
Name: behavioral_face_mask, dtype: float64

Number of missing values: 12274


1.0    0.87972
0.0    0.12028
Name: health_insurance, dtype: float64

Imputing the most frequent entry using SimpleImputer then analyzing the new distributions.

In [25]:
# Creating a simple imputer object with strategy of most_frequent
si = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# creates dataframe of transformed features
X_most_frequent = pd.DataFrame(data=si.fit_transform(X_missing),
                               index=X_missing.index,
                               columns=X_missing.columns)


# Displays new value distributions after imputation
print(f'Number of missing values: {X_most_frequent.behavioral_face_mask.isna().sum()}')
print(X_most_frequent.behavioral_face_mask.value_counts(normalize=True))
print(f'Number of missing values: {X_most_frequent.health_insurance.isna().sum()}')
print(X_most_frequent.health_insurance.value_counts(normalize=True))

Number of missing values: 0
0.0    0.931067
1.0    0.068933
Name: behavioral_face_mask, dtype: float64
Number of missing values: 0
1.0    0.934998
0.0    0.065002
Name: health_insurance, dtype: float64


As can be seen above this imputation method barely changed the distribution of values for `behavioural_face_mask` which was only missing 19 entries. On the other hand the feature `health_insurance` which was missing nearly half its values had its distibution spread increase by nearly 11%. Even though there was already a mismatched distribution the most frequent entry imputation method created a larger bias in value distribution.

### Random Imputation

Next we will take a quick look at random imputation. This method randomly imputes values based of the existing values distribution. This may already be more attractive than the previous method because it ensures the distribution will stay constant. We'll use the same values as before to provide an example.

In [34]:
# Creates copy of DataFrame
X_rand_imp = X_missing.copy()

# Iterates through features
for col in X_missing.columns:
#     Finds number of missing values in feature
    number_missing = X_rand_imp[col].isnull().sum()
#     Finds normalized distribution of existing entries
    value_dist = X_rand_imp.loc[X_rand_imp[col].notnull(), col].value_counts(normalize=True)
#     Sets random seed for random.choice
    np.random.seed(0)
#     Randomly Imputes observed values replacing all missing information
    X_rand_imp.loc[X_rand_imp[col].isnull(), col] = np.random.choice(value_dist.index, 
                                                                     number_missing, 
                                                                     replace = True,
                                                                     p = value_dist)
    
# Displays before and after imputation distributions
print('behavioral_face_mask\n')
print(f'Original Distribution:\nNumber of missing values: {X_missing.behavioral_face_mask.isna().sum()}')
print(X_missing.behavioral_face_mask.value_counts(normalize=True))
print(f'\nDistribution after random imputation:\nNumber of missing values: {X_rand_imp.behavioral_face_mask.isna().sum()}')
print(X_rand_imp.behavioral_face_mask.value_counts(normalize=True))
print('\n-------------------------------------------------------------\n')
print('health_insurance\n')
print(f'Original Distribution:\nNumber of missing values: {X_missing.health_insurance.isna().sum()}')
print(X_missing.health_insurance.value_counts(normalize=True))
print(f'\nDistribution after random imputation:\nNumber of missing values: {X_rand_imp.health_insurance.isna().sum()}')
print(X_rand_imp.health_insurance.value_counts(normalize=True))

behavioral_face_mask

Original Distribution:
Number of missing values: 19
0.0    0.931018
1.0    0.068982
Name: behavioral_face_mask, dtype: float64

Distribution after random imputation:
Number of missing values: 0
0.0    0.931029
1.0    0.068971
Name: behavioral_face_mask, dtype: float64

-------------------------------------------------------------

health_insurance

Original Distribution:
Number of missing values: 12274
1.0    0.87972
0.0    0.12028
Name: health_insurance, dtype: float64

Distribution after random imputation:
Number of missing values: 0
1.0    0.879957
0.0    0.120043
Name: health_insurance, dtype: float64


As seen above this method maintains the original distribution of values after imputation. It should be noted that while this method maintains the distribution of the original feature, the accuracy of imputed entries drops in comparison to the most frequent entry imputation approach. If you assume the missing entries follow the same distribution as the original, by imputing the most frequent entry the accuracy will be equal to the distribution of that most frequent entry in the original data. For example if the distribution is 80:20 for existing values, the accuracy of most frequent entry imputation will be 80%. While the accuracy of using the above random imputation method is only 68%. This can be found by considering of the 80% of values randomly imputed as the most frequent value, only 80% of them will be correct and vice versa for the less frequent value. The math works out as shown below.

In [38]:
100*((.8*.8) + (.2*.2))

68.00000000000001

## References

[1] <a id='About the National Immunization Survery' href="https://webarchive.loc.gov/all/20140511031000/http://www.cdc.gov/nchs/nis/about_nis.htm#h1n1">https://webarchive.loc.gov/all/20140511031000/http://www.cdc.gov/nchs/nis/about_nis.htm#h1n1</a>

[2] <a href='https://www.drivendata.org/competitions/66/flu-shot-learning/data/'>https://www.drivendata.org/competitions/66/flu-shot-learning/data/</a>
