In [18]:
# import packages
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import os
from sklearn.impute import SimpleImputer

In [4]:
# change viewing options
pd.set_option('display.max_columns', None)  
pd.set_option('max_colwidth', -1)

### Step 1: Import data

In [5]:
# read in the data
train = pd.read_csv("training_set_features.csv")
train_labels = pd.read_csv("training_set_labels.csv")
test = pd.read_csv("test_set_features.csv")

### Step 2: Do data preprocessing, scaling, etc.

From the data exploration phase, we decided which columns were relevant to our analysis. 

So, we'll subset the training and test sets by these features

In [6]:
# get columns to subset
columns_subset = ['h1n1_concern', 'h1n1_knowledge', 'doctor_recc_h1n1', 'doctor_recc_seasonal', 
                 'chronic_med_condition', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 
                 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 
                 'opinion_seas_sick_from_vacc', 'age_group', 'education', 'income_poverty']

In [7]:
# subset datasets
train_subset = train.loc[:, columns_subset]
test_subset = test.loc[:, columns_subset]

Now, we have to do something about any missing data that we observe:

In [13]:
# loop through all columns, count NaNs
print(f"The number of NaNs, by column, in the training set, is: \n\n{train_subset.isna().sum()} \n\n =========")
print(f"The number of NaNs, by column, in the test set, is: \n\n{test_subset.isna().sum()} \n\n")

The number of NaNs, by column, in the training set, is: 

h1n1_concern                   92  
h1n1_knowledge                 116 
doctor_recc_h1n1               2160
doctor_recc_seasonal           2160
chronic_med_condition          971 
opinion_h1n1_vacc_effective    391 
opinion_h1n1_risk              388 
opinion_h1n1_sick_from_vacc    395 
opinion_seas_vacc_effective    462 
opinion_seas_risk              514 
opinion_seas_sick_from_vacc    537 
age_group                      0   
education                      1407
income_poverty                 4423
dtype: int64 

The number of NaNs, by column, in the test set, is: 

h1n1_concern                   85  
h1n1_knowledge                 122 
doctor_recc_h1n1               2160
doctor_recc_seasonal           2160
chronic_med_condition          932 
opinion_h1n1_vacc_effective    398 
opinion_h1n1_risk              380 
opinion_h1n1_sick_from_vacc    375 
opinion_seas_vacc_effective    452 
opinion_seas_risk              499 
opinion_s

For the survey data, we can likely impute the NaNs with either 0/unknown or the mean/median values. For the demographic values that are unknown (education, income_poverty), we can just impute them with 'unknown'. Let's check the dtypes of each column. If they're all categorical/object, we can just impute missing data with "unknown"

In [17]:
# for training columns (test columns will be the same)
for col in train_subset.columns:
    print(f"The dtype of the {col} column is: {type(train_subset[col][0])}\n=======")

The dtype of the h1n1_concern column is: <class 'numpy.float64'>
The dtype of the h1n1_knowledge column is: <class 'numpy.float64'>
The dtype of the doctor_recc_h1n1 column is: <class 'numpy.float64'>
The dtype of the doctor_recc_seasonal column is: <class 'numpy.float64'>
The dtype of the chronic_med_condition column is: <class 'numpy.float64'>
The dtype of the opinion_h1n1_vacc_effective column is: <class 'numpy.float64'>
The dtype of the opinion_h1n1_risk column is: <class 'numpy.float64'>
The dtype of the opinion_h1n1_sick_from_vacc column is: <class 'numpy.float64'>
The dtype of the opinion_seas_vacc_effective column is: <class 'numpy.float64'>
The dtype of the opinion_seas_risk column is: <class 'numpy.float64'>
The dtype of the opinion_seas_sick_from_vacc column is: <class 'numpy.float64'>
The dtype of the age_group column is: <class 'str'>
The dtype of the education column is: <class 'str'>
The dtype of the income_poverty column is: <class 'str'>


Because the columns are a mix of ints and strings, we can split the data, impute each separately, then re-append them to each other

In [19]:
# split values
train_subset_numeric = train_subset.loc[:, ['h1n1_concern', 'h1n1_knowledge', 'doctor_recc_h1n1', 'doctor_recc_seasonal', 
                 'chronic_med_condition', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 
                 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 
                 'opinion_seas_sick_from_vacc']]

train_subset_str = train_subset.loc[:, [ 'age_group', 'education', 'income_poverty']]

test_subset_numeric = test_subset.loc[:, ['h1n1_concern', 'h1n1_knowledge', 'doctor_recc_h1n1', 'doctor_recc_seasonal', 
                 'chronic_med_condition', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 
                 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 
                 'opinion_seas_sick_from_vacc']]

test_subset_str = test_subset.loc[:, [ 'age_group', 'education', 'income_poverty']]


For the numeric values, there are several strategies that we could use:

1. Impute with mean/median: This would be the simplest strategy, but might create class imbalances and skew the results by artificially inflating some counts
2. Impute with predicted value (e.g., fit a logistic regression, and use the prediction as the impute value): this would probably be the most robust method, since it uses existing data to make a prediction on what the true values should be. But, this method is imperfect because we also have other missing data in other categories as well (so, to fit a logistic regression, we'd have to impute these other missing values) and we also don't know how reliable our imputations would be. 
3. Impute randomly, such that at the end, the class proportions remain the same. This method would keep the balance in proportions consistent, but at the cost of introducing noise by randomly assigning class values to observations. 
4. Filter out missing data.

Personally, I lean towards using a logistic regression to predict what the missing values should be, since it seems to be the method that will best maintain existing class proportions while accounting for the other features of the data (so, two people who have the same values for every other column in the dataset should presumably have the same value for the column in question)

But, before doing so, it's worth checking to see how "similar" the missing people are to those who we have data for. If they're similar, it would make sense to use information from the existing data in order to impute the missing data. Otherwise, it might be advisable to adapt the imputation method or filter them out entirely. 


In [None]:
# fit separate imputer objects on them
