# EDA notebook

In [1]:
import numpy as np
import pandas as pd

In [2]:
# set options to make data easier to view in Jupyter Notebook
pd.set_option("display.max_columns", 100)

In [3]:
initial_features_df = pd.read_csv('data/training_set_features.csv')
initial_labels_df = pd.read_csv('data/training_set_labels.csv')

## Check target labels for potential issues

Check the target labels for:
- missing rows
- target labels are strings instead of numbers
- duplicated survey respondent_id (same person took survey twice, was entered twice, etc.)
- number of classes in target variable
- class imbalance in target variable

In [4]:
# check labels for missing rows and object /string dtypes
initial_labels_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   respondent_id     26707 non-null  int64
 1   h1n1_vaccine      26707 non-null  int64
 2   seasonal_vaccine  26707 non-null  int64
dtypes: int64(3)
memory usage: 626.1 KB


In [5]:
# check for duplicate respondent_id (same person took survey twice)
initial_labels_df['respondent_id'].duplicated().sum()

0

In [6]:
# See how many classes there are in target label
y = initial_labels_df['seasonal_vaccine']
y.unique()

array([0, 1])

In [7]:
# look at proportion of 1's (to check for class imbalance)
y.mean()

0.4656082674954132

## Review features for potential issues

Check features for:
- missing rows
- whether categorical columns are strings or numbers

In [8]:
initial_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

### Drop irrelevant columns

Drop these columns (for these reasons):
- respondent_id (unique identifier, randomly assigned)
- ['h1n1_concern', 'h1n1_knowledge', 'doctor_recc_h1n1', 
    'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc'] (Focusing on seasonal flu vaccine, so questions about other viruses seem extraneous and unrelated)

In [11]:
# drop columns that are not useful for training model
initial_features_df.drop(columns=[
    'respondent_id', 'h1n1_concern', 'h1n1_knowledge', 'doctor_recc_h1n1', 
    'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc'
], inplace=True)

### Inspect numeric columns and object (string) columns separately

In [12]:
# function to return lists of numeric and non-numeric columns
def dataframe_info(df):
    '''
    Takes Pandas dataframe (df). Prints number of rows, number of columns, and three sample rows.
    Returns list of numerical columns and list of non-numerical columns.
    '''
    numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
    object_cols  = df.select_dtypes(exclude=np.number).columns.tolist()
    
    return numeric_cols, object_cols

In [13]:
numeric_cols, object_cols = dataframe_info(initial_features_df)
print('List of numeric columns: \n', numeric_cols)
print('\n')
print('List of object columns: \n', object_cols)

List of numeric columns: 
 ['behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_seasonal', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'household_adults', 'household_children']


List of object columns: 
 ['age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa', 'employment_industry', 'employment_occupation']


#### Use value_counts(dropna=False) to determine number of categories and proportion of NaNs

Observations:
- `health_insurance` has large proportion of NaNs
    - Missing values don't seem to be related to `employment_status`
    - [Interview questions](https://ftp.cdc.gov/pub/health_Statistics/nchs/Dataset_Documentation/NIS/nhfs/NHFSPUF_QUEX.PDF) appear to offer "don't know" and "refused to answer" choices, so missing values might reflect these choices
- "opinion" variables are on Likert scale (ratings 1 - 5)
- "household" variables are counts

More info in [Data Dictionary](https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/#labels) 

In [14]:
for col in numeric_cols:
    print(f"Column: {col}")
    display(initial_features_df[col].value_counts(normalize=True, dropna=False))
    print("\n")

Column: behavioral_antiviral_meds


0.0    0.948628
1.0    0.048714
NaN    0.002658
Name: behavioral_antiviral_meds, dtype: float64



Column: behavioral_avoidance


1.0    0.719961
0.0    0.272251
NaN    0.007788
Name: behavioral_avoidance, dtype: float64



Column: behavioral_face_mask


0.0    0.930355
1.0    0.068933
NaN    0.000711
Name: behavioral_face_mask, dtype: float64



Column: behavioral_wash_hands


1.0    0.824316
0.0    0.174112
NaN    0.001573
Name: behavioral_wash_hands, dtype: float64



Column: behavioral_large_gatherings


0.0    0.639271
1.0    0.357472
NaN    0.003258
Name: behavioral_large_gatherings, dtype: float64



Column: behavioral_outside_home


0.0    0.660651
1.0    0.336279
NaN    0.003070
Name: behavioral_outside_home, dtype: float64



Column: behavioral_touch_face


1.0    0.674018
0.0    0.321189
NaN    0.004793
Name: behavioral_touch_face, dtype: float64



Column: doctor_recc_seasonal


0.0    0.616056
1.0    0.303067
NaN    0.080878
Name: doctor_recc_seasonal, dtype: float64



Column: chronic_med_condition


0.0    0.690680
1.0    0.272962
NaN    0.036358
Name: chronic_med_condition, dtype: float64



Column: child_under_6_months


0.0    0.889243
1.0    0.080054
NaN    0.030704
Name: child_under_6_months, dtype: float64



Column: health_worker


0.0    0.861347
1.0    0.108548
NaN    0.030104
Name: health_worker, dtype: float64



Column: health_insurance


1.0    0.475418
NaN    0.459580
0.0    0.065002
Name: health_insurance, dtype: float64



Column: opinion_seas_vacc_effective


4.0    0.435429
5.0    0.373423
2.0    0.082600
1.0    0.045718
3.0    0.045531
NaN    0.017299
Name: opinion_seas_vacc_effective, dtype: float64



Column: opinion_seas_risk


2.0    0.335268
4.0    0.285693
1.0    0.223687
5.0    0.110757
3.0    0.025349
NaN    0.019246
Name: opinion_seas_risk, dtype: float64



Column: opinion_seas_sick_from_vacc


1.0    0.444453
2.0    0.285805
4.0    0.181675
5.0    0.064440
NaN    0.020107
3.0    0.003520
Name: opinion_seas_sick_from_vacc, dtype: float64



Column: household_adults


1.0    0.541955
0.0    0.301644
2.0    0.104954
3.0    0.042124
NaN    0.009323
Name: household_adults, dtype: float64



Column: household_children


0.0    0.699143
1.0    0.118883
2.0    0.107238
3.0    0.065414
NaN    0.009323
Name: household_children, dtype: float64





Observations:
- `employment_industry` and `employment_occupation` have large proportion of NaNs (~ 50%)
    - Most of these occur when `employment_status` does not equal 'Employed'

More info in [Data Dictionary](https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/#labels) 

In [15]:
for col in object_cols:
    print(f"Column: {col}")
    display(initial_features_df[col].value_counts(normalize=True, dropna=False))
    print("\n")

Column: age_group


65+ Years        0.256225
55 - 64 Years    0.208297
45 - 54 Years    0.196128
18 - 34 Years    0.195267
35 - 44 Years    0.144082
Name: age_group, dtype: float64



Column: education


College Graduate    0.378066
Some College        0.263714
12 Years            0.217059
< 12 Years          0.088479
NaN                 0.052683
Name: education, dtype: float64



Column: race


White                0.794623
Black                0.079305
Hispanic             0.065713
Other or Multiple    0.060359
Name: race, dtype: float64



Column: sex


Female    0.593777
Male      0.406223
Name: sex, dtype: float64



Column: income_poverty


<= $75,000, Above Poverty    0.478414
> $75,000                    0.254989
NaN                          0.165612
Below Poverty                0.100985
Name: income_poverty, dtype: float64



Column: marital_status


Married        0.507545
Not Married    0.439735
NaN            0.052720
Name: marital_status, dtype: float64



Column: rent_or_own


Own     0.701539
Rent    0.222002
NaN     0.076459
Name: rent_or_own, dtype: float64



Column: employment_status


Employed              0.507732
Not in Labor Force    0.383083
NaN                   0.054780
Unemployed            0.054405
Name: employment_status, dtype: float64



Column: hhs_geo_region


lzgpxyit    0.160894
fpwskwrf    0.122253
qufhixun    0.116149
oxchjgsf    0.107051
kbazzjca    0.107013
bhuqouqj    0.106564
mlyzmhmf    0.083985
lrircsnp    0.077807
atmpeygn    0.076122
dqpwygqj    0.042161
Name: hhs_geo_region, dtype: float64



Column: census_msa


MSA, Not Principle  City    0.436028
MSA, Principle City         0.294455
Non-MSA                     0.269517
Name: census_msa, dtype: float64



Column: employment_industry


NaN         0.499120
fcxhlnwr    0.092410
wxleyezf    0.067548
ldnlellj    0.046093
pxcmvdjn    0.038829
atmlpfrs    0.034673
arjwrbjb    0.032613
xicduogh    0.031864
mfikgejo    0.022990
vjjrobsf    0.019733
rucpziij    0.019583
xqicxuve    0.019134
saaquncn    0.012656
cfqqtusy    0.012169
nduyfdeo    0.010709
mcubkhph    0.010297
wlfvacwt    0.008050
dotnnunm    0.007526
haxffmxo    0.005542
msuufmds    0.004643
phxvnwax    0.003332
qnlwzans    0.000487
Name: employment_industry, dtype: float64



Column: employment_occupation


NaN         0.504362
xtkaffoo    0.066574
mxkfnird    0.056502
emcorrxb    0.047553
cmhcxjea    0.046692
xgwztkwe    0.040514
hfxkjkmi    0.028682
qxajmpny    0.020519
xqwwgdyp    0.018160
kldqjyjy    0.017561
uqqtjvyb    0.016924
tfqavkke    0.014528
ukymxvdu    0.013929
vlluhbov    0.013255
oijqvulv    0.012881
ccgxvspp    0.012768
bxpfxfdn    0.012394
haliazsg    0.011083
rcertsgn    0.010334
xzmlyyjv    0.009286
dlvbwzss    0.008500
hodpvpew    0.007788
dcjcmpih    0.005542
pvmttkik    0.003669
Name: employment_occupation, dtype: float64





### Are missing values systematic

***Can missing values in `employment_industry`, `employment_occupation`, and `health_insurance` be explained by current `employment_status`?***

Most NaNs in `employment_industry` and `employment_occupation` occur when not Employed, but not all

In [16]:
filt = (initial_features_df['employment_industry'].isna())
display( initial_features_df.loc[filt]['employment_status'].value_counts(normalize=True, dropna=False) )

filt = (initial_features_df['employment_occupation'].isna())
display( initial_features_df.loc[filt]['employment_status'].value_counts(normalize=True, dropna=False) )

Not in Labor Force    0.767517
NaN                   0.109752
Unemployed            0.109002
Employed              0.013728
Name: employment_status, dtype: float64

Not in Labor Force    0.759540
NaN                   0.108612
Unemployed            0.107869
Employed              0.023979
Name: employment_status, dtype: float64

But, that's not what's going on with missing `health_insurance`. It's missing almost as often for employed and unemployed.

In [17]:
filt = (initial_features_df['health_insurance'].isna())
display( initial_features_df.loc[filt]['employment_status'].value_counts(normalize=True, dropna=False) )

Employed              0.485742
Not in Labor Force    0.356037
NaN                   0.109989
Unemployed            0.048232
Name: employment_status, dtype: float64