In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.experimental import enable_iterative_imputer
sklearn.set_config(transform_output='pandas')

FILE_PATH = "NHANES_hypertension.pkl"
IT_IMP_SUBSET = 2000


from sklearn.model_selection import train_test_split

# Assignment 3: Handling missing data

In this assignment, you will explore different methods for handling missing values in the development of prediction models. You will use a data set based on the National Health and Nutrition Examination Survey (NHANES) run by the Centers for Disease Control and Prevention (CDC) in the USA: https://www.cdc.gov/nchs/nhanes/index.htm.


## Goal

* Your goal is to predict hypertension (high blood pressure) from a number of subject covariates. 
* You will compare impute-then-regress classifiers to methods that handle missing values natively.

## Data

The NHANES survey is possible to download from the CDC but is spread over 100s of CSV files. For your convenience, we have compiled a .pkl file with a dataframe for this assignment. We cannot share it publicly on the web so instead...

* For this assignment, you will need to download a .pkl data file from Canvas 
* Place the file with the name ```NHANES_hypertension.pkl``` in the same directory as this notebook
* The covariates are described in the file ```NHANES_hypertension.codes.txt```, also available on Canvas

In [None]:
D_full = pd.read_pickle(FILE_PATH)
D_full

## Problem 1 — Exploration & imputation

### Data exploration & setup

* The columns 'SEQN' and 'YEAR' represent the subject ID and year of survey, respectively. These should *not* be used as input for prediction.
* The outcome column $Y$ is called 'HYPERT'
* The columns 'BPXSY1', 'BPXDI1', 'BPXOSY1', 'BPXODI1' all measure the blood pressure and are used to compute the outcome. These should *not* be used as input for prediction. 


1. Report the frequency of missing values in each input column

2. Remove columns with more than 50% missingness and report which columns you removed and which remain
* Variable types can be found in the property ```dtypes``` of the dataframe. Categorical variables have the type 'category'

3. Perform one-hot encoding of categorical variables such that missing values are encoded as a separate category. For example, if a binary variable ```Test``` has values 0, 1 and missing values NaN, the categories should be ```Test_0```, ```Test_1``` and ```Test_NaN``` (although, the names are up to you)
4. Split your data into a training portion (80%) and a test portion (20%) with random_state=0
5. Fit a standard scaler to the numeric features in the training portion and apply that to both training and test sets

### Imputation

6. Fit a constant imputer for numeric (non-categorical) input variables on the training set using the variable *median* as constant. Categorical variables should not be imputed since they are handled by the one-hot encoding. Impute missing values both in the training and test sets.

7. Fit an IterativeImputer (akin to MICE) using scikit-learn for numeric (non-category) variables. Impute missing values both in the training and test sets. (Don't use posterior sampling here. We do single, not multiple imputation in this assignment.)
* Since IterativeImputer is quite slow for large samples, fit the imputer to a subset of the training set of size ```IT_IMP_SUBSET```

8. Evaluate both imputation strategies. 
* Since the value of missing variables is unknown, do this by: 
* Copying the test set into a new data frame
* Adding missing values to 5% to the *observed values* of data frame selected uniformly at random (make them NaN). 
* Report the MSE on the modified subset of values, compared to the original observations, where each error is normalized by the standard deviation of the corresponding column in the training set.
* Are the results expected?

## Problem 2 — Predict hypertension

In this problem, you will use the imputed data sets for predicting hypertension with column name ```HYPERT``` using the other variables as input (excluding columns removed previously). You will compare classifiers fit to the imputed data sets, and classifiers that handle missing values natively. 


1. Fit LogisticRegression (LR) models to the two training data sets imputed with the constant and iterative imputer, respectively. 

2. Fit an [HistGradientBoostingClassifier](https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) (HGBC) to the *unimputed* data set with missing values. The HGBC handles missing values natively by learning default rules which are used when a missing value is encountered. 

3. Report the training and test set AUROC for all models. 

4. Is imputation better than native handling of missing values in this case? Can you tell from the results you already have? If not, perform an experiment to gather more evidence. Describe this experiment, run it, and give your conclusions below. 

## Problem 3 — Complete-case analysis

In this problem, you will compare classifiers fit only to complete cases to methods aimed at overcoming missing values. 

1. Fit complete-case classifiers (LR and HGBC) by dropping all rows with missing values in your selected input columns 

2. Compare your classifiers fit to the full data sets (imputed and with missing values) to the complete-case classifiers on the test set, restricted to complete cases

3. What can you say about the performance of your complete-case models on the full population? Would the results from the complete-case subset transfer (for example, if the missing values in the full population were observed)? 

4. Under what conditions do results from complete-case analysis generally transfer to a population with missingness?

5. Investigate how the complete cases differ from the overall population. Can you see substantial differences in distribution?