In [2]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.experimental import enable_iterative_imputer
sklearn.set_config(transform_output='pandas')

FILE_PATH = "NHANES_hypertension.pkl"
IT_IMP_SUBSET = 2000


from sklearn.model_selection import train_test_split

# Assignment 3: Handling missing data

In this assignment, you will explore different methods for handling missing values in the development of prediction models. You will use a data set based on the National Health and Nutrition Examination Survey (NHANES) run by the Centers for Disease Control and Prevention (CDC) in the USA: https://www.cdc.gov/nchs/nhanes/index.htm.


## Goal

* Your goal is to predict hypertension (high blood pressure) from a number of subject covariates. 
* You will compare impute-then-regress classifiers to methods that handle missing values natively.

## Data

The NHANES survey is possible to download from the CDC but is spread over 100s of CSV files. For your convenience, we have compiled a .pkl file with a dataframe for this assignment. We cannot share it publicly on the web so instead...

* For this assignment, you will need to download a .pkl data file from Canvas 
* Place the file with the name ```NHANES_hypertension.pkl``` in the same directory as this notebook
* The covariates are described in the file ```NHANES_hypertension.codes.txt```, also available on Canvas

In [3]:
D_full = pd.read_pickle(FILE_PATH)
D_full

Unnamed: 0_level_0,SEQN,YEAR,RIDAGEYR,RIAGENDR,SMQ020,SMD680,SMD415,SMD415A,PAD020,PAD200,...,PAD660,PAD675,SMD460,SMDANY,BMXHIP,ALQ121,MCQ366C,BPXOSY1,BPXODI1,HYPERT
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2.0,2.0,1999-2000,77.0,1.0,2.0,2.0,,,2.0,3.0,...,,,,,,,,,,0
3.0,3.0,1999-2000,10.0,2.0,,,1.0,1.0,,,...,,,,1.0,,,,,,0
5.0,5.0,1999-2000,49.0,1.0,1.0,1.0,,,2.0,1.0,...,,,,,,,,,,0
6.0,6.0,1999-2000,19.0,2.0,,2.0,,,1.0,1.0,...,,,,,,,,,,0
7.0,7.0,1999-2000,59.0,2.0,1.0,2.0,,,1.0,2.0,...,,,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102952.0,102952.0,2017-2018,70.0,2.0,2.0,2.0,0.0,0.0,,,...,,60.0,0.0,2.0,87.3,0.0,2.0,154.0,92.0,0
102953.0,102953.0,2017-2018,42.0,1.0,1.0,2.0,0.0,0.0,,,...,,,0.0,2.0,112.8,6.0,1.0,135.0,91.0,0
102954.0,102954.0,2017-2018,41.0,2.0,2.0,2.0,0.0,0.0,,,...,,30.0,0.0,2.0,102.7,,2.0,123.0,75.0,0
102955.0,102955.0,2017-2018,14.0,2.0,,2.0,2.0,2.0,,,...,,,2.0,2.0,128.3,,,92.0,64.0,0


## Problem 1 — Exploration & imputation

### Data exploration & setup

* The columns 'SEQN' and 'YEAR' represent the subject ID and year of survey, respectively. These should *not* be used as input for prediction.
* The outcome column $Y$ is called 'HYPERT'
* The columns 'BPXSY1', 'BPXDI1', 'BPXOSY1', 'BPXODI1' all measure the blood pressure and are used to compute the outcome. These should *not* be used as input for prediction. 


1. Report the frequency of missing values in each input column

In [4]:
# --- 1. Define features (X) and target (y) ---
# Define columns to exclude: Subject ID, Year, and direct blood pressure readings 
# (BPX... columns are excluded to prevent data leakage as they directly determine the outcome)
cols_to_exclude = ['SEQN', 'YEAR', 'BPXSY1', 'BPXDI1', 'BPXOSY1', 'BPXODI1', 'HYPERT']

# Define the target variable
y = D_full['HYPERT']

# Define the initial feature set X by dropping the excluded columns
# errors='ignore' prevents errors if columns are missing or already dropped
X_raw = D_full.drop(columns=cols_to_exclude, errors='ignore')

print(f"Initial number of features: {X_raw.shape[1]}")

# --- 2. Task 1: Report frequency of missing values ---
# Calculate the fraction of missing values for each column (0.0 to 1.0)
missing_fraction = X_raw.isnull().mean()

print("\n--- Frequency of missing values (First 10 columns) ---")
print(missing_fraction.head(10))

# --- 3. Task 2: Remove columns with > 50% missingness ---
# Identify columns where the missing fraction is greater than 0.5 (50%)
cols_to_drop = missing_fraction[missing_fraction > 0.5].index.tolist()

# Drop these columns from the feature set
X_clean = X_raw.drop(columns=cols_to_drop)

# Report the results
print("\n--- Task 2 Report ---")
print(f"Columns removed (>50% missing): {cols_to_drop}")
print(f"Number of features remaining: {X_clean.shape[1]}")
print(f"List of remaining columns: {X_clean.columns.tolist()}")

Initial number of features: 43

--- Frequency of missing values (First 10 columns) ---
RIDAGEYR    0.000000
RIAGENDR    0.000000
SMQ020      0.303307
SMD680      0.664299
SMD415      0.575103
SMD415A     0.575103
PAD020      0.653723
PAD200      0.653752
PAD320      0.653838
DIQ010      0.000666
dtype: float64

--- Task 2 Report ---
Columns removed (>50% missing): ['SMD680', 'SMD415', 'SMD415A', 'PAD020', 'PAD200', 'PAD320', 'OHAROCDT', 'OHAROCGP', 'OHARNF', 'ALQ120Q', 'MCQ160C', 'MCQ080', 'OCQ180', 'OCQ380', 'OCD180', 'PAD645', 'PAD660', 'PAD675', 'SMD460', 'SMDANY', 'BMXHIP', 'ALQ121', 'MCQ366C']
Number of features remaining: 20
List of remaining columns: ['RIDAGEYR', 'RIAGENDR', 'SMQ020', 'DIQ010', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST', 'BPQ020', 'DRXTTFAT', 'DRXTSFAT', 'LBXSAL', 'LBXSGL', 'LBXSCH', 'LBXSUA', 'LBXSKSI', 'DBD100', 'DR1TTFAT', 'DR1TSFAT', 'INDFMMPI']


2. Remove columns with more than 50% missingness and report which columns you removed and which remain
* Variable types can be found in the property ```dtypes``` of the dataframe. Categorical variables have the type 'category'

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- Step 1: Define Categorical and Numeric Columns ---
# Based on the remaining columns in X_clean and the codes.txt
# We need to be precise here.

# Categorical variables (Nominal/Ordinal) in the remaining set
cat_cols = [
    'RIAGENDR', 'SMQ020', 'DIQ010', 'BPQ020', 'DBD100', 
    'MCQ160C', 'MCQ080', 'OCQ180', 'OCQ380', 'OCD180', 
    'ALQ121', 'MCQ366', 'SMD460', 'SMDANY'
]

# Numeric variables (Continuous) in the remaining set
num_cols = [
    'RIDAGEYR', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST', 
    'LBXSAL', 'LBXSGL', 'LBXSCH', 'LBXSUA', 'LBXSKSI', 
    'DR1TTFAT', 'DR1TSFAT', 'INDFMMPI', 'ALQ120Q', 'BMXHIP'
]

# Filter lists to include ONLY columns that actually exist in X_clean
# This prevents "KeyError" if a column was already dropped in Task 2
final_cat_cols = [c for c in cat_cols if c in X_clean.columns]
final_num_cols = [c for c in num_cols if c in X_clean.columns]

print(f"Categorical columns to encode: {len(final_cat_cols)}")
print(f"Numeric columns to scale: {len(final_num_cols)}")

# --- Step 2: Task 3 - One-hot Encoding with NaN as a category ---
# dummy_na=True creates a column like 'SMQ020_nan' if there are missing values
X_encoded = pd.get_dummies(X_clean, columns=final_cat_cols, dummy_na=True, dtype=int)

print(f"Shape after one-hot encoding: {X_encoded.shape}")

# --- Step 3: Task 4 - Train/Test Split ---
# Split data into 80% training and 20% test with random_state=0
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=0, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

# --- Step 4: Task 5 - Standard Scaling ---
# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler ONLY on the numeric features of the training set
scaler.fit(X_train[final_num_cols])

# Apply the transformation to both training and test sets
# We use .loc to update only the numeric columns in place
X_train.loc[:, final_num_cols] = scaler.transform(X_train[final_num_cols])
X_test.loc[:, final_num_cols] = scaler.transform(X_test[final_num_cols])

print("Standard scaling applied to numeric features.")
print("Data preparation complete.")

Categorical columns to encode: 5
Numeric columns to scale: 13
Shape after one-hot encoding: (69118, 32)
Training set shape: (55294, 32)
Test set shape: (13824, 32)
Standard scaling applied to numeric features.
Data preparation complete.


3. Perform one-hot encoding of categorical variables such that missing values are encoded as a separate category. For example, if a binary variable ```Test``` has values 0, 1 and missing values NaN, the categories should be ```Test_0```, ```Test_1``` and ```Test_NaN``` (although, the names are up to you)
4. Split your data into a training portion (80%) and a test portion (20%) with random_state=0
5. Fit a standard scaler to the numeric features in the training portion and apply that to both training and test sets

### Imputation

6. Fit a constant imputer for numeric (non-categorical) input variables on the training set using the variable *median* as constant. Categorical variables should not be imputed since they are handled by the one-hot encoding. Impute missing values both in the training and test sets.

7. Fit an IterativeImputer (akin to MICE) using scikit-learn for numeric (non-category) variables. Impute missing values both in the training and test sets. (Don't use posterior sampling here. We do single, not multiple imputation in this assignment.)
* Since IterativeImputer is quite slow for large samples, fit the imputer to a subset of the training set of size ```IT_IMP_SUBSET```

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# --- Task 6: Constant Imputer (Median) ---
# We create a copy of the data to avoid modifying the original X_train/X_test
X_train_const = X_train.copy()
X_test_const = X_test.copy()

# Initialize SimpleImputer with strategy='median'
const_imputer = SimpleImputer(strategy='median')

# Fit on training data ONLY
const_imputer.fit(X_train_const[final_num_cols])

# Transform both training and test data
X_train_const.loc[:, final_num_cols] = const_imputer.transform(X_train_const[final_num_cols])
X_test_const.loc[:, final_num_cols] = const_imputer.transform(X_test_const[final_num_cols])

print("Task 6: Constant (Median) Imputation complete.")

# --- Task 7: Iterative Imputer (MICE) ---
# Create another copy for iterative imputation
X_train_iter = X_train.copy()
X_test_iter = X_test.copy()

# Initialize IterativeImputer
# random_state=0 for reproducibility
iter_imputer = IterativeImputer(random_state=0, max_iter=10)

# Fit on a SUBSET of training data (as requested by the problem to save time)
# IT_IMP_SUBSET was defined in Setup (e.g., 2000)
# If IT_IMP_SUBSET is not defined, we define it here just in case
if 'IT_IMP_SUBSET' not in locals():
    IT_IMP_SUBSET = 2000

X_train_subset = X_train_iter[final_num_cols].iloc[:IT_IMP_SUBSET]

print(f"Fitting IterativeImputer on subset of size: {X_train_subset.shape}")
iter_imputer.fit(X_train_subset)

# Transform both training and test data (using the imputer fitted on the subset)
X_train_iter.loc[:, final_num_cols] = iter_imputer.transform(X_train_iter[final_num_cols])
X_test_iter.loc[:, final_num_cols] = iter_imputer.transform(X_test_iter[final_num_cols])

print("Task 7: Iterative Imputation complete.")

Task 6: Constant (Median) Imputation complete.
Fitting IterativeImputer on subset of size: (2000, 13)




Task 7: Iterative Imputation complete.


8. Evaluate both imputation strategies. 
* Since the value of missing variables is unknown, do this by: 
* Copying the test set into a new data frame
* Adding missing values to 5% to the *observed values* of data frame selected uniformly at random (make them NaN). 
* Report the MSE on the modified subset of values, compared to the original observations, where each error is normalized by the standard deviation of the corresponding column in the training set.
* Are the results expected?

In [7]:
from sklearn.metrics import mean_squared_error

# --- Task 8: Evaluate Imputation Strategies (Artificial Missingness Experiment) ---

# 1. Setup: Get the numeric part of the original test set
X_test_num = X_test[final_num_cols].copy()

# 2. Create a mask for observed values (Not NaN)
observed_mask = ~np.isnan(X_test_num.values)
# Get the indices (row, col) of all observed values
observed_indices = np.argwhere(observed_mask)

# 3. Randomly select 5% of observed values to mask
np.random.seed(42) # Ensure reproducibility
n_remove = int(len(observed_indices) * 0.05) # 5%

# Choose random indices
remove_idx_indices = np.random.choice(len(observed_indices), n_remove, replace=False)
remove_coords = observed_indices[remove_idx_indices]

# 4. Create the "Corrupted" Test Set
X_test_corrupted = X_test_num.copy()
# Set selected values to NaN
for row, col in remove_coords:
    X_test_corrupted.iloc[row, col] = np.nan

print(f"Artificially removed {n_remove} values for evaluation.")

# 5. Impute the corrupted data using our pre-fitted imputers
# Method A: Constant (Median)
X_test_const_eval = const_imputer.transform(X_test_corrupted)
# Method B: Iterative
X_test_iter_eval = iter_imputer.transform(X_test_corrupted)

# 6. Calculate MSE (Normalized)
true_values = []
const_preds = []
iter_preds = []


for row, col in remove_coords:
  
    true_val = X_test_num.iloc[row, col]
 
    const_val = X_test_const_eval.iloc[row, col]
    iter_val = X_test_iter_eval.iloc[row, col]
    
    true_values.append(true_val)
    const_preds.append(const_val)
    iter_preds.append(iter_val)


# Calculate MSE
mse_const = mean_squared_error(true_values, const_preds)
mse_iter = mean_squared_error(true_values, iter_preds)

print("-" * 40)
print(f"MSE (Constant/Median Imputation): {mse_const:.4f}")
print(f"MSE (Iterative/MICE Imputation):  {mse_iter:.4f}")
print("-" * 40)

if mse_iter < mse_const:
    print("Conclusion: Iterative Imputer performed BETTER (lower error).")
else:
    print("Conclusion: Constant Imputer performed BETTER (lower error).")

Artificially removed 7719 values for evaluation.
----------------------------------------
MSE (Constant/Median Imputation): 1.0359
MSE (Iterative/MICE Imputation):  0.4740
----------------------------------------
Conclusion: Iterative Imputer performed BETTER (lower error).


## Problem 2 — Predict hypertension

In this problem, you will use the imputed data sets for predicting hypertension with column name ```HYPERT``` using the other variables as input (excluding columns removed previously). You will compare classifiers fit to the imputed data sets, and classifiers that handle missing values natively. 


1. Fit LogisticRegression (LR) models to the two training data sets imputed with the constant and iterative imputer, respectively. 

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score

# --- Critical Fix: Ensure no NaNs remain for Logistic Regression ---
# Logistic Regression cannot handle NaN values. 
# Although we imputed numeric columns, some NaNs might remain in categorical columns 
# (even after one-hot encoding if not perfectly handled) or edge cases.
# We fill any remaining NaNs with 0 (safe for one-hot encoded data).

X_train_const_clean = X_train_const.fillna(0)
X_test_const_clean = X_test_const.fillna(0)

X_train_iter_clean = X_train_iter.fillna(0)
X_test_iter_clean = X_test_iter.fillna(0)

# --- 1. Fit Logistic Regression on Constant (Median) Imputed Data ---
print("Training Logistic Regression on Median Imputed Data...")
# Increased max_iter to ensure convergence
lr_const = LogisticRegression(random_state=0, max_iter=2000) 
lr_const.fit(X_train_const_clean, y_train)

# Predict probabilities for AUROC (class 1)
y_pred_const_train = lr_const.predict_proba(X_train_const_clean)[:, 1]
y_pred_const_test = lr_const.predict_proba(X_test_const_clean)[:, 1]

# --- 2. Fit Logistic Regression on Iterative (MICE) Imputed Data ---
print("Training Logistic Regression on MICE Imputed Data...")
lr_iter = LogisticRegression(random_state=0, max_iter=2000)
lr_iter.fit(X_train_iter_clean, y_train)

# Predict probabilities
y_pred_iter_train = lr_iter.predict_proba(X_train_iter_clean)[:, 1]
y_pred_iter_test = lr_iter.predict_proba(X_test_iter_clean)[:, 1]

# --- 3. Fit HistGradientBoostingClassifier on Unimputed Data (Native Handling) ---
# HGBC handles missing values natively, so we use the original X_train/X_test (containing NaNs)
print("Training HistGradientBoostingClassifier on Raw Data...")
hgbc = HistGradientBoostingClassifier(random_state=0)
hgbc.fit(X_train, y_train)

# Predict probabilities
y_pred_hgbc_train = hgbc.predict_proba(X_train)[:, 1]
y_pred_hgbc_test = hgbc.predict_proba(X_test)[:, 1]

# --- 4. Report AUROC Results ---
print("\n--- Model Performance (AUROC) ---")
print(f"1. LR + Median Imputation:   Train={roc_auc_score(y_train, y_pred_const_train):.4f}, Test={roc_auc_score(y_test, y_pred_const_test):.4f}")
print(f"2. LR + MICE Imputation:     Train={roc_auc_score(y_train, y_pred_iter_train):.4f}, Test={roc_auc_score(y_test, y_pred_iter_test):.4f}")
print(f"3. HGBC (Native Handling):   Train={roc_auc_score(y_train, y_pred_hgbc_train):.4f}, Test={roc_auc_score(y_test, y_pred_hgbc_test):.4f}")

Training Logistic Regression on Median Imputed Data...
Training Logistic Regression on MICE Imputed Data...
Training HistGradientBoostingClassifier on Raw Data...

--- Model Performance (AUROC) ---
1. LR + Median Imputation:   Train=0.8621, Test=0.8585
2. LR + MICE Imputation:     Train=0.8620, Test=0.8586
3. HGBC (Native Handling):   Train=0.8981, Test=0.8653


2. Fit an [HistGradientBoostingClassifier](https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) (HGBC) to the *unimputed* data set with missing values. The HGBC handles missing values natively by learning default rules which are used when a missing value is encountered. 

3. Report the training and test set AUROC for all models. 

4. Is imputation better than native handling of missing values in this case? Can you tell from the results you already have? If not, perform an experiment to gather more evidence. Describe this experiment, run it, and give your conclusions below. 

In [9]:
# --- Experiment: Fair Comparison (HGBC on Imputed vs. HGBC on Native) ---
# To isolate the effect of missing value handling, we must use the SAME classifier.
# We already have HGBC on Native data (0.8653).
# Now, let's fit HGBC on the MICE-imputed data.

print("Running Experiment: HGBC on MICE Imputed Data...")

hgbc_imputed = HistGradientBoostingClassifier(random_state=0)
hgbc_imputed.fit(X_train_iter, y_train)

y_pred_hgbc_imp_test = hgbc_imputed.predict_proba(X_test_iter)[:, 1]
auroc_hgbc_imp = roc_auc_score(y_test, y_pred_hgbc_imp_test)

print(f"HGBC + MICE Imputation Test AUROC: {auroc_hgbc_imp:.4f}")
print(f"HGBC + Native Handling Test AUROC: {0.8653} (from previous step)")

if 0.8653 > auroc_hgbc_imp:
    print("Conclusion: Native handling is BETTER than imputation for HGBC.")
else:
    print("Conclusion: Imputation is BETTER or EQUAL to native handling for HGBC.")

Running Experiment: HGBC on MICE Imputed Data...
HGBC + MICE Imputation Test AUROC: 0.8651
HGBC + Native Handling Test AUROC: 0.8653 (from previous step)
Conclusion: Native handling is BETTER than imputation for HGBC.


### Answer to Question 4

**1. Can we tell from the initial results?**
No, we cannot strictly conclude whether imputation is better than native handling based solely on the initial results.
*   **Reasoning:** The initial comparison was between **Logistic Regression (with imputation)** and **HGBC (with native handling)**. Since HGBC is a non-linear ensemble method, it is inherently more powerful than a linear Logistic Regression model. Therefore, the higher AUROC of HGBC (0.8653) could be due to the algorithm's superior predictive power rather than its method of handling missing values.

**2. The Experiment**
To make a fair comparison, I trained the **same classifier (HGBC)** on the **MICE-imputed data** and compared it to the **HGBC on raw data (native handling)**.

**3. Results & Conclusion**
*   **HGBC (Native Handling):** 0.8653
*   **HGBC (MICE Imputed):** 0.8651

**Conclusion:**
Native handling is **slightly better** (or comparable) to imputation in this case.
*   **Why?** Tree-based models like HGBC handle missing values natively by learning the optimal direction (left or right child) for NaN values at each split. This allows the model to treat "missingness" as informative information itself.
*   **Implication:** Imputation, even advanced methods like MICE, forces a specific value onto the missing entry. In this case, the computational cost of imputation does not yield performance benefits over the model's native handling capabilities.

## Problem 3 — Complete-case analysis

In this problem, you will compare classifiers fit only to complete cases to methods aimed at overcoming missing values. 

1. Fit complete-case classifiers (LR and HGBC) by dropping all rows with missing values in your selected input columns 

2. Compare your classifiers fit to the full data sets (imputed and with missing values) to the complete-case classifiers on the test set, restricted to complete cases

3. What can you say about the performance of your complete-case models on the full population? Would the results from the complete-case subset transfer (for example, if the missing values in the full population were observed)? 

4. Under what conditions do results from complete-case analysis generally transfer to a population with missingness?

5. Investigate how the complete cases differ from the overall population. Can you see substantial differences in distribution?

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score

# --- Task 1: Fit complete-case classifiers ---

# 1. Identify indices of complete cases (rows with NO missing values) in Training set
# We use the original X_train (which has NaNs)
train_cc_indices = X_train.dropna().index
X_train_cc = X_train.loc[train_cc_indices]
y_train_cc = y_train.loc[train_cc_indices]

print(f"Original Training Set Size: {X_train.shape[0]}")
print(f"Complete-Case Training Set Size: {X_train_cc.shape[0]}")
print(f"Dropped {X_train.shape[0] - X_train_cc.shape[0]} rows ({100*(1 - len(X_train_cc)/len(X_train)):.1f}%) due to missingness.\n")

# 2. Fit models on Complete Cases (CC)
print("Training models on Complete Cases only...")
lr_cc = LogisticRegression(random_state=0, max_iter=2000)
lr_cc.fit(X_train_cc, y_train_cc)

hgbc_cc = HistGradientBoostingClassifier(random_state=0)
hgbc_cc.fit(X_train_cc, y_train_cc)

# --- Task 2: Compare classifiers on Test Set (Restricted to Complete Cases) ---

# 1. Identify complete cases in Test set
test_cc_indices = X_test.dropna().index
X_test_cc = X_test.loc[test_cc_indices]
y_test_cc = y_test.loc[test_cc_indices]

print(f"Testing on {len(X_test_cc)} complete cases from the test set.\n")

# 2. Evaluate NEW models (Trained on CC)
y_pred_lr_cc = lr_cc.predict_proba(X_test_cc)[:, 1]
y_pred_hgbc_cc = hgbc_cc.predict_proba(X_test_cc)[:, 1]

auc_lr_cc = roc_auc_score(y_test_cc, y_pred_lr_cc)
auc_hgbc_cc = roc_auc_score(y_test_cc, y_pred_hgbc_cc)

# 3. Evaluate OLD models (Trained on Full Data)
# We use the models from Problem 2: lr_iter (MICE) and hgbc (Native)
# Note: For lr_iter, we need the imputed version of X_test_cc. 
# Since X_test_cc has no NaNs, X_test_iter.loc[test_cc_indices] is effectively the same.
y_pred_lr_old = lr_iter.predict_proba(X_test_iter.loc[test_cc_indices])[:, 1]
y_pred_hgbc_old = hgbc.predict_proba(X_test_cc)[:, 1]

auc_lr_old = roc_auc_score(y_test_cc, y_pred_lr_old)
auc_hgbc_old = roc_auc_score(y_test_cc, y_pred_hgbc_old)

print("--- Comparison on Complete Cases (Test Set) ---")
print(f"LR (Trained on CC):   {auc_lr_cc:.4f}")
print(f"LR (Trained on Full): {auc_lr_old:.4f}")
print(f"HGBC (Trained on CC):   {auc_hgbc_cc:.4f}")
print(f"HGBC (Trained on Full): {auc_hgbc_old:.4f}")

# --- Task 5: Investigate Distribution Differences ---
print("\n--- Task 5: Distribution Differences (Full vs. CC) ---")
# Compare the prevalence of Hypertension (Target y)
prev_full = y_train.mean()
prev_cc = y_train_cc.mean()

print(f"Hypertension Prevalence (Full Population): {prev_full:.4f} ({prev_full*100:.1f}%)")
print(f"Hypertension Prevalence (Complete Cases):  {prev_cc:.4f} ({prev_cc*100:.1f}%)")

if abs(prev_full - prev_cc) > 0.01:
    print("Observation: Substantial difference in target distribution detected.")
else:
    print("Observation: Target distribution is relatively similar.")

Original Training Set Size: 55294
Complete-Case Training Set Size: 22815
Dropped 32479 rows (58.7%) due to missingness.

Training models on Complete Cases only...
Testing on 5701 complete cases from the test set.

--- Comparison on Complete Cases (Test Set) ---
LR (Trained on CC):   0.8257
LR (Trained on Full): 0.8249
HGBC (Trained on CC):   0.8297
HGBC (Trained on Full): 0.8303

--- Task 5: Distribution Differences (Full vs. CC) ---
Hypertension Prevalence (Full Population): 0.1615 (16.1%)
Hypertension Prevalence (Complete Cases):  0.1715 (17.2%)
Observation: Substantial difference in target distribution detected.


### Answers to Problem 3

**3. Performance on the full population**
*   **Performance:** Although the complete-case (CC) models perform similarly to the full models when tested *on complete cases*, their performance on the **full population** would likely be unreliable and biased.
*   **Transferability:** The results would **NOT** transfer well. The CC model is trained on a specific subset of the population (only ~41% of the original data) that has no missing values. This subset is likely systematically different from the population with missing data. If the missing values in the full population were observed, the CC model might fail to predict them accurately because it never learned from "incomplete" patients during training.

**4. Conditions for transferability**
*   Results from complete-case analysis generally transfer to the full population with missingness **only if** the data is **Missing Completely At Random (MCAR)**.
*   **MCAR** means that the probability of a data point being missing is entirely unrelated to any observed or unobserved data. In medical datasets like NHANES, this is rarely the case (e.g., sicker patients might have more complete records due to more frequent doctor visits, or conversely, very sick patients might drop out). Since the data is likely **MAR (Missing At Random)** or **MNAR (Missing Not At Random)**, removing incomplete rows introduces **selection bias**.

**5. Distribution Differences**
*   **Observation:** As shown in the code output, there is a substantial difference in the target distribution. The prevalence of hypertension in the **Complete Cases (17.2%)** is **higher** than in the **Full Population (16.1%)**.
*   **Conclusion:** This indicates that the complete cases are **not representative** of the overall population. Individuals with hypertension are slightly *more likely* to have complete data (possibly because they undergo more rigorous medical testing and monitoring), while healthy individuals might skip certain tests, leading to missing values.
*   **Impact:** This confirms that the missingness is **not random**. Using only complete cases introduces a bias, causing the model to overestimate the prevalence of hypertension compared to the general population.

Use of AI: The language fluency and professionalism of the answer section have been polished using ChatGPT.