# NHANES - Impute missing values

The purpose of this notebook is to impute missing values in variables of interest

**NOTE: Document code to explain what is happening at each level of the KNN, why 5 neighbors, what's happening with scale, fit, transform, etc.**

## Import packages

In [1]:
import pandas as pd
import numpy as np

# For missing value imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

## Load data

In [2]:
nhanes = pd.read_csv('data/nhanes.csv')

## Create model 

###  Model-Based Imputation with k-NN 

k-Nearest Neighbors (k-NN) method for imputation can effectively handle the nuances in the dataset by imputing missing 'ferritin' values based on the most similar respondents, considering a range of variables including 'race-ethnicity', 'household-income', and 'income-to-poverty-ratio'.

> - Preparation: Before imputation, it's essential to ensure that all predictor variables used in the k-NN algorithm are appropriately scaled. This ensures that each variable contributes equally to the distance calculation.
> - Choosing k: The choice of k (the number of neighbors) is crucial. A smaller k can make the imputation sensitive to noise, while a larger k might smooth out local variations too much. Cross-validation can help in selecting an optimal k value.
> - Imputation: Perform the k-NN imputation, using available variables as predictors to impute missing 'transformed_ferritin' values.

### Selecting columns for the imputation model

In [3]:
nhanes['household-income'] = nhanes['household-income'].fillna(nhanes['household-income'].median())


In [4]:
nhanes['income-to-poverty-ratio'] = nhanes['income-to-poverty-ratio'].fillna(nhanes['income-to-poverty-ratio'].median())


In [5]:
nhanes.set_index('SEQN', inplace=True)

In [6]:
imputation_features = nhanes[['transformed_ferritin',
                              'tfr',
                              'race-ethnicity', 
                              'household-income', 
                              'income-to-poverty-ratio']]

In [7]:
imputation_features

Unnamed: 0_level_0,transformed_ferritin,tfr,race-ethnicity,household-income,income-to-poverty-ratio
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
31131.0,102.546098,4.60,4.0,8.0,4.65
31152.0,24.754231,3.50,1.0,8.0,1.76
31153.0,40.878436,3.60,5.0,8.0,1.03
31156.0,103.683064,3.80,4.0,8.0,1.19
31160.0,29.370646,3.10,3.0,8.0,1.91
...,...,...,...,...,...
102923.0,155.000000,,3.0,4.0,0.95
102933.0,31.000000,2.39,4.0,12.0,1.66
102935.0,,,3.0,14.0,0.85
102948.0,64.900000,2.05,3.0,15.0,5.00


### Scale features

In [8]:
# Features scaling to prepare for k-NN
scaler = MinMaxScaler()

In [9]:
imputation_features_scaled = pd.DataFrame(scaler.fit_transform(imputation_features), 
                                          columns=imputation_features.columns)

In [10]:
# Applying k-NN imputation
knn_imputer = KNNImputer(n_neighbors=5)

In [11]:
imputed_data = nhanes[['transformed_ferritin', 'tfr','race-ethnicity', 
                                   'household-income', 
                                   'income-to-poverty-ratio']].copy()


In [12]:
imputed_data

Unnamed: 0_level_0,transformed_ferritin,tfr,race-ethnicity,household-income,income-to-poverty-ratio
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
31131.0,102.546098,4.60,4.0,8.0,4.65
31152.0,24.754231,3.50,1.0,8.0,1.76
31153.0,40.878436,3.60,5.0,8.0,1.03
31156.0,103.683064,3.80,4.0,8.0,1.19
31160.0,29.370646,3.10,3.0,8.0,1.91
...,...,...,...,...,...
102923.0,155.000000,,3.0,4.0,0.95
102933.0,31.000000,2.39,4.0,12.0,1.66
102935.0,,,3.0,14.0,0.85
102948.0,64.900000,2.05,3.0,15.0,5.00


In [13]:
imputed_data = knn_imputer.fit_transform(imputation_features_scaled)

In [14]:
imputed_data_df = pd.DataFrame(imputed_data, columns=imputation_features.columns, 
                               index=imputation_features.index)

In [15]:
imputed_data_df

Unnamed: 0_level_0,transformed_ferritin,tfr,race-ethnicity,household-income,income-to-poverty-ratio
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
31131.0,0.054718,0.069030,0.75,0.071429,0.930
31152.0,0.012784,0.048507,0.00,0.071429,0.352
31153.0,0.021476,0.050373,1.00,0.071429,0.206
31156.0,0.055331,0.054104,0.75,0.071429,0.238
31160.0,0.015272,0.041045,0.50,0.071429,0.382
...,...,...,...,...,...
102923.0,0.082994,0.040373,0.50,0.030612,0.190
102933.0,0.016150,0.027799,0.75,0.112245,0.332
102935.0,0.028064,0.029813,0.50,0.132653,0.170
102948.0,0.034425,0.021455,0.50,0.142857,1.000


In [16]:
# Reverse scaling to get back to original scale for 'transformed_ferritin'
imputed_data_df = pd.DataFrame(scaler.inverse_transform(imputed_data_df), 
                               columns=imputation_features.columns,
                              index=imputation_features.index)

In [17]:
imputed_data_df

Unnamed: 0_level_0,transformed_ferritin,tfr,race-ethnicity,household-income,income-to-poverty-ratio
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
31131.0,102.546098,4.600,4.0,8.0,4.65
31152.0,24.754231,3.500,1.0,8.0,1.76
31153.0,40.878436,3.600,5.0,8.0,1.03
31156.0,103.683064,3.800,4.0,8.0,1.19
31160.0,29.370646,3.100,3.0,8.0,1.91
...,...,...,...,...,...
102923.0,155.000000,3.064,3.0,4.0,0.95
102933.0,31.000000,2.390,4.0,12.0,1.66
102935.0,53.100900,2.498,3.0,14.0,0.85
102948.0,64.900000,2.050,3.0,15.0,5.00


In [18]:
print(nhanes.index.is_unique)
print(imputed_data_df.index.is_unique)

True
True


In [19]:
imputed_data_df['transformed_ferritin'].isnull().sum()

0

In [20]:
imputed_data_df['tfr'].isnull().sum()

0

In [21]:
print(nhanes.index.equals(imputed_data_df.index))

True


In [22]:
nhanes = nhanes.join(imputed_data_df[['transformed_ferritin','tfr']], 
                     how='left', rsuffix='_imputed')

In [23]:
nhanes['transformed_ferritin_imputed'].isnull().sum()

0

In [24]:
nhanes['tfr_imputed'].isnull().sum()

0

In [25]:
# Displaying a summary to verify the imputation
nhanes[['transformed_ferritin', 'transformed_ferritin_imputed']].describe()

Unnamed: 0,transformed_ferritin,transformed_ferritin_imputed
count,5711.0,6107.0
mean,56.933053,56.922288
std,62.163773,60.497235
min,1.04,1.04
25%,21.285638,22.442494
50%,40.878436,42.027027
75%,72.5173,72.2
max,1856.1043,1856.1043


In [26]:
nhanes[['tfr', 'tfr_imputed']].describe()

Unnamed: 0,tfr,tfr_imputed
count,5684.0,6107.0
mean,3.681332,3.692469
std,2.296492,2.238591
min,0.9,0.9
25%,2.5,2.55
50%,3.1,3.174
75%,4.0625,4.1
max,54.5,54.5


In [27]:
nhanes['months-postpartum']=nhanes['months-postpartum'].fillna(0)

In [28]:
nhanes['months-postpartum'].isnull().sum()

0

## Save csv

In [29]:
nhanes.to_csv('data/nhanes.csv', index=False)