# Multivariate Imputation on the Heart Disease Prediction Data Set

#### by Fuat Akal


## Table of Content

[Problem](#problem)  
[Loading Libraries](#loading_libraries)  
[Data Preparation](#data_preparation)   
[Imputation](#imputation)   
[Discussion](#discussion)   
[References](#references)   


## Problem <a class="anchor" id="problem"></a>

For various reasons, many real world datasets from the healthcare domain contain missing values, often encoded as blanks, NaNs or other placeholders (e.g., Not checked). Honestly, this is understandable. Because, such data is collected in a clinic environment, which is very busy and stressful. On the other hand, scikit-learn estimators assume that all values in a dataset ( I mean a numpy array) are numerical, and that all have and hold meaning. 

A simple way to handle missing data is to remove rows or columns with missing values. This approach, however, may result in data lost. 

One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer). By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer) [1].

In this small data science project, we will explore our alternatives for completing (impute) missing data.

## Loading Libraries <a class="anchor" id="loading_libraries"></a>

In [126]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import cross_val_score

import pandas as pd
import numpy as np

# Configure Constants
seed = 42 # ultimate answer to everything
pd.options.display.max_columns = None
pd.options.display.max_rows = None # default 60

columnsToImpute = ["Cholesterol", "MaxHR"]

## Data Preparation<a class="anchor" id="data_preparation"></a>

The Heart Disease Prediction Data Set is available at Kaggle [2].

I downloaded and put it under a local folder for convenience.

In [107]:
# Retrieve data from a local folder
df = pd.read_csv("data/heart.csv")

# Display top 5 rows
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [108]:
# Check the dimensions of the data
df.shape

(918, 12)

In [109]:
# I will impute two columns. But, I will use one of them for comparisons.
imputedDF = pd.DataFrame()
imputedDF.insert(0, 'original', df[columnsToImpute[1]])

In [110]:
# The data is complete. So, we have to remove some to demonstrate imputation.
# Let's pick a continuous value: MaxHR
# We will remove 5% of the values on that columns
df.loc[df.sample(frac=0.05).index, columnsToImpute[1]] = np.nan

# I noticed there are too many zeros on the Cholesterol column.
# Missing values were probably encoded as zeros.
# I will also impute them. But, first will convert them to np.NANs.
# Not a necessary step. I just feel like it.
df['Cholesterol'] = df['Cholesterol'].replace({0:np.nan})

In [111]:
# Let's see if it worked. 
# Cholesterol: 172 missing values. Nearly 19% of 918.
# MaxHR: 46 missing values. 5% of 918.
df.isnull().sum()

Age                 0
Sex                 0
ChestPainType       0
RestingBP           0
Cholesterol       172
FastingBS           0
RestingECG          0
MaxHR              46
ExerciseAngina      0
Oldpeak             0
ST_Slope            0
HeartDisease        0
dtype: int64

In [112]:
# There are categorical columns in the data set.
# We should convert them to numbers.
df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol       float64
FastingBS           int64
RestingECG         object
MaxHR             float64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

In [113]:
# Create a list of categorical columns
categoricalColumns = ["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]

In [114]:
# Create dummies for categorical columns
dfDummied = pd.get_dummies(df, columns=categoricalColumns)

## Imputation<a class="anchor" id="imputation"></a>

In [115]:
# Create a list of strategies.
# Actually, there is another one called "constant" that I do not consider here.
# Because, I do not have any expertise about what that constant might be.
# Iterative means multivariate imputation.
strategies = ['drop', 'mean', 'median', 'most_frequent', 'iterative']

In [116]:
# At this part, we will impute data by using different imputation strategies
# and then, run a classifier on the imputed data to find training scores.
# Our goal is to determine which strategy will work best for the classification.

print("Cros-validation scores for different imputation strategies\n")
print("Strategy          Score   Std.Dev.")
print("-----------------------------------")

results = []

imputedDF.insert(0, 'missing', df[columnsToImpute[1]])

imputedList.append(dfDummied[columnsToImpute[1]])

# Perform imputation for each strategy and 
# apply random forest classification to imputed data.
for s in strategies:
        
    # We randomly picked this classifier
    rf = RandomForestClassifier()
    
    dfTemp = dfDummied.copy()

    if s == 'drop':
        # We will not impute
        # We will remove rows with missing values instead.
        dfTemp = dfTemp.dropna(axis=0)
    elif s == 'iterative':
        # Means multivariate imputation
        imp = IterativeImputer(missing_values=np.NaN, random_state=seed)
        dfTemp = pd.DataFrame(imp.fit_transform(dfTemp), columns=dfTemp.columns)
        imputedDF.insert(0, s, dfTemp[columnsToImpute[1]])

    else:
        # Set the strategy for unşvariate imputation
        imp = SimpleImputer(missing_values=np.NaN, strategy=s)
        dfTemp[columnsToImpute] = imp.fit_transform(dfTemp[columnsToImpute])
        imputedDF.insert(0, s, dfTemp[columnsToImpute[1]])
    
    
    # Create independent and dependant variables sets
    X, y = dfTemp.values[:, :11], dfTemp.values[:, 11]
    
    # Let's perform a 10-fold cross validation
    scores = cross_val_score(rf, X, y, cv=10)
    
    # store the scores in a list
    results.append(scores)
    
    print('%-15s %7.3f  %7.3f' % (s, np.mean(scores), np.std(scores)))

Cros-validation scores for different imputation strategies

Strategy          Score   Std.Dev.
-----------------------------------
drop              0.935    0.035
mean              0.950    0.027
median            0.943    0.025
most_frequent     0.945    0.029
iterative         0.949    0.027


There is no surprise for the drop column case. The less data you have, worse results you get.

On the hand, when we compare the RF classification results, we see that the multivariate (iterative) imputing did not achieve better than its univariate rivals, either.

So, I can not say that multivariate is better than univariate, yet. We should dig deeper.

In [117]:
imputedDF.to_csv("data/imputedDF.csv")

In [122]:
# Let us see how imputation worked.
# Remember, these are original and imputed values for the MaxHR column.
# The first line reads as: The original value was 125. It was randomly set to np.NAN and then imputed.
# The iterative imputer returned the closest value, i.e., 123.389351.
imputedDF[imputedDF['missing'].isnull()].head()

Unnamed: 0,iterative,most_frequent,median,mean,missing,original
18,123.389351,150.0,138.0,136.508028,,125
25,159.560506,150.0,138.0,136.508028,,178
32,130.549229,150.0,138.0,136.508028,,122
38,152.985189,150.0,138.0,136.508028,,148
40,149.873344,150.0,138.0,136.508028,,130


In [138]:
# Let us compare how much the imputed values deviate from the original value
# We can use mean squared error for instance.
# Hmmm, the iterative imputing seemed to have done a better job.

print("Strategy         Mean Squared Error (MSE)")
print("-----------------------------------------")

for c in imputedDF.columns:
    if c not in ('missing', 'original'):
        print('%-15s %15.2f' % (c, np.round(mean_squared_error(imputedDF['original'], imputedDF[c]), 2)))

Strategy         Mean Squared Error (MSE)
-----------------------------------------
iterative                 18.12
most_frequent             30.03
median                    28.26
mean                      29.04


## Discussion<a class="anchor" id="discussion"></a>

Honestly, I can not say for sure that which imputation method (univariate vs. multivariate) worked well for this dataset. When I compared the predictions of a randomly picked Rf model, the imputation seems to have no effect. Results are all close. This may due to the Rf algorithm, which is ok with missing data. 

However, when I look at the imputation results, MSE is the lowest for the multivariate imputation. In other words, multivariate imputation suggest values which are more close the original values.

## References<a class="anchor" id="references"></a>

1. [Imputation of missing values, Scikit Learn](https://scikit-learn.org/stable/modules/impute.html#id4)
2. [Heart Failure Prediction Dataset](https://www.kaggle.com/fedesoriano/heart-failure-prediction).
3. [Statistical Imputation for Missing Values in Machine Learning](https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/).



**Disclaimer!** This notebook is available for educational purposes only. There is no guarantee on the correctness of the content provided.

If you think there is any copyright violation, please let me [know](https://forms.gle/BNNRB2kR8ZHVEREq8). 
