# Univariate Imputation on the Heart Disease Prediction Data Set

#### by Fuat Akal


## Table of Content

[Problem](#problem)  
[Loading Libraries](#loading_libraries)  
[Data Preparation](#data_preparation)   
[Imputation](#imputation)   
[Discussion](#discussion)   
[References](#references)   


## Problem <a class="anchor" id="problem"></a>

For various reasons, many real world datasets from the healthcare domain contain missing values, often encoded as blanks, NaNs or other placeholders (e.g., Not checked). Honestly, this is understandable. Because, such data is collected in a clinic environment, which is very busy and stressful. On the other hand, scikit-learn estimators assume that all values in a dataset ( I mean a numpy array) are numerical, and that all have and hold meaning. 

A simple way to handle missing data is to remove rows or columns with missing values. This approach, however, may result in data lost. 

In this small data science project, we will explore our alternatives for completing (impute) missing data.

## Loading Libraries <a class="anchor" id="loading_libraries"></a>

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score

import pandas as pd
import numpy as np

# Configure Constants
seed = 42 # ultimate answer to everything
pd.options.display.max_columns = None
pd.options.display.max_rows = None # default 60

columnToImpute = "Cholesterol"

## Data Preparation<a class="anchor" id="data_preparation"></a>

The Heart Disease Prediction Data Set is available at Kaggle [1].

It is downloaded and put under a local folder for convenience.

In [2]:
# Retrieve data from a local folder
df = pd.read_csv("data/heart.csv")

# Display top 5 rows
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [103]:
# Check the dimensions of the data
df.shape

(918, 12)

In [3]:
# Let's see if there is any missing value in the dataset.
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

I was going to create 5% missing value on the Cholesterol column. Actually, I did and then performed my analyses. However, I later discovered that the dataset is not really complete. There are just too many zeros on the Cholesterol column. So, I decided to impute zeros directly.

In [165]:
# The data is complete. So, we have to remove some to demonstrate imputation.
# Let's pick a continuous value: Cholesterol
# We will remove 5% of the values on that column.
# df.loc[df.sample(frac=0.05).index, columnToImpute] = np.nan

In [4]:
# I will consider zero"s pn the Cholesterol column as missing
# Missing value rate = 172/918 = 0.187
df['Cholesterol'].value_counts().head()

0      172
254     11
223     10
220     10
211      9
Name: Cholesterol, dtype: int64

In [6]:
# Create a list of strategies.
# Actually, there is another one called "constant" that I do not consider here.
# Because, I do not have any expertise about what that constant might be.
strategies = ['drop', 'mean', 'median', 'most_frequent']

In [7]:
# There are categorical columns in the data set.
# We should convert them to numbers.
df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

In [8]:
# Create a list of categorical columns
categoricalColumns = ["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]

In [9]:
# Create dummies for categorical columns
dfDummied = pd.get_dummies(df, columns=categoricalColumns)

## Imputation<a class="anchor" id="imputation"></a>

In [11]:
# At this part, we will impute data by using different imputation strategies
# and then, run a classifier on the imputed data to find training scores.
# Our goal is to determine which strategy will work best for the classification.

print("Cros-validation scores for different imputation strategies\n")
print("Strategy          Score   Std.Dev.")
print("-----------------------------------")

results = []

# Perform imputation for each strategy and 
# apply random forest classification to imputed data.
for s in strategies:
    
    
    # We randomly picked this classifier
    rf = RandomForestClassifier()
    
    dfTemp = dfDummied.copy()

    if s == 'drop':
        # We will not impute
        # We will remove rows with missing values instead.
        dfTemp = dfTemp.dropna(axis=0)
    else:
        # Set the strategy
        imp = SimpleImputer(missing_values=0, strategy=s)
        dfTemp[columnToImpute] = imp.fit_transform(dfTemp[[columnToImpute]])
    
    # Create independent and dependant variables sets
    X, y = dfTemp.values[:, :11], dfTemp.values[:, 11]
    
    # Let's perform a 10-fold cross validation
    scores = cross_val_score(rf, X, y, cv=10)
    
    # store the scores in a list
    results.append(scores)
    
    print('%-15s %7.3f  %7.3f' % (s, np.mean(scores), np.std(scores)))

Cros-validation scores for different imputation strategies

Strategy          Score   Std.Dev.
-----------------------------------
drop              0.941    0.030
mean              0.945    0.027
median            0.943    0.028
most_frequent     0.944    0.024


In [176]:
# Let's do the entire thing by using a bigger missing value rate, e.g. 25%.

## Discussion<a class="anchor" id="discussion"></a>

Honestly, I can not say for sure that the univariate imputation worked well for this dataset and for the selected column when the missing value rate is relatively low (18% in my experiments). 

However, I guess I can say that imputation is likely to work better than dropping rows when the missing value rate is getting higher. Of course, if that value is too much high, it may be better to consider dropping that column.

I do not know. There are just too many possibilities to try. Depending on the dataset's characteristics, mean and most_frequent strategies seem more likely to work for me. 

On the other hand, most of the cases, the univariate imputation is a no go for me. After all, if a patient's cholesterol was not measured, it makes no sense to put a random number instead.

## References<a class="anchor" id="references"></a>

1. [Heart Failure Prediction Dataset](https://www.kaggle.com/fedesoriano/heart-failure-prediction).
2. [Statistical Imputation for Missing Values in Machine Learning](https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/).



**Disclaimer!** This notebook is available for educational purposes only. There is no guarantee on the correctness of the content provided.

If you think there is any copyright violation, please let me [know](https://forms.gle/BNNRB2kR8ZHVEREq8). 
