<a href="https://colab.research.google.com/github/university-of-southampton-ai-society/notebooks/blob/master/Workshop_(24_11_19)_Missing_data_imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Some background info**
### **Types of missingness**
There are 3 types of data missingness: MCAR (Missing Completely at Random), MAR (Missing at Random), and NMAR(Not Missing at Random). More specifically:


1.   **MCAR** - missing value in the dataset is completely random. It is not correlated with features inside the dataset at all. Think of it as missingness that can be generated with typical random function, e.g. *np.rand.int()* 
2.   **MAR** - here the missingness is somewhat related to values that we can observe in the dataset. 

3. **NMAR** Missingness is related to the value that is missing or missing value is dependent on some other variable’s value. For example, imagine a simple dataset with two features salary and job title. Then it is possible that salary (Feature 1) might be sometimes missing if a person is a CEO, or CTO. The reason behind that might be that they tend to earn more than the average person and they don't feel comfortable disclosing their wage.

Most of the imputation techniques work decently with MAR and MCAR but will fail badly with NMAR.
Have look here at more information about these 3 types of randomness: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

### **Few useful imputation algorithms**

*   **Removing the instance with missing values** - it is not really an imputation technique, but it might be a good alternative if there are not that many instance in the dataset with missing values.

*  **Mean/Mode imputation** - Quite popular approach. For numerical features, impute the missing value with a mean, and for categorical ones impute it with the mode (i.e. the most frequent label in the feature). The drawback of this method is that it might hurt the model quite badly as they ignore the fact that some features can be correlated (and you can use that information to make more accurate imputations), and it also reduces the variance in the data which might bias the model and skew the results. In this article I ellaborate on it a bit more: https://towardsdatascience.com/why-using-a-mean-for-missing-data-is-a-bad-idea-alternative-imputation-algorithms-837c731c1008?source=friends_link&sk=429f2f54105cf63a4df61de693ea144b


*   **K-NN imputation** - This imputation technique uses K-NearestNeighbors algorithm to impute the data. Basically, for a missing value in a Feature F1, the algorithm searches for K nearest neighbors (the "closeness" of the neighbor is based on some metric e.g. Squared difference). Once it has found K nearest neighbors that don't have Feature F1 missing, they compute the mean (mode for categorical variables) based on the K neighbors and replace it with a missing value. It is much better than a plain mean/mode imputation as it doesn't ignore feature correlations. It requires more computation power and time, though.

* **MICE** - Consider it as a framework-like algorithm for missing data imputations. It works by filling the missing data multiple times with some underlying algorithm, e.g. LinearRegression or SVM. More info here: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

* **MissForest** - uses RandomForest as an underlying algorithm to impute data. Good think about that is it innately handle both categorical and continuous features. Actually, you can say that MissForest is a MICE with a RandomForest Regression and RandomForest Classifier as underlying algorithms for missing data imputation. 

* **Honorable mentions** - there is much more different imputation  algorithms, each with different pros and cons. Quite interesting work has been done around Deep learning architectures for missing data imputations, e.g. GANs: https://towardsdatascience.com/gans-and-missing-data-imputation-815a0cbc4ece?source=friends_link&sk=3c6170ab92c564d6d77fc558879c1b35
# **Some problems that I've encountered**
Often, data is heterogeneous (i.e. it contains both continous feautures like temperature, height and categorical features like gender, or age). Lot of imputation libraries that are available online fail pretty badly if the dataset is heterogeneous. For some reason a lot of people just assumed that data is either continous or categorical. I wonder why...

# **Plan of action**
We will test how accurate imputations are for different imputation algorithms.
More specifically, we will take a following approach for evaluating the algorithm:
*    Load dataset (Iris) and randomly remove part of the data (MCAR)
*    Save the removed data to compare it with imputed data later on
*    Use an imputation algorithm to impute data
*    Compute the loss between the true value and imputed value in the dataset

# **Some imputation libraries in Python**
*    **fancyimpute**
*    **impyute**
*     **missingpy**

In [21]:
import pandas as pd
import numpy as np 
from sklearn.datasets import load_iris
from copy import deepcopy #to create a copy of the dataset

!pip install impyute
!pip install missingpy
import impyute as imp # we'll use it for KNN, and MICE
import missingpy as miss # we'll use it for MissForest

Collecting missingpy
[?25l  Downloading https://files.pythonhosted.org/packages/b5/be/998d04d27054b58f0974b5f09f8457778a0a72d4355e0b7ae877b6cfb850/missingpy-0.2.0-py3-none-any.whl (49kB)
[K     |████████████████████████████████| 51kB 3.6MB/s 
[?25hInstalling collected packages: missingpy
Successfully installed missingpy-0.2.0


In [0]:
# Load iris data
X, y = load_iris(return_X_y=True)
# we will use only X because y is categorical and there are
# not that many libraries supporting heteregeneous metrics
# Create a copy of X to validate with imputations later on
X_copy = deepcopy(X)

In [0]:
# Function to introduce randomnness to dataset
def introduce_randomness(data, missingness_threshold):
  row_cnt, col_cnt = data.shape
  threshold = missingness_threshold * 100
  for i in range(row_cnt):
    for j in range(col_cnt):
        rand_val = np.random.randint(100, size=1)
        if rand_val >= threshold:
            data[i, j] = np.nan
  return data

In [0]:
X_missing = introduce_randomness(X, 0.5) # 50 % of data should be missing

In [35]:
# Let's print it to see if it works
print(X_missing[:10])

[[5.1 nan 1.4 nan]
 [nan 3.  nan 0.2]
 [nan nan nan nan]
 [4.6 3.1 nan nan]
 [nan 3.6 1.4 nan]
 [5.4 nan nan nan]
 [4.6 nan 1.4 nan]
 [5.  nan 1.5 nan]
 [4.4 nan nan nan]
 [nan 3.1 nan 0.1]]


In [0]:
X_imputed = imp.mean(deepcopy(X_missing)) #Imputing with mean

In [53]:
# Let's have a look at quality of imputations
print(f"Imputed data: \n{X_imputed[:10]}\n")
# Let's have a look at statistical properties of true and imputed data
from scipy import stats
print(f"Imputed data: \n{stats.describe(X_imputed)}\n")
print(f"True data: \n{stats.describe(X_copy)}\n")

# We can also measure an error between imputed and true values.
from sklearn.metrics import mean_absolute_error
for i in range (X_copy.shape[1]):
  print(f"MAE for a feature {i}: {mean_absolute_error(X_copy[:, i], X_imputed[:, i]):.2f}")

Imputed data: 
[[5.1        3.06029412 1.4        1.25769231]
 [5.84347826 3.         3.75492958 0.2       ]
 [5.84347826 3.06029412 3.75492958 1.25769231]
 [4.6        3.1        3.75492958 1.25769231]
 [5.84347826 3.6        1.4        1.25769231]
 [5.4        3.06029412 3.75492958 1.25769231]
 [4.6        3.06029412 1.4        1.25769231]
 [5.         3.06029412 1.5        1.25769231]
 [4.4        3.06029412 3.75492958 1.25769231]
 [5.84347826 3.1        3.75492958 0.1       ]]

Imputed data: 
DescribeResult(nobs=150, minmax=(array([4.4, 2. , 1. , 0.1]), array([7.7, 4.4, 6.7, 2.5])), mean=array([5.84347826, 3.06029412, 3.75492958, 1.25769231]), variance=array([0.33805077, 0.09424694, 1.42681728, 0.28624419]), skewness=array([ 0.39592721,  0.38960195, -0.48072615, -0.35491982]), kurtosis=array([1.92947112, 4.28455599, 0.41128983, 0.41339875]))

True data: 
DescribeResult(nobs=150, minmax=(array([4.3, 2. , 1. , 0.1]), array([7.9, 4.4, 6.9, 2.5])), mean=array([5.84333333, 3.05733333, 3

# Your Turn
**Have a look at the different imputation algorithms from impyute and missingpy and try imputing the data as above. Check how MAE changes, together with statistical properties, e.g. mean and variance.**