---
title: "Dealing with missing-values"
author: "Alape Aniruddha"
format:
  html:
    theme: theme.scss
    toc: true
    html-math-method: katex
---


## Introduction

Dealing with missing data is often the first step when it comes to data-preprocessing. However, not all missing data are the same, requiring us to develop a good understanding of the types of missingness. The types of missingness can be primarily classified to missing completely at random(MCAR), missing at random(MAR), missing not at random(MNAR)

## Bias

When data is missing, the remaining data may not always accurately reflect the true population. The type of missingness and the way we deal with it can dictate bias in our results.

For example, in a survey recording income of individuals, those with low income may choose not to respond. The use of average income to impute the missing values can lead to bias as the average of the available data may not represent that of the population.


## Missing Completely At Random (MCAR)

In this case, the data is missing randomly and is not related to any variable in the dataset or to the missing value themselves. The probability of missingness is same for all observations. There exists no underlying pattern to the missingness. The missing values are completely independent of other data.

For example, during data collection, if some responses were not collected due to a technical error, then the missing data is completely at random.

All statistical analysis performed on the dataset will remain unbiased in this case.


## Missing At Random (MAR)

In this case, the missingness of the data can be fully accounted for by the other known data values. Here there exists some form of pattern in the missing values.

For example, In a survey, women might be unwilling to disclose their age. Here the missingness of the variable age can be explained by another observed variable "gender".



## Missing Not At Random (MNAR)

In this case, the missingness is neither MCAR nor MAR. The fact that a datapoint is missing is dependent on the value of the data point.
In order to correct for the bias we would have to make modelling assumptions about the nature of the bias.

For example, in a social survey where individuals are asked about their income, respondnets may not disclose it if it is too high or too low. Thus the missingness in the feature income is directly related to the values of that feature.

## Identifying the type of missingness

*   Understanding the data collection process, the features involved and the research domain can help identify possible patterns as to why data is missing
*   Graphical representation of the missing data and heatmaps can help identify relationships between features that can be utilized to make better imputation of the missing data
*   We can impute the data using different techniques while also making assumptions on the nature of missingness. Subsequently, we analyze the results to observe the assumptions that have led to consistent results.






## Missing Value Imputation

In many cases it might not be feasible to discard observations that contain missing values. This could be due to the availability of lesser number of samples or due to the importance of each observation. In such cases we can employ imputation which deals with replacing the missing values with some predicted value.
We will have a look at two simple imputation techniques that can be implemented using the sklearn package.

*   Simple Imputer
*   K Nearest neighbours Imputer



## Simple Imputer

The Simple Imputation technique offers a basic approach to filling missing data wherein we replace the missing data with a constant value or utilize statistics such as “mean”, “mode”, or “median” of the available values.
The technique is useful for its simplicity and can serve as a reference technique. It is also worth noting that this can lead to biased or unrealistic results


### Simple Imputer - Mean

In [None]:
import numpy as np
import pandas as pd

Consider the following matrix,
$$ A=\begin{bmatrix}
1 & 1 & 3\\
2 & 1 & 3\\
3 & nan & 3\\
4 & 2 & 4\\
5 & 4 & 5
\end{bmatrix} $$

We can fill the missing value using mean strategy. The mean value of that feature will be
$\frac{( 1+1+2+4)}{4} = 2$

The updated matrix would be,
$$ A=\begin{bmatrix}
1 & 1 & 3\\
2 & 1 & 3\\
3 & 2 & 3\\
4 & 2 & 4\\
5 & 4 & 5
\end{bmatrix} $$

In [None]:
data = {"feature_1":[1,2,3,4,5],
        "feature_2":[1,1,np.nan,2,4],
        "feature_3":[3,3,3,4,5]}

In [None]:
df = pd.DataFrame(data)
df

Unnamed: 0,feature_1,feature_2,feature_3
0,1,1.0,3
1,2,1.0,3
2,3,,3
3,4,2.0,4
4,5,4.0,5


In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(df)

In [None]:
imputed_df = pd.DataFrame(imputed_data, columns=['feature_1','feature_2','feature_3'])
imputed_df

Unnamed: 0,feature_1,feature_2,feature_3
0,1.0,1.0,3.0
1,2.0,1.0,3.0
2,3.0,2.0,3.0
3,4.0,2.0,4.0
4,5.0,4.0,5.0


### Simple Imputer - Mode

Consider the following matrix,
$$ A=\begin{bmatrix}
1 & 1 & 3\\
2 & 1 & 3\\
3 & nan & 3\\
4 & 2 & 4\\
5 & 4 & 5
\end{bmatrix} $$

We can fill the missing value using mode strategy. The value that appears most often is 1.

The updated matrix would be,
$$ A=\begin{bmatrix}
1 & 1 & 3\\
2 & 1 & 3\\
3 & 1 & 3\\
4 & 2 & 4\\
5 & 4 & 5
\end{bmatrix} $$

In [None]:
mode_imputer = SimpleImputer(strategy='most_frequent')
imputed_data = mode_imputer.fit_transform(df)

In [None]:
mode_imputed_df = pd.DataFrame(imputed_data, columns=['feature_1','feature_2','feature_3'])
mode_imputed_df

Unnamed: 0,feature_1,feature_2,feature_3
0,1.0,1.0,3.0
1,2.0,1.0,3.0
2,3.0,1.0,3.0
3,4.0,2.0,4.0
4,5.0,4.0,5.0


## K Nearest Neighbours

This technique is an extension of the KNN classifier we have seen in MLT to perform imputation. In this technique we identify the points that are similar to the observation we wish to impute based on the available features. We can then use the values of these neighboring points fill in the missing values


In [None]:
from sklearn.impute import KNNImputer

In [None]:
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
knn_imputed_data = knn_imputer.fit_transform(df)

In [None]:
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=['feature_1','feature_2','feature_3'])
knn_imputed_df

Unnamed: 0,feature_1,feature_2,feature_3
0,1.0,1.0,3.0
1,2.0,1.0,3.0
2,3.0,1.5,3.0
3,4.0,2.0,4.0
4,5.0,4.0,5.0
