# Missing Data Imputation for Data Cleaning in Cross-Sectional Data Set


Before looking into how to handle missing data in a data set, it is wise to understand the cause of missing data. Understanding the cause can hint us toward a proper strategy when multiple options are available. This notebook takes ideas from multiple online sources. There are three major types of missing values in a cross-sectional data set.

## 1.Missing Completely At Random (MCAR)
Say we have a cross-sectional data set for a population of people and there are missing values of a feature. However, if the probability of missing this feature is the same for each of the samples/ people, then this is a case of Missing Completely At Random (MCAR). So how would we go about as data scientists to verify that this is indeed a case of MCAR and there is no non-uniform probability of this feature missing for different samples? One way to verify this would be to ask the data collector why are the values missing for this feature for certain samples. If the collector says that values get missed when a data collector forgets to type the value sometimes, then it is obvious that there is a uniform probability that this feature could be missing for any of the samples. This would verify that this is a case of MCAR.

## 2.Missing At Random (MAR)
However, suppose in the above case, if the collector replied by saying that people from European countries do not hesitate to give out their heights since they are tall and feel proud of it while some people from Asia with shorter stature prefer not to disclose their heights and thus the data set has some missing values, then it is obvious that different samples can have different probability of missing data for this feature and the probability of randomly missing values of a feature depends on another predictor variable (here, the continent feature that exists in the data set). Suppose, the data collector said that European people give out their heights willingly 99% of the time while Asian people give out their heights only 70% of the time, then there are two different probabilities of randomly missing data for these two groups in the population. This is a case of MAR. Here, the probability of randomly missing values are depending on some other factor (which continent they are from). In such cases, we can impute  the missing values using this another feature from the data set unlike in MCAR where there is no such feature to relate to.

## 3.Missing Not At Random (MNAR)
If the missingness of the value of a feature depends upon the true value of that feature, then this is called Missing Not At Random (MNAR). For example, if there is a survey where one of the data to be collected is the highest level of education an individual has completed, then there is a natural tendency for the people to not give out their education level information if they have not completed a higher level of education. Say, an individual A is less likely to give out his education information if he has attended only elementary school. There is a natural tendency that an individual B who has completed only high school to give out his/ her information with a probability more than individual A but less than another individual who has a PhD degree. Another example could be if the survryor is collecting information on how much debt an individual has. Larger the debt, less likely that they will disclose the information.

# Summarizing and Visualizing the Missing Data
The missing information can be summarized in pandas as below:<br>


In [None]:
features_description = features.describe().T

#Add a new column to the dataframe that stores the percentage of missing data<br>
features_description['missing %'] = 1- (features_description['count']/len(features))

#display the information<br>
features_desc

Also, it can be visualized as below:

In [None]:
import seaborn as sn
sn.heatmap(df.isnull(), cbar = False)

# Strategies To Handle the Missing Data
There is no perfect method to handle missing data in a data set. However, there are different strategies that can be used based on the type of the missing value. Little's MCAR test is used to figure out whether the data is missing at random or not numerically. However, the practical significance of such tests is unclear. Hence, the general rule of thumb to handle missing data is as below: 

## 1. Drop the samples.
If the number of samples that have some missing value(s) is very low as compared to the total number of training samples, training without these samples should not be much of a problem and thus these samples can be dropped. This comes with a cost of losing some information in the process.

## 2.Impute a continuous feature using Zero/ Mean/ Median
Although this method is easy, this method gives poor results in general.We can also add an indicator variable that shows whether this feature is missing or not.

## 3.Impute a categorical feature using 
### a)The Most Frequent Value of this feature 
Since there is no quntitative significance of a categorical feature that was converted into an encoded numerical value (say, a feature "color_of_eyes" could have values "black", "brown", or "blue" and these categories were converted into 0, 1, and 2 respectively), it does not make sense to take a central measure such as mean or median of such a feature. Hence, instead, the missing value can be imputed using the most frequent feature.
### b)A New Category for this Feature
Assign each sample that misses this feature this new category.


## 4. Use a Machine Learning Model to Impute Numerical/ Categorical Features
### a. kNN Imputation
