## Types of Missing Values

MCAR (Missing Completely At Random): If there is no relationship between the missingness of the data and any values (observed or those values which are actually missing), then this is MCAR.

Missing at Random (MAR): If there is a relationship between the missingness and the observed values, then it is MAR. For example — Men are more likely to tell you their weight than Women, So if the column 'Weight' has missing values, then this is MAR. Specifically, the missingness in 'Weight' depends upon the observed values of variable ‘Gender’, i.e., male or female

 Missing Not at Random MNAR: If MCAR and MAR are not detected, then the missingness is an example of MNAR. It basically implies a strong relationship between the missingness of the data and the observed values and values which are actually missing. It can also be detected if 2 or more columns have the same missing value pattern

### Loading Data and using missingno

#### Press Ctrl + Enter in each cell below to show the result

To get missingno module of Python (in cell below): 
1) Go to file + new launcher + terminal 

2) On terminal type: pip install missingno

3) Once installation is complete, close terminal window and Restart Kernel of this window (Round Arrow Sign at top)

In [4]:
#importing the basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import missingno as mano
%matplotlib inline

SyntaxError: invalid syntax (<ipython-input-4-39cc8f1e1997>, line 2)

In [None]:
#read the CSV file
missingdf = pd.read_csv("Missing.Value.Data.csv")

In [None]:
#we remove missings first. Let's see how many missings we have
missingdf.isnull().sum()

In [None]:
#see the completeness of the data using mano.bar
mano.bar(missingdf)

In [None]:
# visualize the location of the missingness of data using mano.matrix
mano.matrix(missingdf)

We can see above that none of the columns have exactly similar missing value patterns. So, none of them is an example of MNAR

If there is no relation between missingness of any 2 or more variables, and also there is no relation between the missing values of one variable and the observed values of another variable, then this is MCAR.

If there is a systematic relationship between the missingness and the observed values, then it is MAR. 

To determine the above, we need to plot some more charts

In [2]:
#plot the heatmap to determine the relationship (correlation) between missingness of columns
mano.heatmap(missingdf, figsize=(12,6))

NameError: name 'mano' is not defined

From the above we can see that a missingness in any column is related (or correlated) with missingness of at least one other column. So, all are examples of MAR

In [None]:
#dendogram will quantify and cluster the missingness
mano.dendrogram(missingdf)

From the above, we can see that S.No and Date are combined with a straight line, i.e., they do not contain any missing values. For all the remaining, there is a correlation so that further confirms MAR

## Handling Missing Values

### Dropping Entire Rows

If the missing values are not that much (just a few rows), we can drop them. However, this is done for MCAR, which we have not been able to prove from the above diagrams

Still, we show the dropping functionality below for an example

In [None]:
#drops all the missing values from the dataframe
missingdf = missingdf.dropna(axis=0)

In [None]:
#check
missingdf.isnull().sum()

In [None]:
#read the CSV file again
missingdf = pd.read_csv("Missing.Value.Data.csv")

In [None]:
#if you want to delete rows for one of the specified variable, say Customer.Name
missingdf = missingdf.dropna(axis=0, subset=['Customer.Name'])

In [None]:
#check
missingdf.isnull().sum()

In [None]:
#shape has also changed because the 2 rows of Customer.Name missing values have been deleted
missingdf.shape

### Imputation

Now we discuss imputation. We can impute through mean, median and mode


# Impute basically means to fill up. here we are using the strategy as mean which is, that we would use mean of the column to fill up missing values in the given column.

In [None]:
from sklearn.impute import SimpleImputer

#create a separate data frame for mean imputation
missingdf_mean = missingdf.copy(deep=True)
#here the we are using mean to impute
mean_imputation = SimpleImputer(strategy='mean')

#take only columns where mean imputation matters, i.e., numerical columns
missingdf_mean[['Customer.Age','Sales','Quantity']] = mean_imputation.fit_transform(missingdf_mean[['Customer.Age','Sales','Quantity']])

In [None]:
missingdf

# Here we see the data frame without the imputation on the selected columns

In [None]:
missingdf_mean

# This dataframe has the imputation applied to the selected columns customer.age, Sales and Quantity

# Here we are checking the nulls in all columns of the missngdf_mean

In [None]:

missingdf_mean.isnull().sum()

We now plot to see whether the distribution remains the same after the missing values have been filled in

# Here we are making a scater plot with sales on xaxis and Quantity on yaxis, here we can see what is the trend in between them

In [None]:
#comparing sales vs quantity (values filled in by mean)
missingdf_mean.plot(x='Sales',y='Quantity',kind='scatter',alpha=0.5,cmap='rainbow')

# Here we are just taging each row in the missing df whether it has missing Sales value or Quantity value, wherever we have a True it means one of them or both are missing

In [None]:
#now, lets determine those rows where values of either sales or quantity went missing
nulls =  missingdf['Sales'].isnull() + missingdf['Quantity'].isnull()
nulls

# Here again we made the scatter plot and here the red points are nulls in either the sales or quantity, c=nulls basically sets different color for these null values in nulls variable like True or False that we figured out above for each row. Here the red are True and purple points are False means not missing in both

In [None]:
#Now, lets plot them together
missingdf_mean.plot(x='Sales',y='Quantity',kind='scatter',alpha=0.5,c=nulls,cmap='rainbow')

From the above, we can conclude that values have been filled in according to the given distribution because the purple marks are following the same pattern as the red ones

# Here we are first taging each row with True if we have a missing sales value or missing cutomer.Age value or both
# Next we again make the scatter plot but between sales and Customer.Age with different color set for null value rows, c=nulls basically sets different color for these null values in nulls variable like True or False that we figured out above for each row. Here the red are True and purple points are False means not missing in both

In [None]:
nulls =  missingdf['Sales'].isnull() + missingdf['Customer.Age'].isnull()
missingdf_mean.plot(x='Sales',y='Customer.Age',kind='scatter',alpha=0.5,c=nulls,cmap='rainbow')

### K-NN Based Imputation

In [None]:
#read the CSV file again
missingdf = pd.read_csv("Missing.Value.Data.csv")

# KNN imputer basically imputes missing value using the mean of the n_neighbors nearest neighbors.
# The last code line is the same as we did for the simple imputer that we are filling the missing values in the Customer.Age, Sales and Quatity column but with one change that we are using the KNNImputer approach as described above. 

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer (n_neighbors=2)
missingdf_knn = missingdf.copy(deep=True)
missingdf_knn[['Customer.Age','Sales','Quantity']] = imputer.fit_transform(missingdf_knn[['Customer.Age','Sales','Quantity']])
missingdf_knn

# Here we are scatter ploting the new df whose values are filled, although a trend can not be understood by the given graph.

In [None]:
#comparing sales vs quantity (values filled in by knn)
missingdf_knn.plot(x='Sales',y='Quantity',kind='scatter',alpha=0.5,cmap='rainbow')

# Here again we are taging row with True if either one or both are null in sales and quantity columns. next the scaller plot with c=nulls tags the true and false with different colors as we did above

In [None]:
nulls =  missingdf['Sales'].isnull() + missingdf['Quantity'].isnull()
missingdf_knn.plot(x='Sales',y='Quantity',kind='scatter',alpha=0.5,c=nulls,cmap='rainbow')

You can see from above that the gray dots are again following the same pattern of red dots so the filling up of missing values seems ok