

In the real world the data the we receive is imperfect. There will be cases wherein the data provided contains missing data. To handle missing data, we use data imputation techniques to attempt to salvage and recover missing data points. Now there isn't a single technique that is able to solve our missing data issues but there are various methods available for us to use. 

<img src="https://memegenerator.net/img/instances/64277502.jpg"/>

Before discussing the different techniques let's discuss common types of missing data.

1. **Missing Completely at Random (MCAR)**

This means that the missingness of a data point is unrelated to any of the values observed or missing. In other words, there is no systematic reasoning to justify why a given data point is missing. For example if we maintain a manual file which contains the different products a company sells and for one product the dimensions is missing then its possible that the user who maintains the file just forgot to add value to that field. 

2. **Missing at Random**

This type suggests that there is a relationship between the missingness of data and the observed data. For example, if we had 2 features called is_sick and symptoms. If is_sick is yes then the symptoms field will have values as to why the person is sick if the person is well then the symtpoms field will be blank 

3. **Missing not at Random**

This type is the most complex among the 3, the missingness of the data is in relation to an unboserved phenomena. For example if we have a field that asks for the salary of an person there is a higher chance that they will leave this field blank for data praviacy. The same thing might also happen for emails or mobile numbers. 

# Imputation Methods

In [3]:
import pandas as pd 
import numpy as np

1. **Complete Case Analysis** <br>
What this method does is just retain data points that have a complete set of features. In effect any instance of data that has missing values will be removed as part of the modeling. This approach can be destructive since it has the potential to remove a lot of data points so it's best to look at how many rows of data will be dropped using this method.

In [4]:
dict = {'Name':['John', 'Bob', 'Eve', 'Rina'],
        'Address': ['Kyoto Region', 'Jhoto Region', 'Sinnoh Region', np.nan],
        'Mobile Number':[2223995, np.nan, 2242641, 2212696]}
  
df = pd.DataFrame(dict)
df

Unnamed: 0,Name,Address,Mobile Number
0,John,Kyoto Region,2223995.0
1,Bob,Jhoto Region,
2,Eve,Sinnoh Region,2242641.0
3,Rina,,2212696.0


In [5]:
# to do a complete case analysis we remove features with null values 
model_data = df.dropna(axis=0)
#since Rina and Bob have missing data points they are excluded from modeling 
model_data

Unnamed: 0,Name,Address,Mobile Number
0,John,Kyoto Region,2223995.0
2,Eve,Sinnoh Region,2242641.0


2. **Arbitrary Value Imputation <br>**

This method can be used for both numerical and categorical features. This technique assigns large numbers to missing data points in order to represent them as null  (Example 9999999 or -999999). While for categorical features an `unknown` value can be used to fill the missing data points. 

In [9]:
#to do this in pandas we use the fillna method and place the value to use to replace our null values
model_data = df.copy()
model_data['Mobile Number'] = model_data['Mobile Number'].fillna(-9999999)
model_data['Address'] = model_data['Address'].fillna('Unknown')
model_data

Unnamed: 0,Name,Address,Mobile Number
0,John,Kyoto Region,2223995.0
1,Bob,Jhoto Region,-9999999.0
2,Eve,Sinnoh Region,2242641.0
3,Rina,Unknown,2212696.0


3. **Using Mean, Median and Mode**

This method fills up our null values with the mean, median or mode depending on the distribution of the data points. 

In [11]:
dict = {'Name':['John', 'Bob', 'Eve', 'Rina'],
        'Address': ['Kyoto Region', 'Jhoto Region', 'Sinnoh Region', np.nan],
        'Mobile Number':[2223995, np.nan, 2242641, 2212696],
        'Age': [np.nan, 11, 13, 14],
        'Color': ['blue', 'blue', np.nan, 'yellow']}
  
df = pd.DataFrame(dict)
df

Unnamed: 0,Name,Address,Mobile Number,Age,Color
0,John,Kyoto Region,2223995.0,,blue
1,Bob,Jhoto Region,,11.0,blue
2,Eve,Sinnoh Region,2242641.0,13.0,
3,Rina,,2212696.0,14.0,yellow


In [16]:
df['Age'] = df['Age'].fillna(df.Age.mean())
df['Color'] = df['Color'].fillna(df.Color.mode()[0])

Unnamed: 0,Name,Address,Mobile Number,Age,Color
0,John,Kyoto Region,2223995.0,12.666667,blue
1,Bob,Jhoto Region,,11.0,blue
2,Eve,Sinnoh Region,2242641.0,13.0,blue
3,Rina,,2212696.0,14.0,yellow


4. **Using Maching Learning** 

It's also possible to use machine learning to forecast missing values for certain data points. In this approach we treat the column with missing values as our target and utilize the other features as input. SKlearn has the KNNImputer class which uses KNN in order to impute the missing values. KNNImputer only works using numeric data so if you have categorical features you can use `LabelEncoder` or `OneHotEncoding` to convert your categorical features to numeric.

In [19]:
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
df = pd.DataFrame(X, columns=['feat_1', 'feat_2', 'feat_3'])
df

Unnamed: 0,feat_1,feat_2,feat_3
0,1.0,2,
1,3.0,4,3.0
2,,6,5.0
3,8.0,8,7.0


In [22]:
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(df)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

Whatever method you choose to impute missing data its important that you convey this to the business or domain experts so that they are made aware that there are limitations in the data provided and they might be able to look into better data sources that are cleaner or suggest ways to fill in the missing values. 

# References

https://www.analyticsvidhya.com/blog/2021/06/defining-analysing-and-implementing-imputation-techniques/