# Feature Imputation

This is a technique commonly used in fixing Outliers and filling for missing values in our feature columns.

Since We treated Outliers already, I won't be going back to Imputation for Outliers, but In this class today, I will focus more on Imputation for filling for missing Values.

We have many methods for Imputation, with its pros and cons, and to determine the method to be used, one has to understand the data we are working with and problems we are trying to solve.

Understanding the data we are working with will make us understand the data's missingness (i.e the types of missing data it has) lets briefly treat types of data missingness.

### Types of Data Missingness
There are three types of data missingness which are :
- **Missing Completely At Random (MCAR)** : This is when data miss randomly, without having any relation with the dataset this most of the time leads to dropping through the rows or columns, which We are not treating in this class today.
- **Missing At Random (MAR)** : This type of missing values occurs when the missing values is the feature occurs due to an effect on the distribution of another feature in the dataset.
- **Missing Not At Random (MNAR)** : This occurs when the missing value is due to conditions and the distribution of the particular feature.

**Lets use this illustration to explain further** Lets assume we have a dataset for zoom calls, and the features in the dataset are **duration of electric supply** , **duration of call**, Lets consider **duration of call** as our feature in question, If :

- Data is missing based on reasons like participant falling asleep or sent on an errand and other reasons not relating to any feature in the dataset, then the Missingness is **MCAR**.

Elif :
- Data is missing based on the fact that the duration of electric supply is small, which made his system or phone go off and he couldn't attend the class, then the Missingness is **MAR** i.e the missing value is affected by another column in the dataset.

Elif :
- Data is missing based on the duration of the call, lets say some participants can't be patient to be on a call for more than two hours which made them leave the call if the duration is more than that causing a missing value at the end of the class, then the Missingness is **MNAR**, since the particular feature distribution is having an effect on the missing value.

Else :
- No missing value.

Missing Values happens both to Categorical and Numerical data, And there are different methods for handling it.
And one of the method for handling missing values is **$IMPUTATION$**

There are many ways which Imputation can be done which are categorised into :
- Constant Replacement Method :
    - Mean Subsitution.
    - Median Substitution
    - Mode Substitution
    - Zero Imputation
- Random Replacement :
    - Data_based (Hot_deck, Cold_deck)
    - Model_based (MCMC, Maximum Likelihood and so on)
    
- Non-Random Replacement
    - One Condition (Group Mean, Group Median, LOCF, NOCB)
    - Multiple Conditions ( K-Nearest Neighbor, Regression and so on)
    
 Will be treating few basic methods in this class for practice
 
 ### Lets Practice
        

The most basic method for imputation is constant replacement, A method in which you chose a value to replace all missing values in your feature. commonly used values are mean, median, mode, or an extremly high or low value

In [1]:
## Let import library
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer


In [2]:
## Lets generate the data data.
feature = [2,3,4,5,4,6,8,10,6,9,12,15,8,12,16,20,10,15,20,25]

MAR_list = []


framed = pd.DataFrame({'feature':feature})
framed['feature_1'] = framed['feature'] ** 2
cat = []
for i in framed['feature_1'] :
    if i < 10**2 :
        cat.append('low')
    else :
        cat.append('high')
framed['categorical'] = cat
for n in framed['feature_1'] :
    if n % 2 == 0 :
        MAR_list.append(n)
    else :
        MAR_list.append(0)
framed['MAR_feature'] = MAR_list

framed['MAR_feature'] = framed['MAR_feature'].replace(0, np.nan)

framed

Unnamed: 0,feature,feature_1,categorical,MAR_feature
0,2,4,low,4.0
1,3,9,low,
2,4,16,low,16.0
3,5,25,low,
4,4,16,low,16.0
5,6,36,low,36.0
6,8,64,low,64.0
7,10,100,high,100.0
8,6,36,low,36.0
9,9,81,low,


A data of 20 rows and two columns, one of the columns contains NaN which is influenced by the first column

In [3]:
(framed['MAR_feature'].isnull().sum()/len(framed['MAR_feature']))*100

30.0

In [4]:
## Constant Replacement Method.
framed['median_fill'] = framed['MAR_feature'].fillna(framed['MAR_feature'].median())
framed['mean_fill'] = framed['MAR_feature'].fillna(framed['MAR_feature'].mean())
framed['zero_fill'] = framed['MAR_feature'].fillna(0)
framed['mode_fill'] = framed['MAR_feature'].fillna(framed['MAR_feature'].mode()[0])
framed

Unnamed: 0,feature,feature_1,categorical,MAR_feature,median_fill,mean_fill,zero_fill,mode_fill
0,2,4,low,4.0,4.0,4.0,4.0,4.0
1,3,9,low,,82.0,127.142857,0.0,16.0
2,4,16,low,16.0,16.0,16.0,16.0,16.0
3,5,25,low,,82.0,127.142857,0.0,16.0
4,4,16,low,16.0,16.0,16.0,16.0,16.0
5,6,36,low,36.0,36.0,36.0,36.0,36.0
6,8,64,low,64.0,64.0,64.0,64.0,64.0
7,10,100,high,100.0,100.0,100.0,100.0,100.0
8,6,36,low,36.0,36.0,36.0,36.0,36.0
9,9,81,low,,82.0,127.142857,0.0,16.0


Using constant replacement method, we can say, most suitable method for categorical variable under constant replacement methods is the Modal replacement, While Mean and median are More suitable for Numerical features.

So what it does is check the distribution of the column with missing values and get its Mean, Medain or modal value as the case might be and use that to fill missing values in that feature. One of the reason why If missing values gets to a percentage in a feature, one might have to drop the feature since the distribution might have become biased due to the effect of the missing values. I usually chose 30% for this but it varies.

In [5]:
##Non Random replacement, One Condition. (Group Mean, Group Median,)
framed.groupby('categorical')['MAR_feature'].mean()

categorical
high    220.571429
low      33.714286
Name: MAR_feature, dtype: float64

In [6]:
framed['feature_grouped_mean'] = framed['MAR_feature'].fillna(framed.groupby('categorical')['MAR_feature'].transform('mean'))
framed['feature_grouped_median'] = framed['MAR_feature'].fillna(framed.groupby('categorical')['MAR_feature'].transform('median'))

This is another method through the use of Non-Random replacement, They are used for numerical data. The intuition behind the method is **getting aggregrate of values to use to fill missing value considering another categorical feature using groupby**. for Instance, when you have in a dataset, **a feature on Gender and another feature on Age, we can now use groupby to get the mean age of the genders seperately and use that to fill the rest based on their gender**

In [7]:
framed['feature_LOCF'] = framed['MAR_feature'].fillna(method='ffill')
framed['feature_NOCB'] = framed['MAR_feature'].fillna(method='backfill')
framed['feature_LOCF_group'] = framed.groupby(['categorical'])['MAR_feature'].ffill()
framed['feature_NOCB_group'] = framed.groupby(['categorical'])['MAR_feature'].backfill()

This method is more suitable on time series data, where for LOCF(Last Observation Carried Forward) we use the data before the missing value to fill for the missing value, while NOCB (Next Observation Carried Backward) we use data after the missing value to fill for the missing value.

We have for grouped too using the groupby.

In [8]:
framed[['feature','categorical','MAR_feature','feature_LOCF','feature_LOCF_group','feature_NOCB','feature_NOCB_group']]

Unnamed: 0,feature,categorical,MAR_feature,feature_LOCF,feature_LOCF_group,feature_NOCB,feature_NOCB_group
0,2,low,4.0,4.0,4.0,4.0,4.0
1,3,low,,4.0,4.0,16.0,16.0
2,4,low,16.0,16.0,16.0,16.0,16.0
3,5,low,,16.0,16.0,16.0,16.0
4,4,low,16.0,16.0,16.0,16.0,16.0
5,6,low,36.0,36.0,36.0,36.0,36.0
6,8,low,64.0,64.0,64.0,64.0,64.0
7,10,high,100.0,100.0,100.0,100.0,100.0
8,6,low,36.0,36.0,36.0,36.0,36.0
9,9,low,,36.0,36.0,144.0,64.0


In [9]:
##Non_Random Replacement (Multiple Conditions)
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=7)
framed[['feature_KNN']] = knn_imputer.fit_transform(framed[['MAR_feature']])

In [10]:
framed[['feature_KNN']]

Unnamed: 0,feature_KNN
0,4.0
1,127.142857
2,16.0
3,127.142857
4,16.0
5,36.0
6,64.0
7,100.0
8,36.0
9,127.142857


In this Method, we use a kind of algorithm that uses distance measures to determine the values to be inplace of missing values, It calculate for the provided values with shortest distance and use that to determine the value in place of the missing values.

In [11]:
framed.describe()

Unnamed: 0,feature,feature_1,MAR_feature,median_fill,mean_fill,zero_fill,mode_fill,feature_grouped_mean,feature_grouped_median,feature_LOCF,feature_NOCB,feature_LOCF_group,feature_NOCB_group,feature_KNN
count,20.0,20.0,14.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,19.0,20.0,19.0,20.0
mean,10.5,148.5,127.142857,113.6,127.142857,89.0,93.8,127.142857,116.0,124.0,127.368421,124.0,127.368421,127.142857
std,6.345326,164.863101,133.591143,112.522466,110.502611,125.635228,122.235149,122.341004,115.916485,134.38045,136.510566,134.38045,136.510566,110.502611
min,2.0,4.0,4.0,4.0,4.0,0.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
25%,5.75,33.25,36.0,57.0,57.0,0.0,16.0,33.714286,36.0,31.0,26.0,31.0,26.0,57.0
50%,9.5,90.5,82.0,82.0,127.142857,36.0,36.0,82.0,82.0,82.0,64.0,82.0,64.0,127.142857
75%,15.0,225.0,144.0,111.0,131.357143,111.0,111.0,220.571429,144.0,144.0,144.0,144.0,144.0,131.357143
max,25.0,625.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0


### Using External Datasets

In [12]:
df = pd.read_csv('Big Mart Sales Prediction Train.csv')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [14]:
df['Outlet_Location_Type'].value_counts()
df.groupby('Outlet_Location_Type')['Item_Weight'].mean()

Outlet_Location_Type
Tier 1    12.892124
Tier 2    12.768628
Tier 3    12.933745
Name: Item_Weight, dtype: float64

In [15]:
(df.isnull().sum()/len(df))*100

Item_Identifier               0.000000
Item_Weight                  17.165317
Item_Fat_Content              0.000000
Item_Visibility               0.000000
Item_Type                     0.000000
Item_MRP                      0.000000
Outlet_Identifier             0.000000
Outlet_Establishment_Year     0.000000
Outlet_Size                  28.276428
Outlet_Location_Type          0.000000
Outlet_Type                   0.000000
Item_Outlet_Sales             0.000000
dtype: float64

We have two columns with missing values and the percentage of its missing values are less than 30%, So we are going to impute values to it using various methods

In [16]:
df['median_fill'] = df['Item_Weight'].fillna(df['Item_Weight'].median())
df['mean_fill'] = df['Item_Weight'].fillna(df['Item_Weight'].mean())
df['zero_fill'] = df['Item_Weight'].fillna(0)
df['mode_fill'] = df['Outlet_Size'].fillna(df['Outlet_Size'].mode()[0])

In [17]:
Constant_filled = df[['Item_Weight','Outlet_Size','median_fill','mean_fill','zero_fill','mode_fill']]
Constant_filled

Unnamed: 0,Item_Weight,Outlet_Size,median_fill,mean_fill,zero_fill,mode_fill
0,9.300,Medium,9.300,9.300,9.300,Medium
1,5.920,Medium,5.920,5.920,5.920,Medium
2,17.500,Medium,17.500,17.500,17.500,Medium
3,19.200,,19.200,19.200,19.200,Medium
4,8.930,High,8.930,8.930,8.930,High
...,...,...,...,...,...,...
8518,6.865,High,6.865,6.865,6.865,High
8519,8.380,,8.380,8.380,8.380,Medium
8520,10.600,Small,10.600,10.600,10.600,Small
8521,7.210,Medium,7.210,7.210,7.210,Medium


In [18]:
(Constant_filled.isnull().sum()/len(Constant_filled))*100

Item_Weight    17.165317
Outlet_Size    28.276428
median_fill     0.000000
mean_fill       0.000000
zero_fill       0.000000
mode_fill       0.000000
dtype: float64

In [19]:
df['feature_grouped_mean'] = df['Item_Weight'].fillna(df.groupby('Outlet_Location_Type')['Item_Weight'].transform('mean'))
df['feature_grouped_median'] = df['Item_Weight'].fillna(df.groupby('Outlet_Location_Type')['Item_Weight'].transform('median'))

In [20]:
df['feature_LOCF'] = df['Item_Weight'].fillna(method='ffill')
df['feature_NOCB'] = df['Item_Weight'].fillna(method='backfill')
df['feature_LOCF_group'] = df.groupby(['Outlet_Location_Type'])['Item_Weight'].ffill()
df['feature_NOCB_group'] = df.groupby(['Outlet_Location_Type'])['Item_Weight'].backfill()

In [21]:
non_Randomized_filled = df[['Item_Weight','feature_grouped_mean','feature_grouped_median','feature_LOCF','feature_NOCB','feature_LOCF_group','feature_NOCB_group']]
non_Randomized_filled

Unnamed: 0,Item_Weight,feature_grouped_mean,feature_grouped_median,feature_LOCF,feature_NOCB,feature_LOCF_group,feature_NOCB_group
0,9.300,9.300,9.300,9.300,9.300,9.300,9.300
1,5.920,5.920,5.920,5.920,5.920,5.920,5.920
2,17.500,17.500,17.500,17.500,17.500,17.500,17.500
3,19.200,19.200,19.200,19.200,19.200,19.200,19.200
4,8.930,8.930,8.930,8.930,8.930,8.930,8.930
...,...,...,...,...,...,...,...
8518,6.865,6.865,6.865,6.865,6.865,6.865,6.865
8519,8.380,8.380,8.380,8.380,8.380,8.380,8.380
8520,10.600,10.600,10.600,10.600,10.600,10.600,10.600
8521,7.210,7.210,7.210,7.210,7.210,7.210,7.210


In [22]:
non_Randomized_filled.describe()

Unnamed: 0,Item_Weight,feature_grouped_mean,feature_grouped_median,feature_LOCF,feature_NOCB,feature_LOCF_group,feature_NOCB_group
count,7060.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,12.86813,12.820453,12.849649,12.812795,12.900772,12.853461
std,4.643456,4.226195,4.226916,4.63102,4.641486,4.639897,4.639648
min,4.555,4.555,4.555,4.555,4.555,4.555,4.555
25%,8.77375,9.31,9.31,8.785,8.75,8.85,8.765
50%,12.6,12.933745,12.65,12.6,12.6,12.65,12.6
75%,16.85,16.0,16.0,16.75,16.75,16.85,16.85
max,21.35,21.35,21.35,21.35,21.35,21.35,21.35


In [23]:
(non_Randomized_filled.isnull().sum()/len(non_Randomized_filled))*100

Item_Weight               17.165317
feature_grouped_mean       0.000000
feature_grouped_median     0.000000
feature_LOCF               0.000000
feature_NOCB               0.000000
feature_LOCF_group         0.000000
feature_NOCB_group         0.000000
dtype: float64

In [24]:
knn_imputer = KNNImputer(n_neighbors=3)
df[['feature_KNN']] = knn_imputer.fit_transform(df[['Item_Weight']])
df[['feature_KNN']]

Unnamed: 0,feature_KNN
0,9.300
1,5.920
2,17.500
3,19.200
4,8.930
...,...
8518,6.865
8519,8.380
8520,10.600
8521,7.210


In [25]:
non_randomised_Multiple = df[['Item_Weight','feature_KNN']]
non_randomised_Multiple 

Unnamed: 0,Item_Weight,feature_KNN
0,9.300,9.300
1,5.920,5.920
2,17.500,17.500
3,19.200,19.200
4,8.930,8.930
...,...,...
8518,6.865,6.865
8519,8.380,8.380
8520,10.600,10.600
8521,7.210,7.210


In [26]:
non_randomised_Multiple.describe()

Unnamed: 0,Item_Weight,feature_KNN
count,7060.0,8523.0
mean,12.857645,12.857645
std,4.643456,4.226124
min,4.555,4.555
25%,8.77375,9.31
50%,12.6,12.857645
75%,16.85,16.0
max,21.35,21.35


In [27]:
(non_randomised_Multiple.isnull().sum()/len(non_randomised_Multiple))*100

Item_Weight    17.165317
feature_KNN     0.000000
dtype: float64