# Imputation Methods for Missing Data - Comparison Study

### Comprehensive Comparison of Imputation Methods (Basic to Advanced) for Missing Values.

In the dynamic realm of data analysis, the presence of missing values in datasets is a common challenge that can significantly impact the reliability and accuracy of analyses. Missing values in datasets can cause problems when we're trying to analyze data or build machine learning models. These gaps in information make it hard to get a complete picture and can lead to incorrect conclusions. When we analyze data, missing values can mess up our calculations and create biases in the results.

In machine learning, models may not work well if the data is incomplete, making predictions less accurate. Mishandling missing data can also lead to wrong decisions based on flawed information. That's why it's crucial to explore different ways to fill in these missing values, so we can make our analyses and models more reliable and effective.The art and science of imputation, or the process of filling in these missing values, has spawned a multitude of techniques, each with its unique strengths and limitations.

In this blog, we embark on a journey to unravel the intricacies of various imputation methods, exploring their theoretical foundations and practical implementations. From traditional approaches like mean imputation to sophisticated methodologies such as k-nearest neighbors and multiple imputation, we will dissect the nuances of each, providing readers with valuable insights to navigate the complex terrain of missing data handling. 

Join us as we delve into the fascinating landscape of imputation methods, demystifying their roles in enhancing the robustness of datasets and empowering data scientists and analysts to make informed decisions in the face of missing information.



#### We will use the `titanic`, and try to sort out the best method of imputation by imputing the missing values by using the following popular/famous methods:

1. Mean / Median / Mode Imputation
2. Forward Fill / Backward Fill
3. K-Nearest Neighbors (KNN) Imputation
4. Iterative Inpmputer

> There are some other techniques to impute the missing data, but those technique depends on the assupmtion of the `collinearity` of the features and `correlation` with the `target variable`, those techniques are applicable in spacific situations, and are not much popular, that is the rason we will not discuss any other technique.

##### **Import Important Libraries**

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

##### **Import the Dataset**

In [11]:
# import the titanic dataset from the seaborn library.
df = sns.load_dataset('titanic')

In [12]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [13]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

##### **Subsetting the Dataset**

> There are a lot of numerical & categorical features in the dataset, but we will keep few of them for quick comparison of different imputation techniques, instead of peforming the `feature engineering`.

In [14]:
# fill the missing data in the 'embarked' column with the mode.
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

In [15]:
titanic = df['embarked']

In [16]:
titanic.head()

0    S
1    C
2    S
3    S
4    S
Name: embarked, dtype: object

In [18]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [20]:
# example of one hot encoding
from sklearn.preprocessing import OneHotEncoder
# example of one hot encoding
titanic = sns.load_dataset('titanic')

onehot_encoder = OneHotEncoder(sparse=False)
embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']])
embarked_onehot_df = pd.DataFrame(embarked_onehot, columns=onehot_encoder.get_feature_names_out(['embarked']))
embarked_onehot_df
# titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)
# titanic.head()

Unnamed: 0,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
886,0.0,0.0,1.0,0.0
887,0.0,0.0,1.0,0.0
888,0.0,0.0,1.0,0.0
889,1.0,0.0,0.0,0.0


In [62]:
# display the percentage of missing values in the columns where the sum-of-missing-values is greater than 0.
(df_train.isnull().sum()[df_train.isnull().sum() > 0] / df_train.shape[0] * 100).sort_values(ascending=False)

FireplaceQu    47.260274
LotFrontage    17.739726
GarageYrBlt     5.547945
GarageCond      5.547945
BsmtCond        2.534247
MasVnrArea      0.547945
dtype: float64

In [56]:
# Set the option to display all columns in the dataframe.
pd.set_option('display.max_columns', None)

In [57]:
# Disply the first 5 rows of the training datase.
df_train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,Alley,MSZoning,FireplaceQu,GarageCond,BsmtCond,Utilities,HouseStyle,OverallQual.1,OverallCond.1,SaleCondition
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,208500,,RL,,TA,TA,AllPub,2Story,7,5,Normal
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,181500,,RL,TA,TA,TA,AllPub,1Story,6,8,Normal
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,223500,,RL,TA,TA,TA,AllPub,2Story,7,5,Normal
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,140000,,RL,Gd,TA,Gd,AllPub,2Story,7,5,Abnorml
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,250000,,RL,TA,TA,TA,AllPub,2Story,8,5,Normal


In [58]:
# Display the value counts of the feature named 'Alley'.
df_train['Alley'].value_counts()

Alley
Grvl    50
Pave    41
Name: count, dtype: int64

> After checking the `description` file of the dataset, I found the following details related to the `Alley` feature:\
    \
    **Alley:**  Type of alley access to property\
            **Grvl:**	Gravel\
            **Pave:**	Paved\
            **NA:** 	No alley access\
            \
            Here, `NA` does not means the `missing-value`.
            \
            So, we should impute the `NA` values with `"No Alley Access"` keyword.\
            \
            To fill the missing values in this feature, we will not use any other method because the imputation the `NA-values` in this feature by using any other method will mislead the information.

In [59]:
# Impute the 'NA' values of 'Alley' columns with the keyword 'No Alley Access'.
df_train['Alley'] = df_train['Alley'].fillna('No Alley Access')

In [60]:
# Display the value counts of the feature named 'FireplaceQu'.
df_train['FireplaceQu'].value_counts()

FireplaceQu
Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: count, dtype: int64

##### **1. Mean / Median / Mode Imputation**

##### **2. Forward Fill / Backward Fill Imputation**

##### **3. Linear Interpolation**

##### **4. Regression Imputation**

##### **5. Stochastic Imputation**

##### **6. K-Nearest Neighbors (KNN) Imputation**

##### **7. Iterative Imputation**

##### **Conclustion:**