Deep Learning for Data Imputation

![](imputation.png)

## Introduction
* Data imputation is a technique used to fill in missing values in a dataset. 
* It is a common practice in data preprocessing. 
* There are several ways to impute missing values, such as using the mean, median, mode, or a constant value to fill in the missing data. 
* However, these methods are not always accurate or efficient, especially when dealing with large datasets or complex data structures.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

## Load Data with Missing Values

In [2]:
df = pd.read_csv('ExampleData.csv')

## Show Data

In [3]:
df.head()

Unnamed: 0,Height,YOE,Salary
0,175.0,3.0,6.0
1,168.0,4.0,9.0
2,160.0,10.0,18.0
3,,15.0,25.0
4,161.0,,50.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Height  20 non-null     float64
 1   YOE     22 non-null     float64
 2   Salary  22 non-null     float64
dtypes: float64(3)
memory usage: 732.0 bytes


In [5]:
df.describe()

Unnamed: 0,Height,YOE,Salary
count,20.0,22.0,22.0
mean,168.7,7.045455,17.590909
std,7.226414,3.3591,14.147721
min,160.0,3.0,6.0
25%,161.75,5.0,10.0
50%,169.0,6.5,11.5
75%,175.0,8.75,18.0
max,180.0,15.0,50.0


In [6]:
df.isnull().sum()

Height    5
YOE       3
Salary    3
dtype: int64

## Fill Missing Values with Linear Regression

In [7]:
lr = LinearRegression()
imputer = IterativeImputer(estimator=lr)
df_imputed = imputer.fit_transform(df)

## Data Imputation Results

In [8]:
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)
df_imputed.head()

Unnamed: 0,Height,YOE,Salary
0,175.0,3.0,6.0
1,168.0,4.0,9.0
2,160.0,10.0,18.0
3,165.275222,15.0,25.0
4,161.0,31.706376,50.0


In [9]:
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Height  25 non-null     float64
 1   YOE     25 non-null     float64
 2   Salary  25 non-null     float64
dtypes: float64(3)
memory usage: 732.0 bytes


In [11]:
df_imputed.describe()

Unnamed: 0,Height,YOE,Salary
count,25.0,25.0,25.0
mean,168.5485,10.004765,16.968229
std,6.513479,8.761896,13.345414
min,160.0,3.0,6.0
25%,162.0,5.0,10.0
50%,169.720689,7.0,12.0
75%,172.0,10.0,18.0
max,180.0,31.706376,50.0


In [12]:
df_imputed.isnull().sum()

Height    0
YOE       0
Salary    0
dtype: int64

## KNN Imputation
KNN Imputation is another technique(nearest neighbors) to fill missing values in a dataset.

In [13]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=2)
df_imputed_knn = knn_imputer.fit_transform(df)

df_imputed_knn = pd.DataFrame(df_imputed_knn, columns=df.columns)
df_imputed_knn.head()

Unnamed: 0,Height,YOE,Salary
0,175.0,3.0,6.0
1,168.0,4.0,9.0
2,160.0,10.0,18.0
3,160.0,15.0,25.0
4,161.0,10.0,50.0


## Fill Missing Values with Mean

In [14]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean') # mean, median, most_frequent, constant
df_imputed_mean = imputer.fit_transform(df)

df_imputed_mean = pd.DataFrame(df_imputed_mean, columns=df.columns)
df_imputed_mean.head()

Unnamed: 0,Height,YOE,Salary
0,175.0,3.0,6.0
1,168.0,4.0,9.0
2,160.0,10.0,18.0
3,168.7,15.0,25.0
4,161.0,7.045455,50.0


## Fill Missing Values with Mice
Miceforest is a library that uses the MICE algorithm to impute missing values in a dataset.

In [16]:
# !pip install miceforest

In [27]:
import miceforest as mf

mc_imputer = mf.ImputationKernel(df)
mc_imputer.complete_data()
# df_imputed_mice = pd.DataFrame(mc_imputer.impute_new_data(df).data, columns=df.columns)
# df_imputed_mice.head()


Unnamed: 0,Height,YOE,Salary
0,175.0,3.0,6.0
1,168.0,4.0,9.0
2,160.0,10.0,18.0
3,160.0,15.0,25.0
4,161.0,4.0,50.0
5,162.0,5.0,10.0
6,180.0,6.0,11.0
7,180.0,7.0,18.0
8,172.0,8.0,12.0
9,170.0,9.0,14.0


## MissingPy

In [None]:
# !pip install missingpy

## FFill and BFill
* Forward fill (ffill) and backward fill (bfill) are two simple techniques to fill missing values in a dataset.

In [None]:
df.ffill()
df.bfill()

## Conclusion
