# Five importnat ways for Imputing Missing Values

You can impute missing values using machine learning models. This process is known as data imputation and is commonly used in data preprocessing to handle missing or incomplete data. There are several methods and models you can use, depending on the nature of your data and the missing values:

1. **`Simple Imputation Techniques:`** 
   - **Mean/Median Imputation:** Replace missing values with the mean or median of the column. Suitable for numerical data.
   - **Mode Imputation:** Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.

2. **`K-Nearest Neighbors (KNN)`:** This algorithm can be used to impute missing values based on the similarity of rows.

3. **`Regression Imputation:`** Use a regression model to predict the missing values based on other variables in your dataset.

4. **`Decision Trees and Random Forests:`** These can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.

5. **`Advanced Techniques:`**
   - **Multiple Imputation by Chained Equations (MICE):** This is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion.
   - **Deep Learning Methods:** Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.

6. **`Time Series Specific Methods:`** For time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.

It's important to choose the right method based on the type of data, the pattern of missingness (e.g., at random, completely at random, or not at random), and the amount of missing data. Additionally, it's crucial to understand that imputation can introduce bias or affect the distribution of your data, so it should be done with caution and an understanding of the potential implications.

<!-- #Simple impputation technique 

Mean Median replaces missing values with the mean and meadian of the column. This is simple and effective method, but it has some limitations. For example, it reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random. -->


In [105]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

#load the titanic dataset from seaborn 

df = sns.load_dataset('titanic')    
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [107]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [108]:
#check the missing values in the dataset as per ascending order 

df.isnull().sum().sort_values(ascending=False)


deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [109]:
#since the age column has 177 columns so we can replace them with mean 

df['age'] = df['age'].fillna(df['age'].mean())
df['age'].isnull().sum()    

0

In [110]:
#we can also impute the missing values with the median 

df = sns.load_dataset('titanic')
df['age'] = df['age'].fillna(df['age'].median())
df['age'].isnull().sum()

0

In [111]:
#check the missing values in the dataset as per ascending order

df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [112]:
#We can apply mode imputation to the categorcial or object columsn 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [113]:
df.isnull().sum().sort_values(ascending=False)


deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [114]:
#impute the missiing values in the embark and embark_town with the mode 

df['embarked'] =df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

df.isnull().sum().sort_values(ascending=False)

deck           688
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [115]:
# #We can impute the missing values using the KNN imputer

# KNN is the method used to impute the missing values in the dataset using KNN algorithim which fill the missing values with the nearest neighbours

# import the dataset
df = sns.load_dataset('titanic')
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [116]:
# impute missing values with KNN imputer
from sklearn.impute import KNNImputer

# call the KNN class with number of neighbors = 4
imputer = KNNImputer(n_neighbors=4)

#impute missing values with KNN imputer
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [117]:
# Regression Imputer 
# Regression imputation uses the regression model to impute the missinng values and based on the other varaiables like predicting the missing values using linear or multi linear regression model..

In [118]:
#import the dataset 

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [119]:
#impute the missing values using the regression model 
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer 

#call the itterative class with max_iter = 10 

imputer = IterativeImputer(max_iter =10)

#impute the missing valeus with regression model 

df['age'] = imputer.fit_transform(df[['age']])

#check the number of the missing values now 

df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [120]:
#impute the missing values using randomforest

import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor 

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

from sklearn.impute import SimpleImputer
#load sns dataset titanic 

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [121]:
#check the misisng values in each column as per ascending order

df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [122]:
#remove the deck column from the dataset 

df = df.drop('deck', axis=1)
df.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [123]:
#encode the data using label encoder 

from sklearn.preprocessing import LabelEncoder

In [124]:
#colum to encode 

column_to_encode = ['sex', 'embarked', 'class', 'embark_town', 'alive' ]

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [125]:
#dictionary to store labelencoder for each column 

label_ecoders = {}

#loop through ech column and apply label encoder 

#ccreate new label encoder for the column
for col in column_to_encode:
    
    le = LabelEncoder()

#fit and transform and then re_transverse the data 

    df[col] = le.fit_transform(df[col])

#store the encoder in the dictionary

    label_ecoders[col] = le


df.head()



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,man,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,woman,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,woman,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,woman,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,man,True,2,0,True


In [126]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,man,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,woman,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,woman,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,woman,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,man,True,2,0,True


In [127]:
#split the dataset without missing valuese andwith missing values 

df_with_missing = df[df['age'].isna()]


#drop all the values having the na values 

df_without_missing = df.dropna()

In [128]:
print('The shape of the orignal dataset is', df.shape)
print('The shape of the dataset with missing values is', df_with_missing.shape)
print('The shape of the dataset with out missing values is', df_without_missing.shape)

The shape of the orignal dataset is (891, 14)
The shape of the dataset with missing values is (177, 14)
The shape of the dataset with out missing values is (714, 14)


In [129]:
df_without_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,man,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,woman,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,woman,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,woman,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,man,True,2,0,True


In [130]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,man,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,woman,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,woman,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,woman,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,man,True,2,0,True


In [131]:
#check the names of the columns 

df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')

In [132]:
# Regression Imputation
#import mean absolute percentage error
from sklearn.metrics import mean_absolute_error, r2_score, root_mean_squared_error, mean_absolute_percentage_error

# split the data into X and y and we will only take the columns with no missing values
X = df_without_missing.drop(['age'], axis=1)
y = df_without_missing['age']

#apply label encoder to the X dataset but to only categorial and object columns

for col in X.columns:
    if X[col].dtype == 'object':
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
        label_ecoders[col] = le
        
#  split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# evaluate the model
y_pred = rf_model.predict(X_test)
print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_pred))
print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Imputation: ", mean_absolute_percentage_error(y_test, y_pred))

RMSE for Random Forest Imputation:  11.081260589808045
R2 Score for Random Forest Imputation:  0.33769388288226154
MAE for Random Forest Imputation:  8.666661815622195
MAPE for Random Forest Imputation:  0.40839466096086574


In [133]:
#Echeck the number of missing values in the dataset

df.isnull().sum().sort_values(ascending=False)


age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [134]:
# 5.1. Multiple Imputation by Chained Equations (MICE)
# Multiple Imputation by Chained Equations (MICE) is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion. It works well for both categorical and numerical data.

# To demonstrate Multiple Imputation by Chained Equations (MICE) in Python, we can use the IterativeImputer class from the sklearn.impute module. MICE is a sophisticated method of imputation that models each feature with missing values as a function of other features, and it uses that estimate for imputation. It does this in a round-robin fashion: each feature is modeled in turn. The MICE algorithm is implemented in the IterativeImputer class.

# Let's see how to implement MICE in Python using the Titanic dataset.

In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df= sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [136]:
#check the missing values 

df.isnull().sum().sort_values(ascending = False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [137]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [143]:
from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object using LabelEncoder() in for loop for categorical columns
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'deck', 'class', 'embark_town', 'alive']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column for encoding
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()
    # Fit and transform the data
    df[col] = le.fit_transform(df[col])
    # Store the encoder in the dictionary
    label_encoders[col] = le
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [144]:
#impute the missing values with itterative imputer 
#call the itterativeImputer with max_iter = 10

#columns to impute

columns_to_impute = ['age', 'embark_town', 'embarked', 'deck']

#loop to impute each column 

for col in columns_to_impute:
    df[col] = imputer.fit_transform(df[[col]])
    
    
#check the missing values now 

df.isnull().sum().sort_values(ascending = False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [145]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2.0,2,1,True,7.0,2.0,0,False
1,1,1,0,38.0,1,0,71.2833,0.0,0,2,False,2.0,0.0,1,False
2,1,3,0,26.0,0,0,7.925,2.0,2,2,False,7.0,2.0,1,True
3,1,1,0,35.0,1,0,53.1,2.0,0,2,False,2.0,2.0,1,False
4,0,3,1,35.0,0,0,8.05,2.0,2,1,True,7.0,2.0,0,True


In [149]:
#now inverse transform for each column 

for col in column_to_encode:
    
    #retreive the correspondance labelencoder for the column 
    
    le = label_encoders[col]
    
    #inverse transform the data and convert it into integer type 
    
    df[col] = le.inverse_transform(df[col].astype(int))
    
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2.0,Third,man,True,,Southampton,no,False
1,1,1,0,38.0,1,0,71.2833,0.0,First,woman,False,C,Cherbourg,yes,False
2,1,3,0,26.0,0,0,7.925,2.0,Third,woman,False,,Southampton,yes,True
3,1,1,0,35.0,1,0,53.1,2.0,First,woman,False,C,Southampton,yes,False
4,0,3,1,35.0,0,0,8.05,2.0,Third,man,True,,Southampton,no,True


In [150]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [154]:
df.isnull().sum().sort_values(ascending = False)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [155]:
from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object using LabelEncoder() in for loop for categorical columns
# Columns to encode
columns_to_encode = ['sex', 'smoker', 'day', 'time']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column for encoding
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()
    # Fit and transform the data
    df[col] = le.fit_transform(df[col])
    # Store the encoder in the dictionary
    label_encoders[col] = le
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [158]:
print(label_encoders)

{'sex': LabelEncoder(), 'smoker': LabelEncoder(), 'day': LabelEncoder(), 'time': LabelEncoder()}


In [160]:
#invers transform the encoded values 

for col in columns_to_encode:
    
    le = label_encoders[col]
    
    df[col] = le.inverse_transform(df[col].astype(int))
    
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


# 5.2. Deep Learning Methods
Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets. Deep learning methods, particularly neural networks like autoencoders, offer a powerful approach for imputing missing values in complex datasets. These methods are especially useful when the data has intricate, non-linear relationships that traditional statistical methods might not capture effectively.

Understanding Autoencoders for Imputation:
What is an Autoencoder?

An autoencoder is a type of neural network that is trained to copy its input to its output.
It has a hidden layer that describes a code used to represent the input.
The network may be viewed as consisting of two parts: an encoder function, which compresses the input into a latent-space representation, and a decoder function, which reconstructs the input from the latent space.
How Autoencoders Work for Imputation:

The key idea is to train the autoencoder to ignore the noise (missing values) in the input data.
During training, inputs with missing values are presented, and the network learns to predict the missing values in a way that minimizes reconstruction error for known parts of the data.
This results in the network learning a robust representation of the data, enabling it to make reasonable guesses about missing values.
Advantages of Using Autoencoders:

Handling Complex Patterns: They can capture non-linear relationships in the data, which is particularly useful for complex datasets.
Scalability: They can handle large-scale datasets efficiently.
Flexibility: They can be adapted to different types of data (e.g., images, text, time-series).
Implementation Considerations:

Data Preprocessing: Data should be normalized or standardized before feeding it into an autoencoder.
Network Architecture: The choice of architecture (number of layers, type of layers, etc.) depends on the complexity of the data.
Training Process: It might involve techniques like dropout or noise addition to improve the model's ability to handle missing data.
Example Use-Cases:

Image Data: Filling in missing pixels or reconstructing corrupted images.
Time-Series Data: Imputing missing values in sequences like stock prices or weather data.
Tabular Data: Handling missing entries in datasets used for machine learning.
Implementation Example:
Here's a simplified example of how you might set up an autoencoder for imputation in Python using TensorFlow and Keras: (Check the next notebook)