## **5.2. Deep Learning Methods**

Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.
Deep learning methods, particularly neural networks like autoencoders, offer a powerful approach for imputing missing values in complex datasets. These methods are especially useful when the data has intricate, non-linear relationships that traditional statistical methods might not capture effectively.

### Understanding Autoencoders for Imputation:

1. **What is an Autoencoder?**
   - An autoencoder is a type of neural network that is trained to copy its input to its output.
   - It has a hidden layer that describes a code used to represent the input.
   - The network may be viewed as consisting of two parts: an encoder function, which compresses the input into a latent-space representation, and a decoder function, which reconstructs the input from the latent space.

2. **How Autoencoders Work for Imputation:**
   - The key idea is to train the autoencoder to ignore the noise (missing values) in the input data.
   - During training, inputs with missing values are presented, and the network learns to predict the missing values in a way that minimizes reconstruction error for known parts of the data.
   - This results in the network learning a robust representation of the data, enabling it to make reasonable guesses about missing values.

3. **Advantages of Using Autoencoders:**
   - **Handling Complex Patterns:** They can capture non-linear relationships in the data, which is particularly useful for complex datasets.
   - **Scalability:** They can handle large-scale datasets efficiently.
   - **Flexibility:** They can be adapted to different types of data (e.g., images, text, time-series).

4. **Implementation Considerations:**
   - **Data Preprocessing:** Data should be normalized or standardized before feeding it into an autoencoder.
   - **Network Architecture:** The choice of architecture (number of layers, type of layers, etc.) depends on the complexity of the data.
   - **Training Process:** It might involve techniques like dropout or noise addition to improve the model's ability to handle missing data.

5. **Example Use-Cases:**
   - **Image Data:** Filling in missing pixels or reconstructing corrupted images.
   - **Time-Series Data:** Imputing missing values in sequences like stock prices or weather data.
   - **Tabular Data:** Handling missing entries in datasets used for machine learning.

### Implementation Example:

Here's a simplified example of how you might set up an autoencoder for imputation in Python using TensorFlow and Keras: (Check the next notebook)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [2]:
df = sns.load_dataset('titanic')

df = df[['survived', "pclass", 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

X = df.drop('survived', axis=1)
y = df['survived']

In [4]:
# handling missing values and categorical variables
numeric_features = ['age', 'fare', 'sibsp', 'parch']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler())
])

categorical_features = ['pclass', 'sex', 'embarked']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# columntransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# preprocessing the dataset
X_preprocessed = preprocessor.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_preprocessed,y, test_size=0.2, random_state=42)

# define the autoencoder architecture
input_dim = X_train.shape[1]
encoding_dim = 32

input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='sigmoid')(encoded)

autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# train the autoencoder
autoencoder.fit(X_train, y_train, epochs=50, batch_size=256, shuffle=True, validation_split=0.2)

# using the autoencoder for imputation on test set
X_set_imputed = autoencoder.predict(X_test)

Epoch 1/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 120ms/step - loss: 0.2666 - val_loss: 0.2668
Epoch 2/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - loss: 0.2648 - val_loss: 0.2637
Epoch 3/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step - loss: 0.2618 - val_loss: 0.2608
Epoch 4/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - loss: 0.2591 - val_loss: 0.2580
Epoch 5/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step - loss: 0.2566 - val_loss: 0.2553
Epoch 6/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - loss: 0.2544 - val_loss: 0.2527
Epoch 7/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - loss: 0.2521 - val_loss: 0.2502
Epoch 8/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - loss: 0.2499 - val_loss: 0.2477
Epoch 9/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

In [6]:
X.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
age,177
embarked,2
pclass,0
sex,0
sibsp,0
parch,0
fare,0


In [8]:
X_set_imputed.shape

(179, 13)