# Auto Encoders

## Setup and Context

### Introduction

An autoencoder is a type of neural network used to learn efficient codings of unlabeled data. It is designed for unsupervised machine learning. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In this example, we are making a Movie Recommedation System.

### Variations of Autoencoders

Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations.

- **Sparse Autoencoders:** Sparse autoencoders introduce a sparsity constraint on the latent space representation.
This encourages the model to learn a compact and sparse representation of the data, which can be useful for feature extraction and dimensionality reduction. Sparse autoencoders are often used when the input data has many irrelevant features, helping the model to focus on the most important ones.

- **Denoising Autoencoders:** In this type of autoencoder, noise is added to the input data, and the network is trained to reconstruct the original, noise-free data. By learning to remove noise from the input, denoising autoencoders can capture robust features of the data. They are useful for tasks where data may be corrupted or noisy.

- **Contractive Autoencoders:** A CAE compresses input data into a lower-dimensional representation and then reconstructs the original data. It includes a regularization term that penalizes changes in the encoder's output with respect to small changes in the input data, encouraging the model to learn stable and invariant representations. CAEs are useful for tasks like data denoising, anomaly detection, and feature learning, providing compact and robust representations while maintaining the ability to reconstruct the input data.

- **Variational Autoencoders:** VAEs are probabilistic autoencoders that learn a latent variable model for the data.
Instead of encoding data into a single fixed point in the latent space, VAEs map data to a probability distribution in the latent space. This allows for generating new data samples by sampling from the learned distribution. VAEs are commonly used for generating new data samples, such as in image generation tasks.

- **Stacked Autoencoders:** Stacked autoencoders consist of multiple layers of autoencoders stacked sequentially. Each layer encodes the input data into a more abstract representation, which serves as the input for the next layer. They are trained layer by layer using unsupervised learning techniques and are commonly used for feature learning and dimensionality reduction tasks.

- **Deep Autoencoders:** A deep autoencoder consists of multiple hidden layers in both the encoder and decoder parts of the network. These additional layers allow for learning more complex and hierarchical representations of the input data. Deep autoencoders are capable of capturing intricate features and structures in the data, making them suitable for tasks like image denoising, feature learning, and data compression. Deep Autoencoders are not the same as Stacked Autoencoders.

### Import Statements

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim

## Data Preprocessing

### Importing the Dataset

The dataset used is from [Kaggle](https://www.kaggle.com/datasets/akkefa/movielens-9000-movies-dataset).

We are not using the movies dataframe for training. We are just using it to know what is going on with all the movies. The id the most important column because this is what we will use to identify the movies.

In [2]:
movies = pd.read_csv("./data/movies.csv")

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


The ratings are what we are using to train the autoencoder.

In [3]:
ratings = pd.read_csv("./data/ratings.csv")

ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


We are just going to do some adjustments and change the moviesId numbering for the movies and rating dataframe since the numbering is slightly messed up.

In [4]:
# Add a new column 'id' to movies starting from 1
movies['id'] = movies.reset_index().index + 1

# Merge ratings with the movieId and id mapping
ratings = ratings.merge(movies[["movieId", "id"]], on="movieId")

# Replace the moviesId and remove id
movies["movieId"] = movies["id"]
ratings["movieId"] = ratings["id"]

del movies["id"]
del ratings["id"]

# remove timestamp column from ratings dataframe as it is not used.
del ratings["timestamp"]

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,44,5.0
4,1,47,5.0


### Preparing the Training and Test Sets

Now we split the data into training and test sets. We will be converting the training and test sets to a numpy array.

In [7]:
train_set, test_set = train_test_split(ratings, test_size=0.2, random_state=37)

train_set = np.array(train_set, dtype="int")
test_set = np.array(test_set, dtype="int")

### Getting the Total Number of Users and Books

We need to get the total number of users and movies. This is because we are going to convert the training and test sets into matrices where the rows are the users, the columns represent the movies and each cell is the rating of a movie by a user.

If a user did not rate a movie then the cell would be 0.

In [8]:
# The users and movies are both using sequential ids. So the total number would just be the last id.
nb_users = max(ratings["userId"])
nb_movies = max(ratings["movieId"])

In [9]:
nb_users

610

In [10]:
nb_movies

9742

### Converting the Data into a 2D Matrices

Let us create a function that would convert our training and test sets into 2D matrices. Simply put, we want to create a list of lists. Each list would be the ratings of every movie by a particular user.

In [11]:
def convert_data(data):
    # Initializing a matrix with zeros to store the ratings
    new_data = np.zeros((nb_users, nb_movies))
    
    # Iterating through each rating in the dataset
    for i in range(len(data[:, 0])):
        # Extracting user, movie, and rating from the current row
        user = data[i, 0]
        book = data[i, 1]
        rating = data[i, 2]
        
        # Storing the rating in the appropriate position in the matrix
        new_data[user - 1, book - 1] = rating

    # Return the converted data matrix
    return new_data

In [12]:
train_set = convert_data(train_set)
test_set = convert_data(test_set)

In [13]:
train_set

array([[0., 0., 4., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [2., 2., 2., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

In [14]:
test_set

array([[4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Converting to Pytorch Tensors

Now we convert our training and test set to **pytorch tensors**.

PyTorch tensors are multi-dimensional arrays similar to NumPy arrays, but with additional features and optimizations for deep learning tasks.

In [15]:
train_set = torch.from_numpy(train_set).float()
test_set = torch.from_numpy(test_set).float()

## The Neural Network

### Creating the Neural Network Architecture

We want to create an Autoencoder so the class we are going to create will contain instructions on how to build the autoencoder. To make the autoencoder using pytorch, we need to define multiple things: the layers, the number of layers, how many nodes in the layers, the activation function, the criterion, the optimizer function.

In [16]:
# This is a stacked autoencoder
class SAE(nn.Module):
    # Initializes the architecture of the stacked autoencoder.
    # Defines the structure of the encoder and decoder layers, including fully connected hidden layers.
    def __init__(self):
        super().__init__()
        # These are the Fully Connected Hidden Layers
        # Encoder
        self.fc1 = nn.Linear(nb_movies, 20) # Fully connected layer from input (number of movies) to a hidden layer with 20 neurons
        self.fc2 = nn.Linear(20, 10) # Fully connected layer from the first hidden layer (20 neurons) to a second hidden layer with 10 neurons
        # Decoder
        self.fc3 = nn.Linear(10, 20) # Fully connected layer from the second hidden layer (10 neurons) to a third hidden layer with 20 neurons
        self.fc4 = nn.Linear(20, nb_movies) # Fully connected layer from the third hidden layer (20 neurons) to the output (number of movies)

        # The activation function
        self.activation = nn.Sigmoid() # Sigmoid activation function is used for non-linearity
        
    
    # This method defines the forward pass computation of the stacked autoencoder.
    # It describes how input data flows through the network's layers to produce a reconstructed output.
    def forward(self, x):
        # Encoding
        x = self.activation(self.fc1(x)) # Pass input through the first fully connected layer and apply sigmoid activation
        x = self.activation(self.fc2(x)) # Pass the result through the second fully connected layer and apply sigmoid activation
        # Decoding
        x = self.activation(self.fc3(x)) # Pass the result through the third fully connected layer and apply sigmoid activation
        x = self.fc4(x) # Pass the result through the fourth fully connected layer without any activation
        return x  # Return the reconstructed output

Next, we sets up the stacked autoencoder model, `sae`, defines the loss function, `criterion` to compute the reconstruction error, and initializes the RMSprop optimizer, `optimizer` with specific parameters for training the autoencoder model as follows:

- `sae.parameters()` provides the parameters (weights and biases) of the autoencoder model to the optimizer for optimization.
- `lr=0.01` sets the learning rate to 0.01, which controls the step size during optimization.
- `weight_decay=0.5` applies L2 regularization to the optimizer with a regularization strength of 0.5. L2 regularization helps prevent overfitting by penalizing large weights in the model.

In [17]:
sae = SAE()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(sae.parameters(), lr=0.01, weight_decay=0.5)

### Training the Stacked Autoencoder

We train a stacked autoencoder model using the provided training data. It iterates over each epoch and each user in the training set, prepares the input data, computes the reconstruction loss, and updates the model parameters accordingly.

In [18]:
num_epoch = 200 # The total number of epoch we will run

# Iterate over the specified number of epochs
for epoch in range(1, num_epoch + 1):
    # Initialize variables to track training loss and number of processed samples
    train_loss = 0
    s = 0.0

    # Iterate over each user in the training set
    for id_user in range(nb_users):
        # Get the input data for the current user and add an extra dimension (unsqueeze) to match model input shape
        model_input = train_set[id_user].unsqueeze(0)
        # Clone the input data to use as target for computing loss
        target = model_input.clone()

        # Check if the target data contains nonzero values (i.e., if the user has rated any movies)
        if torch.sum(target.data > 0) > 0:
            # Pass the input data through the autoencoder to get the reconstructed output
            model_output = sae(model_input)
            # Freeze the target data to prevent gradients from being computed for these values
            target.requires_grad = False
            # Mask the output where target values are zero (unrated movies)
            model_output[target == 0] = 0

            # Compute the loss between the model output and target
            loss = criterion(model_output, target)
            # Compute the mean corrector factor to adjust loss scale
            # We added a very small number to the denominator to prevent a DividedByZero Error
            mean_corrector = nb_movies / float(torch.sum(target.data > 0) + 1e-10)
            # Compute gradients and backpropagate the loss
            loss.backward()
            # Update the training loss by adding the scaled loss
            train_loss += np.sqrt(loss.item() * mean_corrector)

            # Increment the count of processed samples
            s += 1.0

            # Update model parameters using the optimizer
            optimizer.step()

    # Print epoch number and average loss for the epoch
    print(f"epoch: {epoch}\tloss: {train_loss / s}")

epoch: 1	loss: 3.3699132188881458
epoch: 2	loss: 2.5347151419233906
epoch: 3	loss: 2.094292741595523
epoch: 4	loss: 1.8541444945824823
epoch: 5	loss: 1.6997055287471245
epoch: 6	loss: 1.5912896643315348
epoch: 7	loss: 1.5106088953023271
epoch: 8	loss: 1.4479673065772705
epoch: 9	loss: 1.3976987831449645
epoch: 10	loss: 1.356310089634932
epoch: 11	loss: 1.3215266989020547
epoch: 12	loss: 1.2918052236087898
epoch: 13	loss: 1.2660537021577922
epoch: 14	loss: 1.2434845354511628
epoch: 15	loss: 1.223509546980243
epoch: 16	loss: 1.2056854334781908
epoch: 17	loss: 1.1896660055948423
epoch: 18	loss: 1.1751795192293204
epoch: 19	loss: 1.1620045527181753
epoch: 20	loss: 1.1499661283955416
epoch: 21	loss: 1.1389170300835056
epoch: 22	loss: 1.1287369128796663
epoch: 23	loss: 1.1193209077235273
epoch: 24	loss: 1.110585733436947
epoch: 25	loss: 1.1024570129507913
epoch: 26	loss: 1.094873005686984
epoch: 27	loss: 1.0877755893037224
epoch: 28	loss: 1.0811210114639962
epoch: 29	loss: 1.0748668526086322

### Testing the Stacked Autoencoder

Finally, we will test the our autoencoder using the test set.

In [19]:
test_loss = 0  # Initialize test loss
s = 0.0  # Initialize count of processed samples

# Iterate over each user in the test set
for id_user in range(nb_users):
    # Get the input data for the current user from the training set and add an extra dimension
    model_input = train_set[id_user].unsqueeze(0)
    # Get the target data for the current user from the test set and add an extra dimension
    target = test_set[id_user].unsqueeze(0)

    # Check if the target data contains nonzero values (i.e., if the user has rated any movies)
    if torch.sum(target.data > 0) > 0:
        # Pass the input data through the autoencoder to get the reconstructed output
        model_output = sae(model_input)
        # Freeze the target data to prevent gradients from being computed for these values
        target.requires_grad = False
        # Mask the output where target values are zero (unrated movies)
        model_output[target == 0] = 0

        # Compute the loss between the model output and target
        loss = criterion(model_output, target)
        # Compute the mean corrector factor to adjust loss scale
        mean_corrector = nb_movies / float(torch.sum(target.data > 0) + 1e-10)
        # Update the test loss by adding the scaled loss
        test_loss += np.sqrt(loss.item() * mean_corrector)

        # Increment the count of processed samples
        s += 1.0

# Calculate and print the average test loss
print(f"test loss: {test_loss / s}")


test loss: 1.024999052430839
