### Intuition

A type of artificial neural network that consists of:
- Visible input nodes
- Encoding process
- Hidden Layers
- Decoding process
- Visible output nodes

This is a self-supervised learning method.

The aim of an autoencoder is to leanr a representation (encoding) of a set of a data by ignoring signal "noise" and reducing dimensions. 

E.g. translation of human languages, medical imaging, recommendation systems


A note on biases:
Our autoencoder may enclode bias terms to re-centre the data.


#### Training an autoencoder (movie ratings example):

1) Start with an array where row = user and column = movie. Each cell contains the rating 1-5 (0 if no rating) of the user u for movie i

2) First user is passed to the network, with input vector containing all ratings.

3) Input vector is encoded into a vector z of lower dimensions by a mapping function (e.g. sigmoid or tanh)

4) z is decoded into output vector y of the same dimensions as x, aiming to replicate the input vector x

5) The reconstruction error d(x, y) is computed. The goal is to minimize it

6) Using backpropagation, weights are updated according to how much they are responsible for the error. Learning rate determines how much we update the weights by.

7) Repeat steps 1-6 and update weights after each obsercation, or repeat 1-6 but update the weights only after a batch of observations.

8) When the whole training set is pased through the ANN, that becomes 1 epoch. Repeat for a set number of epochs.


#### Overcomplete hidden layers:
Suppose we had more nodes in the hidden layer than in the input layer. Although autoencoders are typically convenient for feature extraction/dimensionality reduction, nothing prevents us from having more nodes in the hidden layer than in the input layer.

PROBLEM:
If we have too many nodes in the hidden layer, the autoencoder's backpropagation algorithm can set some node weights to 0 or 1 such that some nodes are unused, and others are used at 100% weight (e.g. weight = 1). 


#### Sparse autoencoders:
It has been observed that when representations are learnt in a way that encourages sparsity, impproved performance is obtained on classification tasks.

Sparse autoencoders may include more hidden units than inputs, but only a small number of these hidden units are allowed to be active at once. This sparsity constraint forces the model to respond to the unique statistical features of the input data used for training. 

Specifically, a sparse autoencoder will have a training criterion that has a sparsity penalty on each layer. This penalty encourage the model to activate (output value close to 1) some specific areas of the network on the basis of the input data, while forcing all other neurons to be inactive (output values close to 0).

#### Denoising autoencoders:
Denoising autoencoders (DAE) try to achieve a good representation by changing the reconstruction criterion of the model.

DAEs take a partially corrupted input and are trained to recover the original undistorted input. In practice, the objective of denoising autoencoders is that of cleaning the corrupted input.

- Higher level representations much be relatively stable and robust to the corruption of the input

- The model needs to extract features that capture usefule structure in the distribution of the input.


In [7]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

In [8]:
movies = pd.read_csv('ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
users = pd.read_csv('ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')

In [9]:
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set = np.array(training_set, dtype = 'int')
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int')

In [10]:
nb_users = int(max(max(training_set[:, 0], ), max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1], ), max(test_set[:, 1])))

In [11]:
def convert(data):
  new_data = []
  for id_users in range(1, nb_users + 1):
    id_movies = data[:, 1] [data[:, 0] == id_users]
    id_ratings = data[:, 2] [data[:, 0] == id_users]
    ratings = np.zeros(nb_movies)
    ratings[id_movies - 1] = id_ratings
    new_data.append(list(ratings))
  return new_data
training_set = convert(training_set)
test_set = convert(test_set)

In [12]:
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

In [13]:
# Creating architecture for stacked autoencoder
# We will use inheritence via the nn library
class SAE(nn.Module):
    def __init__(self, ):
        super(SAE, self).__init__()
        # Now create linear transformations of data
        self.fc1 = nn.Linear(in_features = nb_movies, out_features = 20)
        self.fc2 = nn.Linear(20, 10)
        self.fc3 = nn.Linear(10, 20)
        self.fc4 = nn.Linear(20, nb_movies)
        self.activation = nn.Sigmoid()
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.fc4(x)
        return x

In [15]:
sae = SAE()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(sae.parameters(), 
                          lr = 0.01, weight_decay = 0.5)

In [21]:
# Training the SAE
nb_epoch = 200
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    normalizer = 0.
    for id_user in range(nb_users):
        # define our input
        # pytorch functions do not accept single 1d vectors
        # so we will add a new dimension to our input at index 0
        inputs = Variable(training_set[id_user]).unsqueeze(0)
        target = inputs.clone()
        # we now test to only select rated movies (0 means no rating)
        if (torch.sum(target.data > 0) > 0):
            outputs = sae(inputs)
            # we prevent algorithm from computing gradients at each step
            # this saves time and space on computation
            target.require_grad = False
            outputs[target == 0] = 0 # restore unrated values
            loss = criterion(outputs, target) # define our loss
            # mean corrector = nb_movies / nb of rated movies
            # we add + 1e-10 in order to prevent division by 0
            # this corrector provides us with an average error only for rated movies
            mean_corrector = nb_movies / float(torch.sum(target.data > 0) + 1e-10)
            loss.backward()
            train_loss += np.sqrt(loss.data*mean_corrector)
            normalizer += 1.
            optimizer.step()
    print('epoch: ' + str(epoch) + ' loss: '+ str(train_loss/normalizer))


epoch: 1 loss: tensor(0.9210)
epoch: 2 loss: tensor(0.9233)
epoch: 3 loss: tensor(0.9207)
epoch: 4 loss: tensor(0.9229)
epoch: 5 loss: tensor(0.9207)
epoch: 6 loss: tensor(0.9225)
epoch: 7 loss: tensor(0.9203)
epoch: 8 loss: tensor(0.9221)
epoch: 9 loss: tensor(0.9200)
epoch: 10 loss: tensor(0.9214)
epoch: 11 loss: tensor(0.9198)
epoch: 12 loss: tensor(0.9211)
epoch: 13 loss: tensor(0.9193)
epoch: 14 loss: tensor(0.9208)
epoch: 15 loss: tensor(0.9187)
epoch: 16 loss: tensor(0.9203)
epoch: 17 loss: tensor(0.9184)
epoch: 18 loss: tensor(0.9200)
epoch: 19 loss: tensor(0.9208)
epoch: 20 loss: tensor(0.9199)
epoch: 21 loss: tensor(0.9200)
epoch: 22 loss: tensor(0.9195)
epoch: 23 loss: tensor(0.9199)
epoch: 24 loss: tensor(0.9190)
epoch: 25 loss: tensor(0.9190)
epoch: 26 loss: tensor(0.9190)
epoch: 27 loss: tensor(0.9186)
epoch: 28 loss: tensor(0.9190)
epoch: 29 loss: tensor(0.9181)
epoch: 30 loss: tensor(0.9183)
epoch: 31 loss: tensor(0.9173)
epoch: 32 loss: tensor(0.9179)
epoch: 33 loss: t

In [24]:
test_loss = 0
normalizer = 0.
for id_user in range(nb_users):
    inputs = Variable(training_set[id_user]).unsqueeze(0)
    target = Variable(test_set[id_user]).unsqueeze(0)
    if torch.sum(target.data > 0) > 0:
        outputs = sae(inputs)
        target.require_grad = False
        outputs[target == 0] = 0
        loss = criterion(outputs, target)
        mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
        test_loss += np.sqrt(loss.data*mean_corrector)
        normalizer += 1.
print('test loss: '+str(test_loss/normalizer))

test loss: tensor(0.9404)
