## Autoencoders

For feature detection, recommender systems, encoding.

Hidden node should not more input nodes, that might overcomplete it. For that we use:
* Sparse Autoencoders : Regularize by not using all hidden nodes.
* Denoising Autoencoders : Random selection of input nodes
* Contractive Autoencoders : adds penalty into the loss function backpropagated
* Stacked : Adding more hidden layers

1. We start with an array where the lines correspond to the users and the columns correspond to the movies. Each cell contains ratings from 1 to 5 of movie i by user u.
2. The first user goes into the network. The input vector x contains all its ratings for all the movies.
3. The input vector x is encoded into a vector z of lower dimensions by mapping function f (e.g. sigmoid function, tanh) z- fW + b where w is weight vector of input weights and b the bias
4. z is then decoded into output vector y of same dimension as x, aiming to replicate the input vector x.
5. the reconstruction error is d(x,y) = ||x-y|| is computed. The goal is to minimize it.
6. The error is back propagated. The weights are updated according to how much they are responsibe for the error. The learning rate decides by how much update the weights.
7. The repeat 1 to 6. Reinforcement or Batch Learning.
8. When whole dataset is passed through ANN the epoch.

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

In [2]:
# Importing the dataset
movies = pd.read_csv('ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
users = pd.read_csv('ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')

In [3]:
movies.head()

Unnamed: 0,0,1,2
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
users.head()

Unnamed: 0,0,1,2,3,4
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [5]:
ratings.head()

Unnamed: 0,0,1,2,3
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [6]:
# Preparing the training set and the test set
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set = np.array(training_set, dtype = 'int') #pytorch takes inputs as array
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int')

In [7]:
training_set

array([[        1,         2,         3, 876893171],
       [        1,         3,         4, 878542960],
       [        1,         4,         3, 876893119],
       ...,
       [      943,      1188,         3, 888640250],
       [      943,      1228,         3, 888640275],
       [      943,      1330,         3, 888692465]])

In [8]:
test_set

array([[        1,        10,         3, 875693118],
       [        1,        12,         5, 878542960],
       [        1,        14,         5, 874965706],
       ...,
       [      459,       934,         3, 879563639],
       [      460,        10,         3, 882912371],
       [      462,       682,         5, 886365231]])

In [9]:
# Getting the number of users and movies
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))

List of User 1: [Ratings of all the movies by User 1]

List of User 2: [Ratings of all the movies by User 2]

................................................................................

List of User 943: [Ratings of all the movies by User 943]

The new structure of data, having the shape of a 2d array where:

the rows are the users,
the columns are the movies,
the cells are the ratings.

In [10]:
# Converting the data into an array with users in lines and movies in columns
def convert(data):
    new_data = []
    for id_users in range(1, nb_users + 1):
        id_movies = data[:,1][data[:,0] == id_users]
        id_ratings = data[:,2][data[:,0] == id_users]
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data
training_set = convert(training_set)
test_set = convert(test_set)

In [11]:
training_set[0][0:3]

[0.0, 3.0, 4.0]

In [12]:
# Converting the data into Torch tensors
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

In [13]:
training_set

tensor([[0., 3., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 5., 0.,  ..., 0., 0., 0.]])

In [14]:
# Creating the architecture of the Neural Network (Sparse Autoencoders)

class SAE(nn.Module):
    def __init__(self, ):
        super(SAE, self).__init__()
        self.fc1 = nn.Linear(nb_movies, 20)
        self.fc2 = nn.Linear(20, 10)
        self.fc3 = nn.Linear(10, 20)
        self.fc4 = nn.Linear(20, nb_movies)
        self.activation = nn.Sigmoid()
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.fc4(x)
        return x
sae = SAE()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(sae.parameters(), lr = 0.01, weight_decay = 0.5)

In [15]:
# Training the SAE
# With unsqueeze you parse a dimension into your tensor, and therefore you need to set the shape of your indexed tensor at 
# index 0 for both target and input equal, to run tensors with each other.
nb_epoch = 100
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.
    for id_user in range(nb_users):
        input = Variable(training_set[id_user]).unsqueeze(0)
        target = input.clone()
        if torch.sum(target.data > 0) > 0:
            output = sae(input)
            target.require_grad = False
            output[target == 0] = 0
            loss = criterion(output, target)
            mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
            loss.backward()
            train_loss += np.sqrt(loss.item()*mean_corrector)
            s += 1.
            optimizer.step()
    print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))

epoch: 1 loss: 1.7719929318220722
epoch: 2 loss: 1.096760158905915
epoch: 3 loss: 1.0533094964245444
epoch: 4 loss: 1.0383982444571715
epoch: 5 loss: 1.0309382982452766
epoch: 6 loss: 1.0265652068782372
epoch: 7 loss: 1.0237284035617966
epoch: 8 loss: 1.0219070052636214
epoch: 9 loss: 1.0205892605646747
epoch: 10 loss: 1.0197149291934113
epoch: 11 loss: 1.0189747759361572
epoch: 12 loss: 1.018246281084498
epoch: 13 loss: 1.0178426867112027
epoch: 14 loss: 1.017592415517041
epoch: 15 loss: 1.0173394900170984
epoch: 16 loss: 1.0168938809318402
epoch: 17 loss: 1.0166501032386377
epoch: 18 loss: 1.0166134474345925
epoch: 19 loss: 1.0161801975558373
epoch: 20 loss: 1.016196408324552
epoch: 21 loss: 1.0159812203164016
epoch: 22 loss: 1.0160233392152687
epoch: 23 loss: 1.0158801475487444
epoch: 24 loss: 1.0158160473735829
epoch: 25 loss: 1.0158485653085103
epoch: 26 loss: 1.015772427463776
epoch: 27 loss: 1.0153465507614634
epoch: 28 loss: 1.0149782115337054
epoch: 29 loss: 1.0129961693909169

In [16]:
# Testing the SAE
test_loss = 0
s = 0.
for id_user in range(nb_users):
    input = Variable(training_set[id_user]).unsqueeze(0)
    target = Variable(test_set[id_user]).unsqueeze(0)
    if torch.sum(target.data > 0) > 0:
        output = sae(input)
        target.require_grad = False
        output[target == 0] = 0
        loss = criterion(output, target)
        mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
        test_loss += np.sqrt(loss.item()*mean_corrector)
        s += 1.
print('test loss: '+str(test_loss/s))

test loss: 0.9604290318022629
