Skip to content

Latest commit

 

History

History
339 lines (236 loc) · 14.2 KB

2016-02-21-cfn.md

File metadata and controls

339 lines (236 loc) · 14.2 KB
layout title comments author excerpt picture
post
Collaborative Filtering with Neural Networks
true
florian-strub
Collaborative Filtering methods are key algorithms for Recommender Systems. They are used by major companies such as Netflix, Amazon or Spotify to suggest items to users. While Neural Networks have tremendous success in image and speech recognition, Collaborative Filtering received less attention as it deals with sparse inputs/targets. After a quick introduction to Recommendation Systems, we show how to implement a well-defined Autoencoders to perform Collaborative Filtering.

Deep Learning has well-known state-of-the-art results in Computer Vision, Speech Recognition or Natural Language Processing. This post intends at using Neural Networks on a different topic : Recommender systems and Collaborative Filtering. It turns out that a well-defined Autoencoder can compete with the best Collaborative Fitering algorithms. This post focus on understating and implementing such Autoencoders.

#Collaborative Filtering#

Recommendation systems are a subclass of algorithms that seek to predict the rating that a user would give to an item. A good recommendation system may dramatically increase the amount of sales of a firm or retain customers. For instance, 80% of movies watched on Netflix come from the recommender system of the company.

One of the most successful approach of Recommendation systems is to use the explicit feedback of users to build a Matrix of ratings. This technique is known as Collaborative Filtering.

For instance, if we gather the feedback of movies of all users into a matrix, we obtain a sparse matrix of ratings. Note that the sparsity is induced by unknown values.

initialMatrix

In this case, should we recommend Bob to watch The Exorcist or Titanic?

Collaborative Filtering aims at estimating the ratings a user would have given to other movies by using the ratings of all the other users. Inherent features of users or items such as the genre, age, actors or some descriptions are not taken into account. The goal is then to turn the previous sparse matrix of ratings into a dense one:

finalMatrix

Therefore, the Recommender System will advise Bob to watch The Exorcist.

If Bob eventually rates The Exorcist, the matrix is re-estimated and so on... In practice, the matrix of ratings is very sparse. IMDB currently indexes around 1,7 millions movies while users rarely rate more than a few hundreds movies.

I would recommend the following paper [1] to curious people which provide a great insight regarding the challenges of the field (Latent factors, Cold-Sart Problem, Hybrid Systems etc.)

In this post, we will introduce a Collaborative Filtering approach based on Stacked Denoising Autoencoders. A preliminary (and excellent!) work was first pulished in 2007 by Ruslan Salakhutdinov [2].

Understanding the model

As a short reminder, Autoencoders are unsupervised Networks where the output of the Network plan to reconstruct the initial input. The Network is constrained to use narrow hidden layers, forcing a dimensionality reduction on the data.

In a Collaborative Filtering task, the goal is to turn the sparse matrix of ratings into a dense one. To do so, we are going to decompose the matrix by either its lines or columns and feed them to the Autoencoder. The output layer will reconstruct the lines/columns and complete them.

UV

Note that the input/output size will depend on the matrix dimension.

Before implemeting the Autoencoder, we need to define how to handle unknwon values.

In the forward pass, a naive but efficient approach is to turn unkwown values into zero values. This is strictly equivalent to removing the neurons from the network. In other words, unknown values are ignored.

In the backward pass, the same idea can be applied. When no rating exist, the error is set to zero as it cannot be computed.

training1

At first sight, there is very few constraint in such Autoencoders. Actually, dropout already removed 50% of the weights. Beware that more than 95% of the ratings are missing in real datasets. Thus, the bottleneck of the autoencoder is often greater than the number of ratings in the input vector! If we change the toys example by a bit more realistic autoencoder. It is not straightfroward that the network will be able to learn a useful representation from a very sparse representation.

error

The autoencoder is acctually diverted from its original purpose. It does not aim to reconstruct the initial input anymore, it aims at predicting the missing data.

By keeping this idea in mind, we applied the Denoising loss [2] to emphasize the prediction aspect during the training. First, a few input ratings are masked on every input vector to simulate missing ratings. And then, a higher error is returned when the output rating is erroneously predicted.

The training loss becomes DAELoss

where \tilde{x} is the corrupted input x, J are the indices of the corrupted elements of x, nn(x)_j is the jth output of the Network, K(x) are the indices of known values of x. The loss is based on two main hyperparameters Alpha, Beta. They balance the evolution of the Network between denoising the input (alpha) and reconstructing the input (beta).

Finally, the full training process is described below

training1

A more advanced description is described in our paper [4]). Importantly, we explain in this paper the underlying idea that links classic Collaborative Filtering techniques and Autoencoders and propose ideas that may explain why neural Networks performs well in Collaborative Filtering tasks.

Now, let's start the interesting part!

Implementation

First, we need to install the nnsparse package. It will provide us with the basic tools to handle sparse data.

luarocks install nnsparse

The data

We are going to work on a small but classic dataset called movieLens-1M. This dataset contains 1 million ratings from 1 to 5 given by 6040 users on 3700 movies.

To avoid memory overflow, data must be stored in sparse tensors. Given the toy matrix in the first section, the sparse matrix of users/items would be:

training1

The following code parses the movieLens-1M dataset and turns it into a sparse reprensation of items

   local train, test = {}, {}

   -- Step 1 : Load file
   local ratesfile = io.open("ml-1m/ratings.dat", "r")

   -- Step 2 : Retrieve ratings
   for line in ratesfile:lines() do

      -- parse the file by using regex
      local userIdStr, movieIdStr, ratingStr, _ = line:match('(%d+)::(%d+)::(%d%.?%d?)::(%d+)')

      local userId  = tonumber(userIdStr)
      local itemId  = tonumber(movieIdStr)
      local rating  = tonumber(ratingStr)

      -- normalize the rating between [-1, 1]
      rating = (rating-3)/2

      -- we are going to autoencode the item with a training ratio of 0.9
      if torch.uniform() < 0.9 then
          train[itemId] = train[itemId] or nnsparse.DynamicSparseTensor()
          train[itemId]:append(torch.Tensor{userId, rating})
      else
          test[itemId] = test[itemId] or nnsparse.DynamicSparseTensor()
          test[itemId]:append(torch.Tensor{userId, rating})
      end 

   end

   -- Step 3 : Build the final sparse matrices
   for k, oneTrain in pairs(train) do train[k] = oneTrain:build():ssortByIndex() end
   for k, oneTest  in pairs(test) do test[k]  = oneTest:build():ssortByIndex() end

It is important to unbias the lines/columns by the training mean to increase the final results.

-- Step 4 : remove mean
for k, oneTrain in pairs(train) do 
   local mean = oneTrain[{ {},2 }]:mean()
   train[k][{ {},2 }]:add(-mean) 
   test[k] [{ {},2 }]:add(-mean) 
end

The Network

We will first implement a shallow Autoencoder with sparse inputs for items. Deeper Autoencoder can be used, but the final results will only be increased by a small margin. So let's keep things simple!

  • On CPU:
local network = nn.Sequential()
network:add(nnsparse.SparseLinearBatch(6040, 500)) --There are 6040 users in movieLens-1M
network:add(nn.Tanh())
network:add(nn.Linear(500, 6040))
network:add(nn.Tanh())
  • On GPU:
local network = nn.Sequential()
network:add(nnsparse.Densify(6040)) --GPU does not handle sparse data, so we densify the input on the fly
network:add(nn.Linear(6040, 750))
network:add(nn.Tanh())
network:add(nn.Linear(750, 6040))
network:add(nn.Tanh())
network:cuda()

Training Loss

As described above, training the autoencoder requires to use a Denoising Loss. The nnsparse package offers a full Denoising Loss (Gaussian Noise/Masking Noise/ Salt&Paper Noise) for CPU and a Masking Loss that both work on CPU/GPU. We will use the latter since we want to focus on the prediction aspect.

   -- define the Loss function
   lossFct = nnsparse.MaskCriterion(nn.MSECriterion(), {
      alpha     = 1,
      beta      = 0.5,
      hideRatio = 0.25,
   })
   lossFct.inputDim    = 6040 --require for sparse inputs
   lossFct.sizeAverage = false

   -- Create minibatch
   local input, minibatch = {}, {}

   --shuffle the indices of the inputs to create the minibatch 
   local shuffle = torch.randperm(inputSize)
   shuffle:apply(function(k)
      if train[k] then
         input[#input+1] = train[k] 
         if #input == batchSize then
            minibatch[#minibatch+1] = input
            input = {}  
         end
      end
   end)
   if #input > 0 then 
      minibatch[#minibatch+1] = input 
   end

   -- Classic training 
   local w, dw = network:getParameters()
   
   for _, input in pairs(minibatch) do
      local function feval(x)

         -- Reset gradients and losses
         network:zeroGradParameters()

         -- AutoEncoder targets
         local target = input

         -- Compute noisy input for Denoising autoencoders
         local noisyInput = lossFct:prepareInput(input) 

         -- FORWARD
         local output = network:forward(noisyInput)
         local loss   = lossFct:forward(output, target)

         -- BACKWARD
         local dloss = lossFct:backward(output, target)
         local _     = network:backward(noisyInput, dloss)

         -- Return loss and gradients
         return loss/batchSize, dw:div(batchSize)
      end
      
      sgdConfiguration.evalCounter = t
      optim.sgd (feval, w, sgdConfiguration )

   end  

Testing Loss

Torch criterions do not handle sparse targets by default. Thus, it is necessary to wrap them into another criterion. The nnsparse.SparseCriterion takes a dense output and a sparse target as parameters. It then computes the loss/dloss regarding the indices of the target.

   -- Create minibatch
   local noRatings = 0
   local input, target, minibatch = {}, {}, {}
   
   for k, _ in pairs(train) do
   
     if test[k] ~= nil then --ignore when there is no target 
       input [#input  +1] = train[k] 
       target[#target +1] = test[k]
       
       noRatings = noRatings + test[k]:size(1)
     
       if #input == batchSize then
         minibatch[#minibatch+1] = {input = input, target = target}
         input, target = {}, {}  
       end
     end
   end
   if #input > 0 then 
      minibatch[#minibatch+1] = {input = input, target = target} 
   end

   -- define the testing criterion
   local criterion = nnsparse.SparseCriterion(nn.MSECriterion())
   criterion.sizeAverage = false

   -- Compute the RMSE by predicting the testing dataset thanks to the training dataset
   local err = 0
   for _, oneBatch in pairs(minibatch) do
     local output = network:forward(oneBatch.input)
     err = err + criterion:forward(output, oneBatch.target)
   end
   
   err = err/noRatings

   print("Current RMSE : " .. math.sqrt(err) * 2) -- The RMSE is rescaled

Here we are! The shallow autoencoder can be trained.

The full code can be found here.

A more advanced implementation can be found here.

Results

For movieLens-1M, the network should be trained in a few minutes (~20 epoches). The final RMSE should be around 0.839 with the following hyperparameters. Deeper Networks can go up to a RMSE of 0.831.

local sgdConfiguration = {
   learningRate      = 0.07,  
   learningRateDecay = 0.9,
   weightDecay       = 0.05,
}

As a benchmark, Alternative Least Squares [3] is a widely used Collaborative algorithm. It has a final RMSE around 0.850 on movieLens-1M - cf Mahout + Script

Furthermore, the bigger is the dataset, the better will be the Autoencoder. The current state of the art results can be obtained with a shallow autoencoder for movieLens-10/20M.

Once you have your network, you are free to implement your recommendation system!

More advanced Network can be built to integrate side information such as movie tags, age/sex/job etc. [4] Boosting can also be used to mix both User/Item-autoencoders to compute a single matrice.

References

  1. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, (8), 30-37.,
  2. Salakhutdinov, R., Mnih, A., & Hinton, G. (2007, June). Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning (pp. 791-798). ACM.
  3. Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management (pp. 337-348). Springer Berlin Heidelberg.
  4. Strub, F., Mary, J., Gaudel R., (2016). Hybrid Collaborative Filterings and Neural Networks. Under review