# Recommender systems based on collaborative filtering and latent factors

We are going to discuss in this notebook an approach to build a recommender system based on using the whole information stored in the user-product ranking matrix. Although here we assume that such matrix gives only **explicit rankings** provided by users to the products they have bought, there are alternative models for the user-product matrix that incorporate other implicit sources of knowledge for the case where no explicit feedback is available from the user. 

Because we use the whole information in such matrix to predict unknown entries, we say that we follow a collaborative (or global) filtering approach, because the predictions for unknown entries are based on the collected information from all the known entries of the user-product matrix.

In [1]:
#
# Our preliminary set-up code
#

import pyspark
import os
import math
import random
import sys

%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt


spark_home = os.environ.get('SPARK_HOME', None)
sc = pyspark.SparkContext('local[*]')

print (spark_home, sc)

None <SparkContext master=local[*] appName=pyspark-shell>


## U-V matrix factorization for latent factors discovery

The approach to discover latent factors that characterize users and products, is the one based on U-V matrix factorization. The problem formulation is that we want to find a factorization of our user-product matrix as the product of two matrices $U$ and $V$, where:
1. Each user will be represented as a row vector with $c$ factors in matrix $U$, so $U$ will be a matrix with $m$ rows and $c$ columns.
2. Each product will be represented as a column vector with $c$ factors in matrix $V$, so $V$ will be a matrix with $c$ rows and $n$ columns.

Then, given our user-product rating matrix, with $m$ rows (users) and $n$ columns (products) we want to find matrices $U$ and $V$ such:

$$ \overbrace{ \left( \begin{matrix}
  U_{1,1} & \cdots & U_{1,c} \\
   & \cdots & \\
  & \vdots & \\
  U_{m,1} & \cdots & U_{m,c}
 \end{matrix} \right) }^{U \ (m\times c)} \times
 \overbrace{ \left( \begin{matrix}
  V_{1,1} & \cdots & V_{1,n} \\
   & \cdots & \\
  & \vdots & \\
  V_{c,1} & \cdots & V_{c,n}
 \end{matrix}  \right)  }^{V \ (c\times n) } \ = \  
  \overbrace{ \left( \begin{matrix}
  M_{1,1} & \cdots & M_{1,n} \\
   & \cdots & \\
  & \vdots & \\
  M_{m,1} & \cdots & V_{m,n}
 \end{matrix}  \right)  }^{M (m \times n)}$$

Remember that in general the matrix $M$ may have empty entries, but our U-V factorization will provide values for all the entries, so when we say an U-V factorization is good for a partially filled matrix M, we usually mean that it agrees with the value for the filled entries of M. Once we have this factorization, observe that the value of any entry $(i,j)$ of the matrix M, even for unknown entries, is predicted multiplying the row $i$ of matrix $U$ by column $j$ of matrix V, that is:

$$ \hat{M}(i,j) = \sum_{k=1}^c U_{i,k} * V_{c,j} $$

Because it may be not possible to find such **exact** factorization with the desired number of latent factors, finding such factorization is actually presented as an optimization problem, where the goal is to find a factorization with the smallest RMSE error (error computed over the filled entries of M):

$$ \sqrt{\frac{\sum_{i,j}(U(row_i) \ \cdot \ V(col_j) - M(i,j))^2}{\# \ known \ entries}} $$

where the summatory is over entries $(i,j)$ of the matrix M with known values. However, because we are really more interested in being able to **predict unknown values** of the matrix M, than in predicting the known ones, to avoid **overfitting** to the known value and  have a high error rate for unknown entries, the optimization algorithms usually also allows to consider alternative objective functions that incorporate regularized terms to control the tradeoff between RMSE error and prediction of unknown entries.


## Optimization algorithms for U-V matrix factorization

We may consider different algorithms for finding a good U-V factorization (with a small RMSE error or any other generalized error function). These are the two main ones:
1. Gradient descend: Using the same approach we presentend for finding a linear model, but this time applied to the parameters of our model (the coefficients of the matrices U and V)

2. Alternating Least Squares: If we keep fixed one of the two matrices (U or V), the problem becomes a quadratic optimization problem that can be optimally solved in P-time. So, we perform an iterative process where we fix one of them, find the optimal one for the other, and in the next iteration we exchange the roles: the second one is fixed and the first one is optimized. This process is repeated until we reach a fixed point.


To know more about this problems and their solving algorithms, a good source of information is the paper:

> Yehuda Koren, Robert Bell and Chris Volinsky. *Matrix factorization techniques for recommender systems*. In Computer Journal, IEEE press, Vol 42(8), 2009.

and also the book about data mining for big data we recommended for this course.

Let's consider the same example matrix we used in our notebook about clustering algorithms. In particular, let's consider a data base of users, where for each user we store the ratings given by the user to different movies. We first consider the matrix full of entries, but later we consider variations of this matrix where some of the entries will be empty.

In [5]:
# Example data
#
# We have 10 users, and 10 movies: STW1, STW2, STW3, STW4, STW5, STW6
#                                  T1, T2, T3 and BaT
# Each entry i,j is the rating given by the user in the range [-5.0,5.0]
# We can observe that we have 4 clear Star Wars fans (that they also like a 
# little bit Terminator movies)
# We also have four clear Terminator fans (that they also like a little STWs movies)
# Finally, we have two clear Breakfast at tiffannies fans (BaT), that they do not
# like too much science-fiction movies

usersandmovies = [ [3,3,3,5,5,4, 3,3,-1, -1], \
                   [3,3,3,5,5,4, 4,2,0, -1], \
                   [3,3,4,5,5,4, 4,4,1, 0], \
                   [4,3,3,4,5,4, 3,3,1, -1], \
                   [1,1,1,0,1,1, 5,4,2, -1], \
                   [1,2,1,0,1,1, 4,4,2, -1], \
                   [1,2,2,1,1,1, 4,4,2, -1], \
                   [1,2,2,1,1,0, 5,4,3, -1], \
                   [-2,-3,-2,0,-2,-1, 0,0,-1,4], \
                   [-2,-3,-2,0,-2,-1, 0,0,-1,4]   ]

# In this first example, that we used also with the clustering algorithms, all the entries are known,
# so we will first consider how good is the ALS algorithm decomposing a fully filled matrix as the product
# of the two latent-factor based matrices U and V

In [6]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# We need a function to convert our matrix format to a RDD of 
# pyspark.mllib.recommendation.Rating objects:
#
#    Rating(int(userid),int(productid),float(rating)
# 
# BEWARE: this function works with the whole matrix in the driver memory,
# obtaining a python list representation of the ratings to finally get the
# RDD. A better (more scalable) version should do this from a RDD with the matrix entries
# loaded from a source file, not from a python matrix in main memory

def convertMatrixToRatings( matrix ):
    ratings = []
    for user,userrow in enumerate(matrix):
        for product,productrating in enumerate(userrow):
            ratings.append( Rating( int(user) , int(product), float(productrating) ) )
    return ratings       

In [7]:


# Load and parse the data
# data = sc.textFile(spark_home+"data/mllib/als/test.data")
# ratings = data.map(lambda l: l.split(','))\
#    .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

ratings = sc.parallelize( convertMatrixToRatings( usersandmovies ) )

# Check the data:
# print ratings.collect()


# Build the recommendation model using Alternating Least Squares
rank = 3
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data
# First, get the data without the rating values (only user-product IDs)
testdata = ratings.map(lambda p: (p[0], p[1]))

# Next, Get the predictions obatined with our U-V factorization model
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
# join ((u,p), V) and ((u,p), W) to get ((u,p), (V, W))
ratesAndPreds = ratings.map( lambda r: ((r[0], r[1]), r[2]) ).join(predictions)

## Compute Mean Square error
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

Mean Squared Error = 0.08866856107494435


In [9]:
# We can also get the U-V factorization, as the set of user features (latent factors) of
# the U matrix

print (" Users latent factors:")
for userfactors in model.userFeatures().sortByKey().collect():
    print (userfactors)

# and the set of product features of the V matrix:
print ("\n Products latent factors:")
for productfactors in model.productFeatures().sortByKey().collect(): 
    print (productfactors)

 Users latent factors:
(0, array('d', [-1.0632634162902832, -3.317167282104492, 1.033426284790039]))
(1, array('d', [-1.170506238937378, -3.213942766189575, 0.9486397504806519]))
(2, array('d', [-1.5727169513702393, -3.1935672760009766, 1.2509902715682983]))
(3, array('d', [-1.2623188495635986, -2.9727492332458496, 0.6029390692710876]))
(4, array('d', [-1.681282639503479, -0.11645570397377014, 0.1271640658378601]))
(5, array('d', [-1.5544735193252563, -0.17659440636634827, -0.09581423550844193]))
(6, array('d', [-1.5795621871948242, -0.4811013638973236, 0.05381409823894501]))
(7, array('d', [-1.8329441547393799, -0.19166715443134308, -0.03150169923901558]))
(8, array('d', [0.3472416400909424, 0.6908116340637207, 1.7061084508895874]))
(9, array('d', [0.3472416400909424, 0.6908116340637207, 1.7061084508895874]))

 Products latent factors:
(0, array('d', [-0.46577298641204834, -1.0279934406280518, -0.6886447668075562]))
(1, array('d', [-0.9108235239982605, -0.9404649138450623, -1.18039691

Can you identify significant differences between the latent factors for different user groups and product groups ? Observe that the two "Breakfast at Tiffany's" fans have clearly different latent factors than the others users

Let's next check what happens if there are some missing values in the user-product matrix. Consider the following function to randomly erase $k$ entries from each row of the matrix:

In [10]:
def convertMatrixToRatingsWithBlanks( matrix, k ):
    ratings = []
    for user,userrow in enumerate(matrix):
        size = len(userrow)
        blanks = random.sample(range(size), k)
        for product,productrating in enumerate(userrow):
            if (product not in blanks):
              ratings.append( Rating( int(user) , int(product), float(productrating) ) )
    return ratings      

In [11]:
ratings2 = sc.parallelize( convertMatrixToRatingsWithBlanks( usersandmovies, 4  ) )

# Check the data:
# print ratings.collect()


# Build the recommendation model using Alternating Least Squares
rank = 3
numIterations = 10
model2 = ALS.train(ratings2, rank, numIterations)

# Evaluate the model on training data
# First, get the data without the rating values (only user-product IDs)
testdata2 = ratings2.map(lambda p: (p[0], p[1]))

# Next, Get the predictions obatined with our U-V factorization model
predictions2 = model2.predictAll(testdata2).map(lambda r: ((r[0], r[1]), r[2]))
# join ((u,p), V) and ((u,p), W) to get ((u,p), (V, W))
ratesAndPreds2 = ratings2.map( lambda r: ((r[0], r[1]), r[2]) ).join(predictions2)

## Compute Mean Square error
MSE2 = ratesAndPreds2.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE2))

Mean Squared Error = 0.12332159044395823


As expected, the error increases where less entries are available to find a good model, but remember that this does not imply that the model predition on unkown entries will be also worse. Actually, a problem of this factorization approach is that to be able to get a good factor vector for a given user, we need enough known entries for that user. In contrast, in the next notebook we will present a recommender system for products in an on-line store based on direct product-to-product comparison that allows to give recommendations for users from any single previously bought (or ranked) product of that user, so even if the user only made ONE purchase the system will be able to give recommendations, as soon as the total number of user purchases in the on-line store is high enough.

To know more about the ALS algorithm (and other global filtering algorithms) in spark, check the documentation page:

> https://spark.apache.org/docs/1.6.1/mllib-collaborative-filtering.html
