# Item-to-item based collaborative filtering recommender systems

In this notebook we will consider how to build recommender systems based on item-to-item (where items are the products of the on-line store) collaborative filtering, where the approach to recommend new items to an user is based on finding SIMILAR items to the ones already bought by the user, or similar to the ones we know the user likes. This approach is a good one when the user-item matrix represents which products were bought by each user, but the typical user row vector is sparse (there are few purchases with respect to the total number of items of the on-line store). This item-centered approach allows us to compute recommendations for any single item bought by any user, so even in the extreme case where the user has only bought one item, the system will be able to compute good recommendations if the total number of purchases of the on-line store is high enough, as recommendations depend on the number of new items (non bought ones) that are similar enough to any already bought.

We can also consider not only wich items the user bought, but also wich items he has browsed in the online store frequently, or any other source of implicit knowledge about the interests of the user.

In any case, the main issue here will be **how to compare items of the on-line store** and how to predict wich ones will be more interesting for the user, given his current set of purchases or interests. So, the difference with the approach followed in the case of the recommender systems with latent factors is that here, although we also perform global filtering because we consider the whole set of users and items to build the system, we compute explicitely similarity measures between any pair of products, instead of decomposing users and products as vectors of latent factors.

Preliminary start-up code for the notebook:

In [1]:
import pyspark
import os
import math
import sys
from pyspark.mllib.linalg import SparseVector
# from pyspark.mllib.linalg import DenseVector

spark_home = os.environ.get('SPARK_HOME', None)
print (spark_home)

sc = pyspark.SparkContext('local[*]')
print (sc)

None
<SparkContext master=local[*] appName=pyspark-shell>


## Comparing items based on global comparison of common costumers

The first step towards discovering (mining) similar items is how to measure the similarity between two items in the store catalog. If we assume that the information we have for every item is the set of users that bought that item, the global filtering approach is based on comparing ALL the costumers of the store, checking how many costumers have bought both the two items we want to compare. The idea is that the similarity measure should be higher when the two items have a higher number of common costumers. So, a very direct way of measuring the distance would be to count the number of common costumers, and normalize it by the total number of costumers of the store.

However, when we consider user-item matrices that incorporate more complex information about user-item pairs, for example the satisfaction degree of the costumer with the item, it is better to consider other kinds of similarity measures. One widely used distance measure is the cosine distance.


For working on item-to-item based collaborative filtering, we must work with the columns of the user-item matrix that contains wich items have been bough by each user, the same matrix that we used with global filtering based on UV decomposition.

In this section we will assume that items are already given as vectors, representing their corresponding column vectors in the user-item matrix.

In [6]:
#
#  Compute the cosine distance between vectors vec1 and vec2, represented
#  as dense lists: with all the elements (non-zero and zero) values present
#
def CosineDistance( vec1, vec2 ):
    # dot product of vec1 and vec2
    dot = 0.0
    v1rs = 0.0
    v2rs = 0.0
    for i in range(len(vec1)):
        dot += (vec1[i]*vec2[i]) 
        v1rs += (vec1[i]*vec1[i])
        v2rs += (vec2[i]*vec2[i])
    v1rs = math.sqrt(v1rs)
    v2rs = math.sqrt(v2rs)
    return dot/(v1rs*v2rs)

Next, let's think about implementing the cosine distance but considering the vectors as sparse vector data types of the mllib spark library

By considering a sparse representation we can work with bigger vectors (if many entries are blank as it should be the case for an online store where many items will be bought only by a small fraction of the total number of costumers)

Check https://spark.apache.org/docs/1.6.0/mllib-data-types.html for basic information about spark mllib data types for matrices and vectors

In [None]:
#
# EXERCISE: Implement cosine distance working with the sparse vector data type of spark mllib library
# Observe that we are still assuming that single vectors can fit into the main memory of a single node when using
# this function for computing with spark
#
def CosineDistance_sparkSparseVectors( svec1, svec2 ):
    #
    # INSERT CODE HERE
    #
    
    return dot/(v1rs*v2rs)

Consider for example the following set of 5 item vectors, each one indicating wich costumers, from a total set of 4 costumers, have bought (1) or have not bought (0) the item

In [7]:
itemvectors = [ [1,1,0,0], [0,1,0,1], [0,1,0,0], [1,0,1,0], [0,1,0,1] ]

In [8]:
# Compute the similarity between any pair of items:

for i1, vec1 in enumerate(itemvectors):
    for i2, vec2 in enumerate(itemvectors):
        if (i1 < i2):
          print ("Cosine distance between ", i1, "and", i2, ":", CosineDistance( vec1, vec2 ))

Cosine distance between  0 and 1 : 0.4999999999999999
Cosine distance between  0 and 2 : 0.7071067811865475
Cosine distance between  0 and 3 : 0.4999999999999999
Cosine distance between  0 and 4 : 0.4999999999999999
Cosine distance between  1 and 2 : 0.7071067811865475
Cosine distance between  1 and 3 : 0.0
Cosine distance between  1 and 4 : 0.9999999999999998
Cosine distance between  2 and 3 : 0.0
Cosine distance between  2 and 4 : 0.7071067811865475
Cosine distance between  3 and 4 : 0.0


The similarity value provided by the Cosine distance is also useful in case our item vectors contain values in a range that includes negative values. This is the case where negative values mean *negative* ratings, positive values mean positive ratings and 0 mean a neutral rating (or no rating at all). For example, consider the following modified example of items.

In [10]:
itemvectors2 = [ [-3,-3,-3,-2], [-2,-2,-2,-1], [1,1,1,1], [2,2,2,2], [3,3,3,3] ]

In [11]:
# Compute the similarity between any pair of items:

for i1, vec1 in enumerate(itemvectors2):
    for i2, vec2 in enumerate(itemvectors2):
        if (i1 < i2):
          print ("Cosine distance between ", i1, "and", i2, ":", CosineDistance( vec1, vec2 ))

Cosine distance between  0 and 1 : 0.996270962773436
Cosine distance between  0 and 2 : -0.987829161147262
Cosine distance between  0 and 3 : -0.987829161147262
Cosine distance between  0 and 4 : -0.987829161147262
Cosine distance between  1 and 2 : -0.9707253433941511
Cosine distance between  1 and 3 : -0.9707253433941511
Cosine distance between  1 and 4 : -0.9707253433941511
Cosine distance between  2 and 3 : 1.0
Cosine distance between  2 and 4 : 1.0
Cosine distance between  3 and 4 : 1.0


Observe that in this case, the cosine distance ranges from -1 to 1, where -1 means *totally opposite vectors*, and 1 means totally aligned vectors, without giving relevance to the magnitude of the vectors.

## Mining similar items (products) based on item-to-item global filtering

Let's now consider the approach for recommending products to the Amazon costumers presented in the paper:

> Greg Linden, Brent Smith, and Jeremy York. *Amazon.com recommendations: Item-to-Item Collaborative Filtering*. In IEEE INTERNET COMPUTING. 2003

(Don't think that in that paper you will find exact final algorithms, only the overall idea). The approach is the one we have explained before, use some similarity measure between products to recommend relevant products, with respect to ones already bought by the user. We do not know what are the current similarity measures used by Amazon, but we will use the cosine distance in this notebook to develop our recommender system.  

The main building block of the Amazon recomender system is their algorithm to compute similarity between any pair of items in their on-line catalog. This is the pseudo-code of the mentioned Amazon item-to-item similarity mining algorithm:

```python
def computeSimilarityBetweenProducts( I ):
 for each item i1 in product catalog I:
     for each customer C who purchased i1:
         for each item i2 purchased by customer C:
            record that *a customer* (C) purchased i1 and i2:
                    store that (i1,i2) were purchased by a same user
     for each item i2 in product catalog I:
        compute the similarity between i1 and i2
```

This algorithm can be though as an **off-line** algorithm, the similarity between products should be computed as a background process, and only be recomputed when there are a significant number of changes in the purchases database. What is the worst-case and real complexity of this algorithm ? Consider the following observations (where $M$ is the number of costumers and $N$ the number of items of the on-line store:

1. The worst-case complexity is $ O(N^{2}  M)$. This is the case when almost any user has bought any item of the store (this can be considered a very unrealistic scenario).  
2. However, if we assume that many costumers have very few purchases (let's say a constant number), the real complexity is more closer to $O(N \ M)$.
3. The complexity can be further reduced if we only consider a sample (subset) of costumers to compute the similarity between products (for example, users that buy best-selling products so they will tipically score in many item-to-item similarity values). Of course, this will produce an approximation of the real similarity values between products. 

For computing the similarity, we can use the cosine distance, or any other similarity measure we think is good for our application domain. Observe that in some sense, the similarity between a pair of products (i1,i2) will be *a number* proportional to the total number of customers that purchased both i1 and i2, so it is only necessary to remember the total number of such customers, and not the  particular customers. So, the computation needed to compute the similarity between i1 and i2 can be thought as some kind of *reduce* (by Key) operation between all pairs (i1,i2) produced by different users C. That is, if we consider (i1,i2) as the key, and for example "1" as the value for each different user C that bought i1 and i2, the (key,value) to produce would be:

                ((i1,i2), 1)  for each user C that bought i1 and i2

and then reduce by key (for example summing up the values) all such (key,value) pairs. Of course, to get a normalized similarity measure the sum should be divided by the maximun number of users. 

The output format of the data set computed by such global filtering algorithm could be something like the following. For each item I we have a list of (item,similarity) pairs:
  
 I :  [ (j1,sim(I,j1), (j2,sim(I,j2), ..., (jI,sim(I,jI) ]
 
 where the set of items j1, j2, ... , jI is the set of items that have at least one common costumer (user) with I and so their similarity is > 0. We will refer to the (distributed) data set that contains such information for all the items as the * rddSimilarityPairs * in the rest of this notebook.

## Recommending similar items to a previously bought one

As a first recommender system, consider the case where we want to recommend similar items to one just bought by an user, or to one recently browsed by the user. So, we want to focus on similar items to a particular one. 

Here you have a possible pseudo-code for a recommender system for that particular case. The input is the user we are considering and the item we want to use as the base item to get the recommendations. Observe that the primary use of this algorithm would be "on-line": every time an user makes a new purchase or browses a new item it would be desirable to get such focused recommendations. 

```python 
def Recommend_K_most_Similar( RDD rddSimilarityPairs, user U, item I, integer K ):

     rdd1 = getSimilarItems( rddSimilarityPairs, I )
     rdd2 = EraseItemsAlreadyBought( rdd1, U )
     rdd3 = rdd2.sortBySimilarity()
     bestK = rdd3.takeFirstKItems( K )
     
     return bestK
```

This function assumes that we have a previously computed data set, the rddSimilarityPairs, that should be the one computed by the similarity mining algorithm of  the previous section (or by any other algorithm that provides such data set).

Let's see a possible implementation in spark of the four steps of the previous algorithm. Let's first consider the following similarity information data set, that for simplicity we will consider that is stored in a plain ASCII file. In a final application this similarity information would be stored in a database.

1    2,0.6  4,0.3

2    1,0.6

3    4,0.7  5,0.8 

4    3,0.7  5,0.4 1,0.3

5    3,0.8  4,0.4

6    7,0.3

7    6,0.3

We need a parsing function to process the information of such text file to get the information for the rdd data set

In [12]:
# Format of line:  ItemI    ItemI1,Sim_(I,I1) ...  ItemIN,Sim_(I,IN)
# 
# We assume that the line contains only the information for items such that their similarity with I is > 0
#
def parseSimilarityInfo( line ):
   toks = line.split()
   sourceitem = int(toks[0])
   targetitems = [ tuple(it.split(',')) for it in toks[1:] ]
   targetitems = [ (int(it[0]),float(it[1])) for it in targetitems ]
   return (sourceitem,targetitems)

We can now load and parse the information in our file similarityPairsInfo_1.txt to get the desired rdd data set

In [13]:
rddSimilarityPairs = sc.textFile('similarityPairsInfo_1.txt').map( parseSimilarityInfo )

Let's take a look to check if the file was well parsed (remember that perfoming a collect() action is only a good idea if the result you expect to get is small enough to fit into the driver memory):

In [15]:
rddSimilarityPairs.collect()

[(1, [(2, 0.6), (4, 0.3)]),
 (2, [(1, 0.6)]),
 (3, [(4, 0.7), (5, 0.8)]),
 (4, [(3, 0.7), (5, 0.4), (1, 0.3)]),
 (5, [(3, 0.8), (4, 0.4)]),
 (6, [(7, 0.3)]),
 (7, [(6, 0.3)])]

Next, we can filter from such rdd data set only the similarity information related to our input item I:

In [16]:
def getSimilarItems( rddSimilarityPairs, I ):
   return  rddSimilarityPairs.filter( lambda x : x[0] == I )

Let's test the function with item 4:

In [17]:
rdd1 = getSimilarItems( rddSimilarityPairs, 4 )
rdd1.collect()

[(4, [(3, 0.7), (5, 0.4), (1, 0.3)])]

The next step is to retain only items not bought by user U. Again, the implementation is highly dependent on whether we assume the set of purchases of user U fits into a single machine or it must be distributed. Let's assume here that his/her ser of purchases fits into one machine, so we can read it into the following function (again we will assume such information comes from an ASCII file). The format for such user purchases will be:

   user1  item11   item12 ...  item1N
   
   user2  item11   item12 ...  item1N
   
    ...
    
   userN  item11   item12 ...  item1N
   
where the set of items in line i is the set of items bought by user i 

In [19]:
# In this function we remove the set of already items by U, and also remove the item key to get the final set
# similar and filtered items as a single list
def RemoveBoughtItems( itemsSimilarToI, U ):
   purchases = getPurchases( U )
   print ( purchases )
   return itemsSimilarToI.flatMap( lambda x : [it for it  in x[1] if it[0] not in purchases ] )

We will consider the next example file of user purchases ('purchases.txt'):

    1    2  3
    2    3  4
    3    4  5  6
    4    6  1

In [20]:
def getPurchases( U ):
    purchases = []
    f = open( 'purchases.txt' )
    for line in f:
        toks = line.split()
        if (int(toks[0]) == U):
            purchases = toks[1:]
            break
    f.close()
    return [ int(p) for p in purchases ]

Let's check it filtering out the purchases of user 3:

In [21]:
rdd2 = RemoveBoughtItems( rdd1, 3 )
rdd2.collect()

[4, 5, 6]


[(3, 0.7), (1, 0.3)]

The last two steps are sort the resulting set of items by similarity and taking the first k items. We can do this with spark with a single action:

In [29]:
# Take only the first most similar non-bought item
rdd2.takeOrdered(1, key=lambda x: -x[1])

[(3, 0.7)]

## Recommending based on the whole set of previous purchases

**Exercise**: Finally, consider the next modification of the previous use case as a programming exercise with spark. Sometimes Amazon also sends emails to costumers sending recommendations based on *all* (or a subset of) his/her previous purchases, instead of based on a single recent one. Observe that in this case, a same item could be recommended but with different similarity measures, as the result of being similar to different purchases of the given user (but with a different similarity value with each purchase). So, we should aggregate in some way the different similarity values obtained for a same item.


Can you think about a good recommender algorithm and implement it with spark ? Take as a starting point the previous algorithm we have developed for recommendations based on a single item.