pip install implicit

BUILDING AND EVALUATING A RECOMMENDER SYSTEM

This Jupyter Notebook explains how an error metric is generated for weighted matrix factorization. It will go through the data manipulation, splitting of the files, implementation of the different RS algorithms and the evaluation of these algorithms.




In [28]:
import random
import numpy as np
import pandas as pd
import time
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
import operator
import time
import scipy as sp
import scipy.stats
import random
from sklearn import metrics
import implicit

LOADING THE DATA

In [29]:
#Please enter the path to the input data
#Other input data should look like:
#0,1,3
#0,2,6
#.....
#Where 0 is the user, 1 and 2 are items, and 3 and 6 are implicit feedback scores
#NOTE: Since this is python, the first item and user should both be labeled 0

filepath = 'C:/Users/peter/Documents/uvt/implicit/finalxdata.csv'


In [30]:
def load_matrix(filepath, num_users, num_items):
    counts = np.zeros((num_users, num_items))
    for i, line in enumerate(open(filepath, 'r')):
        user, item, count = line.strip().split(',')
        user = int(user)
        item = int(item)
        count = float(count)
        counts[user][item] = count
    return counts

# So what happens here?
# The formula creates an empty matrix (counts), of size U by I. 
# Then it starts stripping each row of the input file, where it extracts the user, item and feedback score.
# These values are then used to fill the empty matrix with the implicit feedback score.
# Finally it returns the matrix R.


In [37]:
matrix1 = load_matrix(filepath, num_users = 6518, num_items = 4036)

So now we have loaded the matrix. Let's look at the row of a random user to get a feeling for the data (for this example we'll use user 100). 

In [32]:
item=0
for value in matrix1[99]:
    if value>0:
        print(item, value)
    item+=1

690 2.0
1244 2.0
1833 2.0
1882 3.0
1911 2.0
1929 2.0
1941 2.0
2774 2.0


So this user bought 7 different items 2 times, and 1 item 3 times. The code only prints the items on which the user has a purchase record, which means that the user did not purchase the other 4029 items. For good measure, let's look at user 200.

In [33]:
item=0
for value in matrix1[199]:
    if value>0:
        print(item, value)
    item+=1

139 2.0
1592 2.0
1806 2.0
1821 2.0
1870 2.0
1931 2.0
2070 2.0


The input data is ready, so lets create a train and test set. First, we'll need to select the users of whom we'll forget one interaction. The following function drops one value in a users row. 

In [34]:
# This function tells selects a random user-item-interactions that can be dropped.
# Its input is the user's ID, and the user's implicit scores on all items.
# It returns a list of the user ID, and the ID of the item that will be dropped by another function.
def interactiontodrop(rownumber, row): #rownumber=userid, row=score vector
    scoreditems = []
    items = dict(enumerate(row))
    for k in items:
        if items[k] > 0:
            scoreditems.append(k)
    random.seed(1)
    if not scoreditems:
        print(rownumber)
    return [rownumber, random.choice(scoreditems)]

# This function 'drops' the interactions from the matrix (drop means set equal to zero).
# It lists all users, and then collects a sample of these users.
# For each of these users a random item is selected by the interactiontodrop-function, whereafter it is dropped.
# The input is a matrix, and the output is the list of user-item-interactions that are dropped,
# this list contains the values that are dropped by this function.
def dropper(matrix, n):
    r = list(range(len(matrix)))
    random.seed(1)
    sample = random.sample(r,n)
    toodrop = []
    for x in sample:
        toodrop.append(interactiontodrop(x, matrix[x]))
    for item in toodrop:
        matrix[item[0]][item[1]] = 0
    return toodrop

In [35]:
# So let's check the functions, consider the following matrix X with three users, and four items.
X = [[1,0,0,1],
     [8,0,8,8],
     [9,9,9,9]]

# If we apply the dropper-function to this matrix, it will drop one value in each row, and replace it with zero,
# furthermore, it will remember the user and item id of the dropped interactions

print(dropper(X, 3))
print(X)

[[0, 0], [2, 1], [1, 0]]
[[0, 0, 0, 1], [0, 0, 8, 8], [9, 0, 9, 9]]


So in this example, we have dropped an interaction in each row, namely interaction [0,0], [2,1] and [1,0]
the original matrix X is now 
    X = [[0,0,0,1],
         [0,0,8,8],
         [9,0,9,9]]

With the dropper formula we can create the test set, which usually has the size of 20% of the data,
or in our case 20% of the 6518 users. Therefore, we call the dropper function on the input data with an N of 650.

In [38]:
test_set = dropper(matrix1, 650)
# So the test set is now ready and the first 10 user-item interactions in the test set are:
print(test_set[0:10])
# The values of these items and users are now zero in the original matrix
test_user, test_item = test_set[1]
print('the implicit feedback value of the first user in the test set, is now zero')
print(test_user, test_item, matrix1[test_user][test_item])

[[1100, 922], [4662, 3496], [6256, 3905], [516, 1893], [2089, 2891], [965, 230], [4058, 3373], [6233, 3145], [3682, 3388], [3868, 3321]]
4662 3496 0.0


With the test set ready, we can now train on the training data and generate recommendations!
The code below fits a WMF model to the data
BTW, don't mind the function below, it is an artifact of my old thesis and I will develop a new error metric

In [42]:
# In order to evaluate our model we need to generate predictions for a user.
# We generate predictions by multiplying the item and user vectors and then sort the items based on.
def prediction(userid, user_vectors, item_vectors, originalmatrix, n=4036):
    # The next line generates the score for a single user.
    predictions = np.dot(user_vectors[userid], item_vectors.T)
    # Since there are also predictions for items the user DID interact with, we need to set these items to zero.
    # (of course these will score highly, since they have positive values in the original matrix)
    for i in range(4036):
        if originalmatrix[userid][i] > 0:
            predictions[i] = -6000
    # Sort all items based on the score, and keep the sorted list of item-ids.         
    dict = {key: value for (key, value) in (enumerate(predictions))}
    sorted_dict = sorted(dict.items(), key=operator.itemgetter(1), reverse=True)
    recommendation = []
    for i in range(n):
        recommendation.append(sorted_dict[i][0])
    return recommendation

def mpr_calc(dropped_items, user_vectors, item_vectors, originalmatrix):
    percentile_rankings = []
    # dropped item=test set
    # So for each user,item in the test set, generate the list for a user, locate the rank of an item (index(b[1]),
    # and then divide by the total number of items.
    # Finally store the ranking of this user in a list for all users.
    for b in dropped_items:
        percentile_ranking = (prediction(b[0], user_vectors, item_vectors, originalmatrix).index(b[1]) + 1) / ((4036))
        percentile_rankings.append(percentile_ranking)
    return percentile_rankings

def hlu_calc(dropped_items, user_vectors, item_vectors, originalmatrix):
    all_hlu = []
    for b in dropped_items:
        ranking = prediction(b[0], user_vectors, item_vectors, originalmatrix).index(b[1]) + 1
        individual_hlu = 1/(2**((ranking+1)/10))
        all_hlu.append(individual_hlu)
    return all_hlu

def sparser(originalmatrix, num_users, num_items):
    counts = sparse.dok_matrix((num_users, num_items), dtype=float)
    for i in range(len(originalmatrix)):
        row = originalmatrix[i]
        for j in range(len(row)):
            if row[j] > 0:
                user = int(i)
                item = int(j)
                count = float(row[j])
                counts[user, item] = count
    counts = counts.tocsr()
    return counts

In [43]:
matrix1 = matrix1*10
flipped = matrix1.T
sparsematrix = sparser(flipped, 4036, 6518)
wmf = implicit.als.AlternatingLeastSquares(factors=40, regularization=0.1, iterations=20)
wmf.fit(sparsematrix)
print(np.average(mpr_calc(test_set, (wmf.user_factors), (wmf.item_factors), matrix1)))
print(np.average(hlu_calc(test_set, (wmf.user_factors), (wmf.item_factors), matrix1)))

0.0860109018831


The error metric calculation is now done. Below are some small things that I wanted to show. The first box will print the the user and item factors, while the second box shows how we can compute recommendations with these factors. 

In [45]:
print(wmf.user_factors[1])
print(wmf.item_factors[1])
# Notice that these values are low, which means that the regularization is doing its job well! :D

[ 1.02445708 -0.75462322  0.51603423  0.39406674  0.06896798 -0.38594222
 -0.36607757  1.2692164   0.33990333  0.01336935  1.64384154  0.44549519
 -0.38010104  0.32407807  0.05395025  0.12491241  0.56813029  1.3735044
  0.60461521  0.42391448  0.92220032  0.57576944  0.39123932 -0.83277707
  1.06618653 -0.07535544 -0.39173272  0.40998265 -0.10037952  0.56932853
 -1.34936214  0.00838501  0.45339072 -1.82116499 -1.11187079 -0.50163748
  0.5305226   0.12710697  0.38954433 -0.57443069]
[ 0.01765419  0.04446135 -0.03864371 -0.00268694  0.01044716  0.00210448
 -0.00953599  0.00906902  0.00620778 -0.00671141  0.05174084  0.03203721
  0.00553551 -0.0342268   0.01280689 -0.01784982 -0.00429016 -0.02965901
 -0.02506625  0.02072554 -0.0559107  -0.02168349 -0.00534054  0.01673514
  0.02664424  0.00927827 -0.02705793 -0.02476204 -0.0138883  -0.00703672
 -0.00802228  0.01938651  0.02338761 -0.01892339  0.02653332  0.01756498
  0.0052419   0.00820092  0.03683066  0.00417348]


So in the example below, were we have user 2, and items 1, 2, and 3, we can see the computed scores for user 1. As we can see, the user has the highest score on item 2, so we would recommend this item to this user. 

In [63]:
print(np.dot(wmf.item_factors[0],wmf.user_factors[1]))
print(np.dot(wmf.item_factors[1],wmf.user_factors[1]))
print(np.dot(wmf.item_factors[2],wmf.user_factors[1])) #<-- the highest score

-0.0202808522469
-0.00550191601516
0.0677161486725
