Copyright (C) 2023 Pablo Castells y Alejandro Bellogín

El código que contiene este notebook se ha implementado para la realización de las prácticas de la asignatura "Búsqueda y minería de información" de 4º del Grado en Ingeniería Informática, impartido en la Escuela Politécnica Superior de la Universidad Autónoma de Madrid. El fin del mismo, así como su uso, se ciñe a las actividades docentes de dicha asignatura.

### **Búsqueda y Minería de Información 2022-23**
### Universidad Autónoma de Madrid, Escuela Politécnica Superior
### Grado en Ingeniería Informática, 4º curso

# Matrix-based implementation of recommender systems 

This notebook is a warmup exercise and provides many of the elements you need to build your implementation of recommender systems in the lab assignment. Some blanks are left for you to fill in: they are all labeled with "# Your code here...". In general the blanks can all be filled in one line: many operations consist of matrix multiplications (of the two kinds: dot product and element-wise multiplications), and others can be done in just one function call using the np array API. Using NumPy makes these operations extremely fast.

You don't need to hand in this notebook &mdash; it will provide you with many bits of code though and a step by step understanding of how to handle data and produce recommendations with NumPy matrix operations.

What you will see here reflects representative aspects that are also common in how other algorithms in recommender systems, machine learning, and data science at large are built.

## Creating a rating matrix from a dataframe of ratings

Example ratings data frame.

In [71]:
import pandas as pd

ratings_df = pd.DataFrame(columns=['u', 'i', 'r'],
                          data=[['v', 'b', 4], ['v', 'c', 5], ['v', 'd', 3],
                                ['x', 'a', 5], ['x', 'b', 2], ['x', 'e', 4], 
                                ['y', 'a', 1], ['y', 'b', 4], ['y', 'c', 4],
                                ['z', 'c', 3], ['z', 'd', 5]])
ratings_df.to_csv('recsys-data/toy1.csv', index=False)
ratings_df

Unnamed: 0,u,i,r
0,v,b,4
1,v,c,5
2,v,d,3
3,x,a,5
4,x,b,2
5,x,e,4
6,y,a,1
7,y,b,4
8,y,c,4
9,z,c,3


Create a ratings matrix.

In [72]:
matrix = ratings_df.pivot(index='u', columns='i', values='r')
matrix

i,a,b,c,d,e
u,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
v,,4.0,5.0,3.0,
x,5.0,2.0,,,4.0
y,1.0,4.0,4.0,,
z,,,3.0,5.0,


Handle missing ratings as zeros, convert to numpy 2D array.

In [73]:
matrix = # Your code here...
matrix

array([[0., 4., 5., 3., 0.],
       [5., 2., 0., 0., 4.],
       [1., 4., 4., 0., 0.],
       [0., 0., 3., 5., 0.]])

How to access internal rating matrix from external user and item ids.

*Why is this needed?* In the internal matrix representation, user and item ids are implicit: they are the row numbers and column numbers, ranging from zero to the number of users and items, respectively &mdash; we will refer to them as idx's. Working with ratings as a matrix is highly efficient in NumPy. However, users and items have an external id representation that may not be an integer, or even when they are an integer, they may not range from zero to the number of users/items. Hence, we map external to internal ids, compute recommendations, and map them back to external ids when serving the results. This is alike to doc titles, URLs or pathnames vs. internal doc ids in a search engine.

In [74]:
# Create mappings between internal and external user and item ids.

import numpy as np

uidx_to_uid = np.sort(ratings_df.u.unique())
iidx_to_iid = np.sort(ratings_df.i.unique())
uid_to_uidx = {u:j for j, u in enumerate(uidx_to_uid)}
iid_to_iidx = {i:j for j, i in enumerate(iidx_to_iid)}

print(uidx_to_uid)
print(iidx_to_iid)
print(uid_to_uidx)
print(iid_to_iidx)
print()

def rating(uid, iid):
    # Your code here...

def items_rated_by(uid):
    return # Your code here... Hint: numpy.ndarray.nonzero() returns a subarray with the values that are not zero

# Example how to get data from external ids through internal idxs.
print(rating('v', 'd'))
items_rated_by('v')

['v' 'x' 'y' 'z']
['a' 'b' 'c' 'd' 'e']
{'v': 0, 'x': 1, 'y': 2, 'z': 3}
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

3.0


array(['b', 'c', 'd'], dtype=object)

## Computing a similarity matrix

Example ratings matrix.

In [75]:
import numpy as np
ratings_matrix = np.array([[0, 4, 5, 3, 0],
                           [5, 2, 0, 0, 4],
                           [1, 4, 4, 0, 0],
                           [0, 0, 3, 5, 0]])
ratings_matrix

array([[0, 4, 5, 3, 0],
       [5, 2, 0, 0, 4],
       [1, 4, 4, 0, 0],
       [0, 0, 3, 5, 0]])

Compute dot products of all matrix rows against each other.

In [76]:
dots = # Your code here...
dots

array([[50,  8, 36, 30],
       [ 8, 45, 13,  0],
       [36, 13, 33, 12],
       [30,  0, 12, 34]])

Get modulus of rows as the square of the dot product of each row against itself &mdash; hint: the data you need is already in the dot product matrix.

In [77]:
mods = # Your code here...
mods[mods==0] = 1 # To avoid 0/0 later if some row is all zeros (and hence the row modulus is zero).
mods

array([7.07106781, 6.70820393, 5.74456265, 5.83095189])

Divide all rows and all columns of the dot product matrix by the modulus of the corresponding original matrix row.

In [78]:
sim = # Your code here...
sim

array([[1.        , 0.16865481, 0.88625874, 0.72760688],
       [0.16865481, 1.        , 0.33734954, 0.        ],
       [0.88625874, 0.33734954, 1.        , 0.35824886],
       [0.72760688, 0.        , 0.35824886, 1.        ]])

Remove self-similarities: we don't want users to be their own neighbors.

In [79]:
# Your code here... Hint: can be done in just one np array operation.
sim

array([[0.        , 0.16865481, 0.88625874, 0.72760688],
       [0.16865481, 0.        , 0.33734954, 0.        ],
       [0.88625874, 0.33734954, 0.        , 0.35824886],
       [0.72760688, 0.        , 0.35824886, 0.        ]])

Zero out all but top k similarities of each row.

In [80]:
from IPython.display import display

# Given a matrix, returns a matrix of positions of top k values per row.
def top_positions_per_row(m, k):
    return np.argpartition(m, -k)[:, -k:]

# Positions of top k sim values of each row.
k = 2
uidx = top_positions_per_row(sim, k)
print(str(k) + ' nearest users per user row:')
display(uidx)

# Create mask with 1's on the top k of each row, 0 anywhere else.
mask = np.zeros_like(sim)
mask[np.arange(mask.shape[0]), uidx.T] = 1
print('\nIn mask form:')
display(mask)

# Apply mask to sim.
knn_sim = # Your code here...
print('\nTop ' + str(k) + ' sim matrix:')
knn_sim

2 nearest users per user row:


array([[3, 2],
       [0, 2],
       [3, 0],
       [2, 0]], dtype=int64)


In mask form:


array([[0., 0., 1., 1.],
       [1., 0., 1., 0.],
       [1., 0., 0., 1.],
       [1., 0., 1., 0.]])


Top 2 sim matrix:


array([[0.        , 0.        , 0.88625874, 0.72760688],
       [0.16865481, 0.        , 0.33734954, 0.        ],
       [0.88625874, 0.        , 0.        , 0.35824886],
       [0.72760688, 0.        , 0.35824886, 0.        ]])

## Computing similarity-based recommendations

Example data from previous steps.

In [81]:
knn_sim = np.array([[0, 0, 0.88625874, 0.72760688],
                    [0.16865481, 0, 0.33734954, 0],
                    [0.88625874, 0, 0, 0.35824886],
                    [0.72760688, 0, 0.35824886, 0]])

Can you now create a matrix of user/item scores?

In [82]:
scores = # Your code here...
scores

array([[0.88625874, 3.54503496, 5.7278556 , 3.6380344 , 0.        ],
       [0.33734954, 2.0240174 , 2.19267221, 0.50596443, 0.        ],
       [0.        , 3.54503496, 5.50604028, 4.45002052, 0.        ],
       [0.35824886, 4.34342296, 5.07102984, 2.18282064, 0.        ]])

Now cancel out any scores for user/item pairs in the original ratings (we don't want to recommend items that the user had already rated).

In [83]:
print('Mask:')
display(ratings_matrix == 0)
scores = # Your code here...
print('\nScores:')
scores

Mask:


array([[ True, False, False, False,  True],
       [False, False,  True,  True, False],
       [False, False, False,  True,  True],
       [ True,  True, False, False,  True]])


Scores:


array([[0.88625874, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 2.19267221, 0.50596443, 0.        ],
       [0.        , 0.        , 0.        , 4.45002052, 0.        ],
       [0.35824886, 4.34342296, 0.        , 0.        , 0.        ]])

Now from scores create ordered top n rankings, back to external user and item ids.

In [88]:
def get_elements(m, indices):
    return np.array([s[t] for s, t in zip(m, indices)])

n = 2
top_iidx = top_positions_per_row(scores, n)
# Sort because top_positions_per_row returns the top unsorted.
ranked_iidx = get_elements(top_iidx, np.argsort(get_elements(scores, top_iidx))[:, ::-1])
print('Ranked top item iidx per user rows:')
display(ranked_iidx)

# And now get the ranked uids and scores.
ranked_iids = iidx_to_iid[ranked_iidx]
rank_scores = get_elements(scores, ranked_iidx)

recs = {uid : [(iid, score) for iid, score in zip(ranked_iids[uidx], rank_scores[uidx]) if score > 0] 
        for uidx, uid in enumerate(uidx_to_uid)} 
print('\nRecommendations!')
recs

Ranked top item iidx per user rows:


array([[0, 1],
       [2, 3],
       [3, 4],
       [1, 0]], dtype=int64)


Recommendations!


{'v': [('a', 0.88625874)],
 'x': [('c', 2.19267221), ('d', 0.5059644299999999)],
 'y': [('d', 4.45002052)],
 'z': [('b', 4.34342296), ('a', 0.35824886)]}