# Filtrado Colaborativo: *K-Nearest Neighbors*

Los algoritmos de **filtrado colaborativo** basados en la técnica de los *k* vecinos (KNN) tratan de imitar el comportamiento de los seres humanos cuando buscan recibir una recomendación: cuando necesitamos conocer si nos va a interesar un item, preguntamos a personas que sabemos que conocen nuestros intereses si ellos consideran que el item nos va a gustar.

El método realizará esta misma operación empleando la matriz de votaciones. Este proceso seguirá el siguiente algoritmo:

1. Determinar la similaridad entre los usuarios
2. Encontrar el conjunto de *k* usuarios más similares (*k* vecinos)
3. Estimar las predicciones a los items no votados utilizando las votaciones realizadas por los *k* vecinos
4. (Opcional) Recomendar los *N* items con una predicción más alta

En los siguientes subapartados explicaremos en detalle cada uno de estos pasos.

## Carga del dataset

Para ilustrar mejor el funcionamiento de la técnica de KNN, vamos a desarrollar una implementación explicativa de cómo funciona.

Para ello usaremos el dataset de [MovieLens 100K](https://grouplens.org/datasets/movielens/) que contiene 100.000 votos de 943 usuarios sobre 1682 películas. Este dataset ha sido dividido en votaciones de entrenamiento (80%) y votaciones de test (20%). Además, los códigos de usuarios e items, han sido modificados para que comience en 0 y terminen en el número de (usuarios / items) - 1.

Inicialmente definimos algunas constantes que nos serán necesarias durante la codificación del algoritmo:

In [2]:
import urllib.request
import math
import numpy as np

In [3]:
NUM_USERS = 943
NUM_ITEMS = 1682

MIN_RATING = 1
MAX_RATING = 5

Cargamos el dataset en la matriz de votaciones. La ausencia de voto se representa con None:

In [4]:
ratings = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]

training_file = urllib.request.urlopen("https://drive.upm.es/s/tDdluElfGInyUnU/download")
for line in training_file:
  [u, i, rating] = line.decode("utf-8").split("::")
  ratings[int(u)][int(i)] = int(rating)

Del mismo modo, cargamos la matriz de votaciones de test:

In [5]:
test_ratings = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]

test_file = urllib.request.urlopen("https://drive.upm.es/s/Jn75Vg6okOPsgZu/download")
for line in test_file:
  [u, i, rating] = line.decode("utf-8").split("::")
  test_ratings[int(u)][int(i)] = int(rating)

In [6]:
ratings2 = np.asarray(ratings)
ratings2[ratings2 == None] = np.nan
ratings2 = ratings2.astype(np.float16)
ratings3 = ratings2.astype(np.float64)

In [7]:
%timeit np.isnan(ratings2)
%timeit np.isnan(ratings2)
%timeit np.isnan(ratings3)
%timeit [[rating == None for rating in user] for user in ratings]

2.7 ms ± 81 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.52 ms ± 62.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1 ms ± 13.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
34.7 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
%timeit ratings2[~np.isnan(ratings2)].mean()
%timeit ratings3[~np.isnan(ratings3)].mean()

4.91 ms ± 69.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.46 ms ± 46.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
ratings_np = ratings3
test_ratings_np = np.asarray(test_ratings)
test_ratings_np[test_ratings_np == None] = np.nan
test_ratings_np = test_ratings_np.astype(np.float64)

Definimos también algunas funciones auxiliares que nos serán útiles:

In [10]:
def rating_average_old(u):
    acc = 0
    count = 0

    for i in range(NUM_ITEMS):
        if ratings[u][i] != None:
            acc += ratings[u][i]
            count += 1

    avg = acc / count

    return avg


In [11]:
def rating_average(u):
    return ratings[u,:][~np.isnan(ratings[u,:])].mean()

## Calculo de la similaridad

El primer paso del algoritmo de KNN consiste en determinar el parecido de cada pareja de usuarios basándonos en las votaciones previas de dichos usuarios. Para calcular esta similaridad debemos tener presente que la matriz de votaciones es dispersa y, por lo tanto, parar comparar dos usuarios únicamente podremos emplear los votos sobre aquellos items que sean comunes a ambos usuarios.

Por ejemplo, si un usuario *u* ha realizado las siguientes votaciones:

|   	| i1 	| i2 	| i3 	| i4 	| i5 	| i6 	| i7 	| i8 	| i9 	| i10 	|
|:-:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|-----	|
| u 	|  1 	|  2 	|    	|  4 	|  2 	|    	|  3 	|    	|   	|   5  	|

Y un usuario *v* ha realizado las siguientes votaciones:

|   	| i1 	| i2 	| i3 	| i4 	| i5 	| i6 	| i7 	| i8 	| i9 	| i10 	|
|:-:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|-----	|
| v 	|  3 	|   	|  2  	|   	|  5 	|  4  	|  3 	|  3  	|   	|  1 	|

Únicamente podremos comparar los votos producidos en los items *i1*, *i5*, *i7* e *i10*.

Existen infinidad de métricas de similaridad que permiten conocer el parecido de dos usuarios en función de sus votos comunes. Las más tradicionales se basan en medidas estadísticas clásicas. Por ejemplo, una de las métricas más empleadas es la **correlación**:

$$sim(u, v) = \frac {
    \sum_{i \in I_{u,v}} (r_{u,i} - \bar{r}_u)(r_{v,i} - \bar{r}_v)
   }{
   \sqrt{ \sum_{i \in I_{u,v}} (r_{u,i} - \bar{r}_u)^2 \sum_{i \in I_{u,v}} (r_{v,i} - \bar{r}_v)^2 }
   }$$
   
Donde $I_{u,v}$ representa los items que han votado tanto el usuario $u$ como el usuario $v$, $\bar{r}_u$ representa la votación media del usuario $u$ y $\bar{r}_v$ representa la votación media del usuario $v$.

La implementación de esta función podríamos hacerla del siguiente modo:

In [12]:
ratings_np

array([[ 5.,  3.,  4., ..., nan, nan, nan],
       [ 4., nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [ 5., nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan,  5., nan, ..., nan, nan, nan]])

In [23]:
set(np.arange(ratings_np.shape[0])[~np.isnan(ratings_np).all(axis=1)])

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [30]:
ratings_np.ndim

2

In [57]:
data = ratings_np[0]
[*np.arange(data.shape[0])[~np.isnan(data).all(axis=tuple(range(1, data.ndim)))]]

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [47]:
np.arange(data.shape[0])[~np.isnan(data).all(axis=-1)]

array([[   0,    1,    2, ..., 1679, 1680, 1681]])

In [None]:
class CustomNanSparse:
    def __init__(self, data: np.ndarray):
        valid_values = [*np.arange(data.shape[0])[~np.isnan(data).all(axis=tuple(range(1, data.ndim)))]]
        self._where = dict(zip(valid_values, range(len(valid_values))))   
        if data.ndim == 1:
            self._data = data[valid_values]
            return         
        self._data = [CustomNanSparse(data[i]) for i in valid_values]
        self.ndim = data.ndim
        
    def __getitem__(self, key):
        if isinstance(key, tuple):
            return self._data[key[0]][key[1]]
        if key in self._where:
            return self._data[self._where[key]]
        else:
            raise KeyError(f"Key {key} not found in sparse data.")
        
    def __setitem__(self, key, value):
        if isinstance(key, tuple):
            self._data[key[0]][key[1]] = value
            return
        if key in self._where:
            self._data[self._where[key]] = value
        else:
            raise KeyError(f"Key {key} not found in sparse data.")
        
    def __add__(self, other):
        assert self.ndim == other.ndim, "Dimensions must match for addition."
        if self.ndim == 1:
            result = []
            for i in self.where & other.where:
                result.append(self._data[i] + other._data[i])
            return CustomNanSparse(np.array(result))
        
        if isinstance(other, CustomNanSparse):
            intersection = self.where
            for i in self.where & other.where:
                self._data[i] += other._data[i]
            for i in self.where - other.where:
                self
                self._data[i] += 0
        else:
            raise TypeError(f"Unsupported type for addition: {type(other)}")
        

a = CustomNanSparse(ratings_np)
a[0][0] = 5.1
a[0,0] += 0.1
a[0,0]


5.199999999999999

In [76]:
set([1]) & set([2])

set()

In [68]:
ratings_np

array([[ 5.,  3.,  4., ..., nan, nan, nan],
       [ 4., nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [ 5., nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan,  5., nan, ..., nan, nan, nan]])

In [95]:
def correlation_similarity_old(u, v):
    num = 0

    den_u = 0
    den_v = 0

    count = 0

    avg_u = rating_average_old(u)
    avg_v = rating_average_old(v)

    for i in range(NUM_ITEMS):
        if ratings[u][i] != None and ratings[v][i] != None:
            r_u = ratings[u][i]
            r_v = ratings[v][i]

            num += (r_u - avg_u) * (r_v - avg_v)
            den_u += (r_u - avg_u) * (r_u - avg_u)
            den_v += (r_v - avg_v) * (r_v - avg_v)

            count += 1

    if count > 0 and den_u != 0 and den_v != 0:
        cor = num / math.sqrt( den_u * den_v )
        return cor;
    else:
        return None

In [96]:
%timeit correlation_similarity_old(5,940)
%timeit correlation_similarity_old(113,534)
%timeit correlation_similarity_old(92,355)

786 μs ± 24.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
773 μs ± 46.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
602 μs ± 44.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [529]:
def correlation_similarity(u, v):
    ratings_u = ratings_np[u]
    ratings_v = ratings_np[v]

    u_valid = ~np.isnan(ratings_u)
    v_valid = ~np.isnan(ratings_v)
    both_valid = u_valid & v_valid

    u_mean = ratings_u[u_valid].mean()
    v_mean = ratings_v[v_valid].mean()

    ratings_u = ratings_u[both_valid]
    if len(ratings_u) == 0:
        return np.nan
    ratings_v = ratings_v[both_valid]

    u_error = ratings_u - u_mean
    v_error = ratings_v - v_mean
    denominator = np.sqrt((u_error**2).sum() * (v_error**2).sum())
    if denominator == 0:
        return np.nan
    
    return (u_error * v_error).sum()/denominator

In [531]:
%timeit correlation_similarity(5,940)
%timeit correlation_similarity(113,534)
%timeit correlation_similarity(92,355)

72.6 μs ± 2.41 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
79.8 μs ± 3.94 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
37.9 μs ± 1.69 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [102]:
# Testing
print(correlation_similarity_old(5,940))
print(correlation_similarity_old(113,534))
print(correlation_similarity_old(92,355))
print()
print(correlation_similarity(5,940))
print(correlation_similarity(113,534)) # Es incluso más preciso (aunque la diferencia no es significativa)
print(correlation_similarity(92,355))

0.010617834767124239
-0.08282165590325245
None

0.010617834767124239
-0.08282165590325251
None


In [208]:
user1 = np.array([5, np.nan])
user2 = np.array([np.nan, 4])
intersection = np.array([False, False])
(user1 * user2).sum(where=intersection), 1/np.nan

(0.0, nan)

In [532]:
def correlation_similarities(u):
    all_ratings = ratings_np.copy()
    valids = ~np.isnan(all_ratings)
    intersections = valids[u] & valids
    
    means = all_ratings.mean(axis=1, where=valids)
    errors = all_ratings - means[:, np.newaxis]
    numerator = (errors * errors[u]).sum(axis=1, where=intersections)
    denominator = np.sqrt(
        (errors**2).sum(axis=1, where=intersections)
        *
        (np.broadcast_to(errors[u], all_ratings.shape)**2).sum(axis=1, where=intersections)
    )
        
    denominator[denominator == 0] = np.nan
    similarities = numerator / denominator
    similarities[u] = np.nan
    return similarities

In [533]:
u = 5
%timeit -n 20 correlation_similarities(u)
%timeit -n 20 correlation_similarities(u)
%timeit -n 20 [None if u == v else correlation_similarity(u, v) for v in range(NUM_USERS)]

52.1 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
52.7 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
73.6 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)


Aunque las métricas de similaridad basadas en medidas estadísticas clásicas ofrecen valores "aceptables" para el método de los *k* vecinos, existen otras métricas de similaridad específicas para el dominio del filtrado colaborativo. Una de las más populares es **JMSD** que permite calcular la similaridad atendiendo a dos factores: la información no numérica de los votos y la información numérica de los mismos. *JMSD* se define como el producto del índice de Jaccard por la diferencia cuadrática media:

$$JMSD(u,v) = Jaccard(u,v) * (1 - MSD(u, v))$$

El *índice de Jaccard* es una medida que permite comparar la similitud de dos conjuntos. En este caso se emplea para comprobar si los items votados por ambos usuarios coinciden, independientemente de la votación:

$$Jaccard(u,v) =\frac {I_u \cap I_v} {I_u \cup I_v} = \frac {\# \{ i \in I | r_{u,i} \neq \bullet \wedge r_{v,i} \neq \bullet \}} {\# \{ i \in I | r_{u,i} \neq \bullet \vee r_{v,i} \neq \bullet \}}$$

Donde $I_u$ representa los items votados por el usuario $u$ e $I_v$ representa los items votados por el usuario $v$.

El *MSD* permite comparar si los usuarios tienen la misma opinión sobre los items que votaron en común. En este caso, el *MSD* retorna el valor 0 cuando los usuarios votaron idéntico, por lo que suele invertirse para lograr que cuanto más se parezcan dos usuarios, mayor sea su similaridad. Para esto, es necesario normalizar previamente las votaciones:

$$MSD(u,v) = \frac {1} {\#I_{u,v}} \sum_{i \in I_{u,v}} (r_{u,i} - r_{v,i})^2$$

Donde $I_{u,v}$ representa los items votados en común por $u$ y $v$.

Esta medida busca un equilibrio entre los items que los dos usuarios han votado y lo parecidas que son estas votaciones.

La implementación de esta función podríamos hacerla del siguiente modo:

In [103]:
def jmsd_similarity_old(u, v):

    union = 0
    intersection = 0
    diff = 0

    for i in range(NUM_ITEMS):
        if ratings[u][i] != None and ratings[v][i] != None:
            r_u = (ratings[u][i] - MIN_RATING) / (MAX_RATING - MIN_RATING)
            r_v = (ratings[v][i] - MIN_RATING) / (MAX_RATING - MIN_RATING)

            diff = (r_u - r_v) * (r_u - r_v)

            intersection += 1
            union += 1

        elif ratings[u][i] != None or ratings[v][i] != None:
            union += 1


    if intersection > 0:
        jaccard = intersection / union
        msd = diff / intersection
        return jaccard * (1 - msd);
    else:
        return None

In [104]:
%timeit jmsd_similarity_old(5,940)
%timeit jmsd_similarity_old(113,534)
%timeit jmsd_similarity_old(92,355)

538 μs ± 32.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
535 μs ± 33 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
431 μs ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [None]:
def jmsd_similarity(u, v):
    ratings_u = ratings_np[u]
    ratings_v = ratings_np[v]

    valid_u = ~np.isnan(ratings_u)
    valid_v = ~np.isnan(ratings_v)

    union_size = (valid_u | valid_v).sum()
    intersection = valid_u & valid_v
    intersection_size = intersection.sum()

    if intersection_size == 0:
        return np.nan
    jaccard_similiarty = intersection_size/union_size
    
    ratings_u = ratings_u[intersection]
    ratings_v = ratings_v[intersection]
    
    ratings_u = (ratings_u - MIN_RATING) / (MAX_RATING - MIN_RATING)
    ratings_v = (ratings_v - MIN_RATING) / (MAX_RATING - MIN_RATING)

    return jaccard_similiarty * (1 - ((ratings_u - ratings_v)**2).mean())

In [108]:
%timeit jmsd_similarity(5,940)
%timeit jmsd_similarity(113,534)
%timeit jmsd_similarity(92,355)

67 μs ± 1.92 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
67.9 μs ± 2.52 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
28.9 μs ± 2.11 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [109]:
# Testing
print(jmsd_similarity_old(5,940))
print(jmsd_similarity_old(113,534))
print(jmsd_similarity_old(92,355))
print()
print(jmsd_similarity(5,940))
print(jmsd_similarity(113,534))
print(jmsd_similarity(92,355))

0.049268018018018014
0.12685381355932204
None

0.04025900900900901
0.1138771186440678
None


In [None]:
def jmsd_similarities(u):
    all_ratings = ratings_np.copy()

    # valid_u = ratings_u != None
    valids = ~np.isnan(all_ratings)
    union_sizes = (valids[u] | valids).sum(axis=1)
    intersections = valids[u] & valids
    intersection_sizes = intersections.sum(axis=1, dtype=np.float64)
    intersection_sizes[intersection_sizes == 0] = np.nan
    
    jaccard_similiarties = intersection_sizes/union_sizes

    all_ratings[intersections] = (all_ratings[intersections] - MIN_RATING) / (MAX_RATING - MIN_RATING)
    all_ratings[intersections] = ((np.broadcast_to(all_ratings[u], all_ratings.shape)[intersections] - all_ratings[intersections])**2)
    all_ratings[~intersections] = 0
    similarities = jaccard_similiarties * (1 - all_ratings.sum(axis=1) / intersection_sizes)
    similarities[u] = np.nan
    return similarities

In [None]:
u = 5
%timeit -n 20 jmsd_similarities(u)
%timeit -n 20 jmsd_similarities(u)
%timeit -n 20 [None if u == v else jmsd_similarity(u, v) for v in range(NUM_USERS)]

32 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
34.8 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
69.4 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)


## Búsqueda de los *k* vecinos

En el método de KNN, las predicciones se realizan a partir del conjunto de *k* vecinos de cada usuario, esto es, el conjunto de *k* usuarios más similares a uno dado. Este *k* será considerado como un parámetro del sistema y deberá tunearse para cada dataset.

La búsqueda de los *k* vecinos consiste únicamente en ordenar los usuarios en base a su similaridad y elegir a los *k* con una similaridad más alta.

La implementación de esta función podríamos hacerla del siguiente modo:

In [227]:
k = 25

In [None]:
def get_neighbors_old(u, similarities):

    neighbors = [None for _ in range(k)]

    for n in range(k):

        max_similarity = 0
        neighbor = None

        for v, sim in enumerate(similarities):
            if v not in neighbors and sim != None and sim > max_similarity:
                max_similarity = sim
                neighbor = v

        neighbors[n] = neighbor

    return neighbors

In [466]:
def get_neighbors(u, similarities):
    # Hay que tomar una decisión entre que devuelva vecinos nulos o que 
    # devuelva un número indeterminado de vecinos, opto por la segunda por
    # eficiencia.
    if len(similarities) <= k:
        # Not enough neighbors
        similarities = np.where(~np.isnan(similarities), similarities, 0)
        neighbors = np.arange(len(similarities))
        neighbors = neighbors[similarities[neighbors] != 0]
        return neighbors

    similarities = np.where(~np.isnan(similarities), similarities, 0)
    neighbors = np.argpartition(similarities, -k)[-k:] # Más eficiente que ordenar
    return neighbors[similarities[neighbors] != 0]


In [462]:
a = np.arange(10) - 5
index = np.argwhere((a == 0) | (a == 1))[0][0]
a[:index]

array([-5, -4, -3, -2, -1])

In [467]:
# Testing
u = 112
similarities_corr = [None if u == v else correlation_similarity_old(u, v) for v in range(NUM_USERS)]
similarities_jmsd = [None if u == v else jmsd_similarity_old(u, v) for v in range(NUM_USERS)]
similarities_corr_np = correlation_similarities(u)
similarities_jmsd_np = jmsd_similarities(u)

%timeit get_neighbors_old(u, similarities_corr)
%timeit get_neighbors_old(u, similarities_jmsd)
# Mejora tanto que se usa una unidad distinta
%timeit get_neighbors(u, similarities_corr_np)
%timeit get_neighbors(u, similarities_jmsd_np)
print()

neighbors_corr = get_neighbors(u, similarities_corr_np)
neighbors_jmsd = get_neighbors(u, similarities_jmsd_np)
print(neighbors_corr)
print(neighbors_jmsd)

20.1 ms ± 462 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.7 ms ± 562 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
29.3 μs ± 1.55 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
29.9 μs ± 2.1 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

[608 515 507 659 703 699 875 318 797 186  40 856  50 365  97 134 606 113
 379  66 722 121 521 132  87]
[138 673 322 729 700 577 904 925 643 469 172  73 148 742 412 169 581  62
 936 154 509 702 769 175 833]


## Estimación de las predicciones

La estimación de las predicciones se realiza agregando las votaciones que los *k* vecinos del usuario activo realizaron al item que se quiere predecir. De nuevo, nos encontramos con el problema de la dispersión de la matriz de votaciones, ya que no todos los *k* vecinos habrán votado los mismos items. Podría incluso darse la circunstancia de que un item no hubiera sido votado por ninguno de los *k* vecinos y, por tanto, no podría estimarse una predicción.

Esta situación puede ayudarnos a tunear el parámetro *k*. Si elegimos un *k* muy bajo conseguimos predicciones precisas ya que se realizan con usuarios muy similares al activo, pero existirán muchos items que no podremos predecir. Si elegimos un *k* muy elevado podremos predecir casi todos los items pero estas predicciones serán poco personalizadas. Llevado al extremo si *k* es igual al número de usuario del sistema (menos uno), las predicciones serían equivalentes a la media de los votos.

Existen diversas formas de agregar las votaciones de los *k* vecinos (*aggregation approach*), aunque, la más popular debido a su sencillez es la media:

$$\hat{r}_{u,i} = \frac{1}{\#N_{u,i}} \sum_{n \in N_{u,i}} r_{n,i}$$

Donde $N_{u,i}$ representa el conjunto de *k* vecinos del usuario $u$ que votaron el item $i$.

La implementación de esta medida de agregación podríamos hacerla del siguiente modo:

In [264]:
def average_prediction_old(u, i, neighbors):
    acc = 0
    count = 0

    for n in neighbors:
        if n == None: break

        if ratings[n][i] != None:
            acc += ratings[n][i]
            count += 1

    if count > 0:
        prediction = acc / count
        return prediction
    else:
        return None


In [295]:
def average_prediction(u, i, neighbors):
    values = ratings_np[:,i][neighbors] 
    valid_values = values[~np.isnan(values)]
    if len(valid_values) == 0:
        return np.nan
    return valid_values.mean()

In [275]:
%timeit ratings_np[:,i][neighbors_corr]
%timeit ratings_np[neighbors_corr, i]

1.05 μs ± 38.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
3.15 μs ± 236 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [296]:
# Testing
i = 324
# En este caso no mejoramos el rendimiento, pero tratamos con tiempos muy pequeños
%timeit average_prediction_old(u, i, neighbors_corr)
%timeit average_prediction_old(u, i, neighbors_jmsd)
%timeit average_prediction(u, i, neighbors_corr)
%timeit average_prediction(u, i, neighbors_jmsd)

avgpred_corr = average_prediction(u, i, neighbors_corr)
avgpred_jmsd = average_prediction(u, i, neighbors_jmsd)
print(avgpred_corr)
print(avgpred_jmsd)

11.6 μs ± 1.58 μs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
14.7 μs ± 414 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
15.2 μs ± 328 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
15.6 μs ± 360 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.0
1.75


In [351]:
import warnings
def average_predictions(u, neighbors):
    values = ratings_np[neighbors] 
    valids = ~np.isnan(values)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        return values.mean(axis=0, where=valids)
average_predictions(5, neighbors_corr)#, average_prediction(5,0,neighbors_corr)

array([3.375, 3.   , 1.   , ...,   nan,   nan,   nan])

In [357]:
%timeit [average_prediction(5, i, neighbors_corr) for i in range(NUM_ITEMS)]
%timeit average_predictions(5, neighbors_corr)

15.3 ms ± 392 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
344 μs ± 27.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Una evolución simple de esta medida de agregación es la media ponderada, en la cual el voto de cada uno de los vecinos se pondera en función de su similaridad con el usuario sobre el que se está calculando la predicción:

$$\hat{r}_{u,i} = \frac{\sum_{n \in N_{u,i}} sim(u,n) \cdot r_{n,i}}{\sum_{n \in N_{u,i}} sim(u,n)} $$

Donde $N_{u,i}$ representa el conjunto de *k* vecinos del usuario $u$ que votaron el item $i$ y $sim(u,n)$ simboliza la similaridad entre el usuario $u$ y el vecino $n$.

In [288]:
def weighted_average_prediction_old(u, i, neighbors, similarities):
    num = 0
    den = 0

    for n in neighbors:
        if n == None: break

        if ratings[n][i] != None:
            num += similarities[n] * ratings[n][i]
            den += similarities[n]

    if den > 0:
        prediction = num / den
        return prediction
    else:
        return None

In [415]:
def weighted_average_prediction(u, i, neighbors, similarities):
    neighbors_ratings = ratings_np[:, i][neighbors]
    valid_ratings = ~np.isnan(neighbors_ratings)
    
    neighbors_ratings = neighbors_ratings[valid_ratings]
    if len(neighbors_ratings) == 0:
        return np.nan
    
    neighbors_similarities = similarities[neighbors][valid_ratings]
    return (neighbors_similarities * neighbors_ratings).sum() / neighbors_similarities.sum()

In [363]:
# Testing
%timeit -n 10_000 weighted_average_prediction_old(u, i, neighbors_corr, similarities_corr)
%timeit -n 10_000 weighted_average_prediction_old(u, i, neighbors_jmsd, similarities_jmsd)
%timeit -n 10_000 weighted_average_prediction(u, i, neighbors_corr, similarities_corr_np)
%timeit -n 10_000 weighted_average_prediction(u, i, neighbors_jmsd, similarities_jmsd_np)

wavgpred_corr = weighted_average_prediction(u, i, neighbors_corr, similarities_corr_np)
wavgpred_jmsd = weighted_average_prediction(u, i, neighbors_jmsd, similarities_jmsd_np)
print(wavgpred_corr)
print(wavgpred_jmsd)

13.7 μs ± 2.66 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
15 μs ± 2.95 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
17.3 μs ± 2.63 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
16.7 μs ± 2.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
1.0
1.7426994721014446


In [416]:
def weighted_average_predictions(u, neighbors, similarities):
    neighbors_ratings = ratings_np[neighbors]
    valids = ~np.isnan(neighbors_ratings)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        return (neighbors_ratings * similarities[neighbors, np.newaxis]).sum(axis=0, where=valids) / np.broadcast_to(similarities[neighbors, np.newaxis], valids.shape).sum(axis=0, where=valids)

In [419]:
%timeit [weighted_average_prediction(u, i, neighbors_corr, similarities_corr_np) for i in range(NUM_ITEMS)]
%timeit weighted_average_predictions(u, neighbors_corr, similarities_corr_np)

16.6 ms ± 384 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
319 μs ± 42.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Utilizar la media o la media ponderada como medida de agregación tiene un problema, presupone que todos los usuarios tienen la misma percepción de la escala de votaciones prefijada. Sin embargo, sabemos que esto no es cierto. Existes determinados sesgos que hacen que los usuarios realicen votaciones haciendo una interpretación particular del sentido de su voto. Por ejemplo, existen usuarios más "generosos" con las votaciones que tienden a asignar siempre valoraciones altas y existen usuarios más "tacaños" con las votaciones que tienden a asignar siempre valoraciones más bajas. Que el primer usuario valore un item con 5 y el segundo usuario valore el mismo item con un 4 no quiere decir que al primero le haya gustado más el item. Cada usuario hace su propia interpretación de lo que significan los votos 4 y 5.

Para incluir este fenómeno dentro de las medidas de agregación, es frecuente agregar las votaciones de los k vecinos mediante la **desviación respecto a la media** (*deviation from mean*):

$$\hat{r}_{u,i} = \bar{r}_{u} + \frac{ \sum_{n \in N_{u,i}} r_{n,i} - \bar{r}_n }{\#N_{u,i}}$$

Donde $N_{u,i}$ representa el conjunto de *k* vecinos del usuario $u$ que votaron el item $i$, $\bar{r}_u$ representa la media de votos del usuario $u$ y $\bar{r}_n$ representa la media de votos del usuario $n$.

La implementación de esta medida de agregación podríamos hacerla del siguiente modo:

In [309]:
def deviation_from_mean_prediction_old(u, i, neighbors):
  acc = 0
  count = 0

  for n in neighbors:
    if n == None: break

    if ratings[n][i] != None:
      avg_n = rating_average_old(n)
      acc += ratings[n][i] - avg_n
      count += 1

  if count > 0:
    avg_u = rating_average_old(u)
    prediction = avg_u + acc / count
    return prediction
  else:
    return None

In [310]:
# Testing
dfmpred_corr = deviation_from_mean_prediction_old(u, i, neighbors_corr)
dfmpred_jmsd = deviation_from_mean_prediction_old(u, i, neighbors_jmsd)
print(dfmpred_corr)
print(dfmpred_jmsd)

1.4452214452214451
2.3675879675879674


In [442]:
def deviation_from_mean_prediction (u, i, neighbors):
    mean_u = ratings_np[u].mean(where=~np.isnan(ratings_np[u]))
    neighbors_ratings = ratings_np[neighbors]
    mean_neighbors = neighbors_ratings.mean(axis=1, where=~np.isnan(neighbors_ratings))
    valid_neighbors = ~np.isnan(neighbors_ratings[:,i])
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        return mean_u + (neighbors_ratings[:,i][valid_neighbors] - mean_neighbors[valid_neighbors]).mean()

In [319]:
# Testing
%timeit deviation_from_mean_prediction_old(u, i, neighbors_corr)
%timeit deviation_from_mean_prediction_old(u, i, neighbors_jmsd)
%timeit deviation_from_mean_prediction(u, i, neighbors_corr)
%timeit deviation_from_mean_prediction(u, i, neighbors_jmsd)
print()

dfmpred_corr = deviation_from_mean_prediction(u, i, neighbors_corr)
dfmpred_jmsd = deviation_from_mean_prediction(u, i, neighbors_jmsd)
print(dfmpred_corr)
print(dfmpred_jmsd)

620 μs ± 25.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.6 ms ± 48.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
341 μs ± 49.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
337 μs ± 35.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

1.4452214452214451
2.3675879675879674


In [443]:
def deviation_from_mean_predictions(u, neighbors):
    mean_u = ratings_np[u].mean(where=~np.isnan(ratings_np[u]))
    neighbors_ratings = ratings_np[neighbors]
    mean_neighbors = neighbors_ratings.mean(axis=1, where=~np.isnan(neighbors_ratings))
    valid_neighbors = ~np.isnan(neighbors_ratings)

    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        return mean_u + (neighbors_ratings - mean_neighbors[:, np.newaxis]).mean(axis=0, where=valid_neighbors)

In [445]:
%timeit [deviation_from_mean_prediction(u, i, neighbors_corr) for i in range(NUM_ITEMS)]
%timeit deviation_from_mean_predictions(u, neighbors_corr)

627 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
706 μs ± 46.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Cálculo de las recomendaciones

El cálculo de las recomendaciones, por lo general, simplemente implica seleccionar los *N* items con una predicción más alta. Por ejemplo, si quisiéramos recomendar *N = 3* items a un usuario que tuviera las siguientes predicciones:

|   	| i1 	| i2 	| i3 	| i4 	| i5 	| i6 	| i7 	| i8 	| i9 	| i10 	|
|:-:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|:--:	|-----	|
| u 	|   	|  2,9 	|    	|  4,7 	|  5,0 	|    	|  1,2 	|    	|   	|  3,1 	|

Se le recomendarían a dicho usuario los items *i5*, *i4* e *i10*.

En algunas ocasiones, es posible establecer filtros para acotar los items a recomendar. Por ejemplo: en un sistema de recomendación de restaurantes, es posible filtrar aquellos items que se encuentren a demasiada distancia del usuarios; en un sistema de recomendación de libros, el usuario puede filtrar el idioma o el género literario del libro; en una web de comercio electrónico es posible realizar recomendaciones sobre una categoría concreta...


In [324]:
N = 5

In [447]:
def get_recommendations_old(predictions):
    recommendations = [None for _ in range(N)]

    for n in range(N):

        max_value = 0
        item = None

        for i, value in enumerate(predictions):
            if i not in recommendations and value != None and value > max_value:
                max_value = value
                item = i

        recommendations[n] = item

    return recommendations

In [448]:
# Testing
avg_predictions_corr = [None if ratings[u][i] != None else average_prediction_old(u, i, neighbors_corr) for i in range(NUM_ITEMS)]
avg_predictions_jmsd = [None if ratings[u][i] != None else average_prediction_old(u, i, neighbors_jmsd) for i in range(NUM_ITEMS)]
wavg_predictions_corr = [None if ratings[u][i] != None else weighted_average_prediction_old(u, i, neighbors_corr, similarities_corr) for i in range(NUM_ITEMS)]
wavg_predictions_jmsd = [None if ratings[u][i] != None else weighted_average_prediction_old(u, i, neighbors_jmsd, similarities_jmsd) for i in range(NUM_ITEMS)]
dfm_predictions_corr = [None if ratings[u][i] != None else deviation_from_mean_prediction_old(u, i, neighbors_corr) for i in range(NUM_ITEMS)]
dfm_predictions_jmsd = [None if ratings[u][i] != None else deviation_from_mean_prediction_old(u, i, neighbors_jmsd) for i in range(NUM_ITEMS)]
recommendations_avg_corr = get_recommendations_old(avg_predictions_corr)
recommendations_avg_jmsd = get_recommendations_old(avg_predictions_jmsd)
recommendations_wavg_corr = get_recommendations_old(wavg_predictions_corr)
recommendations_wavg_jmsd = get_recommendations_old(wavg_predictions_jmsd)
recommendations_dfm_corr = get_recommendations_old(dfm_predictions_corr)
recommendations_dfm_jmsd = get_recommendations_old(dfm_predictions_jmsd)
print(recommendations_avg_corr)
print(recommendations_avg_jmsd)
print(recommendations_wavg_corr)
print(recommendations_wavg_jmsd)
print(recommendations_dfm_corr)
print(recommendations_dfm_jmsd)

[45, 52, 64, 130, 142]
[10, 21, 22, 55, 63]
[142, 273, 282, 398, 719]
[10, 21, 22, 55, 63]
[284, 286, 407, 876, 282]
[1136, 10, 21, 22, 55]


In [573]:
def get_recommendations(predictions):
    # Hay que tomar una decisión entre que devuelva recomendaciones nulas o que 
    # devuelva un número indeterminado de recomendaciones, opto por la segunda 
    # por eficiencia
    if len(predictions) < N:
        # Not enough recommendations
        recommendations = np.arange(len(predictions))
        recommendations = recommendations[~np.isnan(predictions[recommendations])]
        return recommendations
    

    predictions = np.where(~np.isnan(predictions), predictions, 0)
    recommendations = np.argpartition(predictions, -N)[-N:] # Más eficiente que ordenar
    return recommendations[predictions[recommendations] != 0]

In [470]:
avg_predictions_corr_np = average_prediction(u, i, neighbors_corr)
avg_predictions_corr_np

1.0

In [478]:
avg_predictions_corr_np = average_predictions(u, neighbors_corr)
avg_predictions_jmsd_np = average_predictions(u, neighbors_jmsd)
wavg_predictions_corr_np = weighted_average_predictions(u, neighbors_corr, similarities_corr_np)
wavg_predictions_jmsd_np = weighted_average_predictions(u, neighbors_jmsd, similarities_jmsd_np)
dfm_predictions_corr_np = deviation_from_mean_predictions(u, neighbors_corr)
dfm_predictions_jmsd_np = deviation_from_mean_predictions(u, neighbors_jmsd)

In [None]:
%timeit -n 100 get_recommendations_old(avg_predictions_corr)
%timeit -n 100 get_recommendations_old(avg_predictions_jmsd)
%timeit -n 100 get_recommendations_old(wavg_predictions_corr)
%timeit -n 100 get_recommendations_old(wavg_predictions_jmsd)
%timeit -n 100 get_recommendations_old(dfm_predictions_corr)
%timeit -n 100 get_recommendations_old(dfm_predictions_jmsd)
print()
%timeit -n 1000 get_recomendations(avg_predictions_corr_np)
%timeit -n 1000 get_recomendations(avg_predictions_jmsd_np)
%timeit -n 1000 get_recomendations(wavg_predictions_corr_np)
%timeit -n 1000 get_recomendations(wavg_predictions_jmsd_np)
%timeit -n 1000 get_recomendations(dfm_predictions_corr_np)
%timeit -n 1000 get_recomendations(dfm_predictions_jmsd_np)

2.6 ms ± 359 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.44 ms ± 215 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.72 ms ± 154 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.32 ms ± 340 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.05 ms ± 70 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.95 ms ± 69.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

23.4 μs ± 6.43 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.6 μs ± 6.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
22.7 μs ± 6.42 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.6 μs ± 7.18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.9 μs ± 6.95 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
24.6 μs ± 6.94 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [None]:
recommendations_avg_corr_np = get_recommendations(avg_predictions_corr_np)
recommendations_avg_jmsd_np = get_recommendations(avg_predictions_jmsd_np)
recommendations_wavg_corr_np = get_recommendations(wavg_predictions_corr_np)
recommendations_wavg_jmsd_np = get_recommendations(wavg_predictions_jmsd_np)
recommendations_dfm_corr_np = get_recommendations(dfm_predictions_corr_np)
recommendations_dfm_jmsd_np = get_recommendations(dfm_predictions_jmsd_np)
print(recommendations_avg_corr_np)
print(recommendations_avg_jmsd_np)
print(recommendations_wavg_corr_np)
print(recommendations_wavg_jmsd_np)
print(recommendations_dfm_corr_np)
print(recommendations_dfm_jmsd_np)

[903 671 273 480 142]
[ 22 202  21  55 315]
[142 719 282 273 398]
[1232   22  466   21   55]
[282 284 286 407 876]
[  55  214   63 1136  171]


In [None]:
dfm_predictions_jmsd_np[171], dfm_predictions_jmsd[171], dfm_predictions_jmsd_np[22], dfm_predictions_jmsd[22] # Hay predicciones iguales por eso cambian los resultados

(5.099971189858831, 5.099971189858831, 5.099971189858831, 5.099971189858831)

## Ejemplo de ejecución: cálculo del MAE

En esta sección vamos a mostrar el ejemplo completo de cómo calcular el error medio absoluto (MAE) de las predicciones realizadas por el método de los *k*-vecinos.

Para ello, lo primero que debemos hacer es calcular las predicciones para todos los items que haya recibido una votación de test:

In [486]:
def has_test_ratings_old(u):
    for i in range(NUM_ITEMS):
        if test_ratings[u][i] != None:
            return True
    return False

In [546]:
def has_test_ratings(u):
    return (~np.isnan(test_ratings_np[u])).any()

In [550]:
%timeit -n 1_000 has_test_ratings_old(5)
%timeit -n 1_000 has_test_ratings_old(113)
%timeit -n 10_000 has_test_ratings(5)
%timeit -n 10_000 has_test_ratings(113)

195 μs ± 20.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
200 μs ± 27.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
7.66 μs ± 1.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
8.18 μs ± 2.32 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [596]:
avg_predictions_corr = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]
avg_predictions_jmsd = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]
wavg_predictions_corr = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]
wavg_predictions_jmsd = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]
dfm_predictions_corr = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]
dfm_predictions_jmsd = [[None for _ in range(NUM_ITEMS)] for _ in range(NUM_USERS)]

# Rellenamos la matriz de predicciones
for u in range(NUM_USERS):
  if has_test_ratings_old(u):

    # Calcular similaridades
    similarities_corr = [None if u == v else correlation_similarity_old(u, v) for v in range(NUM_USERS)]
    similarities_jmsd = [None if u == v else jmsd_similarity_old(u, v) for v in range(NUM_USERS)]

    # Calcular vecinos
    neighbors_corr = get_neighbors_old(u, similarities_corr)
    neighbors_jmsd = get_neighbors_old(u, similarities_jmsd)

    # Calcular predicciones sobre los items de test votados por el usuario
    for i in range(NUM_ITEMS):
      if test_ratings[u][i] != None:
        avg_predictions_corr[u][i] = average_prediction_old(u, i, neighbors_corr)
        avg_predictions_jmsd[u][i] = average_prediction_old(u, i, neighbors_jmsd)
        wavg_predictions_corr[u][i] = weighted_average_prediction_old(u, i, neighbors_corr, similarities_corr)
        wavg_predictions_jmsd[u][i] = weighted_average_prediction_old(u, i, neighbors_jmsd, similarities_jmsd)
        dfm_predictions_corr[u][i] = deviation_from_mean_prediction_old(u, i, neighbors_corr)
        dfm_predictions_jmsd[u][i] = deviation_from_mean_prediction_old(u, i, neighbors_jmsd)

In [597]:
avg_predictions_corr_np = np.zeros_like(ratings_np)
avg_predictions_jmsd_np = np.zeros_like(ratings_np)
wavg_predictions_corr_np = np.zeros_like(ratings_np)
wavg_predictions_jmsd_np = np.zeros_like(ratings_np)
dfm_predictions_corr_np = np.zeros_like(ratings_np)
dfm_predictions_jmsd_np = np.zeros_like(ratings_np)

for u in range(NUM_USERS):
    simmilarities_corr_np = correlation_similarities(u)
    similarities_jmsd_np = jmsd_similarities(u)
    
    neighbors_corr = get_neighbors(u, similarities_corr_np)
    neighbors_jmsd = get_neighbors(u, similarities_jmsd_np)
    
    avg_predictions_corr_np[u] = average_predictions(u, neighbors_corr)
    avg_predictions_jmsd_np[u] = average_predictions(u, neighbors_jmsd)
    wavg_predictions_corr_np[u] = weighted_average_predictions(u, neighbors_corr, similarities_corr_np)
    wavg_predictions_jmsd_np[u] = weighted_average_predictions(u, neighbors_jmsd, similarities_jmsd_np)
    dfm_predictions_corr_np[u] = deviation_from_mean_predictions(u, neighbors_corr)
    dfm_predictions_jmsd_np[u] = deviation_from_mean_predictions(u, neighbors_jmsd)
    

Y, a continuación, calculamos el MAE:

In [540]:
def get_user_mae_old(u, predictions):
    mae = 0
    count = 0

    for i in range(NUM_ITEMS):
        if test_ratings[u][i] != None and predictions[u][i] != None:
            mae += abs(test_ratings[u][i] - predictions[u][i])
            count += 1

    if count > 0:
        return mae / count
    else:
        return None

In [543]:
def get_mae_old(predictions):
    mae = 0
    count = 0

    for u in range(NUM_USERS):
        if has_test_ratings_old(u):
            user_mae = get_user_mae_old(u, predictions)

            if user_mae != None:
                mae += user_mae
                count += 1


    if count > 0:
        return mae / count
    else:
        return None

In [563]:
mae_avg_corr = get_mae_old(avg_predictions_corr)
mae_avg_jmsd = get_mae_old(avg_predictions_jmsd)
mae_wavg_corr = get_mae_old(wavg_predictions_corr)
mae_wavg_jmsd = get_mae_old(wavg_predictions_jmsd)
mae_dfm_corr = get_mae_old(dfm_predictions_corr)
mae_dfm_jmsd = get_mae_old(dfm_predictions_jmsd)
print("MAE Avg Prediction with Corr Similarity = " + str(mae_avg_corr))
print("MAE Avg Prediction with JMSD Similarity = " + str(mae_avg_jmsd))
print("MAE Weighted Avg Prediction with Corr Similarity = " + str(mae_wavg_corr))
print("MAE Weighted Avg Prediction with JMSD Similarity = " + str(mae_wavg_jmsd))
print("MAE Dev from Mean Prediction with Corr Similarity = " + str(mae_dfm_corr))
print("MAE Dev from Mean Prediction with JMSD Similarity = " + str(mae_dfm_jmsd))

MAE Avg Prediction with Corr Similarity = 1.00923070340394
MAE Avg Prediction with JMSD Similarity = 0.8951441319983335
MAE Weighted Avg Prediction with Corr Similarity = 1.0097877459011548
MAE Weighted Avg Prediction with JMSD Similarity = 0.8955826103175067
MAE Dev from Mean Prediction with Corr Similarity = 1.008429618748919
MAE Dev from Mean Prediction with JMSD Similarity = 0.828759929568012


In [565]:
def get_mae(predictions):
    ae = np.abs(test_ratings_np - predictions)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
    
        ae_per_user = ae.mean(axis=1, where=~np.isnan(test_ratings_np))
        return ae_per_user.mean(where=~np.isnan(ae_per_user))

In [566]:
# Es normal que las predicciones varíen, vimos que cuando se eligen los vecinos 
# hay varios igual de semejantes
mae_avg_corr_np = get_mae(avg_predictions_corr_np)
mae_avg_jmsd_np = get_mae(avg_predictions_jmsd_np)
mae_wavg_corr_np = get_mae(wavg_predictions_corr_np)
mae_wavg_jmsd_np = get_mae(wavg_predictions_jmsd_np)
mae_dfm_corr_np = get_mae(dfm_predictions_corr_np)
mae_dfm_jmsd_np = get_mae(dfm_predictions_jmsd_np)
print("MAE Avg Prediction with Corr Similarity = " + str(mae_avg_corr_np))
print("MAE Avg Prediction with JMSD Similarity = " + str(mae_avg_jmsd_np))
print("MAE Weighted Avg Prediction with Corr Similarity = " + str(mae_wavg_corr_np))
print("MAE Weighted Avg Prediction with JMSD Similarity = " + str(mae_wavg_jmsd_np))
print("MAE Dev from Mean Prediction with Corr Similarity = " + str(mae_dfm_corr_np))
print("MAE Dev from Mean Prediction with JMSD Similarity = " + str(mae_dfm_jmsd_np))

MAE Avg Prediction with Corr Similarity = 0.9568764172335601
MAE Avg Prediction with JMSD Similarity = 0.84855730143673
MAE Weighted Avg Prediction with Corr Similarity = 0.9632785734330227
MAE Weighted Avg Prediction with JMSD Similarity = 0.8490995875284602
MAE Dev from Mean Prediction with Corr Similarity = 0.9148744198777381
MAE Dev from Mean Prediction with JMSD Similarity = 0.7498648296898984


# Cálculo del resto de medidas

Calcula el resto de medidas (RMSE, Precision, Recall, F1 y nDCG) y compara las distintas combinaciones de funciones de similitud y de agregación para saber cuál combinación proporciona mejores recomendaciones.

In [578]:
def get_rmse(predictions):
    se = (test_ratings_np - predictions)**2
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
    
        rmse_per_user = np.sqrt(se.mean(axis=1, where=~np.isnan(test_ratings_np)))
        return rmse_per_user.mean(where=~np.isnan(rmse_per_user))

In [577]:
theta = 4

In [587]:
def get_user_precision(u, predictions):
    recommendations = get_recommendations(predictions[u])
    recommendation_ratings = test_ratings_np[:,recommendations]
    recommendation_ratings = recommendation_ratings[~np.isnan(recommendation_ratings)]
    if len(recommendation_ratings) == 0:
        return np.nan
    return (recommendation_ratings >= theta).mean()
    

In [592]:
def get_precision(predictions):
    precisions = np.array([get_user_precision(u, predictions) for u in range(NUM_USERS)])
    valid_precisions = precisions[~np.isnan(precisions)]
    if len(valid_precisions) == 0:
        return np.nan
    return valid_precisions.mean()

In [594]:
print("Precision = " + str(get_precision(avg_predictions_corr_np)))
print("Precision = " + str(get_precision(avg_predictions_jmsd_np)))
print("Precision = " + str(get_precision(wavg_predictions_corr_np)))
print("Precision = " + str(get_precision(wavg_predictions_jmsd_np)))
print("Precision = " + str(get_precision(dfm_predictions_corr_np)))
print("Precision = " + str(get_precision(dfm_predictions_jmsd_np)))

Precision = 0.5714285714285712
Precision = 0.5967796441311218
Precision = nan
Precision = 0.6146886225736182
Precision = 0.9166666666666665
Precision = 0.5536519215926838


In [None]:
def get_user_recall (u, recommendations):

  # TODO: Complete la función get_user_recall para devolver el Recall
  # de las recomendaciones realizadas al usuario u. Devuelva None si el usuario
  # no tiene recomendaciones de test.

In [None]:
def get_recall (recommendations):

  # TODO: Complete la función get_recall para devolver el Recall de las
  # recomendaciones realizadas sobre el conjunto de test. Devuelva None si no
  # se ha podido realizar ninguna recomendación.

In [None]:
def get_user_f1 (u, recommendations):

  # TODO: Complete la función get_user_f1 para devolver la función
  # harmónica (F1) de las recomendaciones realizadas al usuario u. Devuelva None
  # si el usuario no tiene recomendaciones de test.

In [None]:
def get_f1 (recommendations):

  # TODO: Complete la función get_f1 para devolver la función harmónica (F1)
  # de las recomendaciones realizadas sobre el conjunto de test. Devuelva None
  # si no se ha podido realizar ninguna recomendación.

In [None]:
def get_user_ndcg (u, recommendations):

  # TODO: Complete la función get_user_ndcg para devolver el descuento de
  # ganancia cumulativa normalizada (nDCG) de las recomendaciones realizadas al
  # usuario u. Devuelva None si el usuario no tiene recomendaciones de test.

In [None]:
def get_ndcg (recommendations):

  # TODO: Complete la función get_ndcg para devolver el descuento de
  # ganancia cumulativa normalizada (nDCG) de las recomendaciones realizadas
  # sobre el conjunto de test. Devuelva None si no se ha podido realizar ninguna
  # recomendación.

## Referencias

Bobadilla, J., Serradilla, F., & Bernal, J. (2010). **A new collaborative filtering metric that improves the behavior of recommender systems**. Knowledge-Based Systems, 23(6), 520-528.