# Sistemas de recoemndación

En este notebook vamos a famiilarizarnos un poco con algoritmos de sistemas de recomendación y algunas métricas para evaluarlos.

In [27]:
import os
import shutil
import zipfile
from typing import List

import requests
import numpy as np
import pandas as pd

from scipy.sparse import coo_matrix
from urllib import request

## Lecutra y carga del dataset

Lo primero es cargar un dataset, en este caso utilizaremos [movielens](https://grouplens.org/datasets/movielens/100k/). 

In [2]:
URL = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
ZIP_PATH = "ml-100k.zip"
CONTENTS_DIR = "ml-100k"

Las siguientes funciones se encargan de descargar un archivo .zip con los datos, cargar el archivo `"u.data"` en un [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) y convertir esos datos en una matriz de interacciones.

In [9]:
def maybe_download(url, fpath):
    with open(fpath, "wb") as file:
        print("...downloading to {}".format(fpath))
        response = requests.get(url)
        file.write(response.content)
    print("download complete!")


def zipped_csv_to_df(zip_fpath, csv_fpath, **kwargs):
    with zipfile.ZipFile(zip_fpath) as z:
        
        with z.open(csv_fpath) as f:
            df = pd.read_csv(f, **kwargs)

    return df

def get_items_name(contents_dir, zip_path):
    csv_fpath = contents_dir + "/u.item"
    
    names = ("movie id,movie title,release date,video release date,"
            +"IMDb URL,unknown,Action,Adventure,Animation,Children's,"
            +"Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,"
            +"Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western")
            
    kwargs = dict(
        sep='|', header=None,
        names=names.split(','), encoding='latin-1'
    )
        
    df = zipped_csv_to_df(zip_path, csv_fpath, **kwargs)
    return df['movie title'].values

def convert_ratings_df_to_matrix(
        df, shape, columns="user id,item id,rating".split(',')):
    data = df[columns].values
    users = data[:, 0] - 1 # correct for zero index
    items = data[:, 1] - 1 # correct for zero index
    values = data[:, 2]
    return coo_matrix((values, (users, items)), shape=shape).toarray()

Para descargar el dataset solo tenemos que llamar la función de descarga

In [10]:
maybe_download(URL, ZIP_PATH)

...downloading to ml-100k.zip
download complete!


Luego podemos ver un extracto del dataset que vamos a trabajar

In [11]:
csv_fpath = CONTENTS_DIR + "/u.data"
df = zipped_csv_to_df(
    ZIP_PATH, csv_fpath, sep='\t',
    header=None, names="user id,item id,rating,timestamp".split(',')
)
df.head(7)

Unnamed: 0,user id,item id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488


Podemos ver que el dataset no está dispuesto como una matriz, esto se debe a temas de eficiciencia, ya que al convertirlo en una matriz tendríamos una matriz gigante llena de 0s. Sin embargo, para este ejercicio podrás utilizar la representación que te permita trabajar con mayor comodidad. Si quisieras extraer la matriz gigante, solo tienes que hacer lo siguiente

In [12]:
n_users = df['user id'].unique().shape[0]
n_items = df['item id'].unique().shape[0]
interactions = convert_ratings_df_to_matrix(df, shape=(n_users, n_items)).astype(np.float64)
interactions

array([[5., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

In [13]:
items_name = get_items_name(CONTENTS_DIR, ZIP_PATH)
items_name

array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
       'Sliding Doors (1998)', 'You So Crazy (1994)',
       'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object)

## Evaluación de desempeño

Como en todo proceso de aprendizaje supervisado, es importante contar con un set de entrenamiento y otro de validación. A continuación debes implementar la función que lo hace posible.

In [14]:
def train_test_split(interactions, k=5, n=2):
    """Split the interactions matrix.

    It is important to remmber that it is not about remmoving rows at
    random, because it would remmove the users; instead we want to 
    remmove some of the interactions of those users with the items.

    This function calculates the minimun ratings per user in the 
    interactions matrix and take it into account to avoid removing 
    r_ui values for users with less than n*k interactions.

    Args:
        interactions (np.ndarray): contains the data to be splitted.
        k (int): this parameter should be choosen greater than the 
            precision at k that you want to compute. Think of how many 
            items you want to recommend for every user. Defaults to 5.
        n (int): number of times that a user shuld have interacted with 
            the items set so that we can move interactions from the 
            original interactions to the test set.

    Returns:
        train (np.ndarray): the training set.
        test (np.ndarray): the test set.
        
    """
    # reserve the rerutn matrices
    train = interactions.copy()
    test = np.zeros_like(train)

    # store all user indices from which we take interactions to the test set
    user_test_indices = []

    for uid in range(train.shape[0]):
        
        # get indices (item indices) of the interactions of this user
        user_interactions_indices = 
        
        # take k interactions only if that user has more than n*k interactions
        if len(user_interactions_indices) >= int(n*k):

            # pick k interactions to move to the test set
            test_interactions_indices = 
            
            # the train set should be 0 in all places the test set is non zero
            train[uid, test_interactions_indices] = 

            # fill the values of the test set 
            values = 
            test[uid, test_interactions_indices] = 

               
    return train, test

In [15]:
train, test = train_test_split(interactions)

In [19]:
def msg(name, data):
    return f"{name} data has  shape {data.shape} and {np.sum(data != 0)} interactions"
print(msg(name="training", data=train))
print(msg(name="test", data=test))

training data has  shape (943, 1682) and 95285 interactions
test data has  shape (943, 1682) and 4715 interactions


Lo siguiente no debe generar error si la función quedó bien implementada

In [20]:
assert np.sum(np.nonzero(train * test)[0]) == 0

Ahora es tiempo de definir las métricas con las que vamos a trabajar. Realmente solo hace falta implementar las funciones `precision_at_k`  y `recall_at_k`. No es importante cómo lo haces internamente, por lo que puedes ignorar las guías que están comentadas y las funciones que empiezan con la palabra `individual_`; estas están ahí en caso de que las quieras usar como guía.

In [21]:
def individual_precision_at_k(real_scores, pred_scores, k=5, threshold=None):
    """
    Computes the precision at k using the real and predicted scores for
    a particular user.

    Precision at k is the proportion of recommended items in the top-k
    set that are relevant.

    Args:
        real_scores (np.ndarray): real scores as a
            1d array.
        pred_scores (np.ndarray): predicted scores as a
            1d array.
        k (int): compute based on the top k recommendations.
        threshold (float): the value of the score from which an item is
            considered relevant.

    Returns
        precision_at_k (float)

    """
    # if a threshold is not provided, asume it is the mean values of the 
    # real scores
    if threshold is None:
        threshold = np.mean(real_scores)
    
    # get the relevant items from the scores use np.where
    relevant_items = 

    # get top k recomendations, use the top k function defined above
    recommended_items = 
    
    # find the intersection of the relevant items with the recommended items
    recommended_relevant_items = 
    
    # compute the precision at k
    precision_at_k = 
    
    return precision_at_k

In [22]:
def precision_at_k(real_scores, pred_scores, k=5, threshold=None):
    """Average precistion at k for a list of user indices.
    
    Args:
        real_scores (np.ndarray): real scores as a
            2d array of shape (n_users, n_items).
        pred_scores (np.ndarray): predicted scores as a
            2d array of shape (n_users, n_items).
        k (int): compute based on the top k recommendations.
        threshold (float): the value of the score from which an item is
            considered relevant.
    Returns:
        average precision at k for the provided or all users (float).

    """
    # if a threshold is not provided, asume it is the mean values of the 
    # real scores
    if threshold is None:
        threshold = np.mean(real_scores)
        
    # initialize the total precision to 0
    total_precision = 0
    for uid in range(real_scores.shape[0]):
        
        # take the real scores for this user
        user_real_scores = 
        
        # take the predicted scores for this user
        user_pred_scores = 
        
        # update total_precision based on the value for this user
        total_precision += 

    # get the average precision
    avg_precision = 
    return avg_precision

In [23]:
def individual_recall_at_k(real_scores, pred_scores, k=5, threshold=None):
    """
    Computes the precision at k using the real and predicted scores for
    a particular user.

    Recall at k is the proportion of relevant items found in the top-k
    recommendations.

    Args:
        real_scores (sequence or np.ndarray): real scores as a
            1d array.
        pred_scores (sequence or np.ndarray): predicted scores as a
            1d array.
        k (int): compute based on the top k recommendations.
        threshold (float): the value of the score from which an item is
            considered relevant.

    Returns
        recall_at_k (float)

    """
    # if a threshold is not provided, asume it is the mean values of the 
    # real scores
    if threshold is None:
        threshold = np.mean(real_scores)
    
    # get the relevant items from the scores use np.where
    relevant_items = 

    # get top k recomendations
    recommended_items = 
    
    recommended_relevant_items = 

    recall_at_k = 
    return recall_at_k

In [24]:
def recall_at_k(real_scores, pred_scores, k=5, threshold=None):
    """Average recall at k for a list of user indices.
    
    Args:
        real_scores (np.ndarray): real scores as a
            2d array of shape (n_users, n_items).
        pred_scores (sequence or np.ndarray): predicted scores as a
            2d array of shape (n_users, n_items).
        k (int): compute based on the top k recommendations.
        threshold (float): the value of the score from which an item is
            considered relevant.
    Returns:
        average recall at k for the provided or all users (float).

    """
    # if a threshold is not provided, asume it is the mean values of the 
    # real scores
    if threshold is None:
        threshold = np.mean(real_scores)
        
    # initialize the total recall to 0
    total_recall = 0
    for uid in range(real_scores.shape[0]):
        
        # take the real scores for this user
        user_real_scores = 
        
        # take the predicted scores for this user
        user_pred_scores = 

        # update the toal recall
        total_recall += 

    # compute the average recall
    avg_recall = 
    
    return avg_recall

## Explorar los resultados que obtenemos con varios métodos

### Baseline

Lo primero en todo proyecto de Machine Learning es tener un baseline, algo así como un método que sea sencillo y con poco costo computacional. Este método resolverá nuestro problema, aunque los resultados sean muy pobres. La idea es que siempre tengamos un referente para comparar las innovaciones que vamos haciendo en nuestros algoritmos. A este baseline debes calcularle las métricas mencionadas arriba.

In [None]:
raise NotImplementedError

### Método basado en memoria

Ahora intenta con un método un poco menos sencillo, utiliza algún método basado en memoria.

In [None]:
raise NotImplementedError

### Método basado en modelo

La idea es agregar complejidad un paso a la vez. Llegó la hora de intentar un método explícito basado en modelos.

In [None]:
raise NotImplementedError

### Modelo implícito

Recuerda que también existen métodos implícitos, ¿qué tal si nuestro problema se resuelve mejor de esta manera? Llegó la hora de ponerlo a prueba.

In [None]:
raise NotImplementedError

# Predicciones

Dado que ya viste varios modelos en acción, llegó la hora hacer inferencia. Debes implementar la siguiente función

In [28]:
def recommend(user_id: int, k: int = 5) -> List[str]:
    """
    This function takes the id of the user and whatever number of predictions
    we want to make for them. Good luck!
    """
    raise NotImplementedError