# SISTEMAS DE RECOMENDACIÓN 2
## PRÁCTICA

<img src="portada.jpg" align="center" width="80%"/>

El principal objetivo de cualquier sistema de recomendación es ofrecer a los usuarios una experiencia personalizada. ¿Cómo traducimos esto a un problema de machine learning? Tenemos dos tipos de problemas:

- Problema de predicción: el primer enfoque es aquel en el que nos gustaría predecir el valor de calificación de una combinación usuario-artículo con la suposición de que se dispone de datos de entrenamiento que indican la preferencia de un usuario por otros artículos. Imaginemos una matriz m × n, donde m es el número de usuarios y n el número de elementos, y el objetivo es predecir los valores que faltan (o no observados).
<br>

- Problema de clasificación: en otras ocasiones, deseamos recomendar los k mejores artículos para un usuario concreto o determinar los k mejores usuarios a los que dirigirnos para un tema específico. Este problema también se denomina problema de recomendación top-k y es la formulación de clasificación del problema de recomendación. Pensemos en un motor de búsqueda en el que, en función de quién realice la búsqueda, nos gustaría mostrar los elementos top-k para ofrecer resultados personalizados basados en sus preferencias anteriores y en su actividad reciente.

Clasficación de sistemas de recomendación 

<img src="sintesis.jpg" align="center" width="60%"/>

In [2]:
# %pip install chart_studio
# %pip install datapane
# %pip install pip install scikit-surprise
# %pip install --upgrade pandas

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datapane as dp
import chart_studio.plotly as py
import chart_studio
import os

from sklearn.model_selection import train_test_split as skl_train_test_split
from surprise import KNNBasic,NormalPredictor,BaselineOnly,NMF

# username = os.environ['PLOTLY_USERNAME']
# api_key = os.environ['PLOTLY_API_KEY']
# chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
%matplotlib inline

Utilizamos el dataset goodbooks-10k (disponible en Kaggle). El primer CSV contiene datos de 10k libros puntuados por 53k usuarios. El segundo CSV contiene los metadatos (título, autor, ISBN, etc) para cada uno de los 10k libros.

In [4]:
ratings_data = pd.read_csv('./data/books/ratings.csv.zip')
books_metadata = pd.read_csv('./data/books/books.csv.zip')
ratings_data.head(10)

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4
5,1,2077,4
6,1,2487,4
7,1,2900,5
8,1,3662,4
9,1,3922,5


In [5]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981756 entries, 0 to 981755
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   book_id  981756 non-null  int64
 1   user_id  981756 non-null  int64
 2   rating   981756 non-null  int64
dtypes: int64(3)
memory usage: 22.5 MB


In [6]:
books_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

Tenemos además un dataset de tags, que nos da detalles sobre el argumento del libro.

In [7]:
book_tags = pd.read_csv('./data/books/book_tags.csv')
tags = pd.read_csv('./data/books/tags.csv')

In [8]:
data = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how='inner')
titles = books_metadata[['id', 'book_id','title','authors']]
data = pd.merge(titles, data, left_on='book_id', right_on='goodreads_book_id')
data.drop(['goodreads_book_id', 'tag_id', 'count'], axis=1,inplace=True)
data.head(3)

Unnamed: 0,id,book_id,title,authors,tag_name
0,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,to-read
1,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,fantasy
2,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,favorites


In [9]:
list_tags = data.groupby(by='book_id')['tag_name'].apply(set).apply(list)
books_metadata['tags'] = books_metadata['book_id'].apply(lambda x: ' '.join(list_tags[x]))
pd.reset_option('max_colwidth')
books_metadata.head(3)

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,tags
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,books teen-fiction loved adventure suzanne-col...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,books read-2016 england adventure childrens-bo...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,meh books teen-fiction movies young-adults twi...


In [10]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Al trabajar con texto, podemos aplicar una transformación llamada **TF-IDF**, o Term Frequency-Inverse Document Frecuency. 

El **número de features** que crea es igual al número total de palabras distintas utilizadas en la columna de tags. 

Los **valores** son directamente *proporcionales al número de veces que se utiliza una palabra concreta* e *inversamente proporcionales al número de documentos (películas en este caso) en los que se utiliza la palabra*.

Penalizará una palabra presente en los tags de un libro si es común a muchos libros. Las palabras que aparecen varias veces pero son comunes a muchos libros no son tan útiles para diferenciar los distintas libros.

En otras palabras, **el proceso TF-IDF nos dice que tan relevante es una palabra a un documento, en una colección de documentos**.

t - término (palabra) <br>
d - documento (conjunto de palabras) <br>
N - cantidad de documentos
<br>

tf(t,d) = cantidad de t in d / cantidad de palabras en d

df(t) = ocurrencia de t en N documentos

idf(t) = N/df

IDF es la inversa de la frecuencia de documentos que mide que tan informativo es el término t. Cuando calculamos IDF, será muy bajo para las palabras más frecuentes, como las stop words (porque están presentes en casi todos los documentos, y N/df dará un valor muy bajo a esa palabra). Esto da finalmente lo que queremos, un peso relativo.

Se suele usar el logaritmo para contrarrestar el efecto de un N muy grande: idf(t) = log(N/(df + 1)).

El score final será la multiplicación de ambos términos:

**tf-idf(t, d) = tf(t, d) * log(N/(df + 1))**

In [11]:
tfidf = TfidfVectorizer(stop_words='english')

tags_matrix = tfidf.fit_transform(books_metadata['tags'])

tags_matrix.shape

(10000, 16396)

In [12]:
#Cantidad de veces que se repite cada término
tfidf.vocabulary_

{'books': 2075,
 'teen': 13848,
 'fiction': 5289,
 'loved': 8573,
 'adventure': 641,
 'suzanne': 13620,
 'collins': 3179,
 'fantasy': 5132,
 'sci': 12468,
 'fi': 5268,
 'read': 11635,
 'owned': 10424,
 'science': 12470,
 'finished': 5358,
 'series': 12640,
 'hunger': 6863,
 'games': 5744,
 '2012': 253,
 'default': 3899,
 'dystopian': 4451,
 '2014': 255,
 'ebook': 4487,
 'novel': 10105,
 'favorite': 5188,
 'audiobooks': 1336,
 'ya': 15260,
 'futuristic': 5688,
 'favs': 5199,
 'favourites': 5197,
 'love': 8570,
 'triangle': 14275,
 'survival': 13600,
 'scifi': 12480,
 'thriller': 14012,
 'favorites': 5190,
 'reviewed': 11939,
 'borrowed': 2112,
 'young': 15308,
 'adult': 632,
 'action': 563,
 'novels': 10119,
 'audiobook': 1334,
 'kindle': 7818,
 'reads': 11662,
 'completed': 3269,
 'book': 2049,
 'coming': 3228,
 'age': 686,
 'trilogy': 14289,
 'distopian': 4180,
 '2011': 251,
 '2013': 254,
 'favourite': 5196,
 'suspense': 13612,
 'audio': 1333,
 'buy': 2418,
 'dystopia': 4450,
 'librar

In [13]:
doc = 0
feature_index = tags_matrix[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tags_matrix[doc, x] for x in feature_index])

In [14]:
books_metadata['tags'][0]

'books teen-fiction loved adventure suzanne-collins fantasy-sci-fi read-more-than-once owned sci-fi-fantasy science-fiction finished-series hunger-games read-2012 default dystopian-fiction read-in-2014 ebook novel favorite-books audiobooks fiction ya-fantasy futuristic series favs favourites own-it love-triangle survival re-read my-books to-read the-hunger-games scifi-fantasy thriller favorite-series my-favorites reviewed ya-fiction borrowed young-adult i-own action-adventure novels audiobook kindle 2012-reads completed-series e-book coming-of-age sci-fi trilogy dystopian distopian read-in-2011 read-in-2013 favourite-books suspense ya-dystopian scifi audio young-adult-fiction love ya-books teen favorites to-buy dystopia read-2011 my-library book-club contemporary post-apocalyptic sf ya all-time-favorites dystopias ebooks shelfari-favorites favorite 5-star favourite read-in-2012 reread books-i-own library currently-reading speculative-fiction faves action distopia ya-lit future english 

In [15]:
feature_names = tfidf.get_feature_names_out()
tfidf_scores_sorted=[]
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    tfidf_scores_sorted.append([w,s])
    print(w,s)
tfidf_scores_sorted    

romance 0.04199622977855118
stars 0.044902535151457965
2010 0.06125957409048316
drama 0.0493160446958989
english 0.04271333194149995
future 0.1014833466079337
lit 0.03843695526480429
distopia 0.12415576195547642
faves 0.07285947769502386
speculative 0.07251109380641134
reading 0.022981255535489718
currently 0.02323081081003255
reread 0.06447599331006632
star 0.07502456600504018
shelfari 0.04540261740206127
ebooks 0.03006444297235562
dystopias 0.11595489846127825
time 0.048117547849799436
sf 0.07317137596247389
apocalyptic 0.0941436201663041
post 0.09032041817820312
contemporary 0.03622900456869731
club 0.0409993795173276
library 0.04656396799540881
dystopia 0.08248694113279853
buy 0.025624979389037794
audio 0.03151942701504474
suspense 0.05403790104772188
favourite 0.1560120654896369
2013 0.039226482136424316
2011 0.10322233000148888
distopian 0.12208063904468137
trilogy 0.09256582324936077
age 0.06718730323616848
coming 0.06839317342500893
book 0.050751852706958665
completed 0.0895755

[['romance', 0.04199622977855118],
 ['stars', 0.044902535151457965],
 ['2010', 0.06125957409048316],
 ['drama', 0.0493160446958989],
 ['english', 0.04271333194149995],
 ['future', 0.1014833466079337],
 ['lit', 0.03843695526480429],
 ['distopia', 0.12415576195547642],
 ['faves', 0.07285947769502386],
 ['speculative', 0.07251109380641134],
 ['reading', 0.022981255535489718],
 ['currently', 0.02323081081003255],
 ['reread', 0.06447599331006632],
 ['star', 0.07502456600504018],
 ['shelfari', 0.04540261740206127],
 ['ebooks', 0.03006444297235562],
 ['dystopias', 0.11595489846127825],
 ['time', 0.048117547849799436],
 ['sf', 0.07317137596247389],
 ['apocalyptic', 0.0941436201663041],
 ['post', 0.09032041817820312],
 ['contemporary', 0.03622900456869731],
 ['club', 0.0409993795173276],
 ['library', 0.04656396799540881],
 ['dystopia', 0.08248694113279853],
 ['buy', 0.025624979389037794],
 ['audio', 0.03151942701504474],
 ['suspense', 0.05403790104772188],
 ['favourite', 0.1560120654896369],
 [

In [16]:
tfidf_scores_sorted_df=pd.DataFrame(tfidf_scores_sorted)
tfidf_scores_sorted_df.columns=['token','val']
tfidf_scores_sorted_df.sort_values(by='val',ascending=False,inplace=True)
tfidf_scores_sorted_df.head(20)

Unnamed: 0,token,val
65,hunger,0.354943
55,ya,0.301504
64,games,0.28707
61,dystopian,0.243866
70,read,0.227162
73,fantasy,0.209399
75,suzanne,0.16896
74,collins,0.162783
78,fiction,0.159109
28,favourite,0.156012


¿Que palabras comunes se les ocurre que podríamos sacar?

### Similitud Coseno

<img src="similitud_coseno.jpg" align="center" width="60%"/>

La imagen muestra la similitud coseno en dos dimensiones. Si dos vectores de datos son cercanos, el ángulo entre estos dos vectores es pequeño y la similitud coseno será alta. En el extremo, dos vectores iguales tendrán un ángulo de 0 grados, un coseno de 1, y una similitud coseno también de 1.

Naturalmente, podemos llevar el concepto para N dimensiones.

In [17]:
similarity_matrix = cosine_similarity(tags_matrix,tags_matrix)
similarity_matrix

array([[1.        , 0.38961343, 0.40894373, ..., 0.09091467, 0.16148787,
        0.03602845],
       [0.38961343, 1.        , 0.37207812, ..., 0.0886315 , 0.16692497,
        0.04592596],
       [0.40894373, 0.37207812, 1.        , ..., 0.05919159, 0.09024158,
        0.02882952],
       ...,
       [0.09091467, 0.0886315 , 0.05919159, ..., 1.        , 0.05409486,
        0.12427748],
       [0.16148787, 0.16692497, 0.09024158, ..., 0.05409486, 1.        ,
        0.11579571],
       [0.03602845, 0.04592596, 0.02882952, ..., 0.12427748, 0.11579571,
        1.        ]])

In [18]:
def recommend_books_based_on_plot(book_input,n=15):
    book_index = books_metadata.loc[books_metadata.original_title==book_input,:].index[0]

    #get similarity values with other books
    #similarity_score is the list of index and similarity matrix
    similarity_score = list(enumerate(similarity_matrix[book_index]))

    #sort in descending order the similarity score of book inputted with all the other books
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)

    # Get the scores of the 15 most similar books. Ignore the first book.
    similarity_score = similarity_score[1:n]

    #return book names using the mapping series
    book_indices = [i[0] for i in similarity_score]

    return (books_metadata[['original_title','authors','tags']].iloc[book_indices])

In [19]:
recommend_books_based_on_plot('Nineteen Eighty-Four',10)

Unnamed: 0,original_title,authors,tags
54,Brave New World,Aldous Huxley,books read-for-school 1001-books general-ficti...
13,Animal Farm: A Fairy Story,George Orwell,books read-for-school 1001-books allegory gene...
47,Fahrenheit 451,Ray Bradbury,books read-for-school general-fiction school f...
27,Lord of the Flies,William Golding,books read-for-school 1001-books adventure gen...
70,"Frankenstein; or, The Modern Prometheus","Mary Wollstonecraft Shelley, Percy Bysshe Shel...",books 19th-century read-for-school 1001-books ...
172,A Clockwork Orange,Anthony Burgess,books rory-gilmore-reading-list movies england...
808,Brave New World/Brave New World Revisited,"Aldous Huxley, Christopher Hitchens",books read-for-school 1001-books general-ficti...
7,The Catcher in the Rye,J.D. Salinger,books read-for-school 1001-books general-ficti...
845,Animal Farm & 1984,"George Orwell, Christopher Hitchens",books read-for-school allegory general-fiction...


In [20]:
recommend_books_based_on_plot('Nineteen Eighty-Four',15)

Unnamed: 0,original_title,authors,tags
54,Brave New World,Aldous Huxley,books read-for-school 1001-books general-ficti...
13,Animal Farm: A Fairy Story,George Orwell,books read-for-school 1001-books allegory gene...
47,Fahrenheit 451,Ray Bradbury,books read-for-school general-fiction school f...
27,Lord of the Flies,William Golding,books read-for-school 1001-books adventure gen...
70,"Frankenstein; or, The Modern Prometheus","Mary Wollstonecraft Shelley, Percy Bysshe Shel...",books 19th-century read-for-school 1001-books ...
172,A Clockwork Orange,Anthony Burgess,books rory-gilmore-reading-list movies england...
808,Brave New World/Brave New World Revisited,"Aldous Huxley, Christopher Hitchens",books read-for-school 1001-books general-ficti...
7,The Catcher in the Rye,J.D. Salinger,books read-for-school 1001-books general-ficti...
845,Animal Farm & 1984,"George Orwell, Christopher Hitchens",books read-for-school allegory general-fiction...
64,"Slaughterhouse-Five, or The Children's Crusade...",Kurt Vonnegut Jr.,books read-for-school aliens 1001-books postmo...


In [21]:
recommend_books_based_on_plot('Misery')

Unnamed: 0,original_title,authors,tags
910,Gerald's Game,Stephen King,psychological-thriller books couldn-t-finish c...
1074,Dolores Claiborne,"Stephen King, Dominique Dill",psychological-thriller books calibre general-f...
674,Firestarter,Stephen King,books adventure calibre general-fiction suspen...
304,Pet Sematary,Stephen King,books movies general-fiction contemporary-fict...
669,The Dead Zone,Stephen King,books calibre general-fiction fantasy-sci-fi s...
6322,,Stephen King,books listened-to movies general-fiction conte...
552,Needful Things,Stephen King,books calibre general-fiction fantasy-sci-fi s...
1422,Rose Madder,Stephen King,books calibre general-fiction contemporary-fic...
914,Insomnia,"Stephen King, Bettina Blanch Tyroller",books calibre general-fiction contemporary-fic...
704,Thinner,"Richard Bachman, Stephen King",books calibre general-fiction prose stand-alon...


In [22]:
recommend_books_based_on_plot('The Hobbit')

Unnamed: 0,original_title,authors,tags
3520,Neverwhere Graphic Novel,"Mike Carey, Glenn Fabry, Neil Gaiman",england adventure fantasy-sci-fi urban comic-b...
963,The Hobbit and The Lord of the Rings,J.R.R. Tolkien,books hobbit loved movies adventure fantasy-sc...
154,The Two Towers,J.R.R. Tolkien,books movies 1001-books adventure fantasy-sci-...
18,The Fellowship of the Ring,J.R.R. Tolkien,books movies 1001-books adventure lotr movie r...
7421,Coraline,"Neil Gaiman, P. Craig Russell",books teen-fiction movies england adventure re...
6,The Hobbit or There and Back Again,J.R.R. Tolkien,books read-for-school 1001-books adventure sch...
160,The Return of the King,J.R.R. Tolkien,books movies 1001-books adventure fantasy-sci-...
8373,Darth Vader and Son,Jeffrey Brown,read-2016 quadrinhos funny read-2014 childrens...
6737,The Hedge Knight,"Ben Avery, Mike S. Miller, George R.R. Martin",books read-2016 2015-reads adventure male-auth...
1697,The Dark Tower: The Gunslinger Born,"Peter David, Robin Furth, Jae Lee, Richard Isa...",books adventure comix marvel fantasy-sci-fi co...


Para obtener recomendaciones más diversas, podríamos dejar afuera las novelas del mismo autor de la novela target.

## Filtros colaborativos: Librería Surprise

Tenemos que crear un objeto dataset para trabajar con Surprise. Este dataset contiene los siguientes elementos:
1. Los IDs de los usuarios
2. Los IDs de cada ítem (en este caso, de cada libro)
3. El rating correspondiente (en este caso, en una escala del 1 al 5)


In [23]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_data[['user_id', 'book_id', 'rating']], reader)

### Entrenamos un modelo SVD con crossvalidation

Podemos entrenar con cross-validation un modelo que realiza SVD (descomposición de valor singular o Singular Value Decomposition) para construir un sistema de recomendación. SVD es un algoritmo de factorización de matrices que puede utilizarse para sistemas de recomendación.

Los sistemas de recomendación que utilizan la factorización matricial suelen seguir un patrón en el que una matriz de puntuaciones se factoriza en un producto de matrices que representan factores latentes para los ítems (en este caso, libros) y los usuarios. Funciona con la técnica de factorización matricial que vimos la clase pasada.

In [24]:
from surprise import SVD
from surprise.model_selection import cross_validate

svd = SVD(verbose=True, n_epochs=10)
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8562  0.8552  0.8573  0.8562  0.0009  
MAE (testset)     0.6759  0.6751  0.6755  0.6755  0.0003  
Fit time          3.50    4.00    4.09    3.86    0.26    
Test time         2.14    2.19    2.26    2.20    0.05    


{'test_rmse': array([0.85621786, 0.85521367, 0.85730108]),
 'test_mae': array([0.67586187, 0.67507524, 0.67550073]),
 'fit_time': (3.4980738162994385, 3.996868133544922, 4.0857460498809814),
 'test_time': (2.1360020637512207, 2.18822979927063, 2.262756824493408)}

También podemos entrenar el modelo en todo el conjunto de datos utilizando el método fit después de convertir el objeto Dataset en un objeto Surprise Trainset utilizando el método build_full_trainset.

In [25]:
trainset = data.build_full_trainset()
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9f1a86cca0>

Ahora podemos realizar predicciones para un ID de un libro específico, con el método predict.

In [26]:
svd.predict(uid=10, iid=100)

Prediction(uid=10, iid=100, r_ui=None, est=3.902505952678895, details={'was_impossible': False})

Tomemos de ejemplo un fanático confeso de los libros de Dan Brown (autor de El Código Da Vinci). Observemos sus puntuaciones a los libros de dicho autor:

In [27]:
books_metadata.loc[books_metadata.authors.str.contains('Dan Brown'),:]

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,tags
8,9,960,960,3338963,311,1416524797,9781417000000.0,Dan Brown,2000.0,Angels & Demons,...,2078754,25112,77841,145740,458429,716569,680175,https://images.gr-assets.com/books/1303390735m...,https://images.gr-assets.com/books/1303390735s...,books murder-mystery movies adventure general-...
25,26,968,968,2982101,350,307277674,9780307000000.0,Dan Brown,2003.0,The Da Vinci Code,...,1557292,41560,71345,126493,340790,539277,479387,https://images.gr-assets.com/books/1303252999m...,https://images.gr-assets.com/books/1303252999s...,books movies adventure general-fiction mystery...
144,145,976,976,3135896,180,671027387,9780671000000.0,Dan Brown,2001.0,Deception Point,...,479733,7798,14380,45006,143182,160819,116346,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,books adventure general-fiction mystery-crime ...
169,170,11125,11125,40195,185,552151696,9780552000000.0,Dan Brown,1998.0,Digital Fortress,...,447503,7538,16592,47507,136196,144029,103179,https://images.gr-assets.com/books/1360095966m...,https://images.gr-assets.com/books/1360095966s...,books adventure general-fiction mystery-crime ...
200,201,6411961,6411961,6600281,208,385504225,9780386000000.0,Dan Brown,2009.0,The Lost Symbol,...,404982,24199,11293,40545,120122,136545,96477,https://images.gr-assets.com/books/1358274396m...,https://images.gr-assets.com/books/1358274396s...,books adventure general-fiction mystery-crime ...
239,240,17212231,17212231,23841765,235,385537859,9780386000000.0,Dan Brown,2013.0,Inferno,...,345560,33700,7261,26077,89237,128374,94611,https://images.gr-assets.com/books/1397093185m...,https://images.gr-assets.com/books/1397093185s...,books books-read-in-2013 read-2016 adventure g...
4495,4496,643301,643301,24585562,6,593054601,9780593000000.0,Dan Brown,2003.0,,...,18765,476,254,636,2797,6648,8430,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,books adventure religion before-2008 contempor...


In [28]:
libros_db=books_metadata.loc[books_metadata.authors.str.contains('Dan Brown'),'id'].values
libros_db

array([   9,   26,  145,  170,  201,  240, 4496])

In [29]:
libros_leidos_usuario_db=ratings_data.loc[(ratings_data.user_id==18313)&(ratings_data.book_id.isin(libros_db)),:]
libros_leidos_usuario_db

Unnamed: 0,book_id,user_id,rating
843,9,18313,5
2539,26,18313,5


Vemos que en general hace una muy buena evaluación de los libros de DB

In [30]:
libros_db[~np.isin(libros_db,libros_leidos_usuario_db)]

array([ 145,  170,  201,  240, 4496])

In [31]:
pendientes_db=libros_db[~np.isin(libros_db,libros_leidos_usuario_db)]
pred=[]
for libro in pendientes_db:
    pred.append(svd.predict(uid=18313, iid=libro).est)

media_db=ratings_data.loc[ratings_data.book_id.isin(libros_db),"rating"].mean()    
print(f'Media ratings libros DB: {media_db}')    
print(f'Media ratings puntuación estimada libros no leídos de DB por usuario 18313: {np.array(pred).mean()}')    


Media ratings libros DB: 3.437410071942446
Media ratings puntuación estimada libros no leídos de DB por usuario 18313: 3.51842192050196


Vemos que la puntuación promedio estimada por nuestro modelo para los libros aún no leídos de Dan Brown es superior al promedio.

### Generando recomendaciones

In [32]:
import difflib
import random

def get_book_id(book_title, metadata):
    
    existing_titles = list(metadata['title'].values)
    closest_titles = difflib.get_close_matches(book_title, existing_titles)
    book_id = metadata[metadata['title'] == closest_titles[0]]['id'].values[0]
    return book_id

def get_book_info(book_id, metadata):
    
    book_info = metadata[metadata['id'] == book_id][['id', 'isbn', 
                                                    'authors', 'title', 'original_title']]
    return book_info.to_dict(orient='records')

def predict_review(user_id, book_title, model, metadata):
    
    book_id = get_book_id(book_title, metadata)
    review_prediction = model.predict(uid=user_id, iid=book_id)
    return review_prediction.est

def generate_recommendation(user_id, model, metadata, thresh=4):
    
    book_titles = list(metadata['title'].values)
    random.shuffle(book_titles)
    
    for book_title in book_titles:
        rating = predict_review(user_id, book_title, model, metadata)
        if rating >= thresh:
            book_id = get_book_id(book_title, metadata)
            return get_book_info(book_id, metadata)


La función generate_recommendation genera una recomendación de libros para un usuario iterando a través de la lista de títulos de libros y prediciendo las valoraciones de los usuarios para cada título hasta que encuentra un libro con una valoración igual o superior al umbral especificado que lo califica para ser recomendado a un usuario. 

In [33]:
generate_recommendation(18313, svd, books_metadata, 4)

[{'id': 8286,
  'isbn': '684800012',
  'authors': 'Harold McGee',
  'title': 'On Food and Cooking: The Science and Lore of the Kitchen',
  'original_title': 'On Food and Cooking: The Science and Lore of the Kitchen'}]

### Visualizando las similitudes entre los libros utilizando t-SNE

In [34]:
svd.qi.shape

(10000, 100)

Ahora podemos apreciar la similitud entre los libros a partir de la matriz de factorización resultante luego de aplicar el algoritmo SVD.

Esta matriz de 10.000 x 100 tiene un vector de 100 dimensiones para cada libro, que son demasiadas dimensiones para que podamos visualizarlas intuitivamente, pero podemos utilizar una técnica de reducción de la dimensionalidad para representar cada libro como un punto bidimensional en el espacio. 

Podemos utilizar t-SNE (t-Distributed Stochastic Neighbors Embedding, una alternativa a PCA) para representar cada libro como un punto bidimensional.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, n_iter=500, verbose=3, random_state=1)
books_embedding = tsne.fit_transform(svd.qi)
projection = pd.DataFrame(columns=['x', 'y'], data=books_embedding)
projection['title'] = books_metadata['original_title']

In [None]:
import plotly.express as px

fig = px.scatter(
    projection, x='x', y='y'
)
fig.show()

report = dp.Report(dp.Plot(fig) ) #Create a report
# report.publish(name='books_scatter_plot', open=True, visibility='PUBLIC') #Publish the report

Podemos ver que los puntos que representan los 10.000 libros parecen seguir una distribución normal bidimensional. Podemos explicar esta distribución de la siguiente forma:

- Algunos libros pueden ser populares en general entre un amplio abanico de audiencias y, por tanto, corresponder a puntos en el centro de este diagrama de dispersión.
- Otros libros pueden pertenecer a géneros muy específicos, como las novelas de vampiros, de misterio o románticas, que son populares entre un público concreto. Estos libros pueden corresponder a puntos alejados del centro del diagrama.

In [36]:
import datapane as dp

def plot_books(titles, plot_name):
    
    book_indices = []
    for book in titles:
        book_indices.append(get_book_id(book, books_metadata)-1)
        
    book_vector_df = projection.iloc[book_indices]
    
    fig = px.scatter(
    book_vector_df, x='x', y='y', text='title',
    )
    fig.show()
    
    report = dp.Report(dp.Plot(fig) ) #Create a report
#     report.publish(name=plot_name, open=True, visibility='PUBLIC') #Publish the report

¿Qué pasa si observamos algunos libros específicos? ¿Su similitud de características se ve reflejada en el nuevo espacio de dimensiones?

In [None]:
books = list(books_metadata['title'][:30])
plot_books(books, plot_name='books_embedding')

In [38]:
from collections import defaultdict

from surprise import Dataset, SVD


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9f1f678100>

In [40]:
# Predice los ratings para todo par (u, i) que no aparece en train
# testset = trainset.build_anti_testset()
# predictions = algo_baseline.test(testset)

# Recomendamos usar un subsample porque las combinaciones de libros/usuarios que no aparecen en el set de
# train es muy grande!


In [41]:
libros_leidos_usuario=ratings_data.loc[(ratings_data.user_id==18313),'book_id']
libros_no_leidos_usuario=ratings_data.loc[~ratings_data.book_id.isin(libros_leidos_usuario),'book_id'].unique()

pred=[]
for libro in libros_no_leidos_usuario:
    pred.append(svd.predict(uid=18313, iid=libro))


In [42]:
top_n = get_top_n(pred, n=10)
top_n

defaultdict(list,
            {18313: [(307, 5),
              (3753, 5),
              (4868, 5),
              (5207, 5),
              (9842, 4.986308263266028),
              (3628, 4.928676164700801),
              (3660, 4.906307765193705),
              (7831, 4.876989856940389),
              (9569, 4.874730176126109),
              (4778, 4.859533981643138)]})

In [43]:
# Predecimos el TOP-10 de recomendaciones para el usuario 18313
for uid, user_ratings in top_n.items():
    print([books_metadata.loc[books_metadata.id==iid,'original_title'].values[0] for (iid, _) in user_ratings])

#En el TOP 10 de libros recomendados para nuestro usuario de prueba, vemos libros de ciencia ficción, religión y misterios,
#lo cual guarda sentido con su gusto con los libros de Dan Brown

["The Wise Man's Fear", 'Harry Potter Collection (Harry Potter, #1-6)', 'Jesus the Christ: A Study of the Messiah and His Mission according to Holy Scriptures both Ancient and Modern', 'The Days Are Just Packed: A Calvin and Hobbes Collection', 'Humans of New York: Stories', 'The Complete Calvin and Hobbes', 'The Sandman: The Wake', 'رباعيات صلاح جاهين', 'The Hate U Give', 'Holy Bible English Standard Version']


### Comparamos el desempeño de varios algoritmos

In [44]:
ratings_data_f=ratings_data.sample(n=100000)

In [45]:
#CUIDADO! Mejor utilizar el método train_test_split de SK Learn, y luego convertir el conjunto de datos resultante al objeto Dataset
train, test = skl_train_test_split(ratings_data_f, test_size=0.20,random_state=100)
reader = Reader(rating_scale=(1, 5))
data_train = Dataset.load_from_df(train, reader)

algo_svd = SVD()
algo_knn = KNNBasic()
algo_random=NormalPredictor()
algo_baseline=BaselineOnly()
algo_nmf=NMF()

In [46]:
cv_random=cross_validate(algo_random, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)

In [47]:
cv_baseline=cross_validate(algo_baseline, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)
cv_svd=cross_validate(algo_svd, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)
cv_knn=cross_validate(algo_knn, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)
cv_nmf=cross_validate(algo_nmf, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)

In [48]:
res=[]
res.append(round(cv_random['test_rmse'].mean(),3))
res.append(round(cv_baseline['test_rmse'].mean(),3))
res.append(round(cv_svd['test_rmse'].mean(),3))
res.append(round(cv_knn['test_rmse'].mean(),3))
res.append(round(cv_nmf['test_rmse'].mean(),3))

res_df=pd.DataFrame(pd.concat([pd.Series(['Random','Baseline','SVD','KNN','NMF']),pd.Series(res)],axis=1))
res_df.columns=['Algoritmo','RMSE CV']
res_df

Unnamed: 0,Algoritmo,RMSE CV
0,Random,1.326
1,Baseline,0.932
2,SVD,0.934
3,KNN,0.991
4,NMF,1.112


Los mejores algoritmos parecen ser el baseline y el SVD.

Ver tiempos de ajuste y rendimiento esperado en https://surpriselib.com/

### Métricas específicas de los sistemas de recomendación

<img src="metricas.png" align="center" width="60%"/>

Un elemento se considera relevante si su puntuación es superior a un umbral determinado (por ej, mayor o igual a 4). Un elemento se considera recomendado si su puntuación estimada es superior al umbral y se encuentra entre las k mejores puntuaciones estimadas.

Tener en cuenta que en los casos extremos en los que se produce una división por cero, los valores Precision@k y Recall@k no están definidos. Por convención, fijamos sus valores en 0 en estos casos.

In [49]:
from collections import defaultdict

from surprise import Dataset, SVD
from surprise.model_selection import KFold


def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [76]:
kf = KFold(n_splits=4)
# algo = SVD()
precision=[]
recall=[]
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=4)

    # Precision and recall can then be averaged over all users
    precision.append(sum(prec for prec in precisions.values()) / len(precisions))
    recall.append(sum(rec for rec in recalls.values()) / len(recalls))

In [51]:
len(predictions)

245439

In [52]:
#Top 5
# print(f'Precision media: {np.array(precision).mean()}')
# print(f'Recall medio: {np.array(recall).mean()}')

# Precision media: 0.49
# Recall medio: 0.34

Precision media: 0.4894720012978624
Recall medio: 0.34429402285592936


In [77]:
#Top 10
print(f'Precision media: {np.array(precision).mean()}')
print(f'Recall medio: {np.array(recall).mean()}')

# Precision media: 0.49
# Recall medio: 0.37

Precision media: 0.48574111644278617
Recall medio: 0.36972470185920686


Esto significa que, en promedio, el 48% del top 10 de recomendaciones hechas para cada usuario son relevantes (con un puntaje estimado mayor o igual a 4), mientras que de todos los libros relevantes, el 37% aparecieron en el top 10.

Podríamos aumentar k, lo que resultaría en mayor recall a costa de una menor precision. Esto es equivalente a utilizar un punto de corte menor (=menos exigentes) en un modelo convencional de ML.

# PRÁCTICA

Construir dos sistemas de recomendación:
- uno basado en contenidos a partir de la columna "overview" (argumento de la película)
- un modelo basado en filtros colaborativos a partir de los ratings

Además de lo visto en clase, puede ayudarse a partir de lo que aparece acá:
https://www.datacamp.com/tutorial/recommender-systems-python

El dataset completo está acá: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data

TAREA para el hogar: construir una función para medir precision@k y recall@k a partir del listado de recomendaciones del SR basado en contenidos

In [60]:
ratings_metadata = pd.read_csv('./data/movies/ratings_small.csv', low_memory=False)
movies_metadata = pd.read_csv('./data/movies/movies_metadata.csv', low_memory=False)

print(ratings_metadata.shape)
ratings_metadata.head()

(100004, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [61]:
print(movies_metadata.shape)
movies_metadata.head()

(45466, 24)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


## BIBLIOGRAFÍA

- https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802
- https://surprise.readthedocs.io/
- https://medium.com/mlearning-ai/basic-content-based-recommendation-system-with-python-code-be920b412067
- https://medium.com/analytics-vidhya/content-based-recommender-systems-in-python-2b330e01eb80
- https://towardsdatascience.com/hands-on-content-based-recommender-system-using-python-1d643bf314e4

LightFM
- https://towardsdatascience.com/solving-business-usecases-by-recommender-system-using-lightfm-4ba7b3ac8e62


