# SISTEMAS DE RECOMENDACIÓN 2
## PRÁCTICA

<img src="portada.jpg" align="center" width="80%"/>

El principal objetivo de cualquier sistema de recomendación es ofrecer a los usuarios una experiencia personalizada. ¿Cómo traducimos esto a un problema de machine learning? Tenemos dos tipos de problemas:

- Problema de predicción: el primer enfoque es aquel en el que nos gustaría predecir el valor de calificación de una combinación usuario-artículo con la suposición de que se dispone de datos de entrenamiento que indican la preferencia de un usuario por otros artículos. Imaginemos una matriz m × n, donde m es el número de usuarios y n el número de elementos, y el objetivo es predecir los valores que faltan (o no observados). En otras palabras, **queremos predecir el rating que un usuario preexistente le pondría a un ítem que aún no consumió.**
<br>

- Problema de clasificación: en otras ocasiones, deseamos **recomendar los k mejores artículos para un usuario concreto o determinar los k mejores usuarios a los que dirigirnos para un ítem específico**. Este problema también se denomina problema de recomendación top-k y es la formulación de clasificación del problema de recomendación. Pensemos en un motor de búsqueda en el que, en función de quién realice la búsqueda, nos gustaría mostrar los elementos top-k para ofrecer resultados personalizados basados en sus preferencias anteriores y en su actividad reciente.

Clasficación de sistemas de recomendación 

<img src="sintesis.jpg" align="center" width="60%"/>

In [37]:
# !pip install chart_studio
# !pip install datapane
# !pip install pip install scikit-surprise
# !pip install --upgrade pandas

In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datapane as dp
import chart_studio.plotly as py
import chart_studio
import os

from sklearn.model_selection import train_test_split as skl_train_test_split
from surprise import KNNBasic,NormalPredictor,BaselineOnly,NMF

# username = os.environ['PLOTLY_USERNAME']
# api_key = os.environ['PLOTLY_API_KEY']
# chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
%matplotlib inline

## Ejemplo práctico de un SR basado en contenido

Utilizamos el dataset goodbooks-10k (disponible en Kaggle). El primer CSV contiene datos de 10k libros puntuados por 53k usuarios. El segundo CSV contiene los metadatos (título, autor, ISBN, etc) para cada uno de los 10k libros.

In [39]:
ratings_data = pd.read_csv('./data/books/ratings.csv.zip')
books_metadata = pd.read_csv('./data/books/books.csv.zip')
ratings_data.head(10)

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4
5,1,2077,4
6,1,2487,4
7,1,2900,5
8,1,3662,4
9,1,3922,5


In [40]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981756 entries, 0 to 981755
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   book_id  981756 non-null  int64
 1   user_id  981756 non-null  int64
 2   rating   981756 non-null  int64
dtypes: int64(3)
memory usage: 22.5 MB


In [41]:
books_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

Tenemos además un dataset de tags, que nos da detalles sobre el argumento del libro.

In [42]:
book_tags = pd.read_csv('./data/books/book_tags.csv')
tags = pd.read_csv('./data/books/tags.csv')

In [43]:
data = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how='inner')
titles = books_metadata[['id', 'book_id','title','authors']]
data = pd.merge(titles, data, left_on='book_id', right_on='goodreads_book_id')
data.drop(['goodreads_book_id', 'tag_id', 'count'], axis=1,inplace=True)
data.head(3)

Unnamed: 0,id,book_id,title,authors,tag_name
0,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,to-read
1,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,fantasy
2,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,favorites


In [44]:
list_tags = data.groupby(by='book_id')['tag_name'].apply(set).apply(list)
books_metadata['tags'] = books_metadata['book_id'].apply(lambda x: ' '.join(list_tags[x]))
pd.reset_option('max_colwidth')
books_metadata.head(3)

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,tags
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,my-books read-2012 coming-of-age borrowed my-l...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,my-books owned-books my-library i-own classics...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,my-books read-in-2008 never-again owned-books ...


In [45]:
books_metadata['tags'][0]

'my-books read-2012 coming-of-age borrowed my-library i-own read-in-2012 read-more-than-once ebooks audio read-2011 action thriller series reviewed sci-fi-fantasy favorite favorite-books read-in-2013 distopian owned adventure default fantasy favorites hunger-games trilogy science-fiction suspense action-adventure audiobook sf ya-fantasy teen-fiction favourite-books favorite-series shelfari-favorites 5-stars teen ebook favs fantasy-sci-fi favourites dystopian-fiction faves books-i-own sci-fi finished read-in-2014 young-adult dystopia love the-hunger-games fiction reread distopia my-favorites post-apocalyptic future love-triangle suzanne-collins all-time-favorites dystopias novels books contemporary finished-series to-buy completed-series scifi drama english read-in-2011 young-adult-fiction scifi-fantasy currently-reading survival ya-dystopian ya-lit library novel favourite 2012-reads e-book read-in-2010 to-read dystopian ya-books ya-fiction futuristic book-club ya loved audiobooks 5-sta

In [46]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Al trabajar con texto, podemos aplicar una transformación llamada **TF-IDF**, o Term Frequency-Inverse Document Frecuency. 

El **número de features** que crea es igual al número total de palabras distintas utilizadas en la columna de tags. 

Los **valores** son directamente *proporcionales al número de veces que se utiliza una palabra concreta* e *inversamente proporcionales al número de documentos (películas en este caso) en los que se utiliza la palabra*.

Penalizará una palabra presente en los tags de un libro si es común a muchos libros. Las palabras que aparecen varias veces pero son comunes a muchos libros no son tan útiles para diferenciar los distintos libros.

En otras palabras, **el proceso TF-IDF nos dice que tan relevante es una palabra a un documento, en una colección de documentos**.

### Cálculo

t - término (palabra) <br>
d - documento (conjunto de palabras) <br>
N - cantidad de documentos
<br>

tf(t,d) = cantidad de t in d / cantidad de palabras en d

df(t) = ocurrencia de t en N documentos

idf(t) = N/df

IDF es la inversa de la frecuencia de documentos que mide que tan informativo es el término t. Cuando calculamos IDF, será muy bajo para las palabras más frecuentes, como las stop words (porque están presentes en casi todos los documentos, y N/df dará un valor muy bajo a esa palabra). Esto da finalmente lo que queremos, un peso relativo.

Se suele usar el logaritmo para contrarrestar el efecto de un N muy grande: idf(t) = log(N/(df + 1)).

El score final será la multiplicación de ambos términos:

**tf-idf(t, d) = tf(t, d) * log(N/(df + 1))**

In [47]:
tfidf = TfidfVectorizer(stop_words='english')

tags_matrix = tfidf.fit_transform(books_metadata['tags'])

tags_matrix.shape

(10000, 16396)

In [48]:
tags_matrix

<10000x16396 sparse matrix of type '<class 'numpy.float64'>'
	with 767020 stored elements in Compressed Sparse Row format>

In [49]:
#Cantidad de veces que se repite cada término
tfidf.vocabulary_

{'books': 2075,
 'read': 11635,
 '2012': 253,
 'coming': 3228,
 'age': 686,
 'borrowed': 2112,
 'library': 8325,
 'ebooks': 4490,
 'audio': 1333,
 '2011': 251,
 'action': 563,
 'thriller': 14012,
 'series': 12640,
 'reviewed': 11939,
 'sci': 12468,
 'fi': 5268,
 'fantasy': 5132,
 'favorite': 5188,
 '2013': 254,
 'distopian': 4180,
 'owned': 10424,
 'adventure': 641,
 'default': 3899,
 'favorites': 5190,
 'hunger': 6863,
 'games': 5744,
 'trilogy': 14289,
 'science': 12470,
 'fiction': 5289,
 'suspense': 13612,
 'audiobook': 1334,
 'sf': 12671,
 'ya': 15260,
 'teen': 13848,
 'favourite': 5196,
 'shelfari': 12754,
 'stars': 13324,
 'ebook': 4487,
 'favs': 5199,
 'favourites': 5197,
 'dystopian': 4451,
 'faves': 5182,
 'finished': 5358,
 '2014': 255,
 'young': 15308,
 'adult': 632,
 'dystopia': 4450,
 'love': 8570,
 'reread': 11889,
 'distopia': 4179,
 'post': 11099,
 'apocalyptic': 1068,
 'future': 5686,
 'triangle': 14275,
 'suzanne': 13620,
 'collins': 3179,
 'time': 14053,
 'dystopias

In [50]:
doc = 0
feature_index = tags_matrix[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tags_matrix[doc, x] for x in feature_index])

In [51]:
books_metadata['tags'][0]

'my-books read-2012 coming-of-age borrowed my-library i-own read-in-2012 read-more-than-once ebooks audio read-2011 action thriller series reviewed sci-fi-fantasy favorite favorite-books read-in-2013 distopian owned adventure default fantasy favorites hunger-games trilogy science-fiction suspense action-adventure audiobook sf ya-fantasy teen-fiction favourite-books favorite-series shelfari-favorites 5-stars teen ebook favs fantasy-sci-fi favourites dystopian-fiction faves books-i-own sci-fi finished read-in-2014 young-adult dystopia love the-hunger-games fiction reread distopia my-favorites post-apocalyptic future love-triangle suzanne-collins all-time-favorites dystopias novels books contemporary finished-series to-buy completed-series scifi drama english read-in-2011 young-adult-fiction scifi-fantasy currently-reading survival ya-dystopian ya-lit library novel favourite 2012-reads e-book read-in-2010 to-read dystopian ya-books ya-fiction futuristic book-club ya loved audiobooks 5-sta

In [None]:
feature_names = tfidf.get_feature_names_out()
tfidf_scores_sorted=[]
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    tfidf_scores_sorted.append([w,s])
    print(w,s)
tfidf_scores_sorted    

romance 0.04199622977855118
kindle 0.02671163235919224
speculative 0.07251109380641134
star 0.07502456600504018
audiobooks 0.031266504133500087
loved 0.0821114576482351
club 0.04099937951732761
futuristic 0.10222335450570373
2010 0.061259574090483174
book 0.05075185270695867
reads 0.037744265378818725
novel 0.04076188141812142
lit 0.0384369552648043
survival 0.09227270649072088
reading 0.02298125553548972
currently 0.023230810810032552
english 0.042713331941499955
drama 0.049316044695898904
scifi 0.11833114913195372
completed 0.08957555690286702
buy 0.025624979389037798
contemporary 0.036229004568697314
novels 0.03321781659626496
dystopias 0.11595489846127827
time 0.04811754784979944
collins 0.16278266572379443
suzanne 0.1689599755316416
triangle 0.09451559019735939
future 0.10148334660793373
apocalyptic 0.09414362016630412
post 0.09032041817820313
distopia 0.12415576195547645
reread 0.06447599331006632
love 0.1182182733789269
dystopia 0.08248694113279854
adult 0.050274307002089944
you

[['romance', 0.04199622977855118],
 ['kindle', 0.02671163235919224],
 ['speculative', 0.07251109380641134],
 ['star', 0.07502456600504018],
 ['audiobooks', 0.031266504133500087],
 ['loved', 0.0821114576482351],
 ['club', 0.04099937951732761],
 ['futuristic', 0.10222335450570373],
 ['2010', 0.061259574090483174],
 ['book', 0.05075185270695867],
 ['reads', 0.037744265378818725],
 ['novel', 0.04076188141812142],
 ['lit', 0.0384369552648043],
 ['survival', 0.09227270649072088],
 ['reading', 0.02298125553548972],
 ['currently', 0.023230810810032552],
 ['english', 0.042713331941499955],
 ['drama', 0.049316044695898904],
 ['scifi', 0.11833114913195372],
 ['completed', 0.08957555690286702],
 ['buy', 0.025624979389037798],
 ['contemporary', 0.036229004568697314],
 ['novels', 0.03321781659626496],
 ['dystopias', 0.11595489846127827],
 ['time', 0.04811754784979944],
 ['collins', 0.16278266572379443],
 ['suzanne', 0.1689599755316416],
 ['triangle', 0.09451559019735939],
 ['future', 0.1014833466079

In [None]:
tfidf_scores_sorted_df=pd.DataFrame(tfidf_scores_sorted)
tfidf_scores_sorted_df.columns=['token','val']
tfidf_scores_sorted_df.sort_values(by='val',ascending=False,inplace=True)
tfidf_scores_sorted_df.head(20)

Unnamed: 0,token,val
56,hunger,0.354943
48,ya,0.301504
55,games,0.28707
40,dystopian,0.243866
79,read,0.227162
64,fantasy,0.209399
26,suzanne,0.16896
25,collins,0.162783
52,fiction,0.159109
46,favourite,0.156012


¿Que palabras comunes se les ocurre que podríamos sacar?

### Similitud Coseno

<img src="similitud_coseno.jpg" align="center" width="60%"/>

Así como vimos la distancia euclideana y el coeficiente de Pearson para estimar la similitud entre dos personas o ítems, existe otra medida muy utilizada llamada *similitud coseno*. En este caso la utilizaremos para estimar la similitud entre los contenidos de los ítems.

La imagen muestra la similitud coseno en dos dimensiones. Si dos vectores de datos son cercanos, el ángulo entre estos dos vectores es pequeño y la similitud coseno será alta. En el extremo, dos vectores iguales tendrán un ángulo de 0 grados, un coseno de 1, y una similitud coseno también de 1.

Naturalmente, podemos llevar el concepto para N dimensiones.

In [None]:
similarity_matrix = cosine_similarity(tags_matrix,tags_matrix)
similarity_matrix

array([[1.        , 0.38961343, 0.40894373, ..., 0.09091467, 0.16148787,
        0.03602845],
       [0.38961343, 1.        , 0.37207812, ..., 0.0886315 , 0.16692497,
        0.04592596],
       [0.40894373, 0.37207812, 1.        , ..., 0.05919159, 0.09024158,
        0.02882952],
       ...,
       [0.09091467, 0.0886315 , 0.05919159, ..., 1.        , 0.05409486,
        0.12427748],
       [0.16148787, 0.16692497, 0.09024158, ..., 0.05409486, 1.        ,
        0.11579571],
       [0.03602845, 0.04592596, 0.02882952, ..., 0.12427748, 0.11579571,
        1.        ]])

In [None]:
def recommend_books_based_on_plot(book_input,n=15):
    book_index = books_metadata.loc[books_metadata.original_title==book_input,:].index[0]

    #get similarity values with other books
    #similarity_score is the list of index and similarity matrix
    similarity_score = list(enumerate(similarity_matrix[book_index]))

    #sort in descending order the similarity score of book inputted with all the other books
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)

    # Get the scores of the 15 most similar books. Ignore the first book.
    similarity_score = similarity_score[1:n]

    #return book names using the mapping series
    book_indices = [i[0] for i in similarity_score]

    return (books_metadata[['original_title','authors','tags']].iloc[book_indices])

In [None]:
recommend_books_based_on_plot('Nineteen Eighty-Four',10)

Unnamed: 0,original_title,authors,tags
54,Brave New World,Aldous Huxley,my-books owned-books my-library banned-books i...
13,Animal Farm: A Fairy Story,George Orwell,my-books school-reads owned-books borrowed my-...
47,Fahrenheit 451,Ray Bradbury,my-books school-reads owned-books borrowed my-...
27,Lord of the Flies,William Golding,my-books school-reads coming-of-age owned-book...
70,"Frankenstein; or, The Modern Prometheus","Mary Wollstonecraft Shelley, Percy Bysshe Shel...",my-books school-reads owned-books my-library i...
172,A Clockwork Orange,Anthony Burgess,2015-reading-challenge my-books owned-books ba...
808,Brave New World/Brave New World Revisited,"Aldous Huxley, Christopher Hitchens",my-books owned-books my-library i-own classics...
7,The Catcher in the Rye,J.D. Salinger,america my-books school-reads coming-of-age ow...
845,Animal Farm & 1984,"George Orwell, Christopher Hitchens",philosophical best my-books school-reads owned...


In [None]:
recommend_books_based_on_plot('Nineteen Eighty-Four',15)

Unnamed: 0,original_title,authors,tags
54,Brave New World,Aldous Huxley,my-books owned-books my-library banned-books i...
13,Animal Farm: A Fairy Story,George Orwell,my-books school-reads owned-books borrowed my-...
47,Fahrenheit 451,Ray Bradbury,my-books school-reads owned-books borrowed my-...
27,Lord of the Flies,William Golding,my-books school-reads coming-of-age owned-book...
70,"Frankenstein; or, The Modern Prometheus","Mary Wollstonecraft Shelley, Percy Bysshe Shel...",my-books school-reads owned-books my-library i...
172,A Clockwork Orange,Anthony Burgess,2015-reading-challenge my-books owned-books ba...
808,Brave New World/Brave New World Revisited,"Aldous Huxley, Christopher Hitchens",my-books owned-books my-library i-own classics...
7,The Catcher in the Rye,J.D. Salinger,america my-books school-reads coming-of-age ow...
845,Animal Farm & 1984,"George Orwell, Christopher Hitchens",philosophical best my-books school-reads owned...
64,"Slaughterhouse-Five, or The Children's Crusade...",Kurt Vonnegut Jr.,my-books owned-books world-war-ii my-library b...


In [None]:
recommend_books_based_on_plot('Misery')

Unnamed: 0,original_title,authors,tags
910,Gerald's Game,Stephen King,stephenking my-books unread owned-books borrow...
1074,Dolores Claiborne,"Stephen King, Dominique Dill",stephenking my-books unread owned-books my-lib...
674,Firestarter,Stephen King,stephenking my-books unread owned-books borrow...
304,Pet Sematary,Stephen King,stephenking my-books owned-books borrowed my-l...
669,The Dead Zone,Stephen King,stephenking my-books unread owned-books borrow...
6322,,Stephen King,stephenking my-books unread coming-of-age owne...
552,Needful Things,Stephen King,stephenking my-books unread owned-books borrow...
1422,Rose Madder,Stephen King,stephenking my-books unread owned-books my-lib...
914,Insomnia,"Stephen King, Bettina Blanch Tyroller",stephenking my-books owned-books my-library i-...
704,Thinner,"Richard Bachman, Stephen King",stephenking my-books unread borrowed my-librar...


In [None]:
recommend_books_based_on_plot('The Hobbit')

Unnamed: 0,original_title,authors,tags
3520,Neverwhere Graphic Novel,"Mike Carey, Glenn Fabry, Neil Gaiman",graphic-novels-manga my-books vertigo fantasy-...
963,The Hobbit and The Lord of the Rings,J.R.R. Tolkien,fantasy-scifi my-library i-own classics person...
154,The Two Towers,J.R.R. Tolkien,my-books fantasy-scifi owned-books my-library ...
18,The Fellowship of the Ring,J.R.R. Tolkien,my-books fantasy-scifi owned-books my-library ...
7421,Coraline,"Neil Gaiman, P. Craig Russell",my-books coming-of-age library-books owned-boo...
6,The Hobbit or There and Back Again,J.R.R. Tolkien,my-books fantasy-scifi owned-books my-library ...
160,The Return of the King,J.R.R. Tolkien,my-books fantasy-scifi owned-books my-library ...
8373,Darth Vader and Son,Jeffrey Brown,quadrinhos read-2012 pop-culture space library...
6737,The Hedge Knight,"Ben Avery, Mike S. Miller, George R.R. Martin",2015-reading-challenge my-books got fantasy-sc...
1697,The Dark Tower: The Gunslinger Born,"Peter David, Robin Furth, Jae Lee, Richard Isa...",graphic-novels-manga my-books read-in-2008 wes...


Para obtener recomendaciones más diversas, podríamos dejar afuera las novelas del mismo autor de la novela target.

## Ejemplo práctico de un SR basado en filtros colaborativos: Librería Surprise

Tenemos que crear un objeto dataset para trabajar con Surprise. Este dataset contiene los siguientes elementos:
1. Los IDs de los usuarios
2. Los IDs de cada ítem (en este caso, de cada libro)
3. El rating correspondiente (en este caso, en una escala del 1 al 5)


In [None]:
from surprise import Dataset
from surprise import Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_data[['user_id', 'book_id', 'rating']], reader)

### Entrenamos un modelo SVD con crossvalidation

Podemos entrenar con cross-validation un modelo que realiza SVD (descomposición de valor singular o Singular Value Decomposition) para construir un sistema de recomendación. SVD es un algoritmo de factorización de matrices que puede utilizarse para sistemas de recomendación.

Los sistemas de recomendación que utilizan la factorización matricial suelen seguir un patrón en el que una matriz de puntuaciones o ratings se factoriza en un producto de matrices que representan factores latentes para los ítems (en este caso, libros) y los usuarios. Funciona con la técnica de factorización matricial que vimos la clase pasada.

In [None]:
from surprise import SVD
from surprise.model_selection import cross_validate

svd = SVD(verbose=True, n_epochs=10)
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8560  0.8566  0.8556  0.8560  0.0004  
MAE (testset)     0.6751  0.6756  0.6751  0.6752  0.0002  
Fit time          3.05    3.38    3.38    3.27    0.16    
Test time         1.42    1.60    1.66    1.56    0.10    


{'test_rmse': array([0.85598642, 0.8565825 , 0.85557451]),
 'test_mae': array([0.67507586, 0.67556842, 0.67509726]),
 'fit_time': (3.046085834503174, 3.3755037784576416, 3.38267183303833),
 'test_time': (1.420814037322998, 1.603020191192627, 1.6629512310028076)}

También podemos entrenar el modelo en todo el conjunto de datos utilizando el método fit después de convertir el objeto Dataset en un objeto Surprise Trainset utilizando el método build_full_trainset.

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14a4fd970>

Ahora podemos realizar predicciones para un ID de un libro específico, con el método predict.

In [None]:
svd.predict(uid=10, iid=100)

Prediction(uid=10, iid=100, r_ui=None, est=4.057148878995296, details={'was_impossible': False})

Tomemos de ejemplo un fanático confeso de los libros de Dan Brown (autor de El Código Da Vinci). Observemos sus puntuaciones a los libros de dicho autor:

In [None]:
books_metadata.loc[books_metadata.authors.str.contains('Dan Brown'),:]

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,tags
8,9,960,960,3338963,311,1416524797,9781417000000.0,Dan Brown,2000.0,Angels & Demons,...,2078754,25112,77841,145740,458429,716569,680175,https://images.gr-assets.com/books/1303390735m...,https://images.gr-assets.com/books/1303390735s...,my-books 2006 owned-books borrowed my-library ...
25,26,968,968,2982101,350,307277674,9780307000000.0,Dan Brown,2003.0,The Da Vinci Code,...,1557292,41560,71345,126493,340790,539277,479387,https://images.gr-assets.com/books/1303252999m...,https://images.gr-assets.com/books/1303252999s...,my-books 2006 owned-books borrowed my-library ...
144,145,976,976,3135896,180,671027387,9780671000000.0,Dan Brown,2001.0,Deception Point,...,479733,7798,14380,45006,143182,160819,116346,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,action-thriller my-books 2006 owned-books tech...
169,170,11125,11125,40195,185,552151696,9780552000000.0,Dan Brown,1998.0,Digital Fortress,...,447503,7538,16592,47507,136196,144029,103179,https://images.gr-assets.com/books/1360095966m...,https://images.gr-assets.com/books/1360095966s...,action-thriller my-books 2006 owned-books tech...
200,201,6411961,6411961,6600281,208,385504225,9780386000000.0,Dan Brown,2009.0,The Lost Symbol,...,404982,24199,11293,40545,120122,136545,96477,https://images.gr-assets.com/books/1358274396m...,https://images.gr-assets.com/books/1358274396s...,my-books owned-books borrowed my-library i-own...
239,240,17212231,17212231,23841765,235,385537859,9780386000000.0,Dan Brown,2013.0,Inferno,...,345560,33700,7261,26077,89237,128374,94611,https://images.gr-assets.com/books/1397093185m...,https://images.gr-assets.com/books/1397093185s...,my-books inferno nook owned-books 2013-books b...
4495,4496,643301,643301,24585562,6,593054601,9780593000000.0,Dan Brown,2003.0,,...,18765,476,254,636,2797,6648,8430,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,my-books unread owned-books my-library not-int...


In [None]:
libros_db=books_metadata.loc[books_metadata.authors.str.contains('Dan Brown'),'id'].values
libros_db

array([   9,   26,  145,  170,  201,  240, 4496])

In [None]:
libros_leidos_usuario_db=ratings_data.loc[(ratings_data.user_id==18313)&(ratings_data.book_id.isin(libros_db)),:]
libros_leidos_usuario_db

Unnamed: 0,book_id,user_id,rating
843,9,18313,5
2539,26,18313,5


Vemos que en general hace una muy buena evaluación de los libros de DB

In [None]:
libros_db[~np.isin(libros_db,libros_leidos_usuario_db)]

array([ 145,  170,  201,  240, 4496])

In [None]:
pendientes_db=libros_db[~np.isin(libros_db,libros_leidos_usuario_db)]
pred=[]
for libro in pendientes_db:
    pred.append(svd.predict(uid=18313, iid=libro).est)

media_db=ratings_data.loc[ratings_data.book_id.isin(libros_db),"rating"].mean()    
print(f'Media ratings libros DB: {round(media_db,2)}')    
print(f'Media ratings puntuación estimada libros no leídos de DB por usuario 18313: {round(np.array(pred).mean(),2)}')    


Media ratings libros DB: 3.44
Media ratings puntuación estimada libros no leídos de DB por usuario 18313: 3.67


Vemos que la puntuación promedio estimada por nuestro modelo para los libros aún no leídos de Dan Brown es superior al promedio.
<br>

*Dan Brown es un autor muy popular y por eso quizás no vemos una diferencia sustancial entre el promedio estimado por nuestro usuario y el promedio general. Podría verificarse la diferencia entre ambos promedios tomando como referencia otro autor menos popular*.

### Generando recomendaciones

In [None]:
import difflib
import random

def get_book_id(book_title, metadata):
    
    existing_titles = list(metadata['title'].values)
    closest_titles = difflib.get_close_matches(book_title, existing_titles)
    book_id = metadata[metadata['title'] == closest_titles[0]]['id'].values[0]
    return book_id

def get_book_info(book_id, metadata):
    
    book_info = metadata[metadata['id'] == book_id][['id', 'isbn', 
                                                    'authors', 'title', 'original_title']]
    return book_info.to_dict(orient='records')

def predict_review(user_id, book_title, model, metadata):
    
    book_id = get_book_id(book_title, metadata)
    review_prediction = model.predict(uid=user_id, iid=book_id)
    return review_prediction.est

def generate_recommendation(user_id, model, metadata, thresh=4):
    
    book_titles = list(metadata['title'].values)
    random.shuffle(book_titles)
    
    for book_title in book_titles:
        rating = predict_review(user_id, book_title, model, metadata)
        if rating >= thresh:
            book_id = get_book_id(book_title, metadata)
            return get_book_info(book_id, metadata)


La función generate_recommendation genera una recomendación de libros para un usuario iterando a través de la lista de títulos de libros y prediciendo las valoraciones de los usuarios para cada título hasta que encuentra un libro con una valoración igual o superior al umbral especificado que lo califica para ser recomendado a un usuario. 

In [None]:
%%time
generate_recommendation(18313, svd, books_metadata, 4.8)

CPU times: user 3 s, sys: 9.89 ms, total: 3.01 s
Wall time: 3.1 s


[{'id': 3082,
  'isbn': '1563891379',
  'authors': 'Neil Gaiman, Jill Thompson, Vince Locke, Peter Straub',
  'title': 'Brief Lives (The Sandman #7)',
  'original_title': 'The Sandman: Brief Lives'}]

### Visualizando las similitudes entre los libros utilizando t-SNE

In [None]:
svd.qi.shape

(10000, 100)

Ahora podemos apreciar la similitud entre los libros a partir de la matriz de factorización resultante luego de aplicar el algoritmo SVD.

Esta matriz de 10.000 x 100 tiene un vector de 100 dimensiones para cada libro, que son demasiadas dimensiones para que podamos visualizarlas intuitivamente, pero podemos utilizar una técnica de reducción de la dimensionalidad para representar cada libro como un punto bidimensional en el espacio. 

Podemos utilizar t-SNE (t-Distributed Stochastic Neighbors Embedding, una alternativa a PCA) para representar cada libro como un punto bidimensional.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, n_iter=500, verbose=3, random_state=1)
books_embedding = tsne.fit_transform(svd.qi)
projection = pd.DataFrame(columns=['x', 'y'], data=books_embedding)
projection['title'] = books_metadata['original_title']

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.000s...
[t-SNE] Computed neighbors for 10000 samples in 0.450s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.233667
[t-SNE] Computed conditional probabilities in 0.146s
[t-SNE] Iteration 50: error = 97.0348206, gradient norm = 0.0568880 (50 iterations in 3.090s)
[t-SNE] It

KeyboardInterrupt: 

In [None]:
import plotly.express as px

fig = px.scatter(
    projection, x='x', y='y'
)
fig.show()

report = dp.Report(dp.Plot(fig) ) #Create a report
# report.publish(name='books_scatter_plot', open=True, visibility='PUBLIC') #Publish the report

Podemos ver que los puntos que representan los 10.000 libros parecen seguir una distribución normal bidimensional. Podemos explicar esta distribución de la siguiente forma:

- Algunos libros pueden ser populares en general entre un amplio abanico de audiencias y, por tanto, corresponder a puntos en el centro de este diagrama de dispersión.
- Otros libros pueden pertenecer a géneros muy específicos, como las novelas de vampiros, de misterio o románticas, que son populares entre un público concreto. Estos libros pueden corresponder a puntos alejados del centro del diagrama.

In [None]:
import datapane as dp

def plot_books(titles, plot_name):
    
    book_indices = []
    for book in titles:
        book_indices.append(get_book_id(book, books_metadata)-1)
        
    book_vector_df = projection.iloc[book_indices]
    
    fig = px.scatter(
    book_vector_df, x='x', y='y', text='title',
    )
    fig.show()
    
    report = dp.Report(dp.Plot(fig) ) #Create a report
#     report.publish(name=plot_name, open=True, visibility='PUBLIC') #Publish the report

¿Qué pasa si observamos algunos libros específicos? ¿Su similitud de características se ve reflejada en el nuevo espacio de dimensiones?

In [None]:
books = list(books_metadata['title'][:30])
plot_books(books, plot_name='books_embedding')

In [None]:
from collections import defaultdict
from surprise import Dataset, SVD

def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2a4ecee20>

In [None]:
# Predice los ratings para todo par (u, i) que no aparece en train
# testset = trainset.build_anti_testset()
# predictions = algo_baseline.test(testset)

# Recomendamos usar un subsample porque las combinaciones de libros/usuarios que no aparecen en el set de
# train es muy grande!


In [None]:
#Predecimos los puntajes de los libros no leídos por el usuario amante de DB
libros_leidos_usuario=ratings_data.loc[(ratings_data.user_id==18313),'book_id']
libros_no_leidos_usuario=ratings_data.loc[~ratings_data.book_id.isin(libros_leidos_usuario),'book_id'].unique()

pred=[]
for libro in libros_no_leidos_usuario:
    pred.append(svd.predict(uid=18313, iid=libro))


In [None]:
pred

[Prediction(uid=18313, iid=6, r_ui=None, est=4.055893198586564, details={'was_impossible': False}),
 Prediction(uid=18313, iid=11, r_ui=None, est=4.410607082110741, details={'was_impossible': False}),
 Prediction(uid=18313, iid=12, r_ui=None, est=4.05997928144963, details={'was_impossible': False}),
 Prediction(uid=18313, iid=13, r_ui=None, est=3.962854418971632, details={'was_impossible': False}),
 Prediction(uid=18313, iid=15, r_ui=None, est=4.03924420213561, details={'was_impossible': False}),
 Prediction(uid=18313, iid=16, r_ui=None, est=3.9902435430257657, details={'was_impossible': False}),
 Prediction(uid=18313, iid=28, r_ui=None, est=3.792416208128159, details={'was_impossible': False}),
 Prediction(uid=18313, iid=29, r_ui=None, est=4.250935873439344, details={'was_impossible': False}),
 Prediction(uid=18313, iid=30, r_ui=None, est=3.702795215491001, details={'was_impossible': False}),
 Prediction(uid=18313, iid=31, r_ui=None, est=4.1471085374732155, details={'was_impossible': 

In [None]:
top_n = get_top_n(pred, n=10)
top_n

defaultdict(list,
            {18313: [(3628, 5),
              (1308, 4.987570198234232),
              (8663, 4.967533897974668),
              (6920, 4.959674680596504),
              (6361, 4.952630333298262),
              (3395, 4.943302568163949),
              (3736, 4.940580395432595),
              (4229, 4.904086657022224),
              (9076, 4.888844824005519),
              (7264, 4.8796943597651055)]})

In [None]:
# Predecimos el TOP-10 de recomendaciones para el usuario 18313

for uid, user_ratings in top_n.items():
    print([books_metadata.loc[books_metadata.id==iid,'original_title'].values[0] for (iid, _) in user_ratings])

#En el TOP 10 de libros recomendados para nuestro usuario de prueba, vemos libros de ciencia ficción, religión y misterios,
#lo cual guarda sentido con su gusto con los libros de Dan Brown

['The Complete Calvin and Hobbes', 'A Court of Mist and Fury', 'Locke & Key, Vol. 6: Alpha & Omega', 'The Indispensable Calvin and Hobbes: A Calvin and Hobbes Treasury', "There's Treasure Everywhere: A Calvin and Hobbes Collection", 'The Sandman: The Kindly Ones', 'Harry Potter Page to Screen: The Complete Filmmaking Journey', 'The Lord of the Rings: Weapons and Warfare', 'Preach My Gospel (A Guide to Missionary Service)', 'Master of the Senate']


### Comparamos el desempeño de varios algoritmos

In [None]:
ratings_data_f=ratings_data.sample(n=100000)

In [None]:
#CUIDADO! Mejor utilizar el método train_test_split de SK Learn, y luego convertir el conjunto de datos resultante al objeto Dataset
train, test = skl_train_test_split(ratings_data_f, test_size=0.20,random_state=100)
reader = Reader(rating_scale=(1, 5))
data_train = Dataset.load_from_df(train, reader)

algo_svd = SVD()
algo_knn = KNNBasic()
algo_random=NormalPredictor()
algo_baseline=BaselineOnly()
algo_nmf=NMF()

In [None]:
cv_random=cross_validate(algo_random, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)

In [None]:
cv_baseline=cross_validate(algo_baseline, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)
cv_svd=cross_validate(algo_svd, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)
cv_knn=cross_validate(algo_knn, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)
cv_nmf=cross_validate(algo_nmf, data_train, measures=['RMSE'],cv=4, verbose=False, n_jobs=-1)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.


In [None]:
res=[]
res.append(round(cv_random['test_rmse'].mean(),3))
res.append(round(cv_baseline['test_rmse'].mean(),3))
res.append(round(cv_svd['test_rmse'].mean(),3))
res.append(round(cv_knn['test_rmse'].mean(),3))
res.append(round(cv_nmf['test_rmse'].mean(),3))

res_df=pd.DataFrame(pd.concat([pd.Series(['Random','Baseline','SVD','KNN','NMF']),pd.Series(res)],axis=1))
res_df.columns=['Algoritmo','RMSE CV']
res_df

Unnamed: 0,Algoritmo,RMSE CV
0,Random,1.322
1,Baseline,0.934
2,SVD,0.935
3,KNN,0.992
4,NMF,1.118


Los mejores algoritmos parecen ser el baseline y el SVD.

Ver tiempos de ajuste y rendimiento esperado en https://surpriselib.com/

### Métricas específicas de los sistemas de recomendación

<img src="metricas.png" align="center" width="60%"/>

- Un elemento se considera **relevante** si su puntuación es superior a un umbral determinado (por ej, mayor o igual a 4). En otras palabras, binarizamos el puntaje para convertir un problema de regresión en uno de clasificacción.
<br>

- Un elemento se considera **recomendado** si su puntuación estimada es superior al umbral y se encuentra entre las k mejores puntuaciones estimadas.

Tener en cuenta que en los casos extremos en los que se produce una división por cero, los valores Precision@k y Recall@k no están definidos. Por convención, fijamos sus valores en 0 en estos casos.

In [None]:
from collections import defaultdict

from surprise import Dataset, SVD
from surprise.model_selection import KFold


def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [None]:
kf = KFold(n_splits=4)
# algo = SVD()
precision=[]
recall=[]
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=4)

    # Precision and recall can then be averaged over all users
    precision.append(sum(prec for prec in precisions.values()) / len(precisions))
    recall.append(sum(rec for rec in recalls.values()) / len(recalls))

In [None]:
len(predictions)

245439

In [None]:
len(precision)

In [None]:
len(recall)

In [None]:
#Top 5
# print(f'Precision media: {np.array(precision).mean()}')
# print(f'Recall medio: {np.array(recall).mean()}')

# Precision media: 0.49
# Recall medio: 0.34

In [None]:
#Top 10
print(f'Precision media: {np.array(precision).mean()}')
print(f'Recall medio: {np.array(recall).mean()}')

# Precision media: 0.49
# Recall medio: 0.37

Precision media: 0.4858455931008703
Recall medio: 0.3696506802942668


Esto significa que, en promedio, el 48% del top 10 de recomendaciones hechas para cada usuario son relevantes (con un puntaje estimado mayor o igual a 4), mientras que de todos los libros relevantes, el 37% aparecieron en el top 10.

--------

Podríamos aumentar k, lo que resultaría en **mayor recall a costa de una menor precision**. Esto es equivalente a utilizar un punto de corte menor (= menos exigentes) en un modelo convencional de ML.

# PRÁCTICA

Construir dos sistemas de recomendación:
- uno basado en contenidos a partir de la columna "overview" (argumento de la película). Generar recomendaciones de películas similar a Toy Story.
- un modelo basado en filtros colaborativos a partir de los ratings. Generar 10 recomendaciones para un fanático de Star Wars.

Además de lo visto en clase, puede ayudarse a partir de lo que aparece acá:
https://www.datacamp.com/tutorial/recommender-systems-python

El dataset completo está acá: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data

TAREA para el hogar: construir una función para medir precision@k y recall@k a partir del listado de recomendaciones del SR basado en contenidos.

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [None]:
ratings_metadata = pd.read_csv('./data/movies/ratings_small.csv', low_memory=False)
movies_metadata = pd.read_csv('./data/movies/movies_metadata.csv', low_memory=False)

print(ratings_metadata.shape)
ratings_metadata.head()

(100004, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [None]:
print(movies_metadata.shape)
movies_metadata.head()

(45466, 24)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [None]:
links = pd.read_csv('./data/movies/links.csv', low_memory=False)
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
45838,176269,6209470,439050.0
45839,176271,2028550,111109.0
45840,176273,303758,67758.0
45841,176275,8536,227506.0


In [None]:
links.dtypes

movieId      int64
imdbId       int64
tmdbId     float64
dtype: object

In [None]:
print(movies_metadata.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')


In [None]:
import difflib
import random

def get_book_id(book_title, metadata):
    
    existing_titles = list(metadata['title'].values)
    closest_titles = difflib.get_close_matches(book_title, existing_titles)
    book_id = metadata[metadata['title'] == closest_titles[0]]['id'].values[0]
    return book_id

def get_book_info(book_id, metadata):
    
    book_info = metadata[metadata['id'] == book_id][['id', 'imdb_id', 
                                                    'production_companies', 'title', 'original_title']]
    return book_info.to_dict(orient='records')

def predict_review(user_id, book_title, model, metadata):
    
    book_id = get_book_id(book_title, metadata)
    review_prediction = model.predict(uid=user_id, iid=book_id)
    return review_prediction.est

def generate_recommendation(user_id, model, metadata, thresh=4):
    
    book_titles = list(metadata['title'].values)
    random.shuffle(book_titles)
    
    for book_title in book_titles:
        rating = predict_review(user_id, book_title, model, metadata)
        if rating >= thresh:
            book_id = get_book_id(book_title, metadata)
            return get_book_info(book_id, metadata)

In [None]:
user_id=18313
generate_recommendation(user_id, svd, movies_metadata)

# reader = Reader(rating_scale=(1, 5))
# data = Dataset.load_from_df(ratings_data[['user_id', 'book_id', 'rating']], reader)

TypeError: object of type 'float' has no len()

In [None]:
get_book_id()

## BIBLIOGRAFÍA

- https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802
- https://surprise.readthedocs.io/
- https://medium.com/mlearning-ai/basic-content-based-recommendation-system-with-python-code-be920b412067
- https://medium.com/analytics-vidhya/content-based-recommender-systems-in-python-2b330e01eb80
- https://towardsdatascience.com/hands-on-content-based-recommender-system-using-python-1d643bf314e4

LightFM
- https://towardsdatascience.com/solving-business-usecases-by-recommender-system-using-lightfm-4ba7b3ac8e62


