
<h1 align=center><font size = 5> ALS Recommender</font></h1>

---

<center>
  <img src="https://bobliu.io/assets/img/cards.509a5045.jpg" width="800" height="300">
</center>


## Objetivo de este Notebook

1. Cargar y preprocesar un Dataset.
2. Realizar un sistema de recomendación basado en ALS.
3. Comprobar el performance del sistema.

## Tabla de Contenidos

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">Contexto</a>  
2. <a href="#item32">Descargar y preparar el Dataset</a>  
6. <a href="#item34">Entrenamiento del modelo</a>  
6. <a href="#item34">Validación del modelo</a>  

</font>
</div>

## 1. Contexto


El conjunto de datos MovieLens es uno de los conjuntos de datos de recomendación más populares y ampliamente utilizados en la investigación de sistemas de recomendación. Fue creado por el GroupLens Research Project en la Universidad de Minnesota para impulsar la investigación en sistemas de recomendación, proporcionando un recurso valioso para la comunidad académica y promoviendo el desarrollo y la comprensión de tecnologías de recomendación personalizada.


<b>Descripción de datos</b>

1.   List item
2.   List item



El conjunto de datos MovieLens contiene información sobre:

<b>Películas:</b> Detalles sobre las películas, incluyendo su título, género y año de lanzamiento.

<b>Usuarios:</b> Perfiles de los usuarios que han calificado y/o etiquetado las películas, incluyendo su ID y otros detalles demográficos opcionales.

<b>Calificaciones:</b> Calificaciones numéricas que los usuarios asignan a las películas en una escala de 1 a 5.

<b>Etiquetas:</b> Palabras clave o tags proporcionados por los usuarios para describir el contenido o la esencia de las películas.

El conjunto de datos es ampliamente utilizado con fines académicos y de investigación, siendo una referencia en el diseño y evaluación de sistemas de recomendación de películas. También es útil para el análisis de tendencias y comportamientos en la visualización de películas y la interacción del usuario con el contenido.

<strong>Puede consultar este [link](https://grouplens.org/datasets/movielens/) para leer más sobre la fuente de datos MovieLens proporcionada por GroupLens Research en la Universidad de Minnesota.</strong>

## 2. Descargar y preparar Dataset

In [1]:
# Descargar el dataset Movielens
!curl -o dataset.zip "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
!unzip dataset.zip
!ls -la

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  955k  100  955k    0     0  2391k      0 --:--:-- --:--:-- --:--:-- 2394k
Archive:  dataset.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  
total 976
drwxr-xr-x 1 root root   4096 Oct 12 21:22 .
drwxr-xr-x 1 root root   4096 Oct 12 21:21 ..
drwxr-xr-x 4 root root   4096 Oct 11 13:22 .config
-rw-r--r-- 1 root root 978202 Oct 12 21:22 dataset.zip
drwxr-xr-x 2 root root   4096 Sep 26  2018 ml-latest-small
drwxr-xr-x 1 root root   4096 Oct 11 13:23 sample_data


In [2]:
# Principales librerías
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore") # Turn off warnings


In [3]:
links   = pd.read_csv("ml-latest-small/links.csv")
movies  = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")
tags    = pd.read_csv("ml-latest-small/tags.csv")


In [4]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
print("  Movies: {} \n  Ratings: {}".format(len(movies), len(ratings)))


  Movies: 9742 
  Ratings: 100836


In [9]:
# Fusiona ambos datasets basados en la columna 'movieId'
data = pd.merge(ratings, movies, on='movieId')

In [10]:
movie_titles = data['title'].unique().tolist()
movie_ids = data['movieId'].unique().tolist()


In [11]:
# Crear matriz pivotada de usuarios y películas
user_movie_rating = data.pivot_table(index='userId', columns='title', values='rating')


In [12]:
#500 películas más vistas
movies_pop = user_movie_rating.isnull().sum().sort_values()[:500]


In [13]:
user_movie_rating = user_movie_rating[movies_pop.index.tolist()]

In [14]:
user_movie_rating = user_movie_rating.reset_index()

Muestreo (Enmascaramiento)

In [15]:
user_movie_rating

title,userId,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),...,Analyze This (1999),Mortal Kombat (1995),Gran Torino (2008),"Simpsons Movie, The (2007)",Rumble in the Bronx (Hont faan kui) (1995),Lethal Weapon 3 (1992),Beverly Hills Cop (1984),Phenomenon (1996),M*A*S*H (a.k.a. MASH) (1970),The Butterfly Effect (2004)
0,1,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,...,,,,,,,,,5.0,
1,2,,3.0,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,4,,,1.0,5.0,1.0,5.0,,,,...,,,,,,,,,,
4,5,,3.0,5.0,,,,,4.0,3.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606,4.0,3.5,5.0,4.5,5.0,4.5,2.5,3.5,3.5,...,3.0,,4.5,3.5,,0.5,2.5,,,
606,607,,5.0,3.0,5.0,5.0,3.0,4.0,5.0,4.0,...,,,,,2.0,,,,3.0,
607,608,3.0,4.5,5.0,4.0,5.0,3.5,3.0,4.0,3.0,...,,0.5,,,,3.0,2.5,3.0,,4.0
608,609,4.0,4.0,4.0,,,,3.0,3.0,3.0,...,,,,,,,,,,


In [16]:
from sklearn.model_selection import train_test_split

# Convertir la matriz pivotada en un DataFrame y dividir en train y test
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)


In [17]:
train_data_matrix = train_data.pivot_table(index='userId', columns='title', values='rating')
test_data_matrix = test_data.pivot_table(index='userId', columns='title', values='rating')

train_data_matrix = train_data_matrix[movies_pop.index.tolist()].reset_index()
test_data_matrix = test_data_matrix[movies_pop.index.tolist()].reset_index()

In [18]:
# Volver a crear matrices pivotadas para entrenamiento y prueba
train_data_matrix = train_data_matrix.fillna(0)
test_data_matrix = test_data_matrix.fillna(0)

In [19]:
test_data_matrix.head()

title,userId,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),...,Analyze This (1999),Mortal Kombat (1995),Gran Torino (2008),"Simpsons Movie, The (2007)",Rumble in the Bronx (Hont faan kui) (1995),Lethal Weapon 3 (1992),Beverly Hills Cop (1984),Phenomenon (1996),M*A*S*H (a.k.a. MASH) (1970),The Butterfly Effect (2004)
0,1,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3.1 SVD (Singular Value Decomposition)

Aplicaremos el enfoque model based basado en SVD

In [34]:
train_data_matrix.head(5)

title,userId,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),...,Analyze This (1999),Mortal Kombat (1995),Gran Torino (2008),"Simpsons Movie, The (2007)",Rumble in the Bronx (Hont faan kui) (1995),Lethal Weapon 3 (1992),Beverly Hills Cop (1984),Phenomenon (1996),M*A*S*H (a.k.a. MASH) (1970),The Butterfly Effect (2004)
0,1,4.0,0.0,3.0,4.0,0.0,5.0,4.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
1,2,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,1.0,5.0,1.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,3.0,5.0,0.0,0.0,0.0,0.0,4.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
from numpy.linalg import svd

# Descomponemos la matriz de entrenamiento usando SVD
U, sigma_values, Vt = svd(train_data_matrix.drop(columns = ['userId']), full_matrices=False)

# La matriz sigma devuelta es solo una lista de valores singulares. La convertimos a una matriz diagonal.
sigma = np.diag(sigma_values)


In [36]:
# Predicciones con el modelo
predicted_ratings = np.dot(np.dot(U, sigma), Vt)


In [37]:
predicted_ratings_df = pd.DataFrame(predicted_ratings, columns=train_data_matrix.drop(columns = ['userId']).columns, index=train_data_matrix.index)
predicted_ratings_df['userId'] = train_data_matrix['userId']
predicted_ratings_df.head()


title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Mortal Kombat (1995),Gran Torino (2008),"Simpsons Movie, The (2007)",Rumble in the Bronx (Hont faan kui) (1995),Lethal Weapon 3 (1992),Beverly Hills Cop (1984),Phenomenon (1996),M*A*S*H (a.k.a. MASH) (1970),The Butterfly Effect (2004),userId
0,4.0,2.220446e-15,3.0,4.0,-9.478529e-15,5.0,4.0,4.0,2.114975e-14,7.299716e-15,...,6.258882e-15,-8.559126e-15,-9.15934e-16,1.346145e-14,-3.719247e-15,-6.77236e-15,9.339751e-15,5.0,-2.256875e-15,1
1,-4.1633360000000003e-17,3.0,-6.860484e-14,-3.834433e-14,-1.10606e-14,9.436896e-15,-3.134125e-14,1.083855e-14,-2.284284e-14,-9.006684e-15,...,-8.673617e-16,1.665335e-15,4.385381e-15,2.137179e-15,-1.887379e-15,-3.372302e-15,-1.797174e-15,-7.945034e-16,6.453171e-15,2
2,4.156397e-15,1.158448e-14,6.175616e-16,-2.275263e-14,-8.264223e-15,-5.551115e-15,1.529332e-14,5.842549e-15,-6.27276e-15,0.5,...,-1.94636e-15,-2.050443e-15,-1.221245e-15,-8.53484e-16,-4.8572260000000006e-17,5.995204e-15,-6.661338e-16,4.315992e-15,-9.714451000000001e-17,3
3,1.017242e-14,4.950901e-14,1.0,5.0,1.0,5.0,1.065814e-14,1.19349e-14,7.369105e-15,8.798517e-15,...,-1.054712e-15,-6.52256e-15,-2.275957e-15,4.607426e-15,2.331468e-15,1.387779e-15,3.191891e-15,4.163336e-15,-5.828671e-15,4
4,8.34402e-15,3.0,5.0,5.072331e-15,-4.915339e-15,1.69309e-15,-1.609823e-15,4.0,3.0,5.0,...,2.747802e-15,-4.163336e-15,-5.329071e-15,1.748601e-15,-1.186551e-15,2.220446e-16,3.38618e-15,5.467848e-15,-3.608225e-16,5


Predicciones

In [38]:
# Seleccionar un usuario (por ejemplo, el usuario con ID 82)
user_idx = 72
user_predictions = predicted_ratings_df[predicted_ratings_df.userId == user_idx]

In [39]:
# Peliculas calificadas por el cliente

rated_movies_by_user = train_data_matrix[train_data_matrix.userId == user_idx]
already_rated = rated_movies_by_user[rated_movies_by_user > 0].index.tolist()

In [40]:
pddf_rated_movies_by_user = rated_movies_by_user.T.reset_index()
pddf_rated_movies_by_user.columns = ['title', 'rating']
pddf_rated_movies_by_user = pddf_rated_movies_by_user[pddf_rated_movies_by_user.rating.between(1, 5)]
pddf_rated_movies_by_user.sort_values(by = 'rating', ascending = False, inplace = True)
already_rated = pddf_rated_movies_by_user.title.tolist()

pddf_rated_movies_by_user.head(10)

Unnamed: 0,title,rating
5,"Matrix, The (1999)",5.0
6,Star Wars: Episode IV - A New Hope (1977),5.0
87,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",5.0
21,Star Wars: Episode VI - Return of the Jedi (1983),4.5
19,Raiders of the Lost Ark (Indiana Jones and the...,4.5
137,Casablanca (1942),4.5
74,American History X (1998),4.5
23,"Fugitive, The (1993)",4.5
22,"Godfather, The (1972)",4.5
4,"Silence of the Lambs, The (1991)",4.5


In [41]:
# Películas que no ha calificado
movie_recommendations = user_predictions.T.reset_index()
movie_recommendations.columns = ['title', 'rating']
top_recommendations = movie_recommendations[~movie_recommendations.title.isin(already_rated + ['userId'])].sort_values(by = 'rating', ascending=False)
top_recommendations.head()

Unnamed: 0,title,rating
23,Batman (1989),1.182388e-14
112,Star Trek: Generations (1994),1.04361e-14
124,Crimson Tide (1995),1.036671e-14
6,Jurassic Park (1993),1.032507e-14
140,While You Were Sleeping (1995),1.004752e-14


## 3.2 Evaluación del modelo SVD

MSE

In [43]:
from sklearn.metrics import *

# Filtramos las predicciones reales
real_ratings = test_data_matrix.drop(columns = ['userId']).values[test_data_matrix.drop(columns = ['userId']).values.nonzero()]
predicted_ratings = predicted_ratings_df.values[test_data_matrix.drop(columns = ['userId']).values.nonzero()]

mse = mean_squared_error(real_ratings, predicted_ratings)
print("MSE en conjunto de entrenamiento:", mse)


MSE en conjunto de entrenamiento: 14.746152092121761


In [44]:
from sklearn.metrics import *

# Filtramos las predicciones reales
real_ratings = train_data_matrix.drop(columns = ['userId']).values[train_data_matrix.drop(columns = ['userId']).values.nonzero()]
predicted_ratings = predicted_ratings_df.values[train_data_matrix.drop(columns = ['userId']).values.nonzero()]

mse = mean_squared_error(real_ratings, predicted_ratings)
print("MSE en conjunto de entrenamiento:", mse)


MSE en conjunto de entrenamiento: 1.1508412303975523e-28


Evaluación del hit Rate

In [45]:
# Obtiene las películas vistas por cada usuario en entrenamiento

user_seen_movies = {}

for col in range(0, len(train_data_matrix)):
  user = train_data_matrix[train_data_matrix.index == col]
  temp = user.T.reset_index()
  temp.columns = ['tittle', 'rating']
  user_seen_movies[col] = temp[temp.rating.between(1,5)].tittle.tolist()

In [46]:
# Obtiene las películas con las calificaciones predichas más altas para cada usuario

predicted_movies = {}

for col in predicted_ratings_df.userId.tolist():
  user_pred = predicted_ratings_df[predicted_ratings_df.userId == col]
  temp = user_pred.T.reset_index()
  temp.columns = ['tittle', 'rating']
  recs = temp[~temp.tittle.isin(user_seen_movies.get(col, []) + ['userId'])]
  top_recs = recs.sort_values(by = 'rating', ascending = False).head(10)
  predicted_movies[col] = top_recs.tittle.tolist()


In [47]:
# Obtiene las películas vistas por cada usuario en test y que haya disfrutado

user_seen_movies_test = {}

for col in range(0, len(test_data_matrix)):
  user = test_data_matrix[test_data_matrix.index == col]
  temp = user.T.reset_index()
  temp.columns = ['tittle', 'rating']
  user_seen_movies_test[col] = temp[temp.rating.between(4,5)].tittle.tolist()

In [48]:
intersectan = 0

for col in user_seen_movies_test.keys():
  vistas = set(user_seen_movies_test[col])
  recomendadas = set(predicted_movies.get(col, []))
  # Verificar si hay intersección
  intersectan += not recomendadas.isdisjoint(vistas)

print('El hit rate de recomendaciones en usuarios en test es de :', round(intersectan/len(user_seen_movies_test.keys())*100), '%')

El hit rate de recomendaciones en usuarios en test es de : 31 %


## 4.1 ALS (Alternating Least Squares)

Aplicaremos el enfoque model based basado en ALS

In [42]:
%%capture
!pip install implicit


In [21]:
import implicit
from scipy.sparse import csr_matrix

movie_user_sparse = csr_matrix(train_data_matrix.drop(columns = ['userId']).values)

# Inicializar y entrenar el modelo ALS
model_als = implicit.als.AlternatingLeastSquares(factors=50, iterations=20)
model_als.fit(movie_user_sparse)


  0%|          | 0/20 [00:00<?, ?it/s]

In [49]:
# Predicciones con el modelo
user_factors = model_als.user_factors
item_factors = model_als.item_factors

reconstructed_ratings = np.dot(user_factors, item_factors.T)

predicted_ratings_df = pd.DataFrame(reconstructed_ratings, index=train_data_matrix.index, columns=train_data_matrix.drop(columns = ['userId']).columns)
predicted_ratings_df['userId'] = train_data_matrix['userId']


In [50]:
predicted_ratings_df.head()


title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Mortal Kombat (1995),Gran Torino (2008),"Simpsons Movie, The (2007)",Rumble in the Bronx (Hont faan kui) (1995),Lethal Weapon 3 (1992),Beverly Hills Cop (1984),Phenomenon (1996),M*A*S*H (a.k.a. MASH) (1970),The Butterfly Effect (2004),userId
0,1.033772,0.679769,0.682565,0.656327,0.528114,1.090132,0.763184,0.938122,0.116091,0.15005,...,0.256316,-0.0138,0.160909,0.183651,0.377323,0.294298,0.136697,0.781419,0.098726,1
1,0.446999,0.708537,-0.050318,0.345124,0.431016,0.146538,-0.034249,0.204286,-0.047315,0.177084,...,-0.002903,0.217387,-0.071233,0.003383,0.088224,-0.109534,0.031442,-0.118803,0.156484,2
2,-0.046826,0.060583,0.012623,-0.044683,0.038358,0.010453,0.023808,0.027675,-0.075773,0.188316,...,0.081345,-0.010044,-0.026783,-0.020067,0.058907,0.02976,0.013944,0.020494,0.005499,3
3,0.152236,0.158794,0.613825,1.131782,0.46668,0.787229,-0.206913,-0.032981,-0.235848,0.094054,...,0.000616,-0.134582,-0.040767,0.413574,-0.020334,-0.271199,0.152524,0.281298,-0.147274,4
4,0.842437,0.898597,1.006609,0.630942,0.019938,0.129982,0.771919,0.756271,0.751941,0.895664,...,0.137289,-0.09413,0.022077,0.164543,-0.045115,-0.200246,0.111165,0.066306,-0.194367,5


Predicciones

In [24]:
# Seleccionar un usuario (por ejemplo, el usuario con ID 82)
user_idx = 72
user_predictions = predicted_ratings_df[predicted_ratings_df.userId == user_idx]

In [25]:
# Peliculas calificadas por el cliente

rated_movies_by_user = train_data_matrix[train_data_matrix.userId == user_idx]
already_rated = rated_movies_by_user[rated_movies_by_user > 0].index.tolist()


In [26]:
pddf_rated_movies_by_user = rated_movies_by_user.T.reset_index()
pddf_rated_movies_by_user.columns = ['title', 'rating']
pddf_rated_movies_by_user = pddf_rated_movies_by_user[pddf_rated_movies_by_user.rating.between(1, 5)]
pddf_rated_movies_by_user.sort_values(by = 'rating', ascending = False, inplace = True)
already_rated = pddf_rated_movies_by_user.title.tolist()

pddf_rated_movies_by_user.head(10)

Unnamed: 0,title,rating
5,"Matrix, The (1999)",5.0
6,Star Wars: Episode IV - A New Hope (1977),5.0
87,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",5.0
21,Star Wars: Episode VI - Return of the Jedi (1983),4.5
19,Raiders of the Lost Ark (Indiana Jones and the...,4.5
137,Casablanca (1942),4.5
74,American History X (1998),4.5
23,"Fugitive, The (1993)",4.5
22,"Godfather, The (1972)",4.5
4,"Silence of the Lambs, The (1991)",4.5


In [27]:
# Películas que no ha calificado
movie_recommendations = user_predictions.T.reset_index()
movie_recommendations.columns = ['title', 'rating']
top_recommendations = movie_recommendations[~movie_recommendations.title.isin(already_rated + ['userId'])].sort_values(by = 'rating', ascending=False)
top_recommendations.head()

Unnamed: 0,title,rating
1,"Shawshank Redemption, The (1994)",0.932268
2,Pulp Fiction (1994),0.867403
12,Star Wars: Episode V - The Empire Strikes Back...,0.842946
6,Jurassic Park (1993),0.565127
10,Fight Club (1999),0.51529


## 4.2 Evaluación del modelo ALS

MSE

In [53]:
from sklearn.metrics import *

# Filtramos las predicciones reales
real_ratings_als = test_data_matrix.drop(columns = ['userId']).values[test_data_matrix.drop(columns = ['userId']).values.nonzero()]
predicted_ratings_als = predicted_ratings_df.values[test_data_matrix.drop(columns = ['userId']).values.nonzero()]

mse_als_test = mean_squared_error(real_ratings_als, predicted_ratings_als)
print("MSE en conjunto de test:", mse_als_test)


MSE en conjunto de test: 11.795892518909815


In [55]:
from sklearn.metrics import *

# Filtramos las predicciones reales
real_ratings_als = train_data_matrix.drop(columns = ['userId']).values[train_data_matrix.drop(columns = ['userId']).values.nonzero()]
predicted_ratings_als = predicted_ratings_df.values[train_data_matrix.drop(columns = ['userId']).values.nonzero()]

mse_als_train = mean_squared_error(real_ratings_als, predicted_ratings_als)
print("MSE en conjunto de entrenamiento:", mse_als_train)


MSE en conjunto de entrenamiento: 9.35513686278032


Evaluación del hit Rate

In [30]:
# Obtiene las películas vistas por cada usuario en entrenamiento

user_seen_movies = {}

for col in range(0, len(train_data_matrix)):
  user = train_data_matrix[train_data_matrix.index == col]
  temp = user.T.reset_index()
  temp.columns = ['tittle', 'rating']
  user_seen_movies[col] = temp[temp.rating.between(1,5)].tittle.tolist()

In [31]:
# Obtiene las películas con las calificaciones predichas más altas para cada usuario

predicted_movies = {}

for col in predicted_ratings_df.userId.tolist():
  user_pred = predicted_ratings_df[predicted_ratings_df.userId == col]
  temp = user_pred.T.reset_index()
  temp.columns = ['tittle', 'rating']
  recs = temp[~temp.tittle.isin(user_seen_movies.get(col, []) + ['userId'])]
  top_recs = recs.sort_values(by = 'rating', ascending = False).head(10)
  predicted_movies[col] = top_recs.tittle.tolist()


In [32]:
# Obtiene las películas vistas por cada usuario en test y que haya disfrutado

user_seen_movies_test = {}

for col in range(0, len(test_data_matrix)):
  user = test_data_matrix[test_data_matrix.index == col]
  temp = user.T.reset_index()
  temp.columns = ['tittle', 'rating']
  user_seen_movies_test[col] = temp[temp.rating.between(4,5)].tittle.tolist()

In [33]:
intersectan = 0

for col in user_seen_movies_test.keys():
  vistas = set(user_seen_movies_test[col])
  recomendadas = set(predicted_movies.get(col, []))
  # Verificar si hay intersección
  intersectan += not recomendadas.isdisjoint(vistas)

print('El hit rate de recomendaciones en usuarios en test es de :', round(intersectan/len(user_seen_movies_test.keys())*100), '%')

El hit rate de recomendaciones en usuarios en test es de : 34 %


## 5. Tarea

Tarea

1. Comparar SME de ALS en train y en test. ¿La técnica hace overfitting? ¿Qué tanto? Desde este punto de vista es mejor que SVD?

2. ¿El hit rate de ALS es mejor que SVD? Cuánto es?

3. En base a los resultados, ¿qué técnica recomendaría para nuestro problema de movieLens?

## ALS.

En el caso de ALS se tiene los resultados:

MSE en conjunto de test: 11.795892518909815

MSE en conjunto de entrenamiento: 9.35513686278032

El hit rate de recomendaciones en usuarios en test es de : 32 %

## SVD.

En el caso de SVD se tiene los resultados:

MSE en conjunto de test: 14.746152092121761

MSE en conjunto de entrenamiento: 1.1508412303975523e-28

El hit rate de recomendaciones en usuarios en test es de : 31 %



## 1.
<font color='blue'>
ALS no hace overfitting y tiene un buen resultado MSE en el set de Test. Por el contrario, SVD por su concepción hace overfitting. Desde este punto de vista ALS tiene un mejor comportamiento sobre el conjunto de datos seleccionados.
</font>

##2.
<font color='blue'>
Para ALS el hit rate es ligeramente mayor que el encontrado en SVD. 32% vs 31%
</font>

##3.
<font color='blue'>
En base a los resultados, incluyendo las "recomendaciones" dadas para el usuario de prueba, se observa que ALS produce resultados mas parecidos a las películas vistas por el usuario de prueba. El hit rate es muy parecido entre ambos asi que no marca mucho la diferencia pero el hecho que que SVD haga overfitting parace que si lo hace. Por lo tanto de los experimentos realizados, ALS sería una buena técnica para el grupo de datos utilizados.
</font>



