ALTERNATIVA 1) Sistema de recomendación

La idea consiste en crear un modelo de recomendacion. Pueden descargar los archivos desde el siguiente link:

https://drive.google.com/drive/folders/1F58VdEmIUakN7dOvaLeEmd5xCRrSJgB6?usp=sharing


Consigna:

- El primer paso consiste en lograr leer los archivos desde el drive y armar los 2 datasets, el de entrenamiento y el de validacion.

- Se pide ademas, crear un 3er conjunto de testeo donde deberan extraer de los primeros 2 un sample y no utilizar estos datos para entrenar. Es decir, del conjunto total de datos, armar 3 dfs, training, test y validation.

- Desarrollar un recomendador. El recomendador debe ser capaz de generar recomendaciones para TODOS los usuarios (incluyendo los cold start que no tengan visualizaciones en el set de train). Generar 20 recomendaciones por usuario.
- Las recomendaciones tienen que ser para cada account_id y hay que recomendar content_id.
- Los contenidos que recomienden, no tienen que haber sido vistos previamente por los usuarios (filtrar). Se acepta solamente que se recomiende este caso si el usuario vio el contenido solamente 1 vez y el mismo posee un rating que se encuentre en el ultimo decil.
- Evaluarlo con MAP.
- Se valora la originalidad en el codigo.


Recomendaciones:
- Al momento de leer los datos, los archivos llamados "base" forman el conjunto de training y los "test" el de test.
- Quedarse con las columnas user_id; item_id; raiting; ts
- Comenzar con algo simple
- Cuando el modelo ya este funcionando, mejorarlo e ir cubriendo los distintos requisitos.
- También pueden probar con un dataframe mayor: https://grouplens.org/datasets/movielens/

Fecha de entrega 04/02/2024 y exposicion 06/02/2024.

Como desafio adicional, e**s necesario que suban el archivo a un repositorio de github (publico)** y enviar el link del mismo al siguiente mail: hugofolonier@gmail.com o por mensaje privado en Slack.

**SISTEMA DE RECOMENDACION BASADO EN UN DATA SET DE PELICULAS**

In [256]:
import pandas as pd

In [257]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Data Set elegido**: u4base/u4test

Del Data Set u4base se reserva un 20% aproximadamente para validacion y con el resto de u4base se entrena el modelo. Una vez entrenado el modelo se utiliza el set u4base completo y el set de test para para validar.

In [258]:
u4base = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DataSet/u4.base', sep='\t', header=None, names=['UsuarioID', 'PeliculaID', 'Clasificacion', 'Timestamp'])

In [259]:
u4test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DataSet/u4.test', sep='\t', header=None, names=['UsuarioID', 'PeliculaID', 'Clasificacion', 'Timestamp'])

In [260]:
u4test.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,4,3,876893119
1,1,7,4,875071561
2,1,42,5,876892425
3,1,43,4,878542869
4,1,55,5,875072688


In [261]:
u4test.shape

(20000, 4)

Se comienza adecuando el set u4base para dividirlo en train y validacion

In [262]:
u4base.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,5,3,889751712
4,1,6,5,887431973


In [263]:
u4base.shape

(80000, 4)

In [264]:
u4base.UsuarioID.nunique()

943

In [265]:
u4base.PeliculaID.nunique()

1660

Existen 943 Usuarios y 1660 Peliculas

Se convierte la columna Timestamp en formato Fecha

In [266]:
from datetime import datetime

In [267]:
u4base['Timestamp'] = u4base['Timestamp'].apply(lambda x: datetime.utcfromtimestamp(x).strftime('%Y-%m-%d'))

In [268]:
u4base.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,1,5,1997-09-22
1,1,2,3,1997-10-15
2,1,3,4,1997-11-03
3,1,5,3,1998-03-13
4,1,6,5,1998-02-14


In [269]:
u4base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UsuarioID      80000 non-null  int64 
 1   PeliculaID     80000 non-null  int64 
 2   Clasificacion  80000 non-null  int64 
 3   Timestamp      80000 non-null  object
dtypes: int64(3), object(1)
memory usage: 2.4+ MB


Se convierten las fechas al tipo de Pandas

In [270]:
u4base['Timestamp'] = pd.to_datetime(u4base['Timestamp'], format='%Y/%m/%d')

In [271]:
u4base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   UsuarioID      80000 non-null  int64         
 1   PeliculaID     80000 non-null  int64         
 2   Clasificacion  80000 non-null  int64         
 3   Timestamp      80000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 2.4 MB


Buscamos las fechas máxima y mínima para poder definir el set de Train y Validación

In [272]:
u4base.Timestamp.min()

Timestamp('1997-09-20 00:00:00')

In [273]:
u4base.Timestamp.max()

Timestamp('1998-04-22 00:00:00')

Dividimos el Data Set en Train y Validación

In [274]:
train = u4base[(u4base.Timestamp < datetime(year=1998, month=3, day=15))]
train.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,1,5,1997-09-22
1,1,2,3,1997-10-15
2,1,3,4,1997-11-03
3,1,5,3,1998-03-13
4,1,6,5,1998-02-14


In [275]:
train.shape

(65663, 4)

In [276]:
train.UsuarioID.nunique()

767

Set Validacion

In [277]:
valid = u4base[u4base.Timestamp >= datetime(year=1998, month=3, day=15)]
valid.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
360,4,11,4,1998-04-08
361,4,50,5,1998-04-08
362,4,210,3,1998-04-08
363,4,258,5,1998-04-08
364,4,260,4,1998-04-08


In [278]:
valid.shape

(14337, 4)

In [279]:
print(f'Usuarios Data set train = {train.UsuarioID.nunique()}',' / ',
      f'Usuarios Data set validacion = {valid.UsuarioID.nunique()}')

Usuarios Data set train = 767  /  Usuarios Data set validacion = 267


Verificamos si hay usuarios que estén en el set de validación y NO estén en el de Train

In [280]:
valid[~valid.UsuarioID.isin(train.UsuarioID.unique())].UsuarioID.nunique()

176

Matriz de Interacciones

In [281]:
matrix_columns = ['UsuarioID', 'PeliculaID', 'Clasificacion']

In [282]:
interactions = train[matrix_columns]
interactions.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion
0,1,1,5
1,1,2,3
2,1,3,4
3,1,5,3
4,1,6,5


In [283]:
interactions_matrix = interactions.pivot(index="UsuarioID", columns="PeliculaID", values="Clasificacion")

In [284]:
interactions_matrix.head()

PeliculaID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1664,1671,1675,1676,1680,1681,1682
UsuarioID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,
6,4.0,,,,,,2.0,4.0,,,...,,,,,,,,,,


Se cambian los NaN con 0

In [285]:
interactions_matrix = interactions_matrix.fillna(0)

In [286]:
interactions_matrix.head()

PeliculaID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1664,1671,1675,1676,1680,1681,1682
UsuarioID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,0.0,3.0,5.0,0.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [287]:
interactions_matrix.shape

(767, 1601)

Se transforma la matriz a csr matrix (compressed sparse row)

In [288]:
from scipy.sparse import csr_matrix

In [289]:
interactions_matrix_csr = csr_matrix(interactions_matrix.values)

In [290]:
interactions_matrix_csr

<767x1601 sparse matrix of type '<class 'numpy.float64'>'
	with 65663 stored elements in Compressed Sparse Row format>

Se muestra la diferncia entre la matris sin comprimir y la comprimida:

In [291]:
import sys

print(sys.getsizeof(interactions_matrix))
print(sys.getsizeof(interactions_matrix_csr))

9829888
48


Diccionario que nos servirá como índice: Tiene como keys el id de cada usuario y como valor el indice (fila).

In [292]:
user_ids = list(interactions_matrix.index)
user_dict = {}
counter = 0
for i in user_ids:
    user_dict[i] = counter
    counter += 1

In [293]:
user_dict

{1: 0,
 2: 1,
 3: 2,
 5: 3,
 6: 4,
 8: 5,
 9: 6,
 10: 7,
 12: 8,
 13: 9,
 14: 10,
 15: 11,
 16: 12,
 17: 13,
 18: 14,
 19: 15,
 20: 16,
 21: 17,
 22: 18,
 23: 19,
 24: 20,
 25: 21,
 28: 22,
 29: 23,
 30: 24,
 31: 25,
 32: 26,
 34: 27,
 35: 28,
 36: 29,
 37: 30,
 40: 31,
 42: 32,
 43: 33,
 44: 34,
 45: 35,
 46: 36,
 47: 37,
 48: 38,
 49: 39,
 50: 40,
 51: 41,
 52: 42,
 53: 43,
 54: 44,
 55: 45,
 57: 46,
 58: 47,
 59: 48,
 60: 49,
 62: 50,
 63: 51,
 64: 52,
 65: 53,
 66: 54,
 67: 55,
 68: 56,
 69: 57,
 70: 58,
 71: 59,
 72: 60,
 73: 61,
 74: 62,
 75: 63,
 76: 64,
 77: 65,
 78: 66,
 80: 67,
 81: 68,
 82: 69,
 83: 70,
 84: 71,
 85: 72,
 86: 73,
 87: 74,
 89: 75,
 92: 76,
 93: 77,
 94: 78,
 95: 79,
 96: 80,
 97: 81,
 98: 82,
 99: 83,
 101: 84,
 102: 85,
 103: 86,
 104: 87,
 105: 88,
 106: 89,
 108: 90,
 109: 91,
 110: 92,
 112: 93,
 113: 94,
 114: 95,
 115: 96,
 116: 97,
 117: 98,
 118: 99,
 119: 100,
 120: 101,
 122: 102,
 123: 103,
 125: 104,
 126: 105,
 127: 106,
 128: 107,
 129: 108,
 1

# Modelo
Se utiliza la libreria LFM (LightFM)

In [294]:
pip install lightfm



In [295]:
from lightfm import LightFM

In [296]:
model = LightFM(no_components=3, random_state=100, learning_rate=0.03)

Se entrena el modelo con la matriz csr

In [297]:
%%time
model = model.fit(interactions_matrix_csr, epochs=10000)

CPU times: user 2min 31s, sys: 180 ms, total: 2min 31s
Wall time: 2min 37s


In [298]:
model

<lightfm.lightfm.LightFM at 0x7e24c6e89ea0>

**Usuarios COLD START**
Estos son los los usuarios que no estan en el set de Train pero si estan en el de Validacion

A estos usuarios les asignaremos las peliculas mas vistas, las mas populares.

In [299]:
train.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,1,5,1997-09-22
1,1,2,3,1997-10-15
2,1,3,4,1997-11-03
3,1,5,3,1998-03-13
4,1,6,5,1998-02-14


In [300]:
train.groupby("PeliculaID", as_index=False).agg({"UsuarioID":"nunique"})

Unnamed: 0,PeliculaID,UsuarioID
0,1,305
1,2,86
2,3,68
3,4,138
4,5,62
...,...,...
1596,1675,1
1597,1676,1
1598,1680,1
1599,1681,1


In [301]:
popularity_df = train.groupby("PeliculaID", as_index=False).agg({"UsuarioID":"nunique"}).sort_values(by="UsuarioID", ascending=False)

popularity_df.columns=["PeliculaID", "popularity"]
popularity_df.head()

Unnamed: 0,PeliculaID,popularity
49,50,392
99,100,345
180,181,337
293,294,319
257,258,318


Lista con los 20 contenidos mas populares:

In [302]:
popular_content = popularity_df.PeliculaID.values[:20]

In [303]:
popularity_df.head(20).PeliculaID.values

array([ 50, 100, 181, 294, 258, 288, 286,   1, 121,   7, 127, 237,  56,
       174,  98, 117, 300, 405, 172, 222])

In [304]:
popular_content

array([ 50, 100, 181, 294, 258, 288, 286,   1, 121,   7, 127, 237,  56,
       174,  98, 117, 300, 405, 172, 222])

**Generacion de recomendaciones**

En base al analisis anterior se generan 20 recomendaciones para todos los usuarios, considerando las pelicilas ya vistas por el usuario y recomendando contenidos populares si el usuario no esta en el set de train.

In [305]:
import numpy as np

In [306]:
interactions_matrix.head()

PeliculaID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1664,1671,1675,1676,1680,1681,1682
UsuarioID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,0.0,3.0,5.0,0.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [307]:
interactions_matrix.shape

(767, 1601)

In [308]:
interactions_matrix.index

Int64Index([  1,   2,   3,   5,   6,   8,   9,  10,  12,  13,
            ...
            929, 930, 933, 935, 936, 937, 939, 940, 941, 943],
           dtype='int64', name='UsuarioID', length=767)

In [309]:
from tqdm import tqdm # Muestra un medidor de progreso

# Se define un diccionario para ir almacenando las recomendaciones
recomms_dict = {
    'user_id': [],
    'recomms': [],
    'cold': []
}

# Se obtiene cantidad de usuarios y cantidad de items
n_users, n_items = interactions_matrix.shape
item_ids = np.arange(n_items)

# Se generan recomendaciones por cada usuario del dataset de Validacion
for user in tqdm(valid.UsuarioID.unique()):
# Se verifica si el usuario se encuentra en la matriz de interacciones
    if user in list(interactions_matrix.index):
# Si el usuario esta en train, no es cold start. Usamos el modelo para recomendar
# Se busca el indice del usuario en la matriz (transformamos id a indice)
      user_x = user_dict[user]

# Se Generan las predicciones para el usuario x
      preds = model.predict(user_ids=user_x, item_ids = item_ids)

# Se ordenan las predicciones de menor a mayor acotandolas a 20.
      scores = pd.Series(preds)
      scores.index = interactions_matrix.columns
      scores = list(pd.Series(scores.sort_values(ascending=False).index))[:20]

# Listado de peliculas ya vistas por el usuario en el set de train
      watched_contents = train[train.UsuarioID == user].PeliculaID.unique()

#Se Filtran contenidos ya vistos y se eligen los primeros 20
      recomms = [x for x in scores if x not in watched_contents][:20]

# Se Guardan las recomendaciones en el diccionario
      recomms_dict['user_id'].append(user)
      recomms_dict['recomms'].append(scores)
      recomms_dict['cold'].append(False) # Se identifica a los usuarios cold

# A los usuarios Cold Start se les recomienda el contenido Popular
    else:
      recomms_dict['user_id'].append(user)
      recomms_dict['recomms'].append(popular_content) # Se recomienda contenido popular
      recomms_dict['cold'].append(True) # Se identifica a los usuarios cold

100%|██████████| 267/267 [00:00<00:00, 1515.31it/s]


Se convierte el diccionario a dataframe

In [310]:
recomms_df = pd.DataFrame(recomms_dict)
recomms_df

Unnamed: 0,user_id,recomms,cold
0,4,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True
1,7,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True
2,11,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True
3,13,"[294, 286, 50, 258, 288, 181, 748, 300, 100, 1...",False
4,14,"[294, 286, 258, 50, 288, 748, 300, 181, 100, 3...",False
...,...,...,...
262,931,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True
263,932,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True
264,934,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True
265,938,"[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ...",True


In [311]:
recomms_df.user_id.nunique()

267

De los 424 Usuarios hay 296 COLD START

In [312]:
recomms_df.cold.value_counts()

True     176
False     91
Name: cold, dtype: int64

# METRICA
Se comparan nuestras recomendaciones con las que realmente vieron los Usuarios

In [313]:
valid.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
360,4,11,4,1998-04-08
361,4,50,5,1998-04-08
362,4,210,3,1998-04-08
363,4,258,5,1998-04-08
364,4,260,4,1998-04-08


En el set de Validacion, se ordenan las peliculas segun la Clasificacion de mayor a menor.

In [314]:
ideal_recomms = valid.sort_values(by=['UsuarioID', 'Clasificacion'], ascending=False)\
                  .groupby(['UsuarioID'], as_index=False)\
                  .agg({'PeliculaID': 'unique'})
ideal_recomms.head()

Unnamed: 0,UsuarioID,PeliculaID
0,4,"[50, 258, 294, 300, 301, 303, 327, 329, 354, 3..."
1,7,"[4, 7, 8, 12, 22, 28, 44, 47, 50, 53, 56, 64, ..."
2,11,"[9, 15, 173, 258, 268, 277, 286, 332, 357, 423..."
3,13,"[896, 909, 915, 186, 362, 694, 232, 488, 889, ..."
4,14,"[12, 23, 32, 50, 81, 172, 174, 213, 302, 427, ..."


Unimos en un mismo dataframe las recomendaciones nuestras con el set ideal.

In [315]:
df_map = ideal_recomms.merge(recomms_df, how='left', left_on='UsuarioID', right_on='user_id')[['UsuarioID', 'PeliculaID', 'recomms']]
df_map.columns = ['UsuarioID', 'ideal', 'recomms']
df_map.head()

Unnamed: 0,UsuarioID,ideal,recomms
0,4,"[50, 258, 294, 300, 301, 303, 327, 329, 354, 3...","[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ..."
1,7,"[4, 7, 8, 12, 22, 28, 44, 47, 50, 53, 56, 64, ...","[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ..."
2,11,"[9, 15, 173, 258, 268, 277, 286, 332, 357, 423...","[50, 100, 181, 294, 258, 288, 286, 1, 121, 7, ..."
3,13,"[896, 909, 915, 186, 362, 694, 232, 488, 889, ...","[294, 286, 50, 258, 288, 181, 748, 300, 100, 1..."
4,14,"[12, 23, 32, 50, 81, 172, 174, 213, 302, 427, ...","[294, 286, 258, 50, 288, 748, 300, 181, 100, 3..."


**Calculo del MAP**

In [316]:
aps = [] # lista vacía para ir almacenando la AP de cada recomendación

for pred, label in df_map[["ideal", "recomms"]].values:
  n = len(pred) # cantidad de elementos recomendados
  arange = np.arange(n, dtype=np.int32) + 1. # indexamos en base 1
  rel_k = np.in1d(pred[:n], label) # lista de booleanos que indican la relevancia de cada ítem
  tp = np.ones(rel_k.sum(), dtype=np.int32).cumsum() # lista con el contador de verdaderos positivos
  denom = arange[rel_k] # posiciones donde se encuentran los ítems relevantes
  ap = (tp / denom).sum() / len(label) # average precision
  aps.append(ap)

In [317]:
MAP = np.mean(aps)
print(f'mean average precision = {round(MAP, 4)}')

mean average precision = 0.0663


Se realizaro 3 pruebas obteniendo los siguientes valores de MAP.

1) epochs=10000; learning_rate=0.03 ==> MAP=0.0663

2) epochs=10000; learning_rate=0.02 ==> MAP = 0.0657

3) epochs=10000; learning_rate=0.01 ==> MAP=0.0661

A continuacion se utiliza el set u4base completo como Train y el de u4test como Test.

In [318]:
t_u4base = u4base
t_u4base.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,1,5,1997-09-22
1,1,2,3,1997-10-15
2,1,3,4,1997-11-03
3,1,5,3,1998-03-13
4,1,6,5,1998-02-14


In [319]:
t_u4base.shape

(80000, 4)

In [320]:
test = u4test
test.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion,Timestamp
0,1,4,3,876893119
1,1,7,4,875071561
2,1,42,5,876892425
3,1,43,4,878542869
4,1,55,5,875072688


In [321]:
test.shape

(20000, 4)

Verificamos si hay usuarios que estén en el set de test y NO estén en el de train t_u4base

In [322]:
test[~test.UsuarioID.isin(t_u4base.UsuarioID.unique())].UsuarioID.nunique()

0

Matriz de Interacciones

In [323]:
interactions_Final = t_u4base[matrix_columns]
interactions_Final.head()

Unnamed: 0,UsuarioID,PeliculaID,Clasificacion
0,1,1,5
1,1,2,3
2,1,3,4
3,1,5,3
4,1,6,5


In [324]:
interactions_Final_matrix = interactions_Final.pivot(index="UsuarioID", columns="PeliculaID", values="Clasificacion")

In [325]:
interactions_Final_matrix.head()

PeliculaID,1,2,3,4,5,6,7,8,9,10,...,1668,1669,1670,1671,1674,1675,1676,1680,1681,1682
UsuarioID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


In [326]:
interactions_Final_matrix = interactions_Final_matrix.fillna(0)

In [327]:
interactions_Final_matrix.head()

PeliculaID,1,2,3,4,5,6,7,8,9,10,...,1668,1669,1670,1671,1674,1675,1676,1680,1681,1682
UsuarioID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,0.0,3.0,5.0,0.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [328]:
interactions_Final_matrix.shape

(943, 1660)

CSR matrix (Compressed Sparse Row)

In [329]:
from scipy.sparse import csr_matrix

In [330]:
interactions_Final_matrix_csr = csr_matrix(interactions_Final_matrix.values)

In [331]:
interactions_Final_matrix_csr

<943x1660 sparse matrix of type '<class 'numpy.float64'>'
	with 80000 stored elements in Compressed Sparse Row format>

In [332]:
import sys

print(sys.getsizeof(interactions_Final_matrix))
print(sys.getsizeof(interactions_Final_matrix_csr))

12530600
48


Diccionario

In [333]:
user_ids = list(interactions_Final_matrix.index)
user_dict = {}
counter = 0
for i in user_ids:
    user_dict[i] = counter
    counter += 1

In [334]:
user_dict

{1: 0,
 2: 1,
 3: 2,
 4: 3,
 5: 4,
 6: 5,
 7: 6,
 8: 7,
 9: 8,
 10: 9,
 11: 10,
 12: 11,
 13: 12,
 14: 13,
 15: 14,
 16: 15,
 17: 16,
 18: 17,
 19: 18,
 20: 19,
 21: 20,
 22: 21,
 23: 22,
 24: 23,
 25: 24,
 26: 25,
 27: 26,
 28: 27,
 29: 28,
 30: 29,
 31: 30,
 32: 31,
 33: 32,
 34: 33,
 35: 34,
 36: 35,
 37: 36,
 38: 37,
 39: 38,
 40: 39,
 41: 40,
 42: 41,
 43: 42,
 44: 43,
 45: 44,
 46: 45,
 47: 46,
 48: 47,
 49: 48,
 50: 49,
 51: 50,
 52: 51,
 53: 52,
 54: 53,
 55: 54,
 56: 55,
 57: 56,
 58: 57,
 59: 58,
 60: 59,
 61: 60,
 62: 61,
 63: 62,
 64: 63,
 65: 64,
 66: 65,
 67: 66,
 68: 67,
 69: 68,
 70: 69,
 71: 70,
 72: 71,
 73: 72,
 74: 73,
 75: 74,
 76: 75,
 77: 76,
 78: 77,
 79: 78,
 80: 79,
 81: 80,
 82: 81,
 83: 82,
 84: 83,
 85: 84,
 86: 85,
 87: 86,
 88: 87,
 89: 88,
 90: 89,
 91: 90,
 92: 91,
 93: 92,
 94: 93,
 95: 94,
 96: 95,
 97: 96,
 98: 97,
 99: 98,
 100: 99,
 101: 100,
 102: 101,
 103: 102,
 104: 103,
 105: 104,
 106: 105,
 107: 106,
 108: 107,
 109: 108,
 110: 109,
 111: 11

**Modelo**

Se entrena el modelo con la matriz csr.

In [335]:
%%time
model = model.fit(interactions_Final_matrix_csr, epochs=10000)

CPU times: user 3min 4s, sys: 235 ms, total: 3min 4s
Wall time: 3min 11s


Usuarios COLD START. Usuarios que no estan en el set de t_u4base pero si estan en el de test

Se define la popularidad de las peliculas en el set de train t_u4base

In [336]:
t_u4base.groupby("PeliculaID", as_index=False).agg({"UsuarioID":"nunique"})

Unnamed: 0,PeliculaID,UsuarioID
0,1,361
1,2,101
2,3,72
3,4,165
4,5,71
...,...,...
1655,1675,1
1656,1676,1
1657,1680,1
1658,1681,1


In [337]:
popularity_dff = t_u4base.groupby("PeliculaID", as_index=False).agg({"UsuarioID":"nunique"}).sort_values(by="UsuarioID", ascending=False)
popularity_dff.head()

Unnamed: 0,PeliculaID,UsuarioID
49,50,465
99,100,416
257,258,404
180,181,390
285,286,387


In [338]:
popularity_dff.columns=["PeliculaID", "popularity"]
popularity_dff.head(20)

Unnamed: 0,PeliculaID,popularity
49,50,465
99,100,416
257,258,404
180,181,390
285,286,387
287,288,383
293,294,382
0,1,361
299,300,344
126,127,335


Lista con los 20 contenidos mas populares:

In [339]:
populares_content = popularity_dff.PeliculaID.values[:20]

In [340]:
popularity_dff.head(20).PeliculaID.values

array([ 50, 100, 258, 181, 286, 288, 294,   1, 300, 127, 121, 174,  98,
        56,   7, 237, 117, 172, 405, 204])

Se generan 20 recomendaciones para todos los usuarios, considerando los nuevos data set de train y test.

In [341]:
from tqdm import tqdm # Muestra un medidor de progreso

# Se define un diccionario para ir almacenando las recomendaciones
recomms_dict = {
    'user_id': [],
    'recomms': [],
    'cold': []
}

# Se obtiene cantidad de usuarios y cantidad de items
n_users, n_items = interactions_Final_matrix.shape
item_ids = np.arange(n_items)

# Se generan recomendaciones por cada usuario del dataset de Test
for user in tqdm(test.UsuarioID.unique()):
# Se verifica si el usuario se encuentra en la matriz de interacciones
    if user in list(interactions_Final_matrix.index):
# Si el usuario esta en t_u4base, no es cold start. Usamos el modelo para recomendar
# Se busca el indice del usuario en la matriz (transformamos id a indice)
      user_x = user_dict[user]

# Se Generan las predicciones para el usuario x
      preds = model.predict(user_ids=user_x, item_ids = item_ids)

# Se ordenan las predicciones de menor a mayor acotandolas a 20.
      scores = pd.Series(preds)
      scores.index = interactions_Final_matrix.columns
      scores = list(pd.Series(scores.sort_values(ascending=False).index))[:20]

# Listado de peliculas ya vistas por el usuario en el set de train t_u4base
      watched_contents = t_u4base[t_u4base.UsuarioID == user].PeliculaID.unique()

#Se Filtran contenidos ya vistos y se eligen los primeros 20
      recomms = [x for x in scores if x not in watched_contents][:20]

# Se Guardan las recomendaciones en el diccionario
      recomms_dict['user_id'].append(user)
      recomms_dict['recomms'].append(scores)
      recomms_dict['cold'].append(False) # Se identifica a los usuarios cold

# A los usuarios Cold Start se les recomienda el contenido Popular
    else:
      recomms_dict['user_id'].append(user)
      recomms_dict['recomms'].append(popular_content) # Se recomienda contenido popular
      recomms_dict['cold'].append(True) # Se identifica a los usuarios cold

100%|██████████| 923/923 [00:01<00:00, 560.30it/s]


In [342]:
recomms_df = pd.DataFrame(recomms_dict)
recomms_df

Unnamed: 0,user_id,recomms,cold
0,1,"[286, 258, 294, 300, 288, 748, 313, 328, 50, 2...",False
1,2,"[286, 258, 294, 288, 300, 313, 748, 50, 328, 3...",False
2,3,"[286, 294, 258, 288, 300, 748, 313, 50, 328, 3...",False
3,4,"[286, 294, 258, 288, 300, 748, 313, 50, 328, 3...",False
4,5,"[286, 258, 294, 300, 288, 748, 328, 313, 50, 2...",False
...,...,...,...
918,939,"[286, 258, 294, 288, 300, 313, 748, 50, 328, 3...",False
919,940,"[286, 258, 294, 288, 300, 313, 748, 50, 328, 3...",False
920,941,"[286, 258, 294, 300, 288, 50, 748, 313, 328, 2...",False
921,942,"[258, 286, 300, 294, 288, 50, 328, 748, 313, 2...",False


In [343]:
recomms_df.user_id.nunique()

923

No hay usuarios COLD START

In [344]:
recomms_df.cold.value_counts()

False    923
Name: cold, dtype: int64

METRICA

En el set de Test, se ordenan las peliculas segun la Clasificacion de mayor a menor.

In [345]:
idealRecomms = test.sort_values(by=['UsuarioID', 'Clasificacion'], ascending=False)\
                  .groupby(['UsuarioID'], as_index=False)\
                  .agg({'PeliculaID': 'unique'})
idealRecomms.head()

Unnamed: 0,UsuarioID,PeliculaID
0,1,"[42, 55, 124, 165, 191, 198, 216, 7, 43, 58, 9..."
1,2,"[272, 304, 278, 288, 289]"
2,3,"[347, 336]"
3,4,"[324, 359, 362, 358]"
4,5,"[163, 181, 434, 239, 455, 366, 367, 412, 404, ..."


Unificamos dataframes con las recomendaciones nuestras y con el set ideal.

In [346]:
df_mapf = idealRecomms.merge(recomms_df, how='left', left_on='UsuarioID', right_on='user_id')[['UsuarioID', 'PeliculaID', 'recomms']]
df_mapf.columns = ['UsuarioID', 'ideal', 'recomms']
df_mapf.head()

Unnamed: 0,UsuarioID,ideal,recomms
0,1,"[42, 55, 124, 165, 191, 198, 216, 7, 43, 58, 9...","[286, 258, 294, 300, 288, 748, 313, 328, 50, 2..."
1,2,"[272, 304, 278, 288, 289]","[286, 258, 294, 288, 300, 313, 748, 50, 328, 3..."
2,3,"[347, 336]","[286, 294, 258, 288, 300, 748, 313, 50, 328, 3..."
3,4,"[324, 359, 362, 358]","[286, 294, 258, 288, 300, 748, 313, 50, 328, 3..."
4,5,"[163, 181, 434, 239, 455, 366, 367, 412, 404, ...","[286, 258, 294, 300, 288, 748, 328, 313, 50, 2..."


Calculo del MAP

In [347]:
aps_ = [] # lista vacía para ir almacenando la AP de cada recomendación

for pred, label in df_mapf[["ideal", "recomms"]].values:
  n = len(pred) # cantidad de elementos recomendados
  arange = np.arange(n, dtype=np.int32) + 1. # indexamos en base 1
  rel_k = np.in1d(pred[:n], label) # lista de booleanos que indican la relevancia de cada ítem
  tp = np.ones(rel_k.sum(), dtype=np.int32).cumsum() # lista con el contador de verdaderos positivos
  denom = arange[rel_k] # posiciones donde se encuentran los ítems relevantes
  ap = (tp / denom).sum() / len(label) # average precision
  aps_.append(ap)

In [349]:
MAP_ = np.mean(aps_)
print(f'MAP del Set Test = {round(MAP_, 4)}')

MAP del Set Test = 0.0332
