### Modelo de recomendación de juegos

En esta jupyter notebook se hace la experimentación para encontrar un modelo de recomendación, que generen una lista de 5 juegos ya partir de el ID de un juego.

El modelo tiene una relación ítem-ítem, esto es se toma un juego y en base a que tan similar es ese juego con el resto de los juegos se recomiendan similares.

Para generar este modelo se adoptaron algoritmos basados en la memoria, los que abordan el problema del filtrado colaborativo utilizando toda la base de datos, tratando de encontrar usuarios similares al usuario activo (es decir, los usuarios para los que se les quiere recomendar) y utilizando sus preferencias para predecir las valoraciones del usuario activo.

In [5]:
#Importo las librerías
import pandas as pd
import numpy as np

import scipy as sp
from sklearn.metrics.pairwise import cosine_similarity
import operator
from sklearn.decomposition import PCA

import pyarrow as pa
import pyarrow.parquet as pq

In [6]:
#Consumo el archivo exportado como restultado del EDA
df_modelo = pd.read_parquet('/Users/gaston/Documents/Proy.-Data-Analyst-1/Data/steam_games_EDA.parquet')

In [7]:
df_modelo

Unnamed: 0,genres,price,early_access,id,release_year,publisher,app_name,title,developer
88310,Action,4.99,0.0,761140.0,2018,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,Kotoshiro
88310,Casual,4.99,0.0,761140.0,2018,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,Kotoshiro
88310,Indie,4.99,0.0,761140.0,2018,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,Kotoshiro
88310,Simulation,4.99,0.0,761140.0,2018,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,Kotoshiro
88310,Strategy,4.99,0.0,761140.0,2018,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,Kotoshiro
...,...,...,...,...,...,...,...,...,...
120442,Indie,1.99,0.0,610660.0,2018,Laush Studio,Russian Roads,Russian Roads,Laush Dmitriy Sergeevich
120442,Racing,1.99,0.0,610660.0,2018,Laush Studio,Russian Roads,Russian Roads,Laush Dmitriy Sergeevich
120442,Simulation,1.99,0.0,610660.0,2018,Laush Studio,Russian Roads,Russian Roads,Laush Dmitriy Sergeevich
120443,Casual,4.99,0.0,658870.0,2017,SIXNAILS,EXIT 2 - Directions,EXIT 2 - Directions,"xropi,stev3ns"


In [8]:
#Decido trabajar solamente con la info de 3 de las columnas
df_modelo = df_modelo[['app_name','id','genres']]
df_modelo

Unnamed: 0,app_name,id,genres
88310,Lost Summoner Kitty,761140.0,Action
88310,Lost Summoner Kitty,761140.0,Casual
88310,Lost Summoner Kitty,761140.0,Indie
88310,Lost Summoner Kitty,761140.0,Simulation
88310,Lost Summoner Kitty,761140.0,Strategy
...,...,...,...
120442,Russian Roads,610660.0,Indie
120442,Russian Roads,610660.0,Racing
120442,Russian Roads,610660.0,Simulation
120443,EXIT 2 - Directions,658870.0,Casual


In [9]:
df_modelo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67037 entries, 88310 to 120443
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   app_name  67035 non-null  object  
 1   id        67035 non-null  float64 
 2   genres    65915 non-null  category
dtypes: category(1), float64(1), object(1)
memory usage: 1.6+ MB


In [10]:
df_modelo.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_modelo.dropna(inplace=True)


In [11]:
df_modelo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 65911 entries, 88310 to 120443
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   app_name  65911 non-null  object  
 1   id        65911 non-null  float64 
 2   genres    65911 non-null  category
dtypes: category(1), float64(1), object(1)
memory usage: 1.6+ MB


In [12]:
df_modelo['id'] = df_modelo['id'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_modelo['id'] = df_modelo['id'].astype(int)


In [13]:
df_modelo

Unnamed: 0,app_name,id,genres
88310,Lost Summoner Kitty,761140,Action
88310,Lost Summoner Kitty,761140,Casual
88310,Lost Summoner Kitty,761140,Indie
88310,Lost Summoner Kitty,761140,Simulation
88310,Lost Summoner Kitty,761140,Strategy
...,...,...,...
120442,Russian Roads,610660,Indie
120442,Russian Roads,610660,Racing
120442,Russian Roads,610660,Simulation
120443,EXIT 2 - Directions,658870,Casual


In [14]:
# Reorganizamos el DataFrame para aplicar luego la similitud del coseno
df_modelo_final  = pd.get_dummies(df_modelo, columns=['genres'], prefix='genre')
df_modelo_final = df_modelo_final.groupby(['app_name', 'id']).sum().reset_index()

In [15]:
df_modelo_final

Unnamed: 0,app_name,id,genre_Accounting,genre_Action,genre_Adventure,genre_Animation &amp; Modeling,genre_Audio Production,genre_Casual,genre_Design &amp; Illustration,genre_Early Access,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,! That Bastard Is Trying To Steal Our Gold !,449940,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"""Barely Attuned Magic Thingy"" Staff",308163,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,"""Glow Ball"" - The billiard puzzle game",388390,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,1,0,0,0
3,"""Just Another Day"" - Seduce Me Otome CD",454790,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,"""Lethargic Sentience"" Wand",308164,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26345,（尘沙惑设定集）Lost in Secular Love - Concept Design ...,541220,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
26346,４人打ちアクション麻雀 / ACTION MAHJONG,575810,0,1,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
26347,＜/reality＞,562280,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26348,＜/reality＞ Original Soundtrack,626850,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Para medir la similitud entre los juegos (item_similarity) se utilizó la similitud del coseno que es una medida comúnmente utilizada para evaluar la similitud entre dos vectores en un espacio multidimensional. En el contexto de sistemas de recomendación y análisis de datos, la similitud del coseno se utiliza para determinar cuán similares son dos conjuntos de datos o elementos, y se calcula utilizando el coseno del ángulo entre los vectores que representan esos datos o elementos.

In [16]:
similitudes = cosine_similarity(df_modelo_final.iloc[:,3:])


In [17]:
similitudes.shape

(26350, 26350)

In [18]:
similitudes = pd.DataFrame(similitudes)

In [33]:
similitudes

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26340,26341,26342,26343,26344,26345,26346,26347,26348,26349
0,1.000000,0.447214,0.500000,0.353553,0.447214,0.447214,0.577350,0.707107,0.447214,0.447214,...,0.000000,0.250000,0.000000,0.000000,0.447214,0.750000,0.750000,0.707107,0.707107,0.707107
1,0.447214,1.000000,0.223607,0.000000,1.000000,1.000000,0.258199,0.632456,1.000000,1.000000,...,0.447214,0.447214,0.447214,0.447214,0.800000,0.223607,0.447214,0.316228,0.316228,0.316228
2,0.500000,0.223607,1.000000,0.353553,0.223607,0.223607,0.866025,0.353553,0.223607,0.223607,...,0.000000,0.500000,0.000000,0.000000,0.000000,0.500000,0.500000,0.353553,0.353553,0.707107
3,0.353553,0.000000,0.353553,1.000000,0.000000,0.000000,0.408248,0.000000,0.000000,0.000000,...,0.000000,0.353553,0.000000,0.000000,0.000000,0.707107,0.353553,0.000000,0.000000,0.500000
4,0.447214,1.000000,0.223607,0.000000,1.000000,1.000000,0.258199,0.632456,1.000000,1.000000,...,0.447214,0.447214,0.447214,0.447214,0.800000,0.223607,0.447214,0.316228,0.316228,0.316228
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26345,0.750000,0.223607,0.500000,0.707107,0.223607,0.223607,0.577350,0.353553,0.223607,0.223607,...,0.000000,0.500000,0.000000,0.000000,0.223607,1.000000,0.500000,0.707107,0.707107,0.707107
26346,0.750000,0.447214,0.500000,0.353553,0.447214,0.447214,0.577350,0.707107,0.447214,0.447214,...,0.000000,0.250000,0.000000,0.000000,0.223607,0.500000,1.000000,0.353553,0.353553,0.707107
26347,0.707107,0.316228,0.353553,0.000000,0.316228,0.316228,0.408248,0.500000,0.316228,0.316228,...,0.000000,0.353553,0.000000,0.000000,0.316228,0.707107,0.353553,1.000000,1.000000,0.500000
26348,0.707107,0.316228,0.353553,0.000000,0.316228,0.316228,0.408248,0.500000,0.316228,0.316228,...,0.000000,0.353553,0.000000,0.000000,0.316228,0.707107,0.353553,1.000000,1.000000,0.500000


En el siguiente paso tomo la decisión de tomar una fracción de la matriz, para consumirla en la API y demostrar el funcionamiento del modelo  
igualmente en este Notebook utilizo la función para demostrar el funcionamiento con la matriz completa

In [28]:
similitudes_recortadas = similitudes.iloc[:1000, :1000]

In [29]:
similitudes_recortadas

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,1.000000,0.447214,0.500000,0.353553,0.447214,0.447214,0.577350,0.707107,0.447214,0.447214,...,0.707107,0.707107,0.750000,0.500000,0.500000,0.577350,0.288675,0.866025,0.500000,0.500000
1,0.447214,1.000000,0.223607,0.000000,1.000000,1.000000,0.258199,0.632456,1.000000,1.000000,...,0.000000,0.000000,0.223607,0.000000,0.000000,0.258199,0.258199,0.516398,0.000000,0.000000
2,0.500000,0.223607,1.000000,0.353553,0.223607,0.223607,0.866025,0.353553,0.223607,0.223607,...,0.353553,0.353553,0.500000,0.500000,0.500000,0.288675,0.000000,0.577350,0.500000,0.500000
3,0.353553,0.000000,0.353553,1.000000,0.000000,0.000000,0.408248,0.000000,0.000000,0.000000,...,0.500000,0.500000,0.707107,0.707107,0.707107,0.000000,0.408248,0.408248,0.707107,0.707107
4,0.447214,1.000000,0.223607,0.000000,1.000000,1.000000,0.258199,0.632456,1.000000,1.000000,...,0.000000,0.000000,0.223607,0.000000,0.000000,0.258199,0.258199,0.516398,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.577350,0.258199,0.288675,0.000000,0.258199,0.258199,0.333333,0.408248,0.258199,0.258199,...,0.408248,0.408248,0.577350,0.000000,0.000000,1.000000,0.333333,0.333333,0.000000,0.000000
996,0.288675,0.258199,0.000000,0.408248,0.258199,0.258199,0.000000,0.408248,0.258199,0.258199,...,0.000000,0.000000,0.288675,0.000000,0.000000,0.333333,1.000000,0.333333,0.000000,0.000000
997,0.866025,0.516398,0.577350,0.408248,0.516398,0.516398,0.666667,0.816497,0.516398,0.516398,...,0.408248,0.408248,0.577350,0.577350,0.577350,0.333333,0.333333,1.000000,0.577350,0.577350
998,0.500000,0.000000,0.500000,0.707107,0.000000,0.000000,0.577350,0.000000,0.000000,0.000000,...,0.707107,0.707107,0.500000,1.000000,1.000000,0.000000,0.000000,0.577350,1.000000,1.000000


In [21]:
#Exporto el Dataframe con el modelo final para consumir en la función de recomendación en la API
df_modelo_final.to_parquet('df_modelo_final.parquet', engine='pyarrow', compression='snappy')

In [30]:
similitudes_recortadas.to_parquet('matriz_similitud.parquet', engine='pyarrow', compression='snappy')

In [31]:
#Acá realizo la función con la matriz completa y compruebo el funncionamiento
def recomendacion_juego(id):
    
    id = int(id)
    # Filtro el juego de entrada por su ID
    juego_seleccionado = df_modelo_final[df_modelo_final['id'] == id]
    
    if juego_seleccionado.empty:
        return "El juego con el ID especificado no existe en la base de datos."
    
    # Obtengo las puntuaciones de similitud del juego de entrada con otros juegos comparando la matriz similitudes completa
    # con el dataframe que contiene la info de los juegos
    similarity_scores = similitudes[df_modelo_final[df_modelo_final['id'] == id].index[0]]
    
    # Obtengo los índices de los juegos más similares (excluyendo el juego de entrada)
    indices_juegos_similares = similarity_scores.argsort()[::-1][1:5+1]
    
    # Obtengo los nombres de los juegos recomendados
    juegos_recomendados = df_modelo_final.iloc[indices_juegos_similares]['app_name'].tolist()
    
    return juegos_recomendados

In [32]:
recomendacion_juego(449940)

['MILF', 'Maria the Witch', 'Marble Void', 'Major Mayhem', 'Mad Dagger']