# Recomendação de Jogos da Steam
## Projeto da disciplina **SCC0284 - Sistemas de Recomendação**

* Lucas Ciziks - 12559472 - luciziks@usp.br

* Pedro Maçonetto - 12675419 - pedromaconetto@usp.br

In [2]:
# Bibliotecas a serem utilizadas no trabalho
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from caserec.recommenders.item_recommendation.item_attribute_knn import ItemAttributeKNN
# import caserecommender as caserec

## Bases de Dados
Serão utilizadas duas bases de dados encontradas no Kaggle. A base de dados principal reúne as críticas/*reviews* de usuários de todo o mundo para alguns jogos da plataforma, com 8GB de dados. Em conjunto, utilizaremos uma base complementar contendo os metadados dos jogos para auxílio nos algoritmos de recomendação implementados.

As bases completas podem ser acessadas em:

* [Steam Reviews](https://www.kaggle.com/datasets/andrewmvd/steam-reviews): Avaliações reais de jogos na plataforma
Steam em 2021; 
* [Steam Metadata](https://www.kaggle.com/datasets/nikdavis/steam-store-games): Informações e metadados sobre os jogos
disponíveis na plataforma Steam. 

### Amostragem

Como a base de *reviews* é extremamente grande, com mais de 8GB de tamanho, tomaremos uma amostra com **100000 usuários** para aplicarmos e testarmos os métodos em um tempo hábil de processamento.

In [1]:
# Importando arquivo de Reviews da Steam
# df = pd.read_csv('steam_reviews.csv')

# Separando os valores únicos de todos os usuários presentes
# steam_id = df['author.steamid'].unique()

# Escolhemos aleatoriamente 100000 desses usuários
# random_users = np.random.choice(steam_id, 100000)

# Criamos um DataFrame separado somente com estes usuários e suas informações
# data_reduzido = df[df['author.steamid'].isin(random_users)]

# Transformamos por fim este dataframe em um arquivo .csv para facilitar a manipulação dos dados ao longo do tempo
# data_reduzido.to_csv('reviews_reduzido.csv')

### Tratamento de Dados

Utilizando a base de dados reduzida, armazenada no arquivo *reviews.reduzido.csv*, a próxima etapa é limpar e adequar os dados para os recomendadores.

In [13]:
# Reviews Reduzido dos jogos
review = pd.read_csv('reviews_reduzido.csv')
review.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
0,1917,1917,292030,The Witcher 3: Wild Hunt,84817041,polish,cos malo kontentu,1610838136,1610838136,True,...,True,False,False,76561199069719082,16,1,3442.0,1777.0,2903.0,1611356000.0
1,2212,2212,292030,The Witcher 3: Wild Hunt,84770465,english,Better than some games.,1610776476,1610776476,True,...,True,False,False,76561198134193481,10,1,19397.0,1030.0,18917.0,1611383000.0
2,2292,2292,292030,The Witcher 3: Wild Hunt,84753053,english,still one of my favorite games love the story ...,1610749166,1610749166,True,...,True,False,False,76561198076025535,106,6,364.0,55.0,308.0,1610775000.0
3,2923,2923,292030,The Witcher 3: Wild Hunt,84646036,koreana,재미있음 너무 퀘스트가 많아서 할게 많아보이지만 안해도되는거 같음 메인퀘를 중심으로...,1610589095,1610589095,True,...,True,False,False,76561198043684666,10,1,2561.0,1836.0,1060.0,1610974000.0
4,3774,3774,292030,The Witcher 3: Wild Hunt,84500650,turkish,.,1610382839,1610382839,True,...,True,False,False,76561198954233218,26,1,2611.0,923.0,2141.0,1611362000.0


In [14]:
# Metadados dos Jogos
metadados = pd.read_csv('steam_metadados.csv')
metadados.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [15]:
# Removendo colunas que não serão utilizadas, e renomeando as que serão
review = review.drop(['Unnamed: 0.1', 'Unnamed: 0', 'timestamp_created', 'received_for_free', 'written_during_early_access', 'votes_helpful', 'votes_funny', 'weighted_vote_score', 'timestamp_updated'], axis=1)
review = review.rename(columns = {'app_id': 'itemId', 'app_name': 'itemName', 'review_id': 'reviewId',
       'comment_count': 'commentCount', 'steam_purchase': 'steamPurchase',
       'author.steamid': 'userId', 'author.num_games_owned': 'userGames', 'author.num_reviews': 'userReviews',
       'author.playtime_forever': 'userPlaytimeForever', 'author.playtime_last_two_weeks': 'userPlaytimeLastTwoWeeks',
       'author.playtime_at_review': 'userPlaytimeAtReview', 'author.last_played': 'userLastPlayed'})
metadados.rename(columns = {'appid': 'itemId', 'name': 'itemName'}, inplace= True)

In [16]:
# Como utilizamos uma amostragem da base de dados original, faremos um cruzamento de dados 
# entre os jogos da amostra e os jogos disponíveis nos metadados
commonItems = list(set(metadados.itemName.unique())&set(review.itemName))
review = review.loc[review['itemName'].isin(commonItems)]
metadados = metadados.loc[metadados['itemName'].isin(commonItems)]

In [17]:
# Mapeamento de itens e usuários
itemId = {item: idx for idx, item in enumerate(commonItems)}
userId = {user: idx for idx, user in enumerate(review['userId'])}

metadados['itemId'] = metadados['itemName'].map(itemId).dropna()
review['itemId'] = review['itemName'].map(itemId).dropna()
review['userId'] = review['userId'].map(userId).dropna()

# Binarizando a feedback dado pelo usuário (recomendado ou não-recomendado)
review['rating'] = review['recommended'].apply(lambda review: 1 if review else 0)

# Separando os gêneros dos itens do arquivo de metadados
metadados = metadados.drop('genres', axis=1).join(metadados.genres.str.split(';', expand=True)
             .stack().reset_index(drop=True, level=1).rename('genre'))
metadados.dropna(inplace=True)

In [18]:
# Bases após o tratamento
review.head()

Unnamed: 0,itemId,itemName,reviewId,language,review,recommended,commentCount,steamPurchase,userId,userGames,userReviews,userPlaytimeForever,userPlaytimeLastTwoWeeks,userPlaytimeAtReview,userLastPlayed,rating
379,32,Half-Life,84840205,turkish,D-dostum torent kullanma ve al,True,0,True,0,26,2,79.0,0.0,79.0,1592832000.0,1
380,32,Half-Life,84803254,english,its pretty good.,True,0,True,3649,11,5,564.0,9.0,555.0,1610943000.0,1
381,32,Half-Life,83553107,german,Immer noch ein absolut geniales Spiel. Die Ste...,True,0,True,2,15,1,329.0,0.0,329.0,1608773000.0,1
382,32,Half-Life,83546336,turkish,bu oyunu unlost izledikten sonra alanlar :D,True,0,True,5658,6,4,650.0,69.0,319.0,1610887000.0,1
383,32,Half-Life,78617643,english,fun,True,0,True,10961,56,19,915.0,0.0,297.0,1605903000.0,1


In [19]:
metadados.head()

Unnamed: 0,itemId,itemName,release_date,english,developer,publisher,platforms,required_age,categories,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price,genre
6,32,Half-Life,1998-11-08,1,Valve,Valve,windows;mac;linux,0,Single-player;Multi-player;Online Multi-Player...,FPS;Classic;Action,0,27755,1100,1300,83,5000000-10000000,7.19,Action
10,12,Counter-Strike: Source,2004-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Cross-Platform Multiplayer;Steam ...,Action;FPS;Multiplayer,147,76640,3497,6842,400,10000000-20000000,7.19,Action
18,132,Half-Life 2: Episode Two,2007-10-10,1,Valve,Valve,windows;mac;linux,0,Single-player;Steam Achievements;Captions avai...,FPS;Action;Sci-fi,22,13902,696,354,301,5000000-10000000,5.79,Action
23,47,Portal 2,2011-04-18,1,Valve,Valve,windows;mac;linux,0,Single-player;Co-op;Steam Achievements;Full co...,Puzzle;Co-op;First-Person,51,138220,1891,1102,520,10000000-20000000,7.19,Action
23,47,Portal 2,2011-04-18,1,Valve,Valve,windows;mac;linux,0,Single-player;Co-op;Steam Achievements;Full co...,Puzzle;Co-op;First-Person,51,138220,1891,1102,520,10000000-20000000,7.19,Adventure


## Recomendadores

### Hold-Out

Para que o sistema não apresente resultados enviezados, separamos a base em um conjunto de treinamento e um conjunto de teste. Assim, é possível treinar o modelo com o recomendador e avaliá-lo através do teste.

In [20]:
train, test = train_test_split(review, test_size=.2, random_state=2)

### Avaliação

Abaixo estão implementadas algumas funções de métricas para realizarmos a avaliação dos recomendadores

In [21]:
# Calcula o RMSE para um usuário
def rmse_user(preds, ratings):
    if len(preds) != len(ratings):
        return -1
    sum = 0
    for i in range(len(preds)):
        sum += pow(preds[i]-ratings[i], 2)
    return np.sqrt(sum/len(preds))

# Calcula o Average Precision
def AP(rec, gt, limiar):
    common = list(set(rec) & set(gt))
    hit = 0
    i = 0
    score = 0

    while i < len(rec) and hit < limiar:
        if rec[i] in common:
            hit += 1
            score += hit/(i+1)
        i += 1
    return score/hit if hit > 0 else 0

# Calcula o Mean Average Precision
def MAP(rec, gt, limiar=np.inf):
    commom_user = list(set(rec['userId']) & set(gt['userId']))
    score = 0

    for user in commom_user:
        score += AP(rec.loc[rec.userId == user, 'itemId'].tolist(),
                    gt.loc[gt.userId == user, 'itemId'].tolist(), limiar)

    return score/len(commom_user)

## Pointwise

### SVD Optimized (Filtragem Colaborativa)


In [23]:
from math import sqrt

def train_svdopt(train, n_factors, lr=0.05, reg=0.02, miter=10):
    global_mean = train['rating'].mean()
    n_users = review['userId'].max()+1
    n_items = review['itemId'].max()+1
    bu = np.zeros(n_users)
    bi = np.zeros(n_items)
    p = np.random.normal(0.1, 0.1, (n_users, n_factors))
    q = np.random.normal(0.1, 0.1, (n_items, n_factors))
    error = []

    for t in range(miter):
        sq_error = 0
        for index, row in train.iterrows():
            u = row['userId']
            i = row['itemId']
            r_ui = row['rating']
            pred = global_mean + bu[u] + bi[i] + np.dot(p[u], q[i])
            e_ui = r_ui - pred
            sq_error = sq_error + pow(e_ui, 2)
            bu[u] = bu[u] + lr * e_ui
            bi[i] = bi[i] + lr * e_ui
            for f in range(n_factors):
                temp_uf = p[u][f]
                p[u][f] = p[u][f] + lr * (e_ui * q[i][f] - reg * p[u][f])
                q[i][f] = q[i][f] + lr * (e_ui * temp_uf - reg * q[i][f])
        error.append(sqrt(sq_error/len(train)))

    return global_mean, bu, bi, p, q, error


In [24]:
gl, bu, bi, p, q, error = train_svdopt(train, 4, miter=30)

In [25]:
preds = []
for i, row in test.iterrows():
    preds.append(gl + bu[row['userId']] + bi[row['itemId']] + np.dot(p[row['userId']], q[row['itemId']]))

In [26]:
rmse_user(preds, test['rating'].tolist())

0.3336945339693189

In [27]:
px.line(error)

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

### Filtragem Baseada em Conteúdo

Como exemplo de algoritmo para Filtragem Baseada em Conteúdo, utilizaremos o **ItemAttributeKNN**, já implementado pela biblioteca CaseRecommender.

In [38]:
metadados[['itemId', 'genre']].to_csv('meta_genres.dat', index=False, sep='\t', header=False)
train[['userId', 'itemId', 'rating']].to_csv('train.dat', index=False, header=False, sep='\t')
test[['userId', 'itemId', 'rating']].to_csv('test.dat', index=False, header=False, sep='\t')

In [39]:
ItemAttributeKNN('train.dat', 'test.dat', output_file='result_IAKNN.dat', metadata_file='meta_genres.dat', as_similar_first=True).compute()

[Case Recommender: Item Recommendation > Item Attribute KNN Algorithm]

train data:: 7031 users and 215 items (10596 interactions) | sparsity:: 99.30%
test data:: 2270 users and 182 items (2650 interactions) | sparsity:: 99.36%

training_time:: 0.033904 sec
>> metadata:: 216 items and 19 metadata (557 interactions) | sparsity:: 86.43%
prediction_time:: 9.823047 sec


Eval:: PREC@1: 0.004846 PREC@3: 0.005286 PREC@5: 0.005991 PREC@10: 0.004978 RECALL@1: 0.003524 RECALL@3: 0.011652 RECALL@5: 0.021527 RECALL@10: 0.037349 MAP@1: 0.004846 MAP@3: 0.009325 MAP@5: 0.01254 MAP@10: 0.01481 NDCG@1: 0.004846 NDCG@3: 0.013583 NDCG@5: 0.020204 NDCG@10: 0.026438 


In [40]:
result_IAKNN = pd.read_csv('result_IAKNN.dat', sep='\t', names=['userId', 'itemId', 'rating'])
result_IAKNN.head()

Unnamed: 0,userId,itemId,rating
0,0,11,1.0
1,0,12,1.0
2,0,17,1.0
3,0,23,1.0
4,0,38,1.0


In [31]:
# Calculando MAP
MAP(result_IAKNN, test, 100)

0.031508382797042586

In [44]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.zeros((train.userId.max() + 1, train.itemId.max() + 1))
for i, row in train.iterrows():
    A[row['userId']][row['itemId']] = row['rating']

sim_matrix = pairwise_distances(A.T, metric="cosine")
with open('sim_r_matrix.dat', 'w') as arq_sim_matrix:
    for i in range(len(sim_matrix)):
        for j in range(len(sim_matrix)):
            if i < j:
                arq_sim_matrix.write(
                    str(i) + '\t' + str(j) + '\t' + str(sim_matrix[i][j]) + '\n')

ItemAttributeKNN('train.dat', 'test.dat', output_file='recs_iaknn_cos.dat',
                 similarity_file='sim_r_matrix.dat', as_similar_first=True).compute()


[Case Recommender: Item Recommendation > Item Attribute KNN Algorithm]

train data:: 7031 users and 215 items (10596 interactions) | sparsity:: 99.30%
test data:: 2270 users and 182 items (2650 interactions) | sparsity:: 99.36%

training_time:: 0.068562 sec
prediction_time:: 8.676105 sec


Eval:: PREC@1: 0.003524 PREC@3: 0.002496 PREC@5: 0.001674 PREC@10: 0.001586 RECALL@1: 0.002203 RECALL@3: 0.005837 RECALL@5: 0.006498 RECALL@10: 0.01246 MAP@1: 0.003524 MAP@3: 0.00514 MAP@5: 0.00536 MAP@10: 0.00635 NDCG@1: 0.003524 NDCG@3: 0.006676 NDCG@5: 0.007117 NDCG@10: 0.009682 


In [45]:
result_IAKNN_Sim = pd.read_csv('recs_iaknn_cos.dat', sep='\t', names=['userId', 'itemId', 'rating'])
result_IAKNN_Sim .head()

Unnamed: 0,userId,itemId,rating
0,0,1,1.0
1,0,2,1.0
2,0,3,1.0
3,0,4,1.0
4,0,5,1.0


In [47]:
# Calculando MAP
MAP(result_IAKNN_Sim, test, 100)

0.013509171241130003

## Pairwise

### BPR

In [34]:
from caserec.recommenders.item_recommendation.bprmf import BprMF

BprMF('train.dat', 'test.dat', 'result_BPRMF.dat').compute()

[Case Recommender: Item Recommendation > BPRMF]

train data:: 7031 users and 215 items (10596 interactions) | sparsity:: 99.30%
test data:: 2270 users and 182 items (2650 interactions) | sparsity:: 99.36%

training_time:: 7.713846 sec
prediction_time:: 0.946003 sec


Eval:: PREC@1: 0.039648 PREC@3: 0.028781 PREC@5: 0.025815 PREC@10: 0.020881 RECALL@1: 0.032412 RECALL@3: 0.069482 RECALL@5: 0.103069 RECALL@10: 0.165686 MAP@1: 0.039648 MAP@3: 0.059104 MAP@5: 0.067126 MAP@10: 0.074376 NDCG@1: 0.039648 NDCG@3: 0.077821 NDCG@5: 0.094308 NDCG@10: 0.114722 


In [41]:
result_BPRMF = pd.read_csv('result_BPRMF.dat', sep='\t', names=['userId', 'itemId', 'score'])
result_BPRMF.head()

Unnamed: 0,userId,itemId,score
0,0,65,4.306164
1,0,62,4.268502
2,0,0,3.729371
3,0,108,3.712708
4,0,184,3.548882


In [36]:
# Calculando MAP
MAP(result_BPRMF, test, 100)

0.15823161660790522

## ListWise

### CoFiRank

In [43]:
from adarank import AdaRank
from metrics import NDCGScorer

scorer = NDCGScorer(k=10)
model = AdaRank(max_iter=100, estop=10, scorer=scorer).fit(X, y, qid)
pred = model.predict(X_test, qid_test)
print(scorer(y_test, pred, qid_test).mean())

NameError: name 'X' is not defined

# Conclusão