<details>
<summary>About SBERT</summary>


**Reference Iteration**: [Multi-BERT for Embeddings for Recommendation System/](https://arxiv.org/abs/2308.13050)

We used SBERT compared to BERT and other BERT variations for a few reasons, the primary reason is due to SBERT’s sentence embeddings, making it particularly useful in a  content based recommendation system, while also represententing similarity using the siamese network structure. This enabling recommendations based on true content similarity which would be effective for content based recommendation systems. Thus, we wanted to use user reviews, game descriptions, genres, developers and publishes with this model to achieve deeper semantic amd similarity based content based model with SBERT.

**Environment needed**: We used T4 GPU at High RAM;
</details>

## Import packages & set-up

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
import nltk
import re
import spacy
from wordcloud import WordCloud
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df = pd.read_csv('/content/drive/MyDrive/BT4222 Project/EDA/bt4222_dataset_21.csv')
df_modelling = df.copy()

In [None]:
print(df.shape)

(126144, 31)


In [None]:
print(df.columns)

Index(['app_id', 'app_name', 'review_id', 'review', 'timestamp_updated',
       'recommended', 'author.steamid', 'author.num_games_owned',
       'author.playtime_at_review', 'Release date', 'Required age', 'Price',
       'DLC count', 'About the game', 'Windows', 'Mac', 'Linux', 'Genres',
       'Categories', 'Developers', 'Publishers', 'is_free', 'owned_games',
       'weighted_vote_score', 'votes_helpful', 'user_review_count',
       'item_review_count', 'game_playtime_percentile', 'game_description',
       'review_text_clean', 'pred_rating'],
      dtype='object')


## Preprocessing

In [None]:
# group reviews by app_id` and join
df_reviews_aggregated = df.groupby('app_id')['review_text_clean'].apply(lambda x: ' '.join(x)).reset_index()


In [None]:
print(df_reviews_aggregated.head())

   app_id                                  review_text_clean
0      70  One of the best FPS ever.  Stands the test of ...
1     240  awesome game. I was addicted to this for 5 yea...
2     420  The older, cooler brother of the Half-Life epi...
3     620  As Valve's first full length game since Half l...
4    4000  Now you too can make your very own creepy Pose...


In [None]:
!pip install sentence-transformers

from sentence_transformers import SentenceTransformer




In [None]:
!pip install faiss-cpu



In [None]:
import torch
torch.device('cuda' if torch.cuda.is_available() else 'cpu')

device(type='cuda')

Run SBert on game reviews and game features

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import faiss

model = SentenceTransformer('all-MiniLM-L6-v2')

# combine game description fields into combined_text
df['combined_text'] = (
    df['About the game'].fillna('')+ ' ' +
    df['Categories'].fillna('') + ' '+ df['Genres'].fillna('') + ' ' +
    df['Developers'].fillna('') + ' ' +
    df['Publishers'].fillna('')
)

# get embeddings for each combined_text
df['description_embedding'] = df['combined_text'].apply(lambda x: model.encode(x))

# get embeddings for each review
df['review_embedding'] = df['review_text_clean'].apply(lambda x: model.encode(x))

# aggregate review embeddings by app_id
review_embeddings_aggregated = df.groupby('app_id')['review_embedding'].apply(lambda x: np.mean(x.tolist(), axis=0)).reset_index()
review_embeddings_aggregated.rename(columns={'review_embedding': 'aggregated_review_embedding'}, inplace=True)

# get description_embedding
description_embeddings = df.drop_duplicates('app_id')[['app_id', 'description_embedding']]

# merge reviews and descriptions
game_embeddings = pd.merge(review_embeddings_aggregated, description_embeddings, on='app_id')
game_embeddings['combined_embedding'] = game_embeddings.apply(
    lambda row: (row['aggregated_review_embedding'] + row['description_embedding']) / 2,
    axis=1
)



Splitting Training and Test data

In [None]:
# split data into train and test
print(df.head())
df_modelling = df_modelling.sort_values(by=['author.steamid', 'timestamp_updated']).reset_index(drop=True)
split = 0.8

# get unique user IDs
unique_users = df_modelling['author.steamid'].unique()
train_data = []
test_data = []

# split data for each user
for user_id in unique_users:
    user_data = df_modelling[df_modelling['author.steamid'] == user_id]
    split_idx = round(len(user_data) * split)
    train_data.append(user_data.iloc[:split_idx])
    test_data.append(user_data.iloc[split_idx:])

train = pd.concat(train_data)
test = pd.concat(test_data)
test = test[test["app_id"].isin(train['app_id'])]

print(train.shape)
print(test.shape)


   app_id                            app_name  review_id  \
0    4000                         Garry's Mod     297534   
1   48700              Mount & Blade: Warband     297695   
2   48700              Mount & Blade: Warband     321598   
3   35140  Batman: Arkham Asylum GOTY Edition     321586   
4     240              Counter-Strike: Source     321552   

                                              review  timestamp_updated  \
0  Now you too can make your very own creepy Pose...         1290229222   
1  All the thrill of killing groups of raiders on...         1290283941   
2  I really liked this game. You really can build...         1290984804   
3  The story seemed awesome at first, but then it...         1291337732   
4  awesome game. I was addicted to this for 5 years.         1291338488   

   recommended     author.steamid  author.num_games_owned  \
0         True  76561197967992446                    1037   
1         True  76561197967992446                    1037   
2    

FAISS

In [None]:
!pip install faiss-cpu




In [None]:
combined_embeddings_array = np.vstack(game_embeddings['combined_embedding'].values)
print(combined_embeddings_array.shape)


(166, 384)


Moderate dimension size, no PCA will be done

In [None]:
import faiss
import numpy as np

# comvert embeddings to an array
combined_embeddings_array = np.vstack(game_embeddings['combined_embedding'].values)

# normalise embeddings to use L2 distance as cosine similarity
combined_embeddings_normalized = combined_embeddings_array / np.linalg.norm(combined_embeddings_array, axis=1, keepdims=True)

# set up FAISS
app_ids = game_embeddings['app_id'].values
dimension = combined_embeddings_normalized.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(combined_embeddings_normalized)

# get similar games with FAISS search
def get_similar_games(app_id, top_n=5):
    query_index = np.where(app_ids == app_id)[0][0]
    query_embedding = combined_embeddings_normalized[query_index].reshape(1, -1)

    # FAISS search
    distances, indices = faiss_index.search(query_embedding, top_n + 1)  # +1 to include itself in results
    # get top similar app_ids
    similar_indices = indices[0][1:] if indices[0][0] == query_index else indices[0][:top_n]
    similar_app_ids = app_ids[similar_indices]
    return similar_app_ids.tolist()


In [None]:

# get user embeddings by averaging embeddings of interacted games
def compute_user_embeddings(user_game_df, game_embeddings_df, app_id_column='app_id', embedding_column='combined_embedding'):
    user_embeddings = {}
    for user_id in user_game_df['author.steamid'].unique():
        user_app_ids = user_game_df[user_game_df['author.steamid'] == user_id][app_id_column]
        game_embeddings = game_embeddings_df[game_embeddings_df['app_id'].isin(user_app_ids)][embedding_column]
        if not game_embeddings.empty:
            user_embedding = np.mean(np.vstack(game_embeddings.values), axis=0)
            user_embedding = user_embedding / np.linalg.norm(user_embedding)
            user_embeddings[user_id] = user_embedding
    return user_embeddings

# get user embeddings from the training data
user_embeddings = compute_user_embeddings(train, game_embeddings)


# convert test for evaluation
test_df = test.groupby('author.steamid')['app_id'].apply(list).reset_index()
test_df.columns = ['author.steamid', 'actual_app_ids']  # Rename for clarity

# get recommendations for a user based on embedding
def get_recommendations_for_user(user_id, user_embeddings, faiss_index, app_ids, top_n=4):
    user_embedding = user_embeddings[user_id].reshape(1, -1).astype('float32')
    distances, indices = faiss_index.search(user_embedding, top_n)
    recommended_app_ids = app_ids[indices.flatten()]
    return recommended_app_ids.tolist()


all_recommendations = []
all_actuals = []

# get recommendations for each user in the test set
for _, row in test_df.iterrows():
    user_id = row['author.steamid']
    actual_app_ids = row['actual_app_ids']
    if user_id in user_embeddings:  # Ensure user embedding is available
        recommendations = get_recommendations_for_user(user_id, user_embeddings, faiss_index, app_ids, top_n=4)
        all_recommendations.append(recommendations)
        all_actuals.append(actual_app_ids)


for user_id, recs, actuals in zip(test_df['author.steamid'], all_recommendations, all_actuals):
    print(f"User {user_id} Recommendations: {recs}")
    print(f"User {user_id} Actuals: {actuals}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
User 76561198064170000 Recommendations: [582660, 8930, 39210, 377160]
User 76561198064170000 Actuals: [435150, 431960, 214950, 739630]
User 76561198064178653 Recommendations: [239140, 377160, 233860, 460930]
User 76561198064178653 Actuals: [548430, 242760, 552520, 477160, 268910]
User 76561198064184650 Recommendations: [242760, 239140, 823130, 238460]
User 76561198064184650 Actuals: [782330, 379720, 8870, 424840, 418370]
User 76561198064203806 Recommendations: [239140, 637650, 242760, 460930]
User 76561198064203806 Actuals: [1289310, 359550, 548430, 304390, 381210]
User 76561198064215417 Recommendations: [304390, 233860, 204360, 582660]
User 76561198064215417 Actuals: [242760, 255710, 238320, 548430]
User 76561198064236234 Recommendations: [204360, 242760, 582660, 823130]
User 76561198064236234 Actuals: [823130, 519860, 638970, 205100, 447530, 427520, 1145360, 960090]
User 76561198064240292 Recommendations: [637650, 39210

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

# calculate Precision@K, Recall@K, F1@K, and NDCG@K
def calculate_metrics_for_user(actual, recommended, k=4):
    actual_set = set(actual)
    recommended_at_k = recommended[:k]

    # Precision
    hits_at_k = sum([1 for item in recommended_at_k if item in actual_set])
    precision_at_k = hits_at_k / k

    # Recall
    recall_at_k = hits_at_k / len(actual_set) if actual_set else 0

    # F1
    f1_at_k = (2 * precision_at_k * recall_at_k) / (precision_at_k + recall_at_k) if (precision_at_k + recall_at_k) > 0 else 0

    # NDCG
    dcg = sum([1 / np.log2(idx + 2) for idx, item in enumerate(recommended_at_k) if item in actual_set])
    idcg = sum([1 / np.log2(idx + 2) for idx in range(min(len(actual_set), k))])
    ndcg_at_k = dcg / idcg if idcg > 0 else 0

    return precision_at_k, recall_at_k, f1_at_k, ndcg_at_k

# get metrics across all users
def calculate_aggregated_metrics(all_actuals, all_recommendations, k=4):
    precision_list = []
    recall_list = []
    f1_list = []
    ndcg_list = []

    for actual, recommended in zip(all_actuals, all_recommendations):
        precision_at_k, recall_at_k, f1_at_k, ndcg_at_k = calculate_metrics_for_user(actual, recommended, k)
        precision_list.append(precision_at_k)
        recall_list.append(recall_at_k)
        f1_list.append(f1_at_k)
        ndcg_list.append(ndcg_at_k)

    # get avg metrics across all users
    metrics = {
        'Precision': np.mean(precision_list),
        'Recall': np.mean(recall_list),
        'F1 Score': np.mean(f1_list),
        'NDCG': np.mean(ndcg_list)
    }

    return metrics


all_recommendations = []
all_actuals = []

for _, row in test_df.iterrows():
    user_id = row['author.steamid']
    actual_app_ids = row['actual_app_ids']
    if user_id in user_embeddings:
        recommendations = get_recommendations_for_user(user_id, user_embeddings, faiss_index, app_ids, top_n=4)
        all_recommendations.append(recommendations)
        all_actuals.append(actual_app_ids)

# overall metrics
metrics = calculate_aggregated_metrics(all_actuals, all_recommendations, k=4)


print(f"Overall Precision@4: {metrics['Precision']}")
print(f"Recall@4: {metrics['Recall']}")
print(f"F1@4: {metrics['F1 Score']}")
print(f"NDCG@4: {metrics['NDCG']}")


Overall Precision@4: 0.021073961499493414
Recall@4: 0.016743214371499726
F1@4: 0.018418925440202037
NDCG@4: 0.019717864014615536


Using HNSWlib (Hierarchical Navigable Small World)

In [None]:
!pip install hnswlib

Collecting hnswlib
  Downloading hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml) ... [?25l[?25hdone
  Created wheel for hnswlib: filename=hnswlib-0.8.0-cp310-cp310-linux_x86_64.whl size=2360790 sha256=0b4d077a7dbaaba0cb7a0f4bb390bbee5ccbd1496ab69bfac242af9db13a5b51
  Stored in directory: /root/.cache/pip/wheels/af/a9/3e/3e5d59ee41664eb31a4e6de67d1846f86d16d93c45f277c4e7
Successfully built hnswlib
Installing collected packages: hnswlib
Successfully installed hnswlib-0.8.0


In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
import hnswlib
# load embeddings
combined_embeddings_array = np.vstack(game_embeddings['combined_embedding'].values).astype(np.float32)
combined_embeddings_normalized = combined_embeddings_array / np.linalg.norm(combined_embeddings_array, axis=1, keepdims=True)

# HNSWlib setup
dim = combined_embeddings_normalized.shape[1]
num_elements = combined_embeddings_normalized.shape[0]
app_ids = game_embeddings['app_id'].values  # Map back to app IDs

# initialise HNSWlib index for inner product (cosine similarity)
index = hnswlib.Index(space='ip', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=100, M=8)
index.add_items(combined_embeddings_normalized, app_ids)

# get user embeddings
def compute_user_embeddings(user_game_df, game_embeddings_df, app_id_column='app_id', embedding_column='combined_embedding'):
    user_embeddings = {}
    for user_id in user_game_df['author.steamid'].unique():
        user_app_ids = user_game_df[user_game_df['author.steamid'] == user_id][app_id_column]
        game_embeddings = game_embeddings_df[game_embeddings_df['app_id'].isin(user_app_ids)][embedding_column]
        user_embedding = np.mean(np.vstack(game_embeddings.values), axis=0)
        user_embedding = user_embedding / np.linalg.norm(user_embedding)
        user_embeddings[user_id] = user_embedding
    return user_embeddings

# training data user embeddings
user_embeddings = compute_user_embeddings(train, game_embeddings)

# get recommendations
def get_recommendations_for_user(user_id, user_embeddings, index, app_ids, top_n=4):
    user_embedding = user_embeddings[user_id].reshape(1, -1).astype('float32')
    labels, distances = index.knn_query(user_embedding, k=top_n + 1)  # +1 to exclude self
    recommended_app_ids = [
        app_ids[i] for i in labels[0] if i < len(app_ids) and app_ids[i] != user_id
    ][:top_n]
    return recommended_app_ids

# test data setup
all_recommendations = []
all_actuals = []

for _, row in test.iterrows():
    user_id = row['author.steamid']
    actual_app_ids = row['app_id']
    if user_id in user_embeddings:
        recommendations = get_recommendations_for_user(user_id, user_embeddings, index, app_ids, top_n=10)
        all_recommendations.append(recommendations)
        all_actuals.append(actual_app_ids)

# metrics - precision, recall, ndcg, f1
def calculate_metrics_for_user(actual, recommended, k=4):
    if isinstance(actual, int):
        actual = [actual]
    actual_set = set(actual)
    recommended_at_k = recommended[:k]
    hits_at_k = sum([1 for item in recommended_at_k if item in actual_set])
    precision_at_k = hits_at_k / k
    recall_at_k = hits_at_k / len(actual_set) if actual_set else 0
    f1_at_k = (2 * precision_at_k * recall_at_k) / (precision_at_k + recall_at_k) if (precision_at_k + recall_at_k) > 0 else 0
    dcg = sum([1 / np.log2(idx + 2) for idx, item in enumerate(recommended_at_k) if item in actual_set])
    idcg = sum([1 / np.log2(idx + 2) for idx in range(min(len(actual_set), k))])
    ndcg_at_k = dcg / idcg if idcg > 0 else 0
    return precision_at_k, recall_at_k, f1_at_k, ndcg_at_k

def calculate_aggregated_metrics(all_actuals, all_recommendations, k=4):
    precision_list = []
    recall_list = []
    f1_list = []
    ndcg_list = []

    for actual, recommended in zip(all_actuals, all_recommendations):
        precision_at_k, recall_at_k, f1_at_k, ndcg_at_k = calculate_metrics_for_user(actual, recommended, k)
        precision_list.append(precision_at_k)
        recall_list.append(recall_at_k)
        f1_list.append(f1_at_k)
        ndcg_list.append(ndcg_at_k)

    metrics = {
        'Precision': np.mean(precision_list),
        'Recall': np.mean(recall_list),
        'F1 Score': np.mean(f1_list),
        'NDCG': np.mean(ndcg_list)
    }
    return metrics

metrics = calculate_aggregated_metrics(all_actuals, all_recommendations, k=4)
print(f"Precision@4: {metrics['Precision']}")
print(f"Recall@4: {metrics['Recall']}")
print(f"F1@4: {metrics['F1 Score']}")
print(f"NDCG@4: {metrics['NDCG']}")


Precision@4: 1.994256541161455e-05
Recall@4: 7.97702616464582e-05
F1@4: 3.190810465858328e-05
NDCG@4: 7.97702616464582e-05
