# Recommender Systems for Steam Video Games

This project focuses on building a recommendation system using Steam video game data. The dataset includes **user_id**, **name** (game title), **hours** (time spent playing), and **action** (whether the game was purchased or played). Since the data does not contain explicit feedback, I rely on implicit feedback using the hours feature. The more hours a user has spent playing a game, the more likely they are to prefer similar games, making hours a strong indicator of user preferences.

You can access the dataset here: [Steam Video Games Data](https://www.kaggle.com/datasets/tamber/steam-video-games/data)

Two models were chosen to implement this recommendation system:
* **Alternating Least Squares (ALS):** A collaborative filtering model implemented with PySpark, which is well-suited for implicit feedback data and scales efficiently with larger datasets. ALS leverages matrix factorization to learn latent factors for users and games.
* **Neural Collaborative Filtering (NCF):** A deep learning-based recommendation approach that models user-item interactions using neural networks. 

Although the dataset contains 200,000 records, which could be handled using **pandas**, I opted to use **PySpark** for data preprocessing to gain experience with distributed data processing tools. This choice helps develop proficiency in handling larger datasets efficiently and prepares for future projects where scalability might be critical.

Some sources:
* An Overview of Collaborative Filtering Algorithms for Implicit Feedback Data: https://andbloch.github.io/An-Overview-of-Collaborative-Filtering-Algorithms/
* Neural Collaborative Filtering: https://liqiangnie.github.io/paper/p173-he.pdf

In [122]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan, countDistinct, sum, round, max, min, explode, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import tensorflow as tf
import tensorflow_recommenders as tfrs
from tensorflow.keras import layers, Model
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

Create a Spark session

In [46]:
spark = SparkSession.builder.appName('Steam Recommender System').getOrCreate()
spark

Load the Steam video game data

In [47]:
df = spark.read.csv('steam-200k.csv', inferSchema = True)
df

DataFrame[_c0: int, _c1: string, _c2: string, _c3: double, _c4: int]

In [48]:
df.show(5)

+---------+--------------------+--------+-----+---+
|      _c0|                 _c1|     _c2|  _c3|_c4|
+---------+--------------------+--------+-----+---+
|151603712|The Elder Scrolls...|purchase|  1.0|  0|
|151603712|The Elder Scrolls...|    play|273.0|  0|
|151603712|           Fallout 4|purchase|  1.0|  0|
|151603712|           Fallout 4|    play| 87.0|  0|
|151603712|               Spore|purchase|  1.0|  0|
+---------+--------------------+--------+-----+---+
only showing top 5 rows



Display the schema of the DataFrame

In [49]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: integer (nullable = true)



In [50]:
df.columns

['_c0', '_c1', '_c2', '_c3', '_c4']

Rename the columns for better readability:

In [51]:
df = df.withColumnRenamed('_c0', 'user_id') \
       .withColumnRenamed('_c1', 'name') \
       .withColumnRenamed('_c2', 'action') \
       .withColumnRenamed('_c3', 'hours') \
       .withColumnRenamed('_c4', 'zero')

In [52]:
df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- action: string (nullable = true)
 |-- hours: double (nullable = true)
 |-- zero: integer (nullable = true)



Get the shape of the DataFrame by counting the number of rows and columns.

In [53]:
df.count(), len(df.columns) ## shape

(200000, 5)

Count the number of null values in each column

In [54]:
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+-------+----+------+-----+----+
|user_id|name|action|hours|zero|
+-------+----+------+-----+----+
|      0|   0|     0|    0|   0|
+-------+----+------+-----+----+



Count the number of NaN values in each column

In [55]:
columns = ['user_id', 'name', 'action', 'hours', 'zero']

df.select([count(when(isnan(col(c)), c)).alias(c) for c in columns]).show()

+-------+----+------+-----+----+
|user_id|name|action|hours|zero|
+-------+----+------+-----+----+
|      0|   0|     0|    0|   0|
+-------+----+------+-----+----+



*Duplicate records*

In [56]:
df.groupBy(columns).count().filter('count > 1').show()



+---------+--------------------+--------+-----+----+-----+
|  user_id|                name|  action|hours|zero|count|
+---------+--------------------+--------+-----+----+-----+
| 86338111|Grand Theft Auto ...|purchase|  1.0|   0|    2|
|189858084|Grand Theft Auto ...|purchase|  1.0|   0|    2|
|150882304|Sid Meier's Civil...|purchase|  1.0|   0|    2|
|189858084|Grand Theft Auto ...|purchase|  1.0|   0|    2|
|116617462|Grand Theft Auto ...|purchase|  1.0|   0|    2|
| 37422528|Sid Meier's Civil...|purchase|  1.0|   0|    2|
|147859903|Sid Meier's Civil...|purchase|  1.0|   0|    2|
|138941587|Sid Meier's Civil...|purchase|  1.0|   0|    2|
|145825155|Grand Theft Auto ...|purchase|  1.0|   0|    2|
|105782521|Sid Meier's Civil...|purchase|  1.0|   0|    2|
| 46301758|Sid Meier's Civil...|purchase|  1.0|   0|    2|
|179936723|Grand Theft Auto ...|purchase|  1.0|   0|    2|
| 64455019|Sid Meier's Civil...|purchase|  1.0|   0|    2|
|142650116|Grand Theft Auto ...|purchase|  1.0|   0|    

                                                                                

In [57]:
df_new = df.dropDuplicates()

In [58]:
df_new.count()

199293

*Unique Values* -- Calculates the number of unique values for each column 

In [59]:
df_new.select([countDistinct(c).alias(c) for c in columns]).show()

+-------+----+------+-----+----+
|user_id|name|action|hours|zero|
+-------+----+------+-----+----+
|  12393|5155|     2| 1593|   1|
+-------+----+------+-----+----+



Remove the 'zero' column from the DataFrame as it is not needed for analysis

In [60]:
df_new = df_new.drop('zero')

### User Interactions Analysis

#### Count of Users per Game

In [61]:
df.groupBy('name')\
    .agg(count('user_id').alias('users_count'))\
    .orderBy('users_count', ascending = False)\
    .limit(20)\
    .show()

+--------------------+-----------+
|                name|users_count|
+--------------------+-----------+
|              Dota 2|       9682|
|     Team Fortress 2|       4646|
|Counter-Strike Gl...|       2789|
|            Unturned|       2632|
|       Left 4 Dead 2|       1752|
|Counter-Strike So...|       1693|
|      Counter-Strike|       1424|
|         Garry's Mod|       1397|
|The Elder Scrolls...|       1394|
|            Warframe|       1271|
|Half-Life 2 Lost ...|       1158|
|Sid Meier's Civil...|       1150|
|           Robocraft|       1096|
|Half-Life 2 Death...|       1021|
|              Portal|       1005|
|            Portal 2|        997|
|         Half-Life 2|        995|
|   Heroes & Generals|        993|
|            Terraria|        956|
|Counter-Strike Co...|        904|
+--------------------+-----------+



#### Count of Purchases per Game

In [62]:
df.filter(col('action') == 'purchase') \
  .groupBy('name') \
  .agg(count('user_id').alias('count_user_purchase')) \
  .orderBy('count_user_purchase', ascending=False) \
  .limit(20) \
  .show()

+--------------------+-------------------+
|                name|count_user_purchase|
+--------------------+-------------------+
|              Dota 2|               4841|
|     Team Fortress 2|               2323|
|            Unturned|               1563|
|Counter-Strike Gl...|               1412|
|Half-Life 2 Lost ...|                981|
|Counter-Strike So...|                978|
|       Left 4 Dead 2|                951|
|      Counter-Strike|                856|
|            Warframe|                847|
|Half-Life 2 Death...|                823|
|         Garry's Mod|                731|
|The Elder Scrolls...|                717|
|           Robocraft|                689|
|Counter-Strike Co...|                679|
|Counter-Strike Co...|                679|
|   Heroes & Generals|                658|
|         Half-Life 2|                639|
|Sid Meier's Civil...|                596|
|         War Thunder|                590|
|              Portal|                588|
+----------

#### Count of Plays per Game

In [63]:
df.filter(col('action') == 'play') \
  .groupBy('name') \
  .agg(count('user_id').alias('count_user_play')) \
  .orderBy('count_user_play', ascending=False) \
  .limit(20) \
  .show()

+--------------------+---------------+
|                name|count_user_play|
+--------------------+---------------+
|              Dota 2|           4841|
|     Team Fortress 2|           2323|
|Counter-Strike Gl...|           1377|
|            Unturned|           1069|
|       Left 4 Dead 2|            801|
|Counter-Strike So...|            715|
|The Elder Scrolls...|            677|
|         Garry's Mod|            666|
|      Counter-Strike|            568|
|Sid Meier's Civil...|            554|
|            Terraria|            460|
|            Portal 2|            453|
|            Warframe|            424|
|              Portal|            417|
|           Robocraft|            407|
|            PAYDAY 2|            390|
|       Borderlands 2|            386|
|         Half-Life 2|            356|
|   Heroes & Generals|            335|
|         War Thunder|            303|
+--------------------+---------------+



####  Total Play Hours per Game

In [64]:
df.filter(col('action') == 'play') \
  .groupBy('name') \
  .agg(round(sum('hours'), 2).alias('total_play')) \
  .orderBy('total_play', ascending = False) \
  .limit(50) \
  .show()

+--------------------+----------+
|                name|total_play|
+--------------------+----------+
|              Dota 2|  981684.6|
|Counter-Strike Gl...|  322771.6|
|     Team Fortress 2|  173673.3|
|      Counter-Strike|  134261.1|
|Sid Meier's Civil...|   99821.3|
|Counter-Strike So...|   96075.5|
|The Elder Scrolls...|   70889.3|
|         Garry's Mod|   49725.3|
|Call of Duty Mode...|   42009.9|
|       Left 4 Dead 2|   33596.7|
|Football Manager ...|   32308.6|
|Football Manager ...|   30845.8|
|Football Manager ...|   30574.8|
|            Terraria|   29951.8|
|            Warframe|   27074.6|
|Football Manager ...|   24283.1|
|              Arma 3|   24055.7|
|  Grand Theft Auto V|   22956.7|
|       Borderlands 2|   22667.9|
|    Empire Total War|   21030.3|
+--------------------+----------+
only showing top 20 rows



### Data prep for recommedation system models

This part processes the dataset to prepare it for recommendation models. It begins by filtering out entries where the **action** column indicates a purchase or play, resulting in two separate DataFrames: one for purchases and one for plays. These two DataFrames are then merged based on user_id and name. Afterward, null values in the **hours** column are replaced with 0 to ensure accurate calculations and unnecessary columns are removed. Finally, the **hours** column is *normalized*.

Filter the DataFrame to create a new DataFrame 'df_purchase' that includes only the records where the action is 'purchase'

In [65]:
df_purchase = df_new.filter(col('action') == 'purchase')
df_purchase.count()

128804

Count distinct users made purchases and rename the columns

In [66]:
df_purchase.select(countDistinct('user_id').alias('number of unique users made purchases')).show()
df_purchase = df_purchase.withColumnRenamed('action', 'action_pur') \
                         .withColumnRenamed('hours', 'purchase')

+-------------------------------------+
|number of unique users made purchases|
+-------------------------------------+
|                                12393|
+-------------------------------------+



Create a new DataFrame 'df_play' that contains only the records where the action is 'play'

In [67]:
df_play = df_new.filter(col('action') == 'play')
df_play.count()

70489

Count the number of unique users

In [68]:
df_play.select(countDistinct('user_id').alias('number of unique users')).show()

+----------------------+
|number of unique users|
+----------------------+
|                 11350|
+----------------------+



The two DataFrames (df_purchase and df_play) are joined based on the **user_id** and **name** (game title). 

In [69]:
full_df = df_purchase.join(df_play, ["user_id", "name"], "outer")

In [70]:
full_df.count()

128816

Count the distinct values for each column

In [71]:
new_columns = full_df.columns

In [72]:
full_df_unique = full_df.select([countDistinct(c).alias(c) for c in new_columns]).show()



+-------+----+----------+--------+------+-----+
|user_id|name|action_pur|purchase|action|hours|
+-------+----+----------+--------+------+-----+
|  12393|5155|         1|       1|     1| 1593|
+-------+----+----------+--------+------+-----+



                                                                                

Count the number of null (missing) values in each column

In [73]:
full_df.select([count(when(col(c).isNull(), c)).alias(c) for c in new_columns]).show()

+-------+----+----------+--------+------+-----+
|user_id|name|action_pur|purchase|action|hours|
+-------+----+----------+--------+------+-----+
|      0|   0|         0|       0| 58327|58327|
+-------+----+----------+--------+------+-----+



Fill any null values in the **hours** column with zero

In [74]:
data = full_df.fillna(0, subset=['hours'])

Drop unnecessary columns

In [75]:
data = data.drop("action_pur", "purchase", "action")

In [76]:
data.describe("hours").show()

+-------+------------------+
|summary|             hours|
+-------+------------------+
|  count|            128816|
|   mean|26.746411936405334|
| stddev|171.38236775962963|
|    min|               0.0|
|    max|           11754.0|
+-------+------------------+



Normalize the **hours** column

In [77]:
min_hours = data.select(min("hours")).collect()[0][0]
max_hours = data.select(max("hours")).collect()[0][0]

data = data.withColumn("normalized_hours", (col("hours") - min_hours) / (max_hours - min_hours))

## Recommender Systems

### 1) Alternating Least Squares (ALS) - Pyspark

In this code, *StringIndexer* is used to convert the categorical game names in the **name** column into unique numerical identifiers stored in the **game_id** column. This transformation is necessary for preparing the dataset for the ALS model, which requires numerical input for training.

In [78]:
indexer = StringIndexer(inputCol="name", outputCol="game_id")
data_indexed = indexer.fit(data).transform(data)

                                                                                

*Train - Test Split*

In [79]:
(train_set, test_set) = data_indexed.randomSplit([0.8, 0.2])

*Model Selection - Hyperparameter Tunning*

In [80]:
# the als model 
als = ALS(userCol="user_id", itemCol="game_id", ratingCol="normalized_hours", implicitPrefs=True, coldStartStrategy='drop') #Note we set cold start strategy to 'drop' 
#to ensure we don't get NaN evaluation metrics

# the parameter grid
paramGrid = ParamGridBuilder() \
  .addGrid(als.maxIter, [5, 10, 15]) \
  .addGrid(als.regParam, [0.01, 0.05, 0.1]) \
  .build()

# cross-validator
crossval = CrossValidator(estimator=als, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator(metricName="rmse", labelCol="normalized_hours", predictionCol="prediction"), numFolds=5, seed=42)

# Fit the cross-validator to the training data
cvModel = crossval.fit(train_set)

24/10/21 14:37:11 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/10/21 14:37:11 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
                                                                                ]

In [81]:
# best model
bestModel = cvModel.bestModel

# Parameter map
paramMap = bestModel.extractParamMap()

# best params
best_maxIter = bestModel._java_obj.parent().getMaxIter()
best_regParam = bestModel._java_obj.parent().getRegParam()

print("Best maxIter:", best_maxIter)
print("Best regParam:", best_regParam)


Best maxIter: 5
Best regParam: 0.1


*Model Training*

In [82]:
als_best = ALS(maxIter=best_maxIter, regParam=best_regParam, userCol="user_id", itemCol="game_id",
                ratingCol="normalized_hours", implicitPrefs=True, coldStartStrategy="drop")
als_best = als_best.fit(train_set)

*Making predictions*

In [83]:
predictions_als = als_best.transform(test_set)

*Model Evaluation*

In [84]:
evaluator_als = RegressionEvaluator(metricName="rmse", labelCol="normalized_hours", predictionCol="prediction")
rmse_als = evaluator_als.evaluate(predictions_als)
print('RMSE of ALS:', rmse_als)

                                                                                

RMSE of ALS: 0.09944930743217605


*Model Training by using all data to get recommendations*

In [85]:
als_best = ALS(maxIter=best_maxIter, regParam=best_regParam, userCol="user_id", itemCol="game_id",
                ratingCol="normalized_hours", implicitPrefs=True, coldStartStrategy="drop")
als_full_model = als_best.fit(data_indexed)

*Recommender System*

In [86]:
user_recommendations = als_full_model.recommendForAllUsers(5)
user_recommendations.show()



+--------+--------------------+
| user_id|     recommendations|
+--------+--------------------+
|   76767|[{44, 0.8258528},...|
|  144736|[{7, 0.60747874},...|
|  229911|[{7, 0.8815583}, ...|
|  835015|[{49, 0.0}, {48, ...|
|  948368|[{7, 0.6061667}, ...|
|  975449|[{20, 0.7061146},...|
| 1268792|[{7, 0.60748684},...|
| 2531540|[{5, 0.42228198},...|
| 2753525|[{5, 0.6406964}, ...|
| 3450426|[{7, 0.730884}, {...|
| 7923954|[{7, 0.6074828}, ...|
| 7987640|[{49, 0.0}, {48, ...|
| 8259307|[{11, 0.4106043},...|
| 8567888|[{5, 0.3509526}, ...|
| 8585433|[{11, 0.598985}, ...|
| 8784496|[{1, 0.4596405}, ...|
| 8795607|[{5, 0.66845745},...|
|10144413|[{49, 0.0}, {48, ...|
|10595342|[{5, 0.6881972}, ...|
|10599862|[{11, 0.82561207}...|
+--------+--------------------+
only showing top 20 rows



                                                                                

The DataFrame presented contains two columns: **user_id** and **recommendations**. In the **recommendations** column, five suggested games are listed alongside their predicted normalized hours (ratings), identified by specific game IDs. To enhance the interpretability of the recommendations, I will implement code to map these IDs to their corresponding game names. Initially, I will display the schema of the DataFrame for a clearer understanding of its structure.

In [87]:
user_recommendations.printSchema()

root
 |-- user_id: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- game_id: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)



According to the schema of the data, the recommendations column contains an array, and each element is a struct. Each struct includes two fields: game_id (an integer) and rating (a float). The I explode the recommendations column, which contains arrays of recommended games. This results in a new row for each recommended game, making it easier to process them individually.

In [88]:
recommendations_exploded = user_recommendations.withColumn("recommendation", explode("recommendations"))

Select relevant columns

In [89]:
recommendations_als = recommendations_exploded.select("user_id", "recommendation.game_id", "recommendation.rating")

Mapp game IDs to names

In [90]:
game_id_name = data_indexed[['name', 'game_id']]

In [91]:
game_id_name.count()

128816

In [92]:
game_id_name = game_id_name.drop_duplicates()

In [93]:
game_id_name.count()

5155

Add the game names to the recommendations DataFrame

In [94]:
recommendations_als = recommendations_als.join(game_id_name, on='game_id', how='left')

Sort results

In [95]:
recommendations_als = recommendations_als.orderBy(['user_id', 'rating'], ascending =[True, False])

In [96]:
recommendations_als.show(10)

                                                                                

+-------+-------+----------+--------------------+
|game_id|user_id|    rating|                name|
+-------+-------+----------+--------------------+
|      1|   5250|0.72689056|     Team Fortress 2|
|      0|   5250|  0.701716|              Dota 2|
|      6|   5250|0.13185938|       Left 4 Dead 2|
|     19|   5250|0.12507021|              Portal|
|      5|   5250|0.12311798|Counter-Strike So...|
|     44|  76767| 0.8258528|Call of Duty Mode...|
|     45|  76767| 0.8088316|Call of Duty Mode...|
|     69|  76767| 0.5850272|Call of Duty Blac...|
|     70|  76767| 0.5532179|Call of Duty Blac...|
|      3|  76767|0.53492707|Counter-Strike Gl...|
+-------+-------+----------+--------------------+
only showing top 10 rows



A function to find recommended games for a specific user

In [97]:
def print_recommendation_als (user_id):
    """
    A function to find and print the recommended games for a specific user.

    Parameters:
    - user_id (int): The ID of the user for whom the game recommendations will be retrieved.

    The function filters the DataFrame based on the provided user_id and retrieves the game names associated with the recommendations for that user. It then prints the names of the recommended games in a numbered list.
    """
    recommend_filtered = recommendations_als.filter(col('user_id') == user_id).select('name')
    print(f"User ID: {user_id}")
    recommended_games = recommend_filtered.collect()
    print('The list of all recommended games:')
    for i, row in enumerate(recommended_games, 1):
        print(f"{i}. {row['name']}")

    

In [98]:
print_recommendation_als(user_id = 1423371) ## example

User ID: 1423371


                                                                                

The list of all recommended games:
1. Counter-Strike Global Offensive
2. Counter-Strike Source
3. PAYDAY 2
4. Garry's Mod
5. Left 4 Dead 2


### 2) Neural Collaborative Filtering (NCF) - Tensorflow

*Data Prep*

Retrieve unique user IDs and game names and then converts them into Python lists 

In [99]:
user_ids = data_indexed.select('user_id').distinct().rdd.flatMap(lambda x:x).collect()
game_names = data_indexed.select('name').distinct().rdd.flatMap(lambda x:x).collect()

                                                                                

Map each user ID and game name to a unique index

In [100]:
userid_to_index = {user_id: index for index, user_id in enumerate(user_ids)}
game_to_index = {name: index for index, name in enumerate(game_names)}

In [101]:
# Functions to Map Users and Games to Indexes
def get_user_index(user_id):
    return userid_to_index.get(user_id, -1)

def get_game_index(game_name):
    return game_to_index.get(game_name, -1)
  
# Creating UDFs (User Defined Functions) to Use in PySpark
user_index_udf = udf(get_user_index, IntegerType())
game_index_udf = udf(get_game_index, IntegerType())

# Applying the UDFs to the DataFrame
indexed_user_data = data_indexed.withColumn("user_index", user_index_udf(data_indexed["user_id"]))
indexed_user_data = indexed_user_data.withColumn("game_index", game_index_udf(indexed_user_data["name"]))

indexed_user_data.show()



+-------+--------------------+-----+--------------------+-------+----------+----------+
|user_id|                name|hours|    normalized_hours|game_id|user_index|game_index|
+-------+--------------------+-----+--------------------+-------+----------+----------+
|   5250|     Cities Skylines|144.0| 0.01225114854517611|  198.0|      5039|      2500|
|   5250|      Counter-Strike|  0.0|                 0.0|    7.0|      5039|      2653|
|   5250|Counter-Strike So...|  0.0|                 0.0|    5.0|      5039|        30|
|   5250|       Day of Defeat|  0.0|                 0.0|   21.0|      5039|      1449|
|   5250|              Dota 2|  0.2|1.701548409052237...|    0.0|      5039|         0|
|   5250|Half-Life 2 Episo...|  0.0|                 0.0|   30.0|      5039|      2137|
|   5250|Half-Life Blue Shift|  0.0|                 0.0|   40.0|      5039|       321|
|   5250|Half-Life Opposin...|  0.0|                 0.0|   36.0|      5039|      1094|
|  76767|Age of Empires II...| 1

                                                                                

Convert the data to NumPy arrays

In [102]:
X = np.array(indexed_user_data.select("user_index", "game_index").collect())
y = np.array(indexed_user_data.select("normalized_hours").collect())

                                                                                

*Train - test split*

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

*Model*

In [105]:
class NCFModel(Model):
    def __init__(self, num_users, num_games, embedding_dim=8):
        super(NCFModel, self).__init__()
        
        # embedding layers for users and games
        self.user_embedding = layers.Embedding(num_users, embedding_dim)
        self.game_embedding = layers.Embedding(num_games, embedding_dim)

        # MLP (Multi-Layer Perceptron) layers
        self.fc1 = layers.Dense(64, activation='relu')
        self.fc2 = layers.Dense(32, activation='relu')
        self.fc3 = layers.Dense(1, activation='sigmoid')

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        game_vector = self.game_embedding(inputs[:, 1])

        # merge embeddings
        interaction = layers.concatenate([user_vector, game_vector])

        # Fully Connected layers
        x = self.fc1(interaction)
        x = self.fc2(x)
        return self.fc3(x)


*Model training and evaluation function*

In [108]:
def train_evaluate(num_users, num_games, embedding_dim):
    model_ncf = NCFModel(num_users, num_games, embedding_dim)
    model_ncf.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])
    
    # Train the model
    model_ncf.fit(X_train, y_train, epochs=10, batch_size=32)

    # Predictions
    y_pred_ncf = model_ncf.predict(X_test)

    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_ncf))
    return rmse


Model Selection - Hyperparameter Tuning

In [110]:
embedding_dims = [8, 16, 32]
best_rmse = float('inf')
best_embedding_dim = None
num_users = len(user_ids)
num_games = len(game_names)

In [115]:
for embedding_dim in embedding_dims:
    rmse = train_evaluate(num_users, num_games, embedding_dim)
    print(f"Embedding Dimension: {embedding_dim}, RMSE: {rmse}")

    if rmse < best_rmse:
        best_rmse_ncf = rmse
        best_embedding_dim = embedding_dim

print(f"Best RMSE: {best_rmse_ncf} with Embedding Dimension: {best_embedding_dim}")

Epoch 1/10


[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 721us/step - loss: 0.0161 - mae: 0.0497
Epoch 2/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 640us/step - loss: 2.2737e-04 - mae: 0.0036
Epoch 3/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 654us/step - loss: 1.9624e-04 - mae: 0.0035
Epoch 4/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 631us/step - loss: 1.9850e-04 - mae: 0.0032
Epoch 5/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 639us/step - loss: 1.9515e-04 - mae: 0.0030
Epoch 6/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 655us/step - loss: 1.8591e-04 - mae: 0.0031
Epoch 7/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 627us/step - loss: 1.6705e-04 - mae: 0.0030
Epoch 8/10
[1m3221/3221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 641us/step - loss: 1.8021e-04 - mae: 0.0028
Epoch 9/10
[1m3221/3221[0m [

In [119]:
#model
embedding_dim = best_embedding_dim

best_ncf = NCFModel(num_users, num_games, embedding_dim)

In [120]:
best_ncf.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])
best_ncf.fit(X, y, epochs=10, batch_size=32)

Epoch 1/10


[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 0.0125 - mae: 0.0391
Epoch 2/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 2ms/step - loss: 2.3770e-04 - mae: 0.0035
Epoch 3/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 2.2065e-04 - mae: 0.0035
Epoch 4/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 2ms/step - loss: 2.2568e-04 - mae: 0.0032
Epoch 5/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 2ms/step - loss: 1.8430e-04 - mae: 0.0031
Epoch 6/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 1.9718e-04 - mae: 0.0032
Epoch 7/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 2.0062e-04 - mae: 0.0033
Epoch 8/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 1.8482e-04 - mae: 0.0032
Epoch 9/10
[1m4026/4026[0m [32m━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x313a49220>

*Recommender System*

In [121]:
def recommend_games(user_id, model, game_to_index, top_n=5):
    user_index = userid_to_index[user_id]
    user_vector = np.array([user_index])

    # game index
    game_indices = np.array(range(len(game_to_index)))

    # merge user index and game index
    user_game_pairs = np.array(np.meshgrid(user_vector, game_indices)).T.reshape(-1, 2)

    predictions = model.predict(user_game_pairs).flatten()

    # top games
    top_games_indices = np.argsort(predictions)[-top_n:][::-1]
    recommended_games = [game_names[i] for i in top_games_indices]

    return recommended_games


recommended_games = recommend_games(1423371, best_ncf, game_to_index)
print("Recommended games for user 1423371:", recommended_games)

[1m162/162[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 400us/step
Recommended games for user 1423371: ['TRIP Steam Edition', 'Overcast - Walden and the Werewolf - Soundtrack', 'The Counting Kingdom', 'eXceed - Gun Bullet Children', 'Reckless Ruckus']


## Comparison

In [125]:
comparison = pd.DataFrame({
    'Model': ['ALS', 'NCF'],
    'RMSE': [rmse_als, best_rmse_ncf]
})
print(comparison)

  Model      RMSE
0   ALS  0.099449
1   NCF  0.016287


The NCF's RMSE is lower than ALS's, indicating that the NCF model is more effective in capturing user preferences and behavior compared to the ALS model.