<a href="https://colab.research.google.com/github/awal015/AICapstone/blob/main/Adam%20Wallach%20USCD%20capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, I will develop a board game recommendation engine using both content-based and collaborative filtering. Content-based filtering recommends games based on their similarity to other games that you have enjoyed in the past. Collaborative filtering recommends games based on the preferences of other users who are similar to you.

I will use the Board Game Database from BoardGameGeek to train my recommendation engine. This database contains information on over 22,000 board games, 411,000 users, and 19 million ratings.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from scipy.spatial import distance


In [2]:
#access kaggle API to download dataset - used instructions from this article https://www.kaggle.com/discussions/general/74235
!pip install -q kaggle
from google.colab import files
files.upload()

Saving board-games-database-from-boardgamegeek.zip to board-games-database-from-boardgamegeek.zip


In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

!chmod 600 /root/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [4]:
#download boardgame dataset
!kaggle datasets download -d threnjen/board-games-database-from-boardgamegeek
!unzip /content/board-games-database-from-boardgamegeek.zip

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/kaggle/api/kaggle_api_extended.py", line 1734, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'
Archive:  /content/board-games-database-from-boardgamegeek.zip
  inflating: artists_reduced.csv     
  inflating: bgg_data_documentation.txt  
  inflating: designers_reduced.csv   
  inflating: games.csv               
  inflating: mechanics.csv           
  inflating: publishers_reduced.

In [5]:
games = pd.read_csv('games.csv')
mechanics = pd.read_csv('mechanics.csv')
themes = pd.read_csv('themes.csv')
subcategories = pd.read_csv('subcategories.csv')
rank_dist = pd.read_csv('ratings_distribution.csv')
user_rank = pd.DataFrame()
for chunk in pd.read_csv('user_ratings.csv', chunksize=100000):
    user_rank = pd.concat([user_rank, chunk], ignore_index=True)

games.head()

Unnamed: 0,BGGId,Name,Description,YearPublished,GameWeight,AvgRating,BayesAvgRating,StdDev,MinPlayers,MaxPlayers,...,Rank:partygames,Rank:childrensgames,Cat:Thematic,Cat:Strategy,Cat:War,Cat:Family,Cat:CGS,Cat:Abstract,Cat:Party,Cat:Childrens
0,1,Die Macher,die macher game seven sequential political rac...,1986,4.3206,7.61428,7.10363,1.57979,3,5,...,21926,21926,0,1,0,0,0,0,0,0
1,2,Dragonmaster,dragonmaster tricktaking card game base old ga...,1981,1.963,6.64537,5.78447,1.4544,3,4,...,21926,21926,0,1,0,0,0,0,0,0
2,3,Samurai,samurai set medieval japan player compete gain...,1998,2.4859,7.45601,7.23994,1.18227,2,4,...,21926,21926,0,1,0,0,0,0,0,0
3,4,Tal der Könige,triangular box luxurious large block tal der k...,1992,2.6667,6.60006,5.67954,1.23129,2,4,...,21926,21926,0,0,0,0,0,0,0,0
4,5,Acquire,acquire player strategically invest business t...,1964,2.5031,7.33861,7.14189,1.33583,2,6,...,21926,21926,0,1,0,0,0,0,0,0


In [6]:
#Shape
print(games.shape)
print(mechanics.shape)
print(themes.shape)
print(subcategories.shape)
print(rank_dist.shape)
print(user_rank.shape)

#dtypes
print(games.dtypes)
print(mechanics.dtypes)
print(themes.dtypes)
print(subcategories.dtypes)
print(rank_dist.dtypes)
print(user_rank.dtypes)

#Info
print(games.info())
print(mechanics.info())
print(themes.info())
print(subcategories.info())
print(rank_dist.info())
print(user_rank.info())

#Columns
print(games.columns)
print(mechanics.columns)
print(themes.columns)
print(subcategories.columns)
print(rank_dist.columns)
print(user_rank.columns)


(21925, 48)
(21925, 158)
(21925, 218)
(21925, 11)
(21925, 96)
(18942215, 3)
BGGId                    int64
Name                    object
Description             object
YearPublished            int64
GameWeight             float64
AvgRating              float64
BayesAvgRating         float64
StdDev                 float64
MinPlayers               int64
MaxPlayers               int64
ComAgeRec              float64
LanguageEase           float64
BestPlayers              int64
GoodPlayers             object
NumOwned                 int64
NumWant                  int64
NumWish                  int64
NumWeightVotes           int64
MfgPlaytime              int64
ComMinPlaytime           int64
ComMaxPlaytime           int64
MfgAgeRec                int64
NumUserRatings           int64
NumComments              int64
NumAlternates            int64
NumExpansions            int64
NumImplementations       int64
IsReimplementation       int64
Family                  object
Kickstarted              

In [7]:
popular = games.copy()
popular['game_popularty'] = popular['AvgRating'] /10 * popular['NumOwned']
twenty_most_popular = popular.sort_values(by='game_popularty',ascending = False).head(20)
twenty_most_popular = twenty_most_popular[['BGGId','Name','Description','YearPublished','GameWeight','AvgRating']]
twenty_most_popular = twenty_most_popular.reset_index(drop = True)
twenty_most_popular

Unnamed: 0,BGGId,Name,Description,YearPublished,GameWeight,AvgRating
0,30549,Pandemic,pandemic virulent disease break simultaneously...,2008,2.4072,7.5913
1,822,Carcassonne,carcassonne tileplacement game player draw pla...,2000,1.9064,7.41883
2,13,Catan,catan settler catan player try dominant force ...,1995,2.3139,7.13746
3,68448,7 Wonders,leader great city ancient world gather resou...,2010,2.3258,7.73733
4,178900,Codenames,codename easy party game solve puzzle game div...,2015,1.2796,7.60087
5,173346,7 Wonders Duel,way wonder duel resemble parent game wonde...,2015,2.2257,8.1073
6,167791,Terraforming Mars,s mankind begin terraform planet mar giant cor...,2016,3.2441,8.41879
7,36218,Dominion,quotyou monarch like parent ruler small pleasa...,2008,2.3547,7.61081
8,9209,Ticket to Ride,elegantly simple gameplay ticket ride learn ...,2004,1.8449,7.41494
9,230802,Azul,introduce moor azulejos originally white blue ...,2017,1.7639,7.80254


In [8]:
import gc; gc.collect()

31

**Colabrative filtering**

In [9]:
user_rank.Username.nunique()

411374

In [10]:
#choose user to suggest to
active_user_id = 'therealshakeyt'
active_user_ratings = user_rank[user_rank['Username'] == active_user_id]

print(active_user_ratings.shape)
active_user_ratings.head()

(143, 3)


Unnamed: 0,BGGId,Rating,Username
122169,339031,9.0,therealshakeyt
504171,344258,9.46,therealshakeyt
1607695,30549,8.0,therealshakeyt
1741968,68448,8.0,therealshakeyt
1757214,230802,8.0,therealshakeyt


In [11]:
user_rank_filtered = user_rank.groupby('Username').filter(lambda x: len(x) >= 100)
user_rank_filtered.Username.nunique()

48395

In [12]:
user_item_matrix = user_rank_filtered.pivot_table(index='Username', columns='BGGId', values='Rating',fill_value=0)
user_item_matrix
pickle.dump(user_item_matrix, open("user_item_matrix.pkl", "wb"))


In [13]:
from sklearn.metrics.pairwise import cosine_similarity

def recommend_games(user_item_matrix, active_user_id, num_recommendations=10):
    if active_user_id not in user_item_matrix.index:
      print(f"User {active_user_id} not found in the user-item matrix.")
      return []

    # Calculate cosine similarity between users
    user_similarity = cosine_similarity(user_item_matrix)
    user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

    # Find similar users
    similar_users = user_similarity_df[active_user_id].sort_values(ascending=False)[1:]  # Exclude the active user themself

    # Get games rated by similar users
    recommended_games = []
    for similar_user, similarity_score in similar_users.items():
        similar_user_ratings = user_item_matrix.loc[similar_user]
        # Consider only games the active user hasn't rated
        unrated_games = similar_user_ratings[user_item_matrix.loc[active_user_id] == 0]
        # Sort by rating and add to recommendations
        for game_id, rating in unrated_games.sort_values(ascending=False).items():
          if game_id not in recommended_games: # avoid duplicates
            recommended_games.append(game_id)
          if len(recommended_games) >= num_recommendations:
            break
        if len(recommended_games) >= num_recommendations:
          break

    return recommended_games[:num_recommendations]

# Calling function using active_user
recommendations = recommend_games(user_item_matrix, active_user_id)
game_names = []
for bgg_id in recommendations:
    game_name = games.loc[games['BGGId'] == bgg_id, 'Name'].values[0]  # Extract game name
    game_names.append(game_name)


print(f"Recommended games for {active_user_id}: {game_names}")


Recommended games for therealshakeyt: ["The King's Dilemma", 'Chinatown', 'Startups', 'Tales of the Arabian Nights', 'SHASN', 'Witness', 'Warhammer 40,000 (Eighth Edition)', 'Dune', 'Inis', 'Kemet: Blood and Sand']


In [14]:
from sklearn.neighbors import NearestNeighbors

def recommend_games_knn(user_item_matrix, active_user_id, num_recommendations=10, k=5):
    if active_user_id not in user_item_matrix.index:
        print(f"User {active_user_id} not found in the user-item matrix.")
        return []

    # Use KNN to find similar users
    model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=k, n_jobs=-1)
    model_knn.fit(user_item_matrix)

    distances, indices = model_knn.kneighbors(user_item_matrix.loc[[active_user_id]])
    indices = indices.flatten()[1:]  # Exclude the active user themself
    distances = distances.flatten()[1:]

    # Get games rated by similar users, weighted by similarity
    recommended_games = []
    for i, similar_user_index in enumerate(indices):
        similar_user_id = user_item_matrix.index[similar_user_index]
        similar_user_ratings = user_item_matrix.loc[similar_user_id]
        unrated_games = similar_user_ratings[user_item_matrix.loc[active_user_id] == 0]
        # Weight ratings by similarity
        weighted_ratings = unrated_games * (1 - distances[i])  # higher distance = lower similarity

        for game_id, rating in weighted_ratings.sort_values(ascending=False).items():
            if game_id not in recommended_games:
                recommended_games.append(game_id)
            if len(recommended_games) >= num_recommendations:
                break
        if len(recommended_games) >= num_recommendations:
            break

    return recommended_games[:num_recommendations]

# Example usage (assuming user_item_matrix and active_user_id are defined)
recommendations_knn = recommend_games_knn(user_item_matrix, active_user_id)
game_names_knn = []
for bgg_id in recommendations_knn:
    game_name = games.loc[games['BGGId'] == bgg_id, 'Name'].values[0]
    game_names_knn.append(game_name)

print(f"Recommended games for {active_user_id} using KNN: {game_names_knn}")


Recommended games for therealshakeyt using KNN: ["The King's Dilemma", 'Chinatown', 'Startups', 'Tales of the Arabian Nights', 'SHASN', 'Witness', 'Warhammer 40,000 (Eighth Edition)', 'Dune', 'Inis', 'Kemet: Blood and Sand']


In [15]:
# Get active_user_games and users that played same games
active_user_games = active_user_ratings['BGGId'].unique()
similar_users = user_rank[user_rank['BGGId'].isin(active_user_games)]

# Filter for users with minimum common items, excluding the active user
min_common_items = 20
similar_users = similar_users.groupby('Username').filter(lambda x: len(x) >= min_common_items)
similar_users = similar_users[similar_users['Username'] != active_user_id]

print(f"Number of similar users: {similar_users['Username'].nunique()}")
print(similar_users.shape)
similar_users.head()

Number of similar users: 29618
(893866, 3)


Unnamed: 0,BGGId,Rating,Username
35086,339031,8.0,Hutch86
35087,339031,8.0,HINATA1986
35089,339031,8.0,jkboardgames
35092,339031,7.75,jwbeane
35093,339031,7.75,kevinruns262


In [16]:
def calc_similarity(similar_user_ratings, active_user_ratings):
    """Calculates the cosine similarity between two users' ratings."""
    # Merge ratings on common games
    common_ratings = similar_user_ratings.merge(
        active_user_ratings,
        how='inner',
        on='BGGId',
        suffixes=('_similar', '_active')
    )

    # Calculate cosine distance and return similarity
    if not common_ratings.empty:  # Handle cases with no common ratings
        cos_distance = distance.cosine(
            common_ratings['Rating_similar'],
            common_ratings['Rating_active']
        )
        return 1 - cos_distance
    else:
        return 0  # Return 0 for no similarity if no common ratings


# Apply calc_similarity to find similar users
similarities = similar_users.groupby('Username').apply(
    calc_similarity, active_user_ratings=active_user_ratings
)
similarities.name = "Similarity"

# Filter for similar users above the threshold
min_similarity_score = 0.98
similar_users_filtered = similarities[similarities > min_similarity_score]

# get game sugestions where similar_user rate the game and the active user did not rate the game
similar_user_games = user_rank.loc[user_rank['Username'].isin(similar_users['Username'])]
similar_user_games = similar_user_games[~similar_user_games['BGGId'].isin(active_user_games)]

# filter recommend_ratings where number of ratings by the relevent neighbors on a specific game > min_neighbors_ratings
min_neighbors_ratings = 5
recommend_ratings = similar_user_games.groupby('BGGId').filter(lambda x: x.shape[0] >= min_neighbors_ratings)

print(recommend_ratings.shape)
recommend_ratings.BGGId.nunique()
recommend_ratings.info()

  similarities = similar_users.groupby('Username').apply(


(7410011, 3)
<class 'pandas.core.frame.DataFrame'>
Index: 7410011 entries, 0 to 18942214
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   BGGId     int64  
 1   Rating    float64
 2   Username  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 226.1+ MB


In [17]:
# filter recommend_ratings where number of ratings by the relevent neighbors on a specific game > min_neighbors_ratings
min_neighbors_ratings = 5
recommend_ratings = similar_user_games.groupby('BGGId').filter(lambda x: x.shape[0] >= min_neighbors_ratings)

print(recommend_ratings.shape)
recommend_ratings.BGGId.nunique()
recommend_ratings.info()

(7410011, 3)
<class 'pandas.core.frame.DataFrame'>
Index: 7410011 entries, 0 to 18942214
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   BGGId     int64  
 1   Rating    float64
 2   Username  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 226.1+ MB


In [18]:
def calc_game_score(game_ratings):
  game_ratings = game_ratings.join(similarities, on='Username')
  return (game_ratings['Rating'] *
          game_ratings['Similarity']).sum() / game_ratings['Similarity'].sum()

  game_score = recommend_ratings.groupby('BGGId').apply(calc_game_score)
  game_score.name = 'Score'
  game_score.head()

In [19]:
#option 2 ################ Calc game rating weighted avg (with similarity)
def calc_game_score(game_ratings , similarities):
    game_ratings = game_ratings.join(similarities, on='Username')
    return (game_ratings['Rating'] *
            game_ratings['Similarity']).sum() / \
            game_ratings['Similarity'].sum()

In [20]:
# apply calc_game_score on recommend_ratings
game_scores = recommend_ratings.groupby('BGGId').apply(calc_game_score ,(similarities) )
game_scores.name = 'Score'
game_scores.head()

  game_scores = recommend_ratings.groupby('BGGId').apply(calc_game_score ,(similarities) )


Unnamed: 0_level_0,Score
BGGId,Unnamed: 1_level_1
1,7.521359
2,6.37208
3,7.340347
4,6.333032
5,7.213765


In [21]:
#select top 5 games suggestion by game score (rating weighted avg)
n_recommendations = 10
final_recommendation = game_scores.sort_values(ascending=False)[:n_recommendations]
final_recommendation= final_recommendation.to_frame('score').reset_index()
final_recommendation

Unnamed: 0,BGGId,score
0,284121,9.670928
1,342942,9.533226
2,295785,9.403816
3,341169,9.351703
4,345976,9.291651
5,259970,9.122254
6,249277,9.076063
7,277659,9.059546
8,291951,8.998776
9,299659,8.983753


In [22]:
#merge top 5 games suggestion with game details
top_picks= final_recommendation.merge( games[['BGGId','Name','Description']] , on='BGGId', how='left')

In [23]:
# Print a list of game names from the 'Name' column.
print(top_picks['Name'].to_list())


['Uprising: Curse of the Last Emperor', 'Ark Nova', 'Euthia: Torment of Resurrection', 'Great Western Trail (Second Edition)', 'System Gateway (fan expansion for Android: Netrunner)', 'The Lord of the Rings: The Card Game – Two-Player Limited Edition Starter', 'Brazil: Imperial', 'Final Girl', 'The Everdeck', 'Clash of Cultures: Monumental Edition']


**Content-Base Filtering**

In [24]:
# Create a TF-IDF vectorizer object
tfidf = TfidfVectorizer(stop_words='english')

# Fill NaN values in 'Description' with empty strings before applying TF-IDF
games['Description'] = games['Description'].fillna('') # Fill NaN values with empty strings to avoid the error.

pickle.dump(games, open("games.pkl", "wb"))

# Fit and transform the 'Description' column
tfidf_matrix = tfidf.fit_transform(games['Description'])

pickle.dump(tfidf, open("tfidf.pkl", "wb"))
pickle.dump(tfidf_matrix, open("tfidf_matrix.pkl", "wb"))

# Function to get recommendations based on game description
def content_based_recommendations(game_title, tfidf_matrix, games, top_n=10):
    # Find the index of the game in the DataFrame
    idx = games[games['Name'] == game_title].index[0]

    # Calculate cosine similarity between the game and all other games
    cosine_similarities = linear_kernel(tfidf_matrix[idx], tfidf_matrix).flatten()

    # Get the indices of the most similar games
    related_docs_indices = cosine_similarities.argsort()[:-top_n-1:-1]

    # Exclude the game itself
    related_docs_indices = related_docs_indices[related_docs_indices != idx]

    # Return the top N most similar games
    return games['Name'].iloc[related_docs_indices].tolist()


recommendations = content_based_recommendations("Gloomhaven", tfidf_matrix, games)

# Content-based filtering using "Cat" columns (assuming they contain text data)
# Combine relevant "Cat" columns into a single string for each game
games['combined_cat_features'] = games.filter(regex='^Cat:').astype(str).agg(' '.join, axis=1)

pickle.dump(games['combined_cat_features'], open("games_cat.pkl", "wb"))


# Create TF-IDF matrix for combined cat features
tfidf_cat = TfidfVectorizer(stop_words='english', token_pattern=r"(?u)\b\w+\b") # Keep single-character words
tfidf_matrix_cat = tfidf_cat.fit_transform(games['combined_cat_features'])

# Define a function to recommend based on "Cat" features
def content_based_recommendations_cat(game_title, tfidf_matrix, games, top_n=10):
    idx = games[games['Name'] == game_title].index[0]
    cosine_similarities = linear_kernel(tfidf_matrix[idx], tfidf_matrix).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-top_n-1:-1]
    related_docs_indices = related_docs_indices[related_docs_indices != idx]
    return games['Name'].iloc[related_docs_indices].tolist()

pickle.dump(tfidf_cat, open("tfidf_cat.pkl", "wb"))
pickle.dump(tfidf_matrix_cat, open("tfidf_matrix_cat.pkl", "wb"))

# Example usage for "Cat" features
cat_recommendations = content_based_recommendations_cat("Gloomhaven", tfidf_matrix_cat, games)
cat_recommendations


['Space Explorers',
 'Blue Line Hockey',
 'Quests of Valeria',
 'LOKA: A Game of Elemental Strategy',
 'Waste Knights',
 'Flash!',
 "10' to Kill",
 'The Wreck of the B.S.M. Pandora',
 'Unspeakable Words']

Hybrid model using elements of Collabrative and content based filtering above.

In [25]:
def hybrid_recommendations(user_id, game_title, user_item_matrix, tfidf_matrix, tfidf_matrix_cat, games, top_n=25):
    # Collaborative filtering recommendations
    collab_recommendations = recommend_games_knn(user_item_matrix, user_id)

    # Content-based filtering recommendations (categories)
    content_cat_recommendations = content_based_recommendations_cat(game_title, tfidf_matrix_cat, games)

    # Combine recommendations, prioritizing collaborative filtering
    hybrid_recs = []
    seen = set()  # Keep track of games already added

    for rec in collab_recommendations:
      game_name = games[games['BGGId'] == rec]['Name'].values
      if game_name.size > 0 and game_name[0] not in seen:
        hybrid_recs.append(game_name[0])
        seen.add(game_name[0])

    for rec in content_cat_recommendations:
        if rec not in seen and len(hybrid_recs) < top_n:
            hybrid_recs.append(rec)
            seen.add(rec)

    return hybrid_recs[:top_n]

hybrid_recs = hybrid_recommendations(active_user_id, "Ark Nova", user_item_matrix, tfidf_matrix, tfidf_matrix_cat, games)
alphabetized_games = sorted(hybrid_recs)
print(f"Hybrid recommendations for {active_user_id} based on 'Ark Nova': {alphabetized_games}")


Hybrid recommendations for therealshakeyt based on 'Ark Nova': ['7 Wonders: Architects', 'Chinatown', 'Creeper', 'Die Macher', 'Dune', 'Gobblet', 'Horrified: American Monsters', 'Inis', 'Kemet: Blood and Sand', 'Lost Patrol', 'Midway', 'SHASN', 'Startups', 'Tales of the Arabian Nights', "The King's Dilemma", 'Ultimate Stratego', 'Warhammer 40,000 (Eighth Edition)', 'Witness', 'Yahtzee']
