# Boardgame Recommender

## 1. Business Understanding

**BoardGameGeek** is a comprehensive online database that consists of over 125,600 board games, providing access to reviews, discussion forums, and detailed information on individual titles. It is widely recognized as the most extensive repository of board game data available. In addition to serving as an information resource, the platform enables users to rate games on a 1–10 scale and to manage their personal game collections.

The goal of this project is to develop a recommender system that leverages data from BoardGameGeek to suggest board games to users. We chose an item-based collaborative filtering approach, using community rating data to find similarities between games. The system will generate personalized recommendations for selected user, based on their individual rating history.

## 2. Data Understanding

The dataset used in this project is the **Board Game Database** available on [Kaggle](https://www.kaggle.com/datasets/threnjen/board-games-database-from-boardgamegeek), which compiles data from the **BoardGameGeek (BGG)** platform. The data, collected in 2021, captures detailed information about board games, their attributes, and user interactions on the site. 

This dataset provides a rich foundation for exploring patterns in board game design, user preferences, and popularity, making it well-suited for building a **recommender system** based on user ratings and item similarities.

### 2.1 Overview of Available Files

The dataset consists of **nine files**, each offering different dimensions of board game data:

- **GAMES** – Contains core details for approximately **22,000 board games** with **47 features**.  
  Each game is identified by a unique **BGGId**, serving as the primary key. This file includes metadata such as publication year, minimum/maximum players, average playtime, and average user rating.

- **RATINGS_DISTRIBUTION** – Provides the **full rating distribution** for each game (`BGGId`), detailing how users rated the game on a 1–10 scale.

- **THEMES** – Lists **thematic categories** associated with each game (`BGGId`), allowing thematic-based filtering and analysis.

- **MECHANICS** – Includes **game mechanics** represented as binary indicators per game, enabling insights into gameplay styles (e.g., deck building, worker placement, area control).

- **SUBCATEGORIES** – Contains **secondary classifications** (binary flags) that describe additional attributes or gameplay aspects of each title.

- **ARTISTS_REDUCED** – Identifies **artists** involved in the visual design of each game.  
  Only artists with more than 3 works are included explicitly; others are represented by a binary flag.

- **DESIGNERS_REDUCED** – Provides information about **game designers**, using the same filtering logic as the artists file (designers with >3 works listed individually).

- **PUBLISHERS_REDUCED** – Contains **publisher information** with binary flags for those associated with fewer than 3 published games.

- **USER_RATINGS** – The largest and most critical table for recommendation modeling.  
  It contains user-generated ratings with over **411,000 unique users** and approximately **19 million ratings**, linking users to the games they have rated via `username` and `BGGId`.


### 2.2 Data Relevance to the Project

For this project, the **USER_RATINGS** dataset forms the foundation of the **item-based collaborative filtering** recommender system.

### 2.3 Data Source

- **Primary Source:** [BoardGameGeek](https://boardgamegeek.com/)  
- **Dataset Repository:** [Board Games Database on Kaggle](https://www.kaggle.com/datasets/threnjen/board-games-database-from-boardgamegeek)

### 2.4 Data Access and Loading

To begin the analysis, the dataset is retrieved directly from **Kaggle** using the `kagglehub` library.  
This ensures easy and reliable access to the **Board Game Database** for exploration and modeling.

Install dependencies as needed:  
`pip install kagglehub[pandas-datasets]`

In [None]:
import kagglehub
import os
import pandas as pd

path = kagglehub.dataset_download("threnjen/board-games-database-from-boardgamegeek")

### 2.5 Exploring the `GAMES` File

The **`games.csv`** file serves as the **master information table**, containing detailed metadata for each board game listed on BoardGameGeek.  
Each game is uniquely identified by the `BGGId` and described using 47 attributes covering gameplay, community ratings, rankings, and publisher information.

Key features include:

- **BGGId** – Unique BoardGameGeek game identifier  
- **Name** – Title of the board game  
- **Description** – Lemmatized and punctuation-stripped description text  
- **YearPublished** – Year the game was first published  
- **GameWeight** – Difficulty or complexity rating  
- **AvgRating / BayesAvgRating** – Average and Bayesian-weighted average user ratings  
- **MinPlayers / MaxPlayers / BestPlayers** – Recommended and community-voted player counts  
- **ComAgeRec / MfgAgeRec** – Community and manufacturer age recommendations  
- **MfgPlayTime / ComMinPlaytime / ComMaxPlaytime** – Estimated and community-reported play times  
- **NumOwned / NumWant / NumWish** – Ownership and wishlist statistics  
- **Rank:** fields – Game rankings across multiple categories (e.g., strategy, thematic, party)  
- **Cat:** fields – Binary indicators for major game categories  

The following code loads the `games.csv` file into a pandas DataFrame and displays the first few records for inspection.


In [None]:
df_games = pd.read_csv(os.path.join(path, "games.csv"))
display(df_games.head())

### 2.7 Exploring the `USER_RATINGS` File

The **`user_ratings.csv`** file captures individual user ratings for board games listed on BoardGameGeek.  
Each record links a **user** to a **game** through its unique `BGGId` and the corresponding numeric rating, forming the foundation for collaborative filtering–based recommendations.

Key features include:

- **BGGId** – Unique identifier for each board game (foreign key referencing the `GAMES` table)  
- **Rating** – The raw user-assigned rating on a 1–10 scale  
- **Username** – Identifier of the user who provided the rating  

This dataset contains over **19 million ratings** from approximately **411,000 unique users**, making it the **core interaction data** for the recommender system.

The following code loads the `user_ratings.csv` file into a pandas DataFrame and displays the first few records to examine its structure.


In [None]:
df_user_ratings = pd.read_csv(os.path.join(path, "user_ratings.csv"))
display(df_user_ratings.head())

### 2.6 Data Quality Assessment

A thorough assessment of data quality was conducted to ensure the dataset is suitable for analysis and modeling.  
This evaluation focused on two key aspects: **missing values** and **outliers**.  
The findings are based on the official dataset documentation provided on Kaggle:  
[Board Games Database (BoardGameGeek, Kaggle)](https://www.kaggle.com/datasets/threnjen/board-games-database-from-boardgamegeek?select=games.csv)

#### Missing Values
- The **`games.csv`** file was found to be **largely complete**, with **only one missing value** detected in the `Description` column.  
  Since this column contains textual metadata and is not critical for the recommendation system, this missing entry does not pose a problem for the analysis.
- The **`user_ratings.csv`** file contains **no missing values**, ensuring the user–item interaction data is fully populated and ready for modeling.

#### Outliers
- Based on both the exploratory inspection and the accompanying dataset documentation, the **`games.csv`** file contains **no outlier values** in its numerical attributes (e.g., `AvgRating`, `GameWeight`, or `NumUserRatings`).
- The **`user_ratings.csv`** file also shows **no outliers**, with all ratings conforming to the expected **1–10** user rating scale.

Overall, both datasets demonstrate **excellent data quality**, requiring minimal cleaning prior to the development of the recommender system.

In [None]:
# df_artists = pd.read_csv(os.path.join(path, "artists_reduced.csv"))
# display(df_artists.head())
# 
# df_designers = pd.read_csv(os.path.join(path, "designers_reduced.csv"))
# display(df_designers.head())

# df_mechanics = pd.read_csv(os.path.join(path, "mechanics.csv"))
# display(df_mechanics.head())
# 
# df_publishers = pd.read_csv(os.path.join(path, "publishers_reduced.csv"))
# display(df_publishers.head())

# df_ratings_distribution = pd.read_csv(os.path.join(path, "ratings_distribution.csv"))
# display(df_ratings_distribution.head())

# df_subcategories = pd.read_csv(os.path.join(path, "subcategories.csv"))
# display(df_subcategories.head())

# df_themes = pd.read_csv(os.path.join(path, "themes.csv"))
# display(df_themes.head())


# For the documentation file, just preview as text (not tabular)
# with open(os.path.join(path, "bgg_data_documentation.txt"), "r", encoding="utf-8") as f:
#     bgg_documentation = f.read()
# print(bgg_documentation[:500])  # first 500 characters



# GAMES
# 	BGGId			BoardGameGeek game ID
# 	Name			Name of game
# 	Description		Description, stripped of punctuation and lemmatized
# 	YearPublished		First year game published
# 	GameWeight		Game difficulty/complexity
# 	AvgRating		Average user rating for game
# 	BayesAvgRating		Bayes weighted average for game (x # of average reviews applied)
# 	StdDev			Standard deviation of Bayes Avg
# 	MinPlayers		Minimum number of players
# 	MaxPlayers		Maximun number of players
# 	ComAgeRec		Community's recommended age minimum
# 	LanguageEase		Language requirement
# 	BestPlayers		Community voted best player count
# 	GoodPlayers		List of community voted good plater counts
# 	NumOwned		Number of users who own this game
# 	NumWant			Number of users who want this game
# 	NumWish			Number of users who wishlisted this game
# 	NumWeightVotes		? Unknown
# 	MfgPlayTime		Manufacturer Stated Play Time
# 	ComMinPlaytime		Community minimum play time
# 	ComMaxPlaytime		Community maximum play time
# 	MfgAgeRec		Manufacturer Age Recommendation
# 	NumUserRatings		Number of user ratings
# 	NumComments		Number of user comments
# 	NumAlternates		Number of alternate versions
# 	NumExpansions		Number of expansions
# 	NumImplementations	Number of implementations
# 	IsReimplementation	Binary - Is this listing a reimplementation? 
# 	Family			Game family
# 	Kickstarted		Binary - Is this a kickstarter?
# 	ImagePath		Image http:// path
# 	Rank:boardgame		Rank for boardgames overall
# 	Rank:strategygames	Rank in strategy games
# 	Rank:abstracts		Rank in abstracts
# 	Rank:familygames	Rank in family games
# 	Rank:thematic		Rank in thematic
# 	Rank:cgs		Rank in card games
# 	Rank:wargames		Rank in war games
# 	Rank:partygames		Rank in party games
# 	Rank:childrensgames	Rank in children's games
# 	Cat:Thematic		Binary is in Thematic category
# 	Cat:Strategy		Binary is in Strategy category
# 	Cat:War			Binary is in War category
# 	Cat:Family		Binary is in Family category
# 	Cat:CGS			Binary is in Card Games category
# 	Cat:Abstract		Binary is in Abstract category
# 	Cat:Party		Binary is in Party category
# 	Cat:Childrens		Binary is in Childrens category
# 
# 
# MECHANICS
# 	BGGId			BoardGameGeek game ID	
# 	Remaining headers are various mechanics with binary flag
# 
# 
# THEMES
# 	BGGId			BoardGameGeek game ID
# 	Remaining headers are various themes with binary flag
# 
# 
# SUBCATEGORIES
# 	BGGId			BoardGameGeek game ID
# 	Remaining headers are various subcategories with binary flag
# 
# 
# ARTISTS_REDUCED
# 	BGGId			BoardGameGeek game ID
# 	Low-Exp Artist		Indicates game has an unlisted artist with <= 3 entries
# 	Remaining headers are various artists with binary flag
# 
# 
# DESIGNERS_REDUCED
# 	BGGId			BoardGameGeek game ID
# 	Low-Exp Designer	Indicates game has an unlisted designer with <= 3 entries
# 	Remaining headers are various subcategories with binary flag
# 
# 
# PUBLISHERS_REDUCED
# 	BGGId			BoardGameGeek game ID
# 	Low-Exp Publisher	Indicated games has an unlisted publisher with <= 3 entries
# 	Remaining headers are various subcategories with binary flag
# 
# 
# USER_RATINGS
# 	BGGId			BoardGameGeek game ID	
# 	Rating			Raw rating given by user
# 	Username		User giving rating
# 
# 
# RATINGS_DISTRIBUTION
# 	BGGId			BoardGameGeek game ID
# 	Numbers 0.0-10.0	Number of ratings per rating header
# 	total_ratings		Total number of ratings for game

## 3. Data Preparation

 cleanataan data 

In [None]:
# Target user
target_user = "TeemuVataja"

In [None]:
from surprise import Reader, Dataset

# Example df_user_ratings has columns: Username, Item, Rating

# Build user table with unique IDs and rating counts
# If we ever need a dataframe without ratings just copy df_users before this step
df_users = (
    df_user_ratings.groupby("Username")
    .size()  # counts ratings per user
    .reset_index(name="RatingCount")
    .reset_index(names="UserId")  # create UserId from row index
)

#TODO: printtejä

print("Total number of users:")
display(len(df_users))

print("User table")
display(df_users.head())

print("User rating table")
display(df_user_ratings.head())

print("Selected user table")
display(df_users[df_users["Username"] == target_user])

Users with more than equal 10 ratings

In [None]:
print(f"Length of users set before reduction: {len(df_users)}")

df_users_reduced = df_users[df_users['RatingCount'] >= 10]

print(f"Length of users set after reduction: {len(df_users_reduced)}")

In [None]:
print(f"Length of games set before reduction: {len(df_games)}")

df_games_reduced = df_games[df_games['NumUserRatings'] >= 100]

print(f"Length of games set after reduction: {len(df_games_reduced)}")

In [None]:
print(f"Length of ratings set before reduction: {len(df_user_ratings)}")

df_ratings_reduced = df_user_ratings[df_user_ratings["BGGId"].isin(df_games_reduced["BGGId"])]
df_ratings_reduced = df_ratings_reduced[df_ratings_reduced["Username"].isin(df_users_reduced["Username"])]

print(f"Length of ratings set after reduction: {len(df_ratings_reduced)}")
print(f"Valid ratings after pruning for target user: {len(df_ratings_reduced[df_ratings_reduced['Username'] == target_user])}")

## 4. Modeling


Selitä mitä nyt tehdään
# SVD = item x item
# KNN = user x user
Mallin luonti ja tallennus tietokoneelle. Jos ei luotu niin kouluttaa sen ja tallentaa

In [None]:
from surprise import SVD, dump
from surprise.model_selection import train_test_split
import os

def create_surprise_dataset(df, user_id, item_id, ratings, reader):
    return Dataset.load_from_df(df[[user_id, item_id, ratings]], reader)

data = create_surprise_dataset(df_ratings_reduced, "Username", "BGGId", "Rating", Reader(rating_scale=(0.0, 10.0)))

# def load_test_model(model_name, algo, force_recreate=False):
#     if not os.path.exists(model_name) or force_recreate:
#         surprise_dataset = create_surprise_dataset(df_ratings_reduced, "Username", "BGGId", "Rating", Reader(rating_scale=(0.0, 10.0))) # First create surprise dataset from an existing Pandas Dataframe. Hardcoded for now
#         trainset, testset = train_test_split(surprise_dataset, test_size=0.2)
#         algo.fit(trainset)
#         #surprise_trainset = surprise_dataset.build_full_trainset() # Then build the trainset
#         #algo.fit(surprise_trainset) # Lastly create the model using said training set
#         
#         dump.dump(model_name, algo=algo)
#         return algo, testset
# 
#     _, algo = dump.load(model_name)
#     return algo
# 
# model, testset = load_test_model("testmodel.pkl", SVD(n_factors=50, random_state=666))

predictions funktiot.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def top_similar_users(username, model, top_n=10):
    """
    Returns top-N similar users to a given username (raw ID).
    """
    try:
        inner_uid = model.trainset.to_inner_uid(username)
    except ValueError:
        return []

    user_vec = model.pu[inner_uid].reshape(1, -1)
    sims = cosine_similarity(user_vec, model.pu)[0]

    # Exclude self
    top_inner = np.argsort(-sims)[1:top_n+1]
    return [model.trainset.to_raw_uid(i) for i in top_inner]


Selitystä

In [None]:

def user_predictions(username, model, top_n_items=None):
    """
    Predict ratings for all unseen items for a given username.
    Optionally, return only top-N items.
    """
    try:
        inner_uid = model.trainset.to_inner_uid(username)
    except ValueError:
        return []

    # Items already rated by user
    user_items = {iid for (iid, _) in model.trainset.ur[inner_uid]}
    all_items = set(range(model.trainset.n_items))
    unseen_items = all_items - user_items

    # Build test set (user x unseen items)
    testset = [(username, model.trainset.to_raw_iid(i), 0.) for i in unseen_items]

    # Get predictions
    predictions = model.test(testset)

    # Optionally keep top-N items
    if top_n_items:
        predictions = sorted(predictions, key=lambda x: x.est, reverse=True)[:top_n_items]

    return predictions


Selitystä

In [None]:
def user_predictions_from_similar(username, model, top_n_users=1000, top_n_items=10):
    """
    Predict items for a user based on items liked by similar users.
    """
    similar_users = top_similar_users(username, model, top_n=top_n_users)

    # Collect items rated by similar users
    similar_items = set()
    for su in similar_users:
        inner_uid = model.trainset.to_inner_uid(su)
        similar_items.update([model.trainset.to_raw_iid(iid) for (iid, _) in model.trainset.ur[inner_uid]])

    # Filter out items already rated by target user
    inner_uid_target = model.trainset.to_inner_uid(username)
    user_items = {model.trainset.to_raw_iid(iid) for (iid, _) in model.trainset.ur[inner_uid_target]}
    candidate_items = similar_items - user_items

    # Build test set and predict
    testset = [(username, iid, 0.) for iid in candidate_items]
    predictions = model.test(testset)
    predictions = sorted(predictions, key=lambda x: x.est, reverse=True)[:top_n_items]

    return predictions


Selitystä

## 5. Evaluation

In [None]:
from surprise.model_selection import cross_validate

cross_validate(SVD(n_factors=50, random_state=666), data, measures=["RMSE", "MAE"], cv=3, return_train_measures=True, verbose=True)

## 6. Deployment

In [None]:
def load_surprise_model(model_name, algo, force_recreate=False):
    if not os.path.exists(model_name) or force_recreate:
        surprise_dataset = create_surprise_dataset(df_ratings_reduced, "Username", "BGGId", "Rating", Reader(rating_scale=(0.0, 10.0))) # First create surprise dataset from an existing Pandas Dataframe. Hardcoded for now
        surprise_trainset = surprise_dataset.build_full_trainset() # Then build the trainset
        algo.fit(surprise_trainset) # Lastly create the model using said training set
        
        dump.dump(model_name, algo=algo)
        return algo

    _, algo = dump.load(model_name)
    return algo

model = load_surprise_model("model.pkl", SVD(n_factors=50,random_state=666))

##tehdään tällä juttutja

tähän juttuja mallin luonnista

In [None]:
#TODO: poista sarakkeita tulostuksesta
# Example usage
predictions = user_predictions_from_similar(target_user, model)

top10 = sorted(predictions, key=lambda x: x.est, reverse=True)
for pred in top10:
    print(pred.iid, pred.est)
    display(df_games_reduced[df_games_reduced["BGGId"] == pred.iid])
    

ajatuksia