# RAG Study

Build a RAG system to recommend games based on player count and desired game characteristics (e.g., funny, action, FPS, beautiful scenery), using game descriptions and information. This notebook aims to work as a PoC for a feature
on a private project.


## Set Up

In [1]:
!pip install transformers langchain_community chromadb



## Imports

In [2]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel
from typing import List, Optional

import torch
import chromadb
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Data collection

Im mocking a dataset of games with descriptions and relevant information (e.g., player count, genres, tags, themes).


In [3]:
import pandas as pd

data = {
    'title': ['Game A', 'Game B', 'Game C', 'Game D', 'Game E'],
    'description': [
        'A funny adventure game with beautiful scenery.',
        'An action-packed FPS game with multiplayer.',
        'A single-player puzzle game.',
        'A strategy game with a fantasy theme.',
        'A co-op action game with funny moments.'
    ],
    'player_count': ['single-player', 'multiplayer', 'single-player', 'single-player', 'multiplayer'],
    'genres': ['adventure', 'action, FPS', 'puzzle', 'strategy', 'action'],
    'tags': ['funny', 'multiplayer', 'single-player', 'fantasy', 'funny, co-op'],
    'themes': ['scenery', 'action', 'puzzle', 'fantasy', 'funny']
}

df_games = pd.DataFrame(data)

display(df_games.head())

Unnamed: 0,title,description,player_count,genres,tags,themes
0,Game A,A funny adventure game with beautiful scenery.,single-player,adventure,funny,scenery
1,Game B,An action-packed FPS game with multiplayer.,multiplayer,"action, FPS",multiplayer,action
2,Game C,A single-player puzzle game.,single-player,puzzle,single-player,puzzle
3,Game D,A strategy game with a fantasy theme.,single-player,strategy,fantasy,fantasy
4,Game E,A co-op action game with funny moments.,multiplayer,action,"funny, co-op",funny


## Data preprocessing

Clean an preprocess the game data, including handling missing values, standardizing text, and extracting relevant features.


In [4]:
print("Missing values before handling:")
print(df_games.isnull().sum())

df_games.dropna(inplace=True)

text_columns = ['description', 'genres', 'tags', 'themes']
for col in text_columns:
    df_games[col] = df_games[col].str.lower().str.strip()

print("\nDataFrame after preprocessing:")
display(df_games.head())

Missing values before handling:
title           0
description     0
player_count    0
genres          0
tags            0
themes          0
dtype: int64

DataFrame after preprocessing:


Unnamed: 0,title,description,player_count,genres,tags,themes
0,Game A,a funny adventure game with beautiful scenery.,single-player,adventure,funny,scenery
1,Game B,an action-packed fps game with multiplayer.,multiplayer,"action, fps",multiplayer,action
2,Game C,a single-player puzzle game.,single-player,puzzle,single-player,puzzle
3,Game D,a strategy game with a fantasy theme.,single-player,strategy,fantasy,fantasy
4,Game E,a co-op action game with funny moments.,multiplayer,action,"funny, co-op",funny


## Embeddings generation

Generate embeddings for the game descriptions using a suitable language model. The model I chose is a question and anwsering model, as the feature assumes the user will always ask a question.


In [5]:
MODEL = "distilbert-base-cased-distilled-squad" # "sentence-transformers/all-MiniLM-L6-v2

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)

In [6]:
def get_embeddings(texts, tokenizer, model):
    # handle single string input
    if isinstance(texts, str):
        texts = [texts]

    # tokenize the texts
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # compute embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # mean pooling to get the average of all token embeddings
    sentence_embeddings = model_output.last_hidden_state.mean(dim=1)

    return sentence_embeddings.numpy()

# initialize ChromaDB client and collection
def setup_chromadb(collection_name: str = "game_descriptions", persist_directory: str = "./chroma_db"):
    """
    Set up ChromaDB client and create/get collection
    """
    client = chromadb.PersistentClient(path=persist_directory)

    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # use cosine similarity
    )

    return client, collection

def add_documents_to_chroma(collection, df_games, batch_size: int = 100):
    """
    Add game descriptions and embeddings to ChromaDB collection
    """
    descriptions = df_games['description'].tolist()

    # generate embeddings for all descriptions
    print("Generating embeddings...")
    description_embeddings = get_embeddings(descriptions, tokenizer, model)

    # prepare data for ChromaDB
    documents = descriptions
    ids = [f"game_{i}" for i in range(len(descriptions))]
    embeddings = description_embeddings.tolist()

    # create metadata (including other columns from dataframe)
    metadatas = []
    for idx, row in df_games.iterrows():
        metadata = {}
        for col in df_games.columns:
            if col not in ['description', 'description_embeddings']:
                value = row[col]
                # handle different data types
                if isinstance(value, (list, np.ndarray)):
                    # skip array-like columns or convert to string representation
                    continue
                elif pd.isna(value):
                    metadata[col] = ""
                else:
                    metadata[col] = str(value)
        metadatas.append(metadata)

    # add documents in batches
    print(f"Adding {len(documents)} documents to ChromaDB...")
    for i in range(0, len(documents), batch_size):
        end_idx = min(i + batch_size, len(documents))
        collection.add(
            documents=documents[i:end_idx],
            embeddings=embeddings[i:end_idx],
            metadatas=metadatas[i:end_idx],
            ids=ids[i:end_idx]
        )

    print(f"Successfully added {len(documents)} documents to ChromaDB!")
    return collection


## Search and display results

In [None]:
def search_similar_games(collection, query: str, n_results: int = 5):
    """
    Search for similar games based on description
    """
    # generate embedding for the query
    query_embedding = get_embeddings([query], tokenizer, model)[0]

    # search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )

    return results

def text_search(collection, query: str, n_results: int = 5):
    """
    Search using text query directly (ChromaDB will handle embedding)
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )

    return results

def display_search_results(results, query: str):
    """
    Display search results in a readable format
    """
    print(f"\nSearch results for: '{query}'\n" + "="*50)

    documents = results['documents'][0]
    metadatas = results['metadatas'][0]
    distances = results['distances'][0]

    for i, (doc, metadata, distance) in enumerate(zip(documents, metadatas, distances)):
        print(f"\nResult {i+1}:")
        print(f"Similarity Score: {1 - distance:.4f}")  # convert distance to similarity

        for key, value in metadata.items():
            if value and value.strip():  # only show non-empty values!!
                print(f"{key.title()}: {value}")

        print(f"Description: {doc[:200]}..." if len(doc) > 200 else f"Description: {doc}")
        print("-" * 40)

In [7]:
client, collection = setup_chromadb()

collection = add_documents_to_chroma(collection, df_games)

Generating embeddings...
Adding 5 documents to ChromaDB...
Successfully added 5 documents to ChromaDB!


## Usage

In [10]:
results = search_similar_games(collection,
                               "What is the best single player fantasy with magic?",
                               n_results=3)
display_search_results(results, "What is the best single player fantasy with magic?")


Search results for: 'What is the best single player fantasy with magic?'

Result 1:
Similarity Score: 0.9166
Genres: strategy
Tags: fantasy
Player_Count: single-player
Title: Game D
Themes: fantasy
Description: a strategy game with a fantasy theme.
----------------------------------------

Result 2:
Similarity Score: 0.9150
Title: Game C
Player_Count: single-player
Tags: single-player
Genres: puzzle
Themes: puzzle
Description: a single-player puzzle game.
----------------------------------------

Result 3:
Similarity Score: 0.9141
Tags: funny
Player_Count: single-player
Genres: adventure
Title: Game A
Themes: scenery
Description: a funny adventure game with beautiful scenery.
----------------------------------------
