# RAG Study

Build a RAG system to recommend games based on player count and desired game characteristics (e.g., funny, action, FPS, beautiful scenery), using game descriptions and information. This notebook aims to work as a PoC for a feature
on a private project.


## Set Up

In [1]:
!pip install transformers langchain_community chromadb igdb-api-v4

Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting chromadb
  Downloading chromadb-1.0.17-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp311-cp311-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloadin

## Imports

In [2]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel
from typing import List, Optional

import torch
import chromadb
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Data collection

Im mocking a dataset of games with descriptions and relevant information (e.g., player count, genres, tags, themes).


In [24]:
from google.colab import userdata
from requests import post

CLIENT_ID = userdata.get('TWITCH_CLIENT_ID')
TWITCH_SECRET = userdata.get('TWITCH_SECRET')

access_token = post(f'https://id.twitch.tv/oauth2/token?client_id={CLIENT_ID}&client_secret={TWITCH_SECRET}&grant_type=client_credentials')

access_token = access_token.json()
access_token = 'Bearer ' + access_token['access_token']

In [32]:
QUERY = "fields id, name, game_type, game_modes, genres, keywords, platforms, ports, total_rating, total_rating_count, storyline, summary, tags, themes; limit 100;"

In [33]:
# games
response = post('https://api.igdb.com/v4/games',
                **{
                    'headers': {'Client-ID': CLIENT_ID, 'Authorization': access_token},
                    'data':QUERY
                }
          )

result = response.json()

In [36]:
df_games = pd.DataFrame(result)
df_games.head()

Unnamed: 0,id,game_modes,genres,name,platforms,summary,themes,game_type,tags,keywords,total_rating,total_rating_count,storyline,ports
0,330684,"[1, 2, 4]","[10, 33]",Nightmare Kart: The Old Karts,[6],An upcoming free expansion to Nightmare Kart w...,[1],14,,,,,,
1,177310,[1],"[13, 32]",The Undying Beast,[6],"There was a flash of light, a choir of glass, ...",[19],0,"[19, 268435469, 268435488]",,,,,
2,357276,[1],"[9, 12, 32]",Hello Anxiety,[6],Hello Anxiety is a pixel art narrative puzzle ...,,0,"[268435465, 268435468, 268435488, 536872617, 5...","[1705, 1721, 1780, 1912, 3749, 4004, 44629]",,,,
3,350392,[2],[5],Rival Species,[6],In the far distant future of the 40th millenni...,"[1, 18]",5,,"[546, 1035, 2004, 45434]",,,,
4,63844,"[1, 2, 4]",[14],Ace wo Nerae!,[58],A tennis game for the Super Famicom based on t...,[1],0,"[1, 268435470, 536870990, 536871223, 536875053...","[78, 311, 4141, 25265, 29745, 48291]",52.904629,5.0,,


In [54]:
# game_modes
QUERY = "fields checksum, name; limit 100;"
response = post('https://api.igdb.com/v4/game_modes',
                **{
                    'headers': {'Client-ID': CLIENT_ID, 'Authorization': access_token},
                    'data':QUERY
                }
          )

result = response.json()

df_game_mode = pd.DataFrame(result)
df_game_mode = df_game_mode.rename(columns={'name': 'game_mode', 'id' : 'game_mode_id'})
df_game_mode.head(2)

Unnamed: 0,game_mode_id,game_mode,checksum
0,1,Single player,1cc07088-c5fb-3cb2-9e68-af6620c18836
1,2,Multiplayer,3ffef62b-e19f-6bab-d510-98385c06d902


In [55]:
# genres
QUERY = "fields checksum, name; limit 100;"
response = post('https://api.igdb.com/v4/genres',
                **{
                    'headers': {'Client-ID': CLIENT_ID, 'Authorization': access_token},
                    'data':QUERY
                }
          )

result = response.json()

df_genres = pd.DataFrame(result)
df_genres = df_genres.rename(columns={'name': 'genre', 'id' : 'genre_id'})
df_genres.head(2)

Unnamed: 0,genre_id,genre,checksum
0,2,Point-and-click,b295f28a-5f68-fc3e-5de2-f3195e10d160
1,4,Fighting,d23da988-5bb7-011e-34dc-0b712765e470


In [56]:
# themes
QUERY = "fields checksum, name; limit 100;"
response = post('https://api.igdb.com/v4/themes',
                **{
                    'headers': {'Client-ID': CLIENT_ID, 'Authorization': access_token},
                    'data':QUERY
                }
          )

result = response.json()

df_themes = pd.DataFrame(result)
df_themes = df_themes.rename(columns={'name': 'theme', 'id' : 'theme_id'})
df_themes.head(2)

Unnamed: 0,theme_id,theme,checksum
0,31,Drama,a10308c4-a660-7016-742b-ec956e9e9675
1,32,Non-fiction,64d7a553-bbf8-1bbb-fc3c-8b1f7e404f5e


In [69]:
df = df_games.explode('game_modes').merge(df_game_mode, left_on='game_modes', right_on='game_mode_id', how='left')
df = df.explode('genres').merge(df_genres, left_on='genres', right_on='genre_id', how='left')
df = df.explode('themes').merge(df_themes, left_on='themes', right_on='theme_id', how='left')
df.head(2)

Unnamed: 0,id,game_modes,genres,name,platforms,summary,themes,game_type,tags,keywords,...,ports,game_mode_id,game_mode,checksum_x,genre_id,genre,checksum_y,theme_id,theme,checksum
0,330684,1,10,nightmare kart: the old karts,[6],An upcoming free expansion to Nightmare Kart w...,1,14,,,...,,1.0,Single player,1cc07088-c5fb-3cb2-9e68-af6620c18836,10.0,Racing,41227287-0a1a-0f14-90f2-05655314e8b4,1.0,Action,cee4e3c1-6b2d-6dcc-a707-e00ca4de6ecc
1,330684,1,33,nightmare kart: the old karts,[6],An upcoming free expansion to Nightmare Kart w...,1,14,,,...,,1.0,Single player,1cc07088-c5fb-3cb2-9e68-af6620c18836,33.0,Arcade,cd4431bf-5482-b058-a863-7eb596a438dd,1.0,Action,cee4e3c1-6b2d-6dcc-a707-e00ca4de6ecc


## Data preprocessing

Clean an preprocess the game data, including handling missing values, standardizing text, and extracting relevant features.


In [70]:
df = df[['name', 'summary', 'game_mode', 'genre', 'theme', 'storyline']].copy()
df.columns = ['name', 'description', 'game_mode', 'genre', 'theme', 'storyline']
df.fillna('Unkown',inplace=True)
df.head(2)

Unnamed: 0,name,description,game_mode,genre,theme,storyline
0,nightmare kart: the old karts,An upcoming free expansion to Nightmare Kart w...,Single player,Racing,Action,Unkown
1,nightmare kart: the old karts,An upcoming free expansion to Nightmare Kart w...,Single player,Arcade,Action,Unkown


In [71]:
df['genre'] = df.groupby('name')['genre'].transform(lambda x: ', '.join(x))
df['theme'] = df.groupby('name')['theme'].transform(lambda x: ', '.join(x))
df['game_mode'] = df.groupby('name')['game_mode'].transform(lambda x: ', '.join(x))
df = df.drop_duplicates(subset=['name'])

In [72]:
print("Missing values before handling:")
print(df.isnull().sum())

text_columns = ['name', 'description', 'game_mode', 'genre', 'theme', 'storyline']
for col in text_columns:
    df[col] = df[col].str.lower().str.strip()

print("\nDataFrame after preprocessing:")
display(df.head())

Missing values before handling:
name           0
description    0
game_mode      0
genre          0
theme          0
storyline      0
dtype: int64

DataFrame after preprocessing:


Unnamed: 0,name,description,game_mode,genre,theme,storyline
0,nightmare kart: the old karts,an upcoming free expansion to nightmare kart w...,"single player, single player, multiplayer, mul...","racing, arcade, racing, arcade, racing, arcade","action, action, action, action, action, action",unkown
6,the undying beast,"there was a flash of light, a choir of glass, ...","single player, single player","simulator, indie","horror, horror",unkown
8,hello anxiety,hello anxiety is a pixel art narrative puzzle ...,"single player, single player, single player","puzzle, role-playing (rpg), indie","unkown, unkown, unkown",unkown
11,rival species,in the far distant future of the 40th millenni...,"multiplayer, multiplayer","shooter, shooter","action, science fiction",unkown
13,ace wo nerae!,a tennis game for the super famicom based on t...,"single player, multiplayer, split screen","sport, sport, sport","action, action, action",unkown


## Embeddings generation

Generate embeddings for the game descriptions using a suitable language model. The model I chose is a question and anwsering model, as the feature assumes the user will always ask a question.


In [73]:
MODEL = "distilbert-base-cased-distilled-squad" # "sentence-transformers/all-MiniLM-L6-v2

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

In [74]:
def get_embeddings(texts, tokenizer, model):
    # handle single string input
    if isinstance(texts, str):
        texts = [texts]

    # tokenize the texts
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # compute embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # mean pooling to get the average of all token embeddings
    sentence_embeddings = model_output.last_hidden_state.mean(dim=1)

    return sentence_embeddings.numpy()

# initialize ChromaDB client and collection
def setup_chromadb(collection_name: str = "game_descriptions", persist_directory: str = "./chroma_db"):
    """
    Set up ChromaDB client and create/get collection
    """
    client = chromadb.PersistentClient(path=persist_directory)

    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # use cosine similarity
    )

    return client, collection

def add_documents_to_chroma(collection, df_games, batch_size: int = 100):
    """
    Add game descriptions and embeddings to ChromaDB collection
    """
    descriptions = df_games['description'].tolist()

    # generate embeddings for all descriptions
    print("Generating embeddings...")
    description_embeddings = get_embeddings(descriptions, tokenizer, model)

    # prepare data for ChromaDB
    documents = descriptions
    ids = [f"game_{i}" for i in range(len(descriptions))]
    embeddings = description_embeddings.tolist()

    # create metadata (including other columns from dataframe)
    metadatas = []
    for idx, row in df_games.iterrows():
        metadata = {}
        for col in df_games.columns:
            if col not in ['description', 'description_embeddings']:
                value = row[col]
                # handle different data types
                if isinstance(value, (list, np.ndarray)):
                    # skip array-like columns or convert to string representation
                    continue
                elif pd.isna(value):
                    metadata[col] = ""
                else:
                    metadata[col] = str(value)
        metadatas.append(metadata)

    # add documents in batches
    print(f"Adding {len(documents)} documents to ChromaDB...")
    for i in range(0, len(documents), batch_size):
        end_idx = min(i + batch_size, len(documents))
        collection.add(
            documents=documents[i:end_idx],
            embeddings=embeddings[i:end_idx],
            metadatas=metadatas[i:end_idx],
            ids=ids[i:end_idx]
        )

    print(f"Successfully added {len(documents)} documents to ChromaDB!")
    return collection


## Search and display results

In [75]:
def search_similar_games(collection, query: str, n_results: int = 5):
    """
    Search for similar games based on description
    """
    # generate embedding for the query
    query_embedding = get_embeddings([query], tokenizer, model)[0]

    # search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )

    return results

def text_search(collection, query: str, n_results: int = 5):
    """
    Search using text query directly (ChromaDB will handle embedding)
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )

    return results

def display_search_results(results, query: str):
    """
    Display search results in a readable format
    """
    print(f"\nSearch results for: '{query}'\n" + "="*50)

    documents = results['documents'][0]
    metadatas = results['metadatas'][0]
    distances = results['distances'][0]

    for i, (doc, metadata, distance) in enumerate(zip(documents, metadatas, distances)):
        print(f"\nResult {i+1}:")
        print(f"Similarity Score: {1 - distance:.4f}")  # convert distance to similarity

        for key, value in metadata.items():
            if value and value.strip():  # only show non-empty values!!
                print(f"{key.title()}: {value}")

        print(f"Description: {doc[:200]}..." if len(doc) > 200 else f"Description: {doc}")
        print("-" * 40)

In [77]:
client, collection = setup_chromadb()

collection = add_documents_to_chroma(collection, df)

Generating embeddings...
Adding 100 documents to ChromaDB...
Successfully added 100 documents to ChromaDB!


## Usage

In [82]:
query_embedding = get_embeddings(["Return all"], tokenizer, model)[0]

  # search in ChromaDB
collection.query(
      query_embeddings=[query_embedding.tolist()],
      n_results=5,
      include=['documents', 'metadatas', 'distances'],
      where={"game_mode": "multiplayer"}
  )

{'ids': [['game_56']],
 'embeddings': None,
 'documents': [['the pyro update was the second major content update for team fortress 2.\n\nthe pyro class was the focus of this update. in addition to three new weapons, the latent hadouken taunt was upgraded to become the first kill taunt. the update also added compression blast to the stock flamethrower, an ability that would become a core technique for the class.\n\ntwo community-made maps, fastlane and turbine, were made official and included in the update, along with a link to the valve developer community site. meet the sniper was also introduced in a similar manner to meet the scout. additionally, many improvements and fixes were included.']],
 'uris': None,
 'included': ['documents', 'metadatas', 'distances'],
 'data': None,
 'metadatas': [[{'theme': 'action',
    'game_mode': 'multiplayer',
    'genre': 'shooter',
    'name': 'team fortress 2: the pyro update',
    'storyline': 'unkown'}]],
 'distances': [[0.886254608631134]]}

TODO: Aprender a filtrar com where contains

In [78]:
results = search_similar_games(collection,
                               "What is the best single player fantasy with magic?",
                               n_results=3)
display_search_results(results, "What is the best single player fantasy with magic?")


Search results for: 'What is the best single player fantasy with magic?'

Result 1:
Similarity Score: 0.8446
Genre: shooter, shooter, shooter, shooter
Theme: action, comedy, action, comedy
Storyline: unkown
Game_Mode: single player, single player, multiplayer, multiplayer
Name: faceball 2000
Description: welcome to the exciting new world of faceball 2000, where 3d graphics, first person perspective and 360° maneuverability make you feel like you're inside your video game! what you see is where you are...
----------------------------------------

Result 2:
Similarity Score: 0.8345
Game_Mode: unkown, unkown, unkown, unkown
Storyline: unkown
Theme: action, romance, action, romance
Name: moonlight destiny
Genre: role-playing (rpg), role-playing (rpg), adventure, adventure
Description: a chinese developed role-playing game that was later localized into japanese by nihon falcom under the name moonlight destiny.

yue ying chuan shuo jian xia qing yuan wai zhuan is a role-playing game....
---