In [1]:
import os

os.environ['KERAS_BACKEND'] = "torch"
random_state = 281997

BGG Does not directly provide a way to list all the games it has in archive therefore we used a dump created by the community (2024-08-18).

# Dataset Generation
Our dataset is a corpus of reviews scrapped from the BGG API. <br /> 
In order to download the comments we make use of the ```bgg_corpus_service.py``` content.

## Subsample the data
We should limit the number of reviews, how many? Let's look at some case studies:

- Amazon Product Reviews
Size: Varies by category, but subsets of 5,000 to 20,000 reviews are common.
- Yelp Dataset
Size: Typically, 8,000 to 15,000 reviews are used in research for unsupervised aspect extraction.
- TripAdvisor Reviews
Size: Around 5,000 to 10,000 reviews in unsupervised experiments.

For unsupervised learning, 5,000–10,000 reviews is a reasonable starting point for recognizing 6 aspects. More reviews may improve diversity and robustness but come with increased computational costs.




We decided to subsample in 5 different sizes: [16k, 32k, 64k, 128k, 256k] <br>
Training will be done on all the datasets and we will see how the model behaves with more data. (If we are actually underfitting)

In [3]:
# File of our corpus:
corpus_file = "../data/corpus.csv"

### Random sampling
We randomly take reviews, without taking into account anything special. <br>

In [3]:
from core.dataset_sampler import BggDatasetRandomBalancedSampler

target_sizes = [16000, 64000, 256000]
for size in target_sizes:
    sampler = (BggDatasetRandomBalancedSampler(size, output_dir="../data/sampled-dataset", random_state=random_state))
    sampler.make_sample_of_data(corpus_file, f"corpus.sampled.{int(size / 1000)}k.csv")

I have a total of 2220 games with reviews. We want to be ~8000 reviews so we take 4 reviews per game.
Dataset with a total of 8880 rows has been generated.
Storing the dataset under: ../data/sampled-dataset/corpus.sampled.8k.csv

I have a total of 2220 games with reviews. We want to be ~64000 reviews so we take 29 reviews per game.
Dataset with a total of 64380 rows has been generated.
Storing the dataset under: ../data/sampled-dataset/corpus.sampled.64k.csv

I have a total of 2220 games with reviews. We want to be ~256000 reviews so we take 116 reviews per game.
Dataset with a total of 257520 rows has been generated.
Storing the dataset under: ../data/sampled-dataset/corpus.sampled.256k.csv



In [2]:
generated_corpora = dict(
    k8="../data/sampled-dataset/corpus.sampled.8k.csv",
    k64="../data/sampled-dataset/corpus.sampled.64k.csv",
    k256="../data/sampled-dataset/corpus.sampled.256k.csv"
)

### Longest reviews sampling

In [5]:
from core.dataset_sampler import BggDatasetLongestSampler

target_sizes = [16000, 64000, 256000]  # 4 steps distance
for size in target_sizes:
    sampler = (BggDatasetLongestSampler(size, output_dir="../data/sampled-dataset", random_state=random_state))
    sampler.make_sample_of_data(corpus_file, f"corpus.longest-sampled.{int(size / 1000)}k.csv")

Dataset with a total of 8000 rows has been generated.
Storing the dataset under: ../data/sampled-dataset/corpus.longest-sampled.8k.csv

Dataset with a total of 64000 rows has been generated.
Storing the dataset under: ../data/sampled-dataset/corpus.longest-sampled.64k.csv

Dataset with a total of 256000 rows has been generated.
Storing the dataset under: ../data/sampled-dataset/corpus.longest-sampled.256k.csv



In [3]:
generated_corpora["k8_longest"] = "../data/sampled-dataset/corpus.longest-sampled.8k.csv"
generated_corpora["k64_longest"] = "../data/sampled-dataset/corpus.longest-sampled.64k.csv"
generated_corpora["k256_longest"] = "../data/sampled-dataset/corpus.longest-sampled.256k.csv"

Check distribution of games

In [12]:
import pandas as pd

# todo plot?
data = pd.read_csv("../data/sampled-dataset/corpus.longest-sampled.256k.csv")
print(f"File contains a total of {len(data)} records")
data["game_id"].value_counts()

File contains a total of 256000 records


game_id
286096    442
72125     392
21050     392
175914    384
25613     384
         ... 
354544      6
326934      5
326933      5
177802      4
198487      4
Name: count, Length: 2220, dtype: int64

In [5]:
import pandas as pd

# todo plot?
pd.read_csv("../data/sampled-dataset/corpus.sampled.256k.csv")["game_id"].value_counts()

game_id
1         116
242722    116
242529    116
242574    116
242639    116
         ... 
127518    116
127398    116
127060    116
127024    116
414317    116
Name: count, Length: 2220, dtype: int64

In [14]:
from spacy import displacy
import spacy

nlp = spacy.load("en_core_web_md")

In [18]:
c = nlp("root is a fantastic boardgame. It's almost as good as settlers! want to play some risk?")
displacy.render(c)

## Special Scenario: Kickstarter
Many reviews on BGG reference the Kickstarter campaigns of the games. <br>
Most of these reviews are not informative and are not useful for training the model. <br>

For reviews containing ```Kickstarter``` we apply the following Heuristic:
- If the review is short (<15 words) we remove it.
- If it is longer we keep it.

In [12]:
# Removal of 'Kickstarter' reviews
from pre_processing import PreProcessingService, KickstarterRemovalRule

ps = PreProcessingService.kickstarter_filter_pipeline()

In [13]:
test = "Kickstarter is a great platform to launch games. The longer my review is the more likely we are going to keep it. Extraordinary."
ps.pre_process(test)

'kickstarter great platform launch game long review likely go extraordinary'

We have a special pipeline for the Kickstarter removal. <br>
For which we will generate a separate dataset for comparison to see if the quality of the data improves.

In [4]:
ps.pre_process_corpus("../data/corpus.sampled.csv", "../data/corpus.preprocessed.kickstarter_removed.csv")

Pandas Apply:   0%|          | 0/64380 [00:00<?, ?it/s]

We also create a corpus file starting > 256k reviews

In [14]:
# todo refactor
ps.pre_process_corpus(generated_corpora['k256'], "../data/corpus.preprocessed.kickstarter_removed.256k.csv")

Pandas Apply:   0%|          | 0/257520 [00:00<?, ?it/s]

# Preprocessing
The downloaded information from the BGG API might not be informative, faulty or bloated with useless information. <br>
In order to avoid this we apply some pre-processing steps in order to filter out information we don't need, that may be entire records or some of the 
text inside a line.

During the process we already make the tokenization and stemming of the text using the ```spacy```


In [None]:
import warnings

# Some parts of torch that are used by Spacy are deprecated, we can ignore them 
# (The new 3.8 Spacy has some little issues, so we keep it like it is for now)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Using Spacy
To download the model and use it with spacy:
```
python -m spacy download en_core_web_sm
```

In [None]:
import spacy

model = spacy.load("en_core_web_sm")

## PreProcessingService
Class that holds the process to clean the text and produce a stemmed corpus. <br/> This will then be persisted in a file to avoid re-processing the same data.

In [None]:
from pre_processing import CleanTextRule

In [None]:
demo_text = "This is a demo text. Isn't Root just an amazing game? I love it!"

### BGG noise removal
BGG comments can carry metadata such as images and some pseudo-html tags. <br>
To avoid processing those we simply remove them applying two regexes:

In [None]:
# As defined in the PreProcessingService
clean_tags_regex = r"(?i)\[(?P<tag>[A-Z]+)\].*?\[/\1\]"
keep_tag_content_regex = r"\[(?P<tag>[a-z]+)(=[^\]]+)?\](.*?)\[/\1\]"

In [None]:
CleanTextRule(clean_tags_regex).process(
    "This is a test for processing [IMG]https://cf.geekdo-static.com/mbs/mb_5855_0.gif[/IMG] as content")

In [None]:
CleanTextRule(keep_tag_content_regex, r'\3').process("This is a test for processing [b=323]bold[/b] as content")

### Language detection
While it of course would be amazing to have a model with multiple languages support, we are focusing on English. <br>
To filter out foreign languages we use the ```langdetect``` library.

In [None]:
from fast_langdetect import detect

german_sentence = "Naja, ich finde die Siedler von Catan immer noch besser"
print(f"For the demo sentence: \"{demo_text}\" we detected: {detect(demo_text)['lang']}")
print(f"For the demo sentence: \"{german_sentence}\" we detected: {detect(german_sentence)['lang']}")

In [7]:
from pre_processing import FilterLanguageRule

print(FilterLanguageRule(["it", "de"]).process("Wir hatten viel spass heute"))
print(FilterLanguageRule(["it", "de"]).process("We had lots of fun today"))

Wir hatten viel spass heute
None


### Tokenization and lemmatization
Using ```spacy``` we tokenize the text and then we lemmatize it. <br>

In [None]:
from pre_processing import LemmatizeTextRule

LemmatizeTextRule().process(demo_text)  # (Should be considered private)

### Remove too narrow texts
Comments (reviews) that are too short might not be informative. <br>
We already remove stopwords and punctuation, so we can filter out comments that are too short but we better set a reasonable threshold (not too high). This step is done by the PreProcessingService aswell.

In [15]:
from pre_processing import ShortTextFilterRule

ShortTextFilterRule(4).process(['this', 'is', 'short'])

## Batch Process

In [None]:
import pandas as pd
from core.pre_processing import PreProcessingService

# Our known game names.
game_names = pd.read_csv("../resources/2024-08-18.csv")['Name'].tolist()

In [4]:
# Specially tailored possible cases
game_names.remove("Quick")  # A tricky word that could be often used in reviews.
game_names.append("Catan")  # CATAN is the name in our Database.

This pre-processing might not be perfect BUT it is good enough and probably a step in the right direction. <br>
A complete model or well thought way to recognize board games is desirable but a long task on its own.

In [5]:
print(len(game_names))

25899


In [6]:
pipelines = [
    PreProcessingService.default_pipeline("../data/processed-dataset/default"),
    PreProcessingService.game_name_less_pipeline(game_names, "../data/processed-dataset/game-name-filtered"),
    PreProcessingService.kickstarter_filter_pipeline_without_game_names(
        game_names, "../data/processed-dataset/kickstarter-filtered-game-name-filtered"
    ),
    PreProcessingService.kickstarter_filter_pipeline_without_game_names_and_numbers(
        game_names, "../data/processed-dataset/kickstarter-filtered-game-name-filtered-no-numbers"
    )
]

Generating game names tokenized representationL: (25899)
Done generating... cb ready for use!
Generating game names tokenized representationL: (25899)
Done generating... cb ready for use!


In [12]:
processed_corpora = dict()

for key in generated_corpora:
    # Generate the 4 datasets we desire.
    print(f"Processing the {key} datasets:")
    for pipeline in pipelines:
        print(f"Started pipeline with pipe:\n {list(map(lambda x: x.__class__.__name__, pipeline.pipeline))}")
        generated_file = pipeline.pre_process_corpus(generated_corpora[key], key)
        processed_corpora[f"{key}_{pipeline.name.replace("pipeline", "")}"] = generated_file

Processing the k8 datasets:
Started pipeline PreProcessingService with pipe:
 ['CleanTextRule', 'CleanTextRule', 'FilterLanguageRule', 'LemmatizeTextRule', 'ShortTextFilterRule', 'ListToTextRegenerationRule']


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

Started pipeline PreProcessingService with pipe:
 ['CleanTextRule', 'CleanTextRule', 'FilterLanguageRule', 'LemmatizeTextWithoutGameNamesRule', 'ShortTextFilterRule', 'ListToTextRegenerationRule']


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

Started pipeline PreProcessingService with pipe:
 ['CleanTextRule', 'CleanTextRule', 'KickstarterRemovalRule', 'FilterLanguageRule', 'LemmatizeTextRule', 'ShortTextFilterRule', 'ListToTextRegenerationRule']


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

Started pipeline PreProcessingService with pipe:
 ['CleanTextRule', 'CleanTextRule', 'KickstarterRemovalRule', 'FilterLanguageRule', 'LemmatizeTextWithoutGameNamesRule', 'ShortTextFilterRule', 'ListToTextRegenerationRule']


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

Processing the k64 datasets:
Started pipeline PreProcessingService with pipe:
 ['CleanTextRule', 'CleanTextRule', 'FilterLanguageRule', 'LemmatizeTextRule', 'ShortTextFilterRule', 'ListToTextRegenerationRule']


Pandas Apply:   0%|          | 0/64380 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
print(f"We generated: {processed_corpora}")

See how the dataset changed:

In [None]:
print(
    f"We lost a total of {len(pd.read_csv(generated_corpora["k8"])) - len(pd.read_csv("../data/processed-dataset/default/8k.csv"))} reviews")

# Custom Dataset Definition
To train the model we require a way to get elements of our dataset. ```torch``` provides a way to do this by defining a custom ```Dataset``` class. <br>
This class and later loaded into a ```DataLoader``` that will provide the batches of data to the model.

In order to generate valid inputs for the model we have to give a numerical representation to our data. <br>
In order to do so we use a ```WordEmbedding``` model that will give us the dictionary of the recognized words (The embeddings will be generated inside the model). <br>

In [None]:
max_vocab_size = 16000
embedding_size = 128
target_embedding_model_file = "./../data/word-embeddings.model"

In [None]:
import core.utils as utils
import core.embeddings as embeddings

# We just show how to use them
embeddings_model = embeddings.WordEmbedding(
    utils.LoadCorpusUtility(), max_vocab_size=max_vocab_size, embedding_size=embedding_size,
    target_model_file=target_embedding_model_file, corpus_file="../data/processed-dataset/default/k8.preprocessed.csv"
)

In [None]:
# We require a vocabulary to map the words to indexes
embeddings_model.load_model()
embeddings_model.get_vocab()

vocabulary = embeddings_model.model.wv.key_to_index

## PositiveNegativeCommentGeneratorDataset
Gives a sample and also returns some negative samples for contrastive learning. <br>


In [None]:
from core.dataset import PositiveNegativeCommentGeneratorDataset

ds = PositiveNegativeCommentGeneratorDataset("./../data/corpus.preprocessed.csv", vocabulary, 10)

In [None]:
from torch.utils.data import DataLoader

lazy_dataloader = DataLoader(ds, batch_size=32, shuffle=True)

In [None]:
i = 11  # A random index to show content and 
print(
    f"Sentence at index {i} original text is: `{ds.get_text_item(i)}` (Look at [comments] property for the stripped down version)\n "
    f"It's numeric representation:\n {ds[i][0][0]}"
)

### Sequence length truncation
The model will be trained on sequences of fixed length. <br>
The chosen length must be reasonable, we can't just pad everything out for the same of it. <br>

We want that the top 95% of the reviews are not truncated. <br>

In [None]:
# We have 137 of the 50461 total reviews that are bigger than 256 tokens.
# This is less than 1% of the total reviews. We can truncate.