In [1]:
import os

os.environ['KERAS_BACKEND'] = "torch"
random_state = 281997

BGG Does not directly provide a way to list all the games it has in archive therefore we used a dump created by the community (2024-08-18).

# Dataset Generation
Our dataset is a corpus of reviews scrapped from the BGG API. <br /> 
In order to download the comments we make use of the ```bgg_corpus_service.py``` content.

## Subsample the data
We should limit the number of reviews, how many? Let's look at some case studies:

- Amazon Product Reviews
Size: Varies by category, but subsets of 5,000 to 20,000 reviews are common.
- Yelp Dataset
Size: Typically, 8,000 to 15,000 reviews are used in research for unsupervised aspect extraction.
- TripAdvisor Reviews
Size: Around 5,000 to 10,000 reviews in unsupervised experiments.

For unsupervised learning, 5,000–10,000 reviews is a reasonable starting point for recognizing 6 aspects. More reviews may improve diversity and robustness but come with increased computational costs.




In [2]:
import pandas as pd

corpus_file = "../data/corpus.csv"
sampled_corpus_file = "../data/corpus.sampled.csv"

In [None]:
og_data = pd.read_csv(corpus_file)
reviews_per_game = int(64000 / len(og_data.groupby(["game_id"]).count())) + 1

print(f"I have a total of {len(og_data.groupby(["game_id"]).count())} games with reviews. "
      f"We want to be ~64k reviews so we take {reviews_per_game} reviews per game.")

In [None]:
# We start by using ~64k reviews (More robustness). This is before pre-processing which might reduce the total number of reviews later.
(
    og_data.groupby("game_id", group_keys=False)[og_data.columns]
    .apply(lambda x: x.sample(min(len(x), reviews_per_game), random_state=random_state))
    .to_csv(sampled_corpus_file, index=False)
)

Check distribution of games

In [None]:
data = pd.read_csv(sampled_corpus_file)
data

In [None]:
data.groupby(["game_id"]).count()
# Each of our games has the same representation then others. The "reviews" should be balanced across all games.
# We can now proceed to pre-process the data.

# Preprocessing
The downloaded information from the BGG API might not be informative, faulty or bloated with useless information. <br>
In order to avoid this we apply some pre-processing steps in order to filter out information we don't need, that may be entire records or some of the 
text inside a line.

During the process we already make the tokenization and stemming of the text using the ```spacy```


In [None]:
import warnings

# Some parts of torch that are used by Spacy are deprecated, we can ignore them 
# (The new 3.8 Spacy has some little issues, so we keep it like it is for now)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Using Spacy
To download the model and use it with spacy:
```
python -m spacy download en_core_web_sm
```

In [None]:
import spacy

model = spacy.load("en_core_web_sm")

## PreProcessingService
Class that holds the process to clean the text and produce a stemmed corpus. <br/> This will then be persisted in a file to avoid re-processing the same data.

In [None]:
from pre_processing import PreProcessingService

ps = PreProcessingService()

In [None]:
demo_text = "This is a demo text. Isn't Root just an amazing game? I love it!"

### BGG noise removal
BGG comments can carry metadata such as images and some pseudo-html tags. <br>
To avoid processing those we simply remove them applying two regexes:

In [None]:
# As defined in the PreProcessingService
clean_tags_regex = r"(?i)\[(?P<tag>[A-Z]+)\].*?\[/\1\]"
keep_tag_content_regex = r"\[(?P<tag>[a-z]+)(=[^\]]+)?\](.*?)\[/\1\]"

In [None]:
ps.clean_text("This is a test for processing [IMG]https://cf.geekdo-static.com/mbs/mb_5855_0.gif[/IMG] as content")

In [None]:
ps.clean_text("This is a test for processing [b=323]bold[/b] as content")

### Language detection
While it of course would be amazing to have a model with multiple languages support, we are focusing on English. <br>
To filter out foreign languages we use the ```langdetect``` library.

In [None]:
from fast_langdetect import detect

german_sentence = "Naja, ich finde die Siedler von Catan immer noch besser"
print(f"For the demo sentence: \"{demo_text}\" we detected: {detect(demo_text)['lang']}")
print(f"For the demo sentence: \"{german_sentence}\" we detected: {detect(german_sentence)['lang']}")

### Tokenization and lemmatization
Using ```spacy``` we tokenize the text and then we lemmatize it. <br>

In [None]:
ps._make_text_lemmas(demo_text)  # (Should be considered private)

### Remove too narrow texts
Comments (reviews) that are too short might not be informative. <br>
We already remove stopwords and punctuation, so we can filter out comments that are too short but we better set a reasonable threshold (not too high). This step is done by the PreProcessingService aswell.

In [None]:
ps.pre_process(demo_text)

## Batch Process

In [3]:
preprocessed_corpus_file: str = "../data/corpus.preprocessed.csv"

In [None]:
from pre_processing import pre_process_corpus

pre_process_corpus(sampled_corpus_file, preprocessed_corpus_file, False)

See how the dataset changed:

In [None]:
len(pd.read_csv(preprocessed_corpus_file))  # We lost 14k reviews but it is okay! (I expect to lose more)

# Custom Dataset Definition
To train the model we require a way to get elements of our dataset. ```torch``` provides a way to do this by defining a custom ```Dataset``` class. <br>
This class and later loaded into a ```DataLoader``` that will provide the batches of data to the model.

In order to generate valid inputs for the model we have to give a numerical representation to our data. <br>
In order to do so we use a ```WordEmbedding``` model that will give us the dictionary of the recognized words (The embeddings will be generated inside the model). <br>

In [4]:
max_vocab_size = 16000
embedding_size = 128
target_embedding_model_file = "./../data/word-embeddings.model"

In [5]:
import core.utils as utils
import core.embeddings as embeddings

embeddings_model = embeddings.WordEmbedding(
    utils.LoadCorpusUtility(), max_vocab_size=max_vocab_size, embedding_size=embedding_size,
    target_model_file=target_embedding_model_file, corpus_file=preprocessed_corpus_file
)

In [6]:
# We require a vocabulary to map the words to indexes
embeddings_model.load_model()
embeddings_model.get_vocab()

vocabulary = embeddings_model.model.wv.key_to_index



Pandas Apply:   0%|          | 0/50462 [00:00<?, ?it/s]



Pandas Apply:   0%|          | 0/50462 [00:00<?, ?it/s]

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 196688 words, keeping 8896 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #20000, processed 426738 words, keeping 11170 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #30000, processed 646351 words, keeping 12215 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #40000, processed 880823 words, keeping 12729 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #50000, processed 1120282 words, keeping 12946 word types
INFO:gensim.models.word2vec:collected 12954 word types from a corpus of 1132772 raw words and 50462 sentences
INFO:gensim.models.word2vec:Creating a fresh vocabulary
DEBUG:gensim.utils:starting a new internal lifecycle event log for Word2Vec
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_m

## PositiveNegativeCommentGeneratorDataset
Gives a sample and also returns some negative samples for contrastive learning. <br>


In [7]:
from core.dataset import PositiveNegativeCommentGeneratorDataset

ds = PositiveNegativeCommentGeneratorDataset("./../data/corpus.preprocessed.csv", vocabulary, 10)

Loading spacy model.
Loading dataset from file: ./../data/corpus.preprocessed.csv
Generating numeric representation for each word of ds.


Pandas Apply:   0%|          | 0/50461 [00:00<?, ?it/s]

Max sequence length calculation in progress...
We loose information on 136 points.This is 0.2695150710449654% of the dataset.
Padding sequences to max length (256).
Max sequence length is:  1235  but we will limit sequences to 256 tokens.


In [8]:
from torch.utils.data import DataLoader

lazy_dataloader = DataLoader(ds, batch_size=32, shuffle=True)

In [9]:
i = 11  # A random index to show content and 
print(f"Sentence at index {i} original text is: `{ds.get_text_item(i)}` (Look at [comments] property for the stripped down version)\n "
      f"It's numeric representation:\n {ds[i][0][0]}")

Sentence at index 11 original text is: `Fun, but a bit complex for my taste.` (Look at [comments] property for the stripped down version)
 It's numeric representation:
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0

### Sequence length truncation
The model will be trained on sequences of fixed length. <br>
The chosen length must be reasonable, we can't just pad everything out for the same of it. <br>

We want that the top 95% of the reviews are not truncated. <br>

In [None]:
# We have 137 of the 50461 total reviews that are bigger than 256 tokens.
# This is less than 1% of the total reviews. We can truncate.