In [1]:
import os

os.environ['KERAS_BACKEND'] = "torch"
random_state = 281997

BGG Does not directly provide a way to list all the games it has in archive therefore we used a dump created by the community (2024-08-18).

# Dataset Generation
Our dataset is a corpus of reviews scrapped from the BGG API. <br /> 
In order to download the comments we make use of the ```bgg_corpus_service.py``` content.

## Subsample the data
We should limit the number of reviews, how many? Let's look at some case studies:

- Amazon Product Reviews
Size: Varies by category, but subsets of 5,000 to 20,000 reviews are common.
- Yelp Dataset
Size: Typically, 8,000 to 15,000 reviews are used in research for unsupervised aspect extraction.
- TripAdvisor Reviews
Size: Around 5,000 to 10,000 reviews in unsupervised experiments.

For unsupervised learning, 5,000–10,000 reviews is a reasonable starting point for recognizing 6 aspects. More reviews may improve diversity and robustness but come with increased computational costs.




#### Subsample (before pre-processing) of 64K

In [2]:
import pandas as pd

corpus_file = "../data/corpus.csv"
sampled_corpus_file = "../data/corpus.sampled.csv"

In [None]:
og_data = pd.read_csv(corpus_file)
reviews_per_game = int(64000 / len(og_data.groupby(["game_id"]).count())) + 1

print(f"I have a total of {len(og_data.groupby(["game_id"]).count())} games with reviews. "
      f"We want to be ~64k reviews so we take {reviews_per_game} reviews per game.")

In [None]:
# We start by using ~64k reviews (More robustness). This is before pre-processing which might reduce the total number of reviews later.
(
    og_data.groupby("game_id", group_keys=False)[og_data.columns]
    .apply(lambda x: x.sample(min(len(x), reviews_per_game), random_state=random_state))
    .to_csv(sampled_corpus_file, index=False)
)

#### Subsample (before pre-processing) of 256K
We try to also expand our dataset and see how the model behaves with more data.

In [10]:
from core.utils import subsample_corpus

sampled_corpus_file_256: str = "../data/corpus.sampled.256k.csv"

In [None]:
subsample_corpus(corpus_file, sampled_corpus_file_256, 256000, random_state)

Check distribution of games

In [None]:
data = pd.read_csv(sampled_corpus_file)
data

In [None]:
data.groupby(["game_id"]).count()
# Each of our games has the same representation then others. The "reviews" should be balanced across all games.
# We can now proceed to pre-process the data.

## Special Scenario: Kickstarter
Many reviews on BGG reference the Kickstarter campaigns of the games. <br>
Most of these reviews are not informative and are not useful for training the model. <br>

For reviews containing ```Kickstarter``` we apply the following Heuristic:
- If the review is short (<15 words) we remove it.
- If it is longer we keep it.

In [12]:
# Removal of 'Kickstarter' reviews
from pre_processing import PreProcessingService, KickstarterRemovalRule

ps = PreProcessingService.kickstarter_filter_pipeline()

In [13]:
test = "Kickstarter is a great platform to launch games. The longer my review is the more likely we are going to keep it. Extraordinary."
ps.pre_process(test)

'kickstarter great platform launch game long review likely go extraordinary'

We have a special pipeline for the Kickstarter removal. <br>
For which we will generate a separate dataset for comparison to see if the quality of the data improves.

In [4]:
ps.pre_process_corpus("../data/corpus.sampled.csv", "../data/corpus.preprocessed.kickstarter_removed.csv")

Pandas Apply:   0%|          | 0/64380 [00:00<?, ?it/s]

We also create a corpus file starting > 256k reviews

In [14]:
ps.pre_process_corpus(sampled_corpus_file_256, "../data/corpus.preprocessed.kickstarter_removed.256k.csv")

Pandas Apply:   0%|          | 0/257520 [00:00<?, ?it/s]

# Preprocessing
The downloaded information from the BGG API might not be informative, faulty or bloated with useless information. <br>
In order to avoid this we apply some pre-processing steps in order to filter out information we don't need, that may be entire records or some of the 
text inside a line.

During the process we already make the tokenization and stemming of the text using the ```spacy```


In [None]:
import warnings

# Some parts of torch that are used by Spacy are deprecated, we can ignore them 
# (The new 3.8 Spacy has some little issues, so we keep it like it is for now)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Using Spacy
To download the model and use it with spacy:
```
python -m spacy download en_core_web_sm
```

In [None]:
import spacy

model = spacy.load("en_core_web_sm")

## PreProcessingService
Class that holds the process to clean the text and produce a stemmed corpus. <br/> This will then be persisted in a file to avoid re-processing the same data.

In [None]:
from pre_processing import CleanTextRule

In [None]:
demo_text = "This is a demo text. Isn't Root just an amazing game? I love it!"

### BGG noise removal
BGG comments can carry metadata such as images and some pseudo-html tags. <br>
To avoid processing those we simply remove them applying two regexes:

In [None]:
# As defined in the PreProcessingService
clean_tags_regex = r"(?i)\[(?P<tag>[A-Z]+)\].*?\[/\1\]"
keep_tag_content_regex = r"\[(?P<tag>[a-z]+)(=[^\]]+)?\](.*?)\[/\1\]"

In [None]:
CleanTextRule(clean_tags_regex).process(
    "This is a test for processing [IMG]https://cf.geekdo-static.com/mbs/mb_5855_0.gif[/IMG] as content")

In [None]:
CleanTextRule(keep_tag_content_regex, r'\3').process("This is a test for processing [b=323]bold[/b] as content")

### Language detection
While it of course would be amazing to have a model with multiple languages support, we are focusing on English. <br>
To filter out foreign languages we use the ```langdetect``` library.

In [None]:
from fast_langdetect import detect

german_sentence = "Naja, ich finde die Siedler von Catan immer noch besser"
print(f"For the demo sentence: \"{demo_text}\" we detected: {detect(demo_text)['lang']}")
print(f"For the demo sentence: \"{german_sentence}\" we detected: {detect(german_sentence)['lang']}")

In [7]:
from pre_processing import FilterLanguageRule

print(FilterLanguageRule(["it", "de"]).process("Wir hatten viel spass heute"))
print(FilterLanguageRule(["it", "de"]).process("We had lots of fun today"))

Wir hatten viel spass heute
None


### Tokenization and lemmatization
Using ```spacy``` we tokenize the text and then we lemmatize it. <br>

In [None]:
from pre_processing import LemmatizeTextRule

LemmatizeTextRule().process(demo_text)  # (Should be considered private)

### Remove too narrow texts
Comments (reviews) that are too short might not be informative. <br>
We already remove stopwords and punctuation, so we can filter out comments that are too short but we better set a reasonable threshold (not too high). This step is done by the PreProcessingService aswell.

In [3]:
from pre_processing import ShortTextFilterRule

ShortTextFilterRule(3).process(['this', 'is', 'short'])

['this', 'is', 'short']

## Batch Process

In [None]:
preprocessed_corpus_file: str = "../data/corpus.preprocessed.csv"

In [None]:
from pre_processing import pre_process_corpus

pre_process_corpus(sampled_corpus_file, preprocessed_corpus_file, False)

See how the dataset changed:

In [None]:
len(pd.read_csv(preprocessed_corpus_file))  # We lost 14k reviews but it is okay! (I expect to lose more)

# Custom Dataset Definition
To train the model we require a way to get elements of our dataset. ```torch``` provides a way to do this by defining a custom ```Dataset``` class. <br>
This class and later loaded into a ```DataLoader``` that will provide the batches of data to the model.

In order to generate valid inputs for the model we have to give a numerical representation to our data. <br>
In order to do so we use a ```WordEmbedding``` model that will give us the dictionary of the recognized words (The embeddings will be generated inside the model). <br>

In [None]:
max_vocab_size = 16000
embedding_size = 128
target_embedding_model_file = "./../data/word-embeddings.model"

In [None]:
import core.utils as utils
import core.embeddings as embeddings

embeddings_model = embeddings.WordEmbedding(
    utils.LoadCorpusUtility(), max_vocab_size=max_vocab_size, embedding_size=embedding_size,
    target_model_file=target_embedding_model_file, corpus_file=preprocessed_corpus_file
)

In [None]:
# We require a vocabulary to map the words to indexes
embeddings_model.load_model()
embeddings_model.get_vocab()

vocabulary = embeddings_model.model.wv.key_to_index

## PositiveNegativeCommentGeneratorDataset
Gives a sample and also returns some negative samples for contrastive learning. <br>


In [None]:
from core.dataset import PositiveNegativeCommentGeneratorDataset

ds = PositiveNegativeCommentGeneratorDataset("./../data/corpus.preprocessed.csv", vocabulary, 10)

In [None]:
from torch.utils.data import DataLoader

lazy_dataloader = DataLoader(ds, batch_size=32, shuffle=True)

In [None]:
i = 11  # A random index to show content and 
print(
    f"Sentence at index {i} original text is: `{ds.get_text_item(i)}` (Look at [comments] property for the stripped down version)\n "
    f"It's numeric representation:\n {ds[i][0][0]}")

### Sequence length truncation
The model will be trained on sequences of fixed length. <br>
The chosen length must be reasonable, we can't just pad everything out for the same of it. <br>

We want that the top 95% of the reviews are not truncated. <br>

In [None]:
# We have 137 of the 50461 total reviews that are bigger than 256 tokens.
# This is less than 1% of the total reviews. We can truncate.