In [1]:
import os

os.environ['KERAS_BACKEND'] = "torch"
random_state = 281997

BGG Does not directly provide a way to list all the games it has in archive therefore we used a dump created by the community (2024-08-18).

# Dataset Generation
Our dataset is a corpus of reviews scrapped from the BGG API. <br /> 
In order to download the comments we make use of the ```bgg_corpus_service.py``` content.

## Subsample the data
We should limit the number of reviews, how many? Let's look at some case studies:

- Amazon Product Reviews
Size: Varies by category, but subsets of 5,000 to 20,000 reviews are common.
- Yelp Dataset
Size: Typically, 8,000 to 15,000 reviews are used in research for unsupervised aspect extraction.
- TripAdvisor Reviews
Size: Around 5,000 to 10,000 reviews in unsupervised experiments.

For unsupervised learning, 5,000–10,000 reviews is a reasonable starting point for recognizing 6 aspects. <br>
More reviews may improve diversity and robustness but come with increased computational costs.




In [3]:
# File of our corpus:
corpus_file = "../data/corpus.csv"

## Special Scenario: Kickstarter
Many reviews on BGG reference the Kickstarter campaigns of the games. <br>
Most of these reviews are not informative and are not useful for training the model.

In [2]:
import pandas as pd

dataset = pd.read_csv(corpus_file)

### How Many comments contain Kickstarter?
Let's measure it! And while at it check how many of these are short comments:

In [34]:
kickstarter_subset = dataset[dataset["comments"].str.contains("kickstarter|kickstarted|kickstart", case=False)]
print(
    f"The subset is {len(kickstarter_subset) / len(dataset) * 100}% of "
    f"the original with a total of {len(kickstarter_subset)} comments."
)

The subset is 1.619698530100178% of the original with a total of 34600 comments.


In [36]:
kickstarter_counts = (kickstarter_subset["comments"].apply(lambda x: len(x.split(" ")) > 15)).value_counts()

In [39]:
ds = dataset[~dataset["comments"].str.contains("kickstarter|kickstarted|kickstart", case=False)]
counts = (ds["comments"].apply(lambda x: len(x.split(" ")) > 15)).value_counts()

In [40]:
print(
    f"We loose a total of: {kickstarter_counts.get(True) / (kickstarter_counts.get(True) + counts.get(True)) * 100}% possible extractions from the dataset. \nLess than 1.1, we can just ignore Kickstarter comments"
)

We loose a total of: 1.0276979373944586% possible extractions from the dataset. 
Less than 1.1, we can just ignore Kickstarter comments


In [42]:
ds.to_csv("../data/corpus.csv", index=False)

### How many times are game titles referenced in the corpus? Let's see!


In [272]:
game_names = pd.read_csv("../resources/2024-08-18.csv")['Name']
# As we use regex pattern to check we might have some problems with some game names.
# These include cases like: [kosmopoli:t], [redacted], or **, or ???. They are only 9 games.
# This should not change the results of our inspections too dramatically especially because they are not as popular as other games.
game_names = game_names.drop([3011, 6800, 13330, 14764, 19280, 20312, 21764, 21796, 25651])

print(f"We have a total of {len(game_names)} games.")
match_string = "|".join(game_names)

We have a total of 25890 games.


In [276]:
# We do match case for those unfortunate cases where game names are actually common use terms like Risk, Get Lucky etc...
# I prefer to underestimate than overestimate in this case as it seems the wisest of the two approaches.
game_named_subset = dataset[dataset["comments"].str.contains(match_string)]

  game_named_subset = dataset[dataset["comments"].str.contains(match_string)]


In [281]:
print(
    f"A total {len(dataset) - len(game_named_subset)} games have no reference to game names. This is {len(game_named_subset) / len(dataset) * 100}%")  #These many games contain game references reviews.

A total 864815 games have no reference to game names. This is 59.51619698530101%


A total of 864815 reference game names at least once in the comment. <br> Replacing those with the \<GAME_NAME> token might be beneficial to reduce noise in the data

# Preprocessing
The downloaded information from the BGG API might not be informative, faulty or bloated with useless information. <br>
In order to avoid this we apply some pre-processing steps in order to filter out information we don't need, that may be entire records or some of the 
text inside a line.

During the process we already make the tokenization and stemming of the text using the ```spacy```


In [None]:
import warnings

# Some parts of torch that are used by Spacy are deprecated, we can ignore them 
# (The new 3.8 Spacy has some little issues, so we keep it like it is for now)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Using Spacy
To download the model and use it with spacy:
```
python -m spacy download en_core_web_sm
```

In [None]:
import spacy

# Best compromise between accuracy and speed
model = spacy.load("en_core_web_md")

## PreProcessingService
Class that holds the process to clean the text and produce a stemmed corpus. <br/> This will then be persisted in a file to avoid re-processing the same data.

In [1]:
from core.pre_processing import CleanTextRule

In [4]:
demo_text = "This is a demo text. Isn't Root just an amazing game? I love it!"

### BGG noise removal
BGG comments can carry metadata such as images and some pseudo-html tags. <br>
To avoid processing those we simply remove them applying two regexes:

In [13]:
# As defined in the PreProcessingService
clean_tags_regex = r"(?i)\[(?P<tag>[A-Z]+)\].*?\[/\1\]"
keep_tag_content_regex = r"(?i)\[(?P<tag>[a-z]+)(=[^\]]+)?\](.*?)\[/\1\]"

In [None]:
CleanTextRule(clean_tags_regex).process(
    "This is a test for processing [IMG]https://cf.geekdo-static.com/mbs/mb_5855_0.gif[/IMG] as content"
)

In [None]:
CleanTextRule(keep_tag_content_regex, r'\3').process("This is a test for processing [b=323]bold[/b] as content")

### Language detection
While it of course would be amazing to have a model with multiple languages support, we are focusing on English. <br>
To filter out foreign languages we use the ```langdetect``` library.

In [None]:
from fast_langdetect import detect

german_sentence = "Naja, ich finde die Siedler von Catan immer noch besser"
print(f"For the demo sentence: \"{demo_text}\" we detected: {detect(demo_text)['lang']}")
print(f"For the demo sentence: \"{german_sentence}\" we detected: {detect(german_sentence)['lang']}")

In [288]:
from pre_processing import FilterLanguageRule

print(FilterLanguageRule(["it", "de"]).process("Wir hatten heute viel spass"))
print(FilterLanguageRule(["it", "de"]).process("We had lots of fun today"))

Wir hatten heute viel spass
None


### Split reviews in sentences
This should help us generate shorter input data and also giving more granularity to the reviews making aspect extraction simpler. <br>
An approach like this was taken by different academic studies such as:
- Improving unsupervised neural aspect extraction for online discussions using out-of-domain classification (https://arxiv.org/abs/2006.09766)
- Embarrassingly Simple Unsupervised Aspect Extraction (https://arxiv.org/abs/2004.13580)

In [16]:
from core.pre_processing import SplitSentencesRule

SplitSentencesRule().process(
    "Brilliant!  Fits right into my wheelhouse all around and game weight."
    "This also has some of the most and best player interaction I've experienced in a game!"
)

['Brilliant!',
 'Fits right into my wheelhouse all around and game weight.',
 "This also has some of the most and best player interaction I've experienced in a game!"]

### Tokenization and lemmatization
Using ```spacy``` we tokenize the text and then we lemmatize it. <br>

In [None]:
from pre_processing import LemmatizeTextRule

LemmatizeTextRule().process(demo_text)  # (Should be considered private)

### Remove too narrow texts
Comments (reviews) that are too short might not be informative. <br>
We already remove stopwords and punctuation, so we can filter out comments that are too short but we better set a reasonable threshold (not too high). This step is done by the PreProcessingService aswell.

In [None]:
from pre_processing import ShortTextFilterRule

ShortTextFilterRule(4).process(['this', 'is', 'short'])

## Remove Dates
I believe dates can be good information but not if too specific. Thus, we replace the actual dates with a custom <DATE> token using ```DateMatcherReplacementRule```. <br>
This allows us to maintain the information but reduce the granularity of it.

In [11]:
from pre_processing import LemmatizeTextWithMatcherRules, DateMatcherReplacementRule, GameNamesMatcherReplacementRule
import spacy

text = "Rating previous Gloomhaven to February 2017: 8.1 Rating previous to Oct 2017: 9.45 Rating previous to June 2018: 7.72. 10/10/2023"
nlp = spacy.load('en_core_web_md')

LemmatizeTextWithMatcherRules(nlp, rules=[DateMatcherReplacementRule(nlp.vocab), ]).process(text)

['rate',
 'previous',
 '<GAME_NAME>',
 '<DATE>',
 '8.1',
 'rating',
 'previous',
 '<DATE>',
 '9.45',
 'rate',
 'previous',
 '<DATE>',
 '7.72',
 '<DATE>']

## Delete duplicated rows
There might, and there are, duplicate rows in our dataset. These are filtered out by the ```PreProcessingService``` after each step of processing. <br>
It subsets on the original comment and game_id.

## Batch Process

In [2]:
import pandas as pd

# File of our corpus:
corpus_file = "../data/corpus.csv"
# Our known game names.
game_names = pd.read_csv("../resources/2024-08-18.csv")['Name']

In [3]:
# Specially tailored possible cases
game_names = pd.concat([game_names, pd.Series(["Quick", "Catan"])], ignore_index=True)
print(len(game_names))

25901


This pre-processing might not be perfect BUT it is good enough and probably a step in the right direction. <br>
A complete model or well thought way to recognize board games is desirable but a long task on its own.

In [4]:
import swifter
import spacy

nlp = spacy.load("en_core_web_sm")  # We use small as we don't need anything over the top.
document_game_names = game_names.swifter.apply(lambda x: nlp(x)).tolist()

Pandas Apply:   0%|          | 0/25901 [00:00<?, ?it/s]

In [5]:
from core.pre_processing import PreProcessingService

default_pipeline = PreProcessingService.default_pipeline("../data/processed-dataset/default")
full_pipeline = PreProcessingService.full_pipeline(document_game_names, "../data/processed-dataset/full")

We will create these datasets:
- ```default_pipeline```: 64k, 64k-longest, 256k
- ```full_pipeline```: 256k and 256k-longest

This is under the assumption that the more data yeild better models.

In [6]:
from core.dataset_sampler import ConsumingDatasetSampler, BggDatasetRandomBalancedSampler, BggDatasetLongestSampler
from dataclasses import dataclass


@dataclass
class DatasetGeneration:
    pipeline: PreProcessingService
    target_size: int
    sampler: ConsumingDatasetSampler

    def __iter__(self):
        # For a rapid unpacking of the object
        return iter((self.pipeline, self.target_size, self.sampler))


combinations: [DatasetGeneration] = [
    DatasetGeneration(default_pipeline, 64000, BggDatasetRandomBalancedSampler(16000, corpus_file, random_state)),
    DatasetGeneration(full_pipeline, 64000, BggDatasetRandomBalancedSampler(16000, corpus_file, random_state)),
    DatasetGeneration(full_pipeline, 64000, BggDatasetLongestSampler(16000, corpus_file, random_state)),
    # We also try with more data. For now, we suppose that the full_pipeline with longer comments yield better results
    # so we preprocess this. Tests on previous datasets lead to our choice for the bigger dataset composition.
    DatasetGeneration(full_pipeline, 256000, BggDatasetRandomBalancedSampler(64000, corpus_file, random_state)),
    DatasetGeneration(full_pipeline, 256000, BggDatasetLongestSampler(64000, corpus_file, random_state)),
]

In [7]:
print('We will generate a total of:', len(combinations), ' datasets')
for combination in combinations:
    pipeline, target_size, sampler = combination
    longest_affix = "_longest" if type(sampler) is BggDatasetLongestSampler else ""

    name = f"{int(target_size / 1000)}k{longest_affix}"
    print("Generated dataset will be stored in file of prefix: " + name)
    file = pipeline.pre_process_corpus(combination.target_size, sampler, name)

    print(f"Generated dataset in file: {file}")

We will generate a total of: 5  datasets
Generated dataset will be stored in file of prefix: 64k
I have a total of 2220 games with reviews. We take 8 reviews per game.


Pandas Apply:   0%|          | 0/17760 [00:00<?, ?it/s]

I have a total of 2220 games with reviews. We take 8 reviews per game.


Pandas Apply:   0%|          | 0/17760 [00:00<?, ?it/s]

Generated dataset in file: ../data/processed-dataset/default/64k.preprocessed.csv
Generated dataset will be stored in file of prefix: 64k
I have a total of 2220 games with reviews. We take 8 reviews per game.


Pandas Apply:   0%|          | 0/17760 [00:00<?, ?it/s]

I have a total of 2220 games with reviews. We take 8 reviews per game.


Pandas Apply:   0%|          | 0/17759 [00:00<?, ?it/s]

Generated dataset in file: ../data/processed-dataset/full/64k.preprocessed.csv
Generated dataset will be stored in file of prefix: 64k_longest


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["original_text"] = batch["comments"]


Pandas Apply:   0%|          | 0/16000 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["comments"] = batch["comments"].swifter.apply(self.pre_process)


Generated dataset in file: ../data/processed-dataset/full/64k_longest.preprocessed.csv
Generated dataset will be stored in file of prefix: 256k
I have a total of 2220 games with reviews. We take 29 reviews per game.


Pandas Apply:   0%|          | 0/64380 [00:00<?, ?it/s]

I have a total of 2220 games with reviews. We take 29 reviews per game.


Pandas Apply:   0%|          | 0/64378 [00:00<?, ?it/s]

Generated dataset in file: ../data/processed-dataset/full/256k.preprocessed.csv
Generated dataset will be stored in file of prefix: 256k_longest


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["original_text"] = batch["comments"]


Pandas Apply:   0%|          | 0/64000 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["comments"] = batch["comments"].swifter.apply(self.pre_process)


Generated dataset in file: ../data/processed-dataset/full/256k_longest.preprocessed.csv


See how the dataset changed:

# Custom Dataset Definition
To train the model we require a way to get elements of our dataset. ```torch``` provides a way to do this by defining a custom ```Dataset``` class. <br>
This class and later loaded into a ```DataLoader``` that will provide the batches of data to the model.

In order to generate valid inputs for the model we have to give a numerical representation to our data. <br>
In order to do so we use a ```WordEmbedding``` model that will give us the dictionary of the recognized words (The embeddings will be generated inside the model). <br>

### Vocab size?
In an ideal world the words in our dictionary are mapped 1 to 1.<br>
There are two problems with this approach:
- It might not be feasible
- We might be introducing too much noise for words that occur in few cases

Let's still see if it is generally feasible:

In [2]:
from core.utils import LoadCorpusUtility

# Just to see if it is feasible even if we won't be going for all the words at the end.
utility = LoadCorpusUtility(min_word_count=0)

corpora = [
    dict(file="../data/processed-dataset/default/64k.preprocessed.csv"),
    dict(file="../data/processed-dataset/full/64k.preprocessed.csv"),
    dict(file="../data/processed-dataset/full/256k.preprocessed.csv"),
    dict(file="../data/processed-dataset/full/256k_longest.preprocessed.csv"),
]

In [8]:

for corpus in corpora:
    dictionary = utility.make_corpus_dictionary(corpus['file'])
    corpus["full_dictionary"] = dictionary
    print(f"For corpus file: {corpus['file']}.\nWe have a total of {len(dictionary)} unique words.")

Pandas Apply:   0%|          | 0/69200 [00:00<?, ?it/s]

For corpus file: ../data/processed-dataset/default/64k.preprocessed.csv.
We have a total of 23408 unique words.


Pandas Apply:   0%|          | 0/80318 [00:00<?, ?it/s]

For corpus file: ../data/processed-dataset/full/64k.preprocessed.csv.
We have a total of 27473 unique words.


Pandas Apply:   0%|          | 0/288651 [00:00<?, ?it/s]

For corpus file: ../data/processed-dataset/full/256k.preprocessed.csv.
We have a total of 55288 unique words.


Pandas Apply:   0%|          | 0/736608 [00:00<?, ?it/s]

For corpus file: ../data/processed-dataset/full/256k_longest.preprocessed.csv.
We have a total of 91499 unique words.


Let's try for the 64k full dataset:

In [3]:
from core.embeddings import WordEmbedding

emb_model = WordEmbedding(
    corpus_loader_utility=utility, embedding_size=128,
    target_model_file='../output/256k-full-longest.embeddings.model',
    corpus_file=corpora[3]['file'], min_word_count=1
)

len(emb_model.get_vocab())

INFO:gensim.utils:loading Word2Vec object from ..\output\256k-full-longest.embeddings.model
DEBUG:smart_open.smart_open_lib:{'uri': '..\\output\\256k-full-longest.embeddings.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'compression': 'infer_from_extension', 'transport_params': None}
INFO:gensim.utils:loading wv recursively from ..\output\256k-full-longest.embeddings.model.wv.* with mmap=None
INFO:gensim.utils:loading vectors from ..\output\256k-full-longest.embeddings.model.wv.vectors.npy with mmap=None
INFO:gensim.utils:loading syn1neg from ..\output\256k-full-longest.embeddings.model.syn1neg.npy with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:Word2Vec lifecycle event {'fname': '..\\output\\256k-full-longest.embeddings.model', 'datetime': '2024-12-19T17:37:00.515183', 'gensim': '4.3.3', 'python': '3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 

91499

For the longest dataset it just took us 30s. It is feasible on the computational side. <br>
We still want to limit the noise of our dataset so we allow only words that occur with a certain frequency threshold. We use, as proposed when the paper was published, a ```min_word_count=5```.

> We might want to test different values of min_word_count and see how it affects the model

In [3]:
from core.embeddings import WordEmbedding

emb_model = WordEmbedding(
    corpus_loader_utility=LoadCorpusUtility(min_word_count=4), embedding_size=128,
    target_model_file='../output/64k-full.embeddings.model',
    corpus_file=corpora[1]['file'], min_word_count=1
)

vocabulary = emb_model.get_vocab()

INFO:gensim.utils:loading Word2Vec object from ..\output\64k-full.embeddings.model
DEBUG:smart_open.smart_open_lib:{'uri': '..\\output\\64k-full.embeddings.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'compression': 'infer_from_extension', 'transport_params': None}
INFO:gensim.utils:loading wv recursively from ..\output\64k-full.embeddings.model.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:Word2Vec lifecycle event {'fname': '..\\output\\64k-full.embeddings.model', 'datetime': '2024-12-19T17:45:35.729095', 'gensim': '4.3.3', 'python': '3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'loaded'}


In [4]:
len(vocabulary)  # Is it a small vocabulary?

6790

## PositiveNegativeCommentGeneratorDataset
Gives a sample and also returns some negative samples for contrastive learning. <br>


In [4]:
from core.dataset import PositiveNegativeCommentGeneratorDataset

ds = PositiveNegativeCommentGeneratorDataset(corpora[1]['file'], vocabulary, 10)

Loading dataset from file: ../data/processed-dataset/full/64k.preprocessed.csv
Generating numeric representation for each word of ds.


Pandas Apply:   0%|          | 0/80318 [00:00<?, ?it/s]

Max sequence length calculation in progress...
Max sequence length is:  206 . The limit is set to 256 tokens.
Padding sequences to length (256).


In [5]:
from torch.utils.data import DataLoader

lazy_dataloader = DataLoader(ds, batch_size=32, shuffle=True)

In [7]:
i = 11  # A random index to show content and 
print(
    f"Sentence at index {i} original text is: `{ds.get_text_sentence(i)}`\n "
    f"It's numeric representation:\n {ds[i][0][0]}"
)

Sentence at index 11 original text is: `play twice Grune`
 It's numeric representation:
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   

### Sequence length truncation
The model will be trained on sequences of fixed length. <br>
The chosen length must be reasonable, we can't just pad everything out for the same of it. <br>

We want that the top 95% of the reviews are not truncated. <br>

In [None]:
# We have 137 of the 50461 total reviews that are bigger than 256 tokens.
# This is less than 1% of the total reviews. We can truncate.

In [19]:
from core.dataset import PositiveNegativeCommentGeneratorDataset

# We can go for little sequences to reduce overhead of the model. Most of our sentences are quite short.
# On this dataset (64k-full 20 is enough).
ds = PositiveNegativeCommentGeneratorDataset(corpora[1]['file'], vocabulary, 10, max_seq_length=20)

Loading dataset from file: ../data/processed-dataset/full/64k.preprocessed.csv
Generating numeric representation for each word of ds.


Pandas Apply:   0%|          | 0/80318 [00:00<?, ?it/s]

Max sequence length calculation in progress...
Max sequence length is:  206 . The limit is set to 20 tokens.
We loose information on 1496 points.This is 1.8625961801837696% of the dataset.
Padding sequences to length (206).


In [None]:
# todo Salva parametri di "studio" in file .ini per ogni dataset confg.