In [2]:
corpus_file = "../../data/corpus.csv"

## Pre-processing
We produce two datasets: a "scouting" one which will be used for hp learning and one for the final model which is richer in data. For each *ds* we generate an ABAE ready version and a CONLL one (required by CAt).

In [3]:
import pandas as pd

# Read the game names to replace with the <GAME_NAME> tag
game_names = pd.read_csv("../../resources/2024-08-18.csv")['Name']
game_names = pd.concat([game_names, pd.Series(["Quick", "Catan"])], ignore_index=True).tolist()
print(f"We have a total of {len(game_names)} different game names.")

We have a total of 25901 different game names.


In [3]:
from pre_processing import PreProcessingServiceFactory

test_fraction = 0.25

ps = PreProcessingServiceFactory.default(game_names, "./output/default", test_fraction)
sentence_ps = PreProcessingServiceFactory.default_sentences(game_names, "./output/default_sentences", test_fraction)
pos_tag_ps = PreProcessingServiceFactory.pos_tagged(game_names, "./output/pos_tagged", test_fraction)

In [4]:
pos_tag_ps.process("This is a neat trick I know, do you like it?")

'neat__adj trick__noun know__verb like__verb'

In [5]:
from dataset_sampler import BggDatasetRandomBalancedSampler

paths = []
# 80k, 300k -> 80k will be for discovery while 300k for the actual model training
for size, random_state in zip([80000, 310000], [2, 8]):
    for service in [ps, sentence_ps, pos_tag_ps]:
        sampler = BggDatasetRandomBalancedSampler(int(size / 10), corpus_file, random_state)
        _, path = service.process_dataset(int(size), sampler)

        paths.append(path)

I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 5885/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 11776/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 17664/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 23501/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 29239/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 35072/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 40852/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 46721/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52398/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 58059/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 63798/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 69524/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 75228/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 80853/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/default/pre_processed.80k.test.csv
File created with success as ./output/default/pre_processed.80k.csv
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 13611/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 27083/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 40110/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52890/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 65626/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 78534/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 91186/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/default_sentences/pre_processed.80k.test.csv
File created with success as ./output/default_sentences/pre_processed.80k.csv
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 5886/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 11779/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 17670/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 23511/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 29253/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 35092/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 40879/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 46755/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52442/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 58110/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 63857/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 69588/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 75303/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 80935/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/pos_tagged/pre_processed.80k.test.csv
File created with success as ./output/pos_tagged/pre_processed.80k.csv
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 20590/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 40943/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 61045/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 81005/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 100723/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 120193/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 139659/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 158760/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 177863/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 196901/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31071 [00:00<?, ?it/s]

ds_size: 215479/310000
I have a total of 2218 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31022 [00:00<?, ?it/s]

ds_size: 234051/310000
I have a total of 2211 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/33115 [00:00<?, ?it/s]

ds_size: 253737/310000
I have a total of 2206 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/33026 [00:00<?, ?it/s]

ds_size: 273279/310000
I have a total of 2198 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/32916 [00:00<?, ?it/s]

ds_size: 292741/310000
I have a total of 2188 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/32717 [00:00<?, ?it/s]

ds_size: 311703/310000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/pos_tagged/pre_processed.310k.test.csv
File created with success as ./output/pos_tagged/pre_processed.310k.csv
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 46038/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 91783/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 136444/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 180728/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 224945/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 268431/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 311863/310000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/default_sentences/pre_processed.310k.test.csv
File created with success as ./output/default_sentences/pre_processed.310k.csv
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 20590/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 40943/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 61045/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 81005/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 100723/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 120193/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 139659/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 158760/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 177863/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 196901/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31071 [00:00<?, ?it/s]

ds_size: 215479/310000
I have a total of 2218 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31022 [00:00<?, ?it/s]

ds_size: 234051/310000
I have a total of 2211 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/33115 [00:00<?, ?it/s]

ds_size: 253737/310000
I have a total of 2206 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/33026 [00:00<?, ?it/s]

ds_size: 273279/310000
I have a total of 2198 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/32916 [00:00<?, ?it/s]

ds_size: 292741/310000
I have a total of 2188 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/32717 [00:00<?, ?it/s]

ds_size: 311703/310000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/pos_tagged/pre_processed.310k.test.csv
File created with success as ./output/pos_tagged/pre_processed.310k.csv


The generated dataset is almost ready for usage in the CAt model. ABAE on the other hand does not require this much
information. <br>
To speed thing up the two variants are generated at the same time.

In [5]:
from dataset_sampler import BggDatasetRandomBalancedSampler
test_fraction = 0.25
from pre_processing import PreProcessingServiceFactory
ps = PreProcessingServiceFactory.default(game_names, "./output/default", test_fraction)

paths = []
# 80k, 300k -> 80k will be for discovery while 300k for the actual model training
for size, random_state in zip([310000], [8]):
    for service in [ps]:
        sampler = BggDatasetRandomBalancedSampler(int(size / 10), corpus_file, random_state)
        _, path = service.process_dataset(int(size), sampler)

        paths.append(path)

I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 20580/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 40911/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 60986/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 80918/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 100619/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 120046/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 139474/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 158542/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 177614/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31080 [00:00<?, ?it/s]

ds_size: 196624/310000
I have a total of 2220 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31071 [00:00<?, ?it/s]

ds_size: 215166/310000
I have a total of 2218 games with reviews. We take 14 reviews per game.


Pandas Apply:   0%|          | 0/31022 [00:00<?, ?it/s]

ds_size: 233704/310000
I have a total of 2211 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/33115 [00:00<?, ?it/s]

ds_size: 253344/310000
I have a total of 2206 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/33026 [00:00<?, ?it/s]

ds_size: 272843/310000
I have a total of 2198 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/32916 [00:00<?, ?it/s]

ds_size: 292262/310000
I have a total of 2188 games with reviews. We take 15 reviews per game.


Pandas Apply:   0%|          | 0/32717 [00:00<?, ?it/s]

ds_size: 311178/310000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/default/pre_processed.310k.test.csv
File created with success as ./output/default/pre_processed.310k.csv
