In [1]:
corpus_file = "../../data/corpus.csv"
test_fraction = 0.25  # Fraction fo samples used to generate a test set.

random_seed = 8
# sizes = [80000, 310000]
sizes = [80000]

In [2]:
from pre_processing import PreProcessingServiceFactory
from main.dataset.dataset_sampler import BggDatasetRandomBalancedSampler
import pandas as pd

## Pre-processing
We produce two datasets: a "scouting" one for the first steps and one for the final models that are eventually hp tuned. <br>
By using the same seed I ensure that one dataset is contained in the other. <br>
This means that the 310k variant only adds information to the existing 80k one.

In [3]:
# Read the game names to replace with the <GAME_NAME> tag
game_names = pd.read_csv("../../resources/2024-08-18.csv")['Name']
game_names = pd.concat([game_names, pd.Series(["Quick", "Catan"])], ignore_index=True).tolist()
print(f"We have a total of {len(game_names)} different game names.")

We have a total of 25901 different game names.


In [4]:
def process(service, ds_sizes: list, random_state: int, corpus_file_path: str):
    for size in ds_sizes:
        sampler = BggDatasetRandomBalancedSampler(int(size / 10), corpus_file_path, random_state)
        service.process_dataset(int(size), sampler)

## Default

In [5]:
ps = PreProcessingServiceFactory.default(game_names, "./output/default", test_fraction)
process(ps, sizes, random_seed, corpus_file)

I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 5912/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 11787/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 17654/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 23483/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 29281/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 35132/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 40991/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 46736/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52470/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 58123/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 63860/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 69539/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 75238/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 80920/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/default/pre_processed.80k.test.csv
File created with success as ./output/default/pre_processed.80k.csv


## Sentence-split

In [6]:
sentence_ps = PreProcessingServiceFactory.default_sentences(game_names, "./output/default_sentences", test_fraction)
process(sentence_ps, sizes, random_seed, corpus_file)

I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 13155/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 26182/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 39292/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52171/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 64964/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 78379/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 91400/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/default_sentences/pre_processed.80k.test.csv
File created with success as ./output/default_sentences/pre_processed.80k.csv


### POS-tagged

In [7]:
pos_tag_ps = PreProcessingServiceFactory.pos_tagged(game_names, "./output/pos_tagged", test_fraction)
process(pos_tag_ps, sizes, random_seed, corpus_file)

I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 5913/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 11791/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 17662/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 23499/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 29299/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 35156/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 41020/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 46773/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52516/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 58178/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 63926/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 69614/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 75330/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 81017/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/pos_tagged/pre_processed.80k.test.csv
File created with success as ./output/pos_tagged/pre_processed.80k.csv


Now a POS tagged variant on sentences

In [8]:
out_path = "./output/pos_tagged_sentence_level"
pos_tag_sentence_ps = PreProcessingServiceFactory.pos_tagged_sentence_level(game_names, out_path, test_fraction)
process(pos_tag_sentence_ps, sizes, random_seed, corpus_file)

I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 13156/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 26189/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 39309/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 52201/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 65007/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 78438/80000
I have a total of 2220 games with reviews. We take 4 reviews per game.


Pandas Apply:   0%|          | 0/8880 [00:00<?, ?it/s]

ds_size: 91476/80000
Processing terminated. We are storing the work ready file now...
Test subset created with success as ./output/pos_tagged_sentence_level/pre_processed.80k.test.csv
File created with success as ./output/pos_tagged_sentence_level/pre_processed.80k.csv
