In [None]:
# TODO pulisci questo file

In [1]:
import os

os.environ['KERAS_BACKEND'] = "torch"

## Regularization
>We hope to learn vector representations of the most representative aspects for a review dataset.
However, the aspect embedding matrix T may suffer from redundancy problems during training. [...] 
> The regularization term encourages orthogonality among the rows of the aspect embedding matrix T and penalizes redundancy between different aspect vectors
> ~ Ruidan

We use an Orthogonal Regulizer definition of the method can be found here: https://paperswithcode.com/method/orthogonal-regularization. <br/>
For the code we use the default implementation provided by Keras (https://keras.io/api/layers/regularizers/)

## Aspect Embedding Size
The aspect embedding size is what will be inferring aspects. It is closest to representative words (?). <br />
We have to identify 7 actual aspects (luck, bookkeeping, downtime...) but that does not mean our matrix should be limited to rows only! What size to search is a good question and should be studied (Which I may be doing later). 

For the first try we setup the aspect_size:
>The optimal number of rows is problem-dependent, so it’s crucial to: <br/>
> Start with a heuristic: Begin with 2–3x the number of aspects.

For **aspect extraction**, which involves identifying key aspects or topics in text, the best early stopping method depends on your approach:

### 1. Embedding-based Methods (e.g., Clustering Embeddings)
- **Silhouette Score**: Measure the separation and compactness of clusters. Stop when the score stabilizes.
- **Inertia/Distortion**: Track the sum of squared distances within clusters and stop when improvement flattens.
- **Centroid Movement**: Stop when the change in cluster centroids across iterations is minimal.

### 2. Topic Modeling (e.g., LDA)
- **Perplexity**: Monitor the perplexity on a held-out dataset and stop when it stops decreasing significantly.
- **Coherence Score**: Measure the semantic consistency of extracted topics and stop when it stabilizes.

### 3. Autoencoder-based Aspect Extraction
- **Reconstruction Loss**: Stop training when the validation reconstruction error no longer improves.

### 4. Qualitative Evaluation (if feasible)
- Periodically inspect extracted aspects for meaningfulness and diversity to decide on stopping.

For **aspect extraction**, combining an automated metric (like coherence score or silhouette score) with manual inspection often yields the best results.


## Parameters scouting

In [2]:
aspect_size = 2 * 7 + 2  # 16 seems reasonable. We should fine tune this parameter. todo

In [2]:
from core.hp_tuning import RandomTunableDiscreteParameter, RandomTunableSteppedParameter

## Parameters scouting. We scout on our main dataset.
corpus_file = "../data/processed-dataset/full/256k.preprocessed.csv"

# We do random search. todo wrap around?
hp_aspect_size = RandomTunableSteppedParameter(14, 20, 1)
hp_embedding_size = RandomTunableDiscreteParameter([100, 128, 150, 200, 256])
hp_aspect_embedding_size = RandomTunableDiscreteParameter([100, 128, 150, 200, 256])
hp_epochs = RandomTunableDiscreteParameter([5, 7, 10, 14, 20])
hp_batch_size = RandomTunableDiscreteParameter([32, 64, 128])

In [3]:
from core.train import AbaeModelManager, AbaeModelConfiguration
from core.hp_tuning import KFoldDatasetWrapper
from core.dataset import PositiveNegativeCommentGeneratorDataset

k_fold = KFoldDatasetWrapper(k=5)

n = 15  # We try 15 different test configurations
for i in range(n):
    # todo dump the configuration
    config = AbaeModelConfiguration(
        corpus_file=corpus_file, model_name=f"hp_{i}",
        embedding_size=hp_embedding_size.get_value(),
        aspect_embedding_size=hp_aspect_embedding_size.get_value(),
        aspect_size=hp_aspect_size.get_value()
    )

    print("Configuration:", config)

    epochs = hp_epochs.get_value()
    print("Epochs:", epochs)

    batch_size = hp_batch_size.get_value()
    print("Batch size:", batch_size)

    manager = AbaeModelManager(config)

    train_dataset = PositiveNegativeCommentGeneratorDataset(
        vocabulary=manager.embedding_model.vocabulary(),
        csv_dataset_path=config.corpus_file, negative_size=15
    )

    k_fold.load_data(train_dataset)
    k_fold.run_k_fold_cv(manager, batch_size, epochs)

Configuration: AbaeModelConfiguration(corpus_file='../data/processed-dataset/full/256k.preprocessed.csv', model_name='hp_0', aspect_size=17, max_vocab_size=None, embedding_size=200, aspect_embedding_size=200, max_sequence_length=256, negative_sample_size=15, output_path='./output')
Epochs: 20
Batch size: 64


KeyboardInterrupt: 

## Process:
We have a ```AbaeModelConfiguration``` dataclass for each config we want to train on. <br>
In the notebook we will handle things manually but there is a script: ```todo``` to run all at once.

In [4]:
from train import AbaeModelConfiguration

# todo vedi se passare altri parametri.
configs = [
    AbaeModelConfiguration(
        corpus_file="../data/processed-dataset/default/64k.preprocessed.csv",
        model_name="abae.default.64k", aspect_size=aspect_size
    ),

    AbaeModelConfiguration(
        corpus_file="../data/processed-dataset/full/64k.preprocessed.csv.",
        model_name="abae.full.64k", aspect_size=aspect_size
    ),

    AbaeModelConfiguration(
        corpus_file="../data/processed-dataset/full/256k.preprocessed.csv",
        model_name="abae.full.256k", aspect_size=aspect_size
    ),

    AbaeModelConfiguration(
        corpus_file="../data/processed-dataset/full/512k.preprocessed.csv.preprocessed.csv",
        model_name="abae.full.512k", aspect_size=aspect_size
    ),
]

### 64k - Default

In [5]:
config = configs[1]

In [6]:
from core.train import AbaeModelManager, AbaeModelConfiguration

manager = AbaeModelManager(config)

Pandas Apply:   0%|          | 0/80286 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/80286 [00:00<?, ?it/s]

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 69874 words, keeping 5554 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #20000, processed 144092 words, keeping 7208 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #30000, processed 218736 words, keeping 8081 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #40000, processed 295254 words, keeping 8600 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #50000, processed 372434 words, keeping 8811 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #60000, processed 445937 words, keeping 8934 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #70000, processed 517713 words, keeping 8976 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #80000, processed 593358 words, keeping 9003 word

exceptions must derive from BaseException


INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.utils:Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-01-02T10:52:22.696723', 'gensim': '4.3.3', 'python': '3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'build_vocab'}
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'training model with 8 workers on 9004 vocabulary and 128 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-01-02T10:52:22.696723', 'gensim': '4.3.3', 'python': '3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'train'}
DEBUG:gensim.models.word2vec:job loop exiting, total 60 jobs
DEBUG:gensim.models.word2vec:worker exiting, processed 7 jobs
DEBUG:gensim.models.word2vec:worker thread finished; awaiting finish of 7 more threads
DEBUG:gensim.models.word2vec:wor

exceptions must derive from BaseException


In [7]:
train_model = manager.prepare_training_model('adam')

File not found: filepath=./output/abae.full.64k/abae.full.64k.keras. Please ensure the file is an accessible `.keras` zip file.


  super(WeightedAspectEmb, self).__init__(**kwargs)


In [8]:
from core.dataset import PositiveNegativeCommentGeneratorDataset
from torch.utils.data import DataLoader

train_dataset = PositiveNegativeCommentGeneratorDataset(
    vocabulary=manager.embedding_model.vocabulary(),
    csv_dataset_path=config.corpus_file, negative_size=15
)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

Loading dataset from file: ../data/processed-dataset/full/64k.preprocessed.csv.
Generating numeric representation for each word of ds.


Pandas Apply:   0%|          | 0/80286 [00:00<?, ?it/s]

Max sequence length calculation in progress...
Max sequence length is:  206 . The limit is set to 256 tokens.
Padding sequences to length (256).


In [9]:
train_model.fit(train_dataloader, epochs=7, batch_size=64)

Epoch 1/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m260s[0m 207ms/step - loss: 14.6209 - max_margin_loss: 14.6209
Epoch 2/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 208ms/step - loss: 10.0960 - max_margin_loss: 10.0960
Epoch 3/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 210ms/step - loss: 8.2702 - max_margin_loss: 8.2702
Epoch 4/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 210ms/step - loss: 7.8211 - max_margin_loss: 7.8211
Epoch 5/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 208ms/step - loss: 7.6295 - max_margin_loss: 7.6295
Epoch 6/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 210ms/step - loss: 7.5364 - max_margin_loss: 7.5364
Epoch 7/7
[1m1255/1255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 210ms/step - loss: 7.4515 - max_margin_loss: 7.4515


<keras.src.callbacks.history.History at 0x1be4a3f7410>

In [15]:
manager.persist_model()

DEBUG:h5py._conv:Creating converter from 5 to 3


In [10]:
import torch

torch.cuda.get_device_name(0)

'NVIDIA GeForce RTX 3070 Ti'

We have too much data for my little PC:

> Sampling: Randomly select a subset of your data that represents the overall distribution of aspects. This will help maintain diversity while reducing the size.
Filtering: Focus on the most informative or high-quality samples. For example, if certain reviews are very short, irrelevant, or don't have useful context for aspect extraction, remove them.
Focus on Diversity: If you reduce the data, make sure the remaining dataset is still representative of the diversity of aspects you're trying to capture.

In [12]:
# How to Address Issues (If Any):
# Introduce Hard Negatives:
# Instead of randomly selecting negative samples, use hard negatives—examples that are more challenging to distinguish from positive pairs. This keeps the max-margin loss informative and prevents the model from converging too quickly.

# Regularization:
# Apply regularization (e.g., L2 regularization) to prevent overfitting and ensure the model generalizes well.

# Early Stopping:
# If the loss plateaus and aspect quality is satisfactory, consider using early stopping to avoid unnecessary training.

### Hyper-parameters
These should have been discussed earlier. <br>
We could do hyperparmeter optimization, but how do we 'validate' our model? <br>