## Preprocessing
Unlike LDA we should not toy too much with the sentence structure as ABAE uses word embeddings and needs the sequence information to weight the terms based on the surrounding context. One question remains:

**Should we work on sentence level or full reviews? Let's try a first simple comparison**

### Full-reviews

In [None]:
from main.abae.model_manager import ABAEManagerConfig, ABAEDefaultManagerFactory

corpus = "../dataset/output/default/pre_processed.80k.csv"
default_config = ABAEManagerConfig(model_name="abae_default_ds", corpus_file_path=corpus)
abae_manager = ABAEDefaultManagerFactory().factory_method(default_config)

In [2]:
history, _ = abae_manager.train(corpus)
model = abae_manager.get_compiled_model(refresh=False)

[1m474/474[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m100s[0m 211ms/step - loss: 4.6942 - max_margin_loss: 4.6932
Epoch 12/15
[1m474/474[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 206ms/step - loss: 4.6925 - max_margin_loss: 4.6915
Epoch 13/15
[1m474/474[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m96s[0m 202ms/step - loss: 4.6794 - max_margin_loss: 4.6784
Epoch 14/15
[1m474/474[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 208ms/step - loss: 4.6635 - max_margin_loss: 4.6625
Epoch 15/15
[1m474/474[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m100s[0m 210ms/step - loss: 4.6614 - max_margin_loss: 4.6604


Latest run result:
```
Max Margin loss: [4.6614, 4.6604]
```

In [3]:
from evaluation import ABAEEvaluationProcessor

inv_vocab = abae_manager.generator.emb_model.model.wv.index_to_key
processor = ABAEEvaluationProcessor.generate_for_model(model, inv_vocab)

In [15]:
from main.abae.dataset import PositiveNegativeABAEDataset
import pandas as pd
from torch.utils.data import DataLoader

test_corpus_path = "../dataset/output/default/pre_processed.80k.test.csv"
df = pd.read_csv(test_corpus_path)

npmi_coh = processor.c_npmi_coherence_model(top_n=10, ds=df['comments'].apply(lambda x: x.split(' ')))
npmi_coherence = npmi_coh.get_coherence()

cv_coh = processor.c_v_coherence_model(top_n=100, ds=df['comments'].apply(lambda x: x.split(' ')))
cv_coherence = cv_coh.get_coherence()


vocabulary = abae_manager.generator.emb_model.vocabulary()
max_seq_len = default_config.max_seq_len
negative_sample_size = default_config.negative_sample_size
test_ds = PositiveNegativeABAEDataset(df, vocabulary, max_seq_len, negative_sample_size)

res = model.evaluate(DataLoader(test_ds, batch_size=default_config.batch_size))
print(f"Max margin reconstruction result: {res}")

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary<12053 unique tokens: ['<game_name>', 'abstract', 'add', 'art', 'bad']...>
INFO:gensim.corpora.dictionary:adding document #20000 to Dictionary<16753 unique tokens: ['<game_name>', 'abstract', 'add', 'art', 'bad']...>
INFO:gensim.corpora.dictionary:built Dictionary<16850 unique tokens: ['<game_name>', 'abstract', 'add', 'art', 'bad']...> from 20213 documents (total 390827 corpus positions)
DEBUG:gensim.utils:starting a new internal lifecycle event log for Dictionary
INFO:gensim.utils:Dictionary lifecycle event {'msg': "built Dictionary<16850 unique tokens: ['<game_name>', 'abstract', 'add', 'art', 'bad']...> from 20213 documents (total 390827 corpus positions)", 'datetime': '2025-03-10T01:00:04.936508', 'gensim': '4.3.3', 'python': '3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]', 'platform': 'Windows-11

Generating numeric representation for each word of ds.


Pandas Apply:   0%|          | 0/20213 [00:00<?, ?it/s]

Max sequence length calculation in progress...
We loose information on 464(2.295552367288379% of ds).
[1m158/158[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 120ms/step - loss: 4.7868 - max_margin_loss: 4.7858
Max margin reconstruction result: [4.745960235595703, 4.744936943054199]


In [16]:
print(f"NPMI coherence: {npmi_coherence}")
print(f"CV score: {cv_coherence}")
print(f"Max margin reconstruction result: {res}")

NPMI coherence: -0.23037988688672537
CV score: 0.5646751917101251
Max margin reconstruction result: [4.745960235595703, 4.744936943054199]


Results of the latest run:
```
NPMI coherence: -0.23037988688672537
CV score: 0.5646751917101251
Max margin reconstruction result: [4.745960235595703, 4.744936943054199]
```

In [22]:
list(processor.extract_top_k_words(10, 10))

[('instruction', tensor(0.6540, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('manual', tensor(0.6516, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('page', tensor(0.6307, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('solution', tensor(0.6230, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('reference', tensor(0.5907, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('query', tensor(0.5859, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('vague', tensor(0.5815, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('wiki', tensor(0.5806, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('decipher', tensor(0.5799, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('rulebook', tensor(0.5764, device='cuda:0', grad_fn=<SelectBackward0>))]

### Sentence-split reviews

In [None]:
from main.abae.model_manager import ABAEManagerConfig, ABAEDefaultManagerFactory

corpus = "../dataset/output/default_sentences/pre_processed.80k.csv"
default_config = ABAEManagerConfig(model_name="abae_sent_ds", corpus_file_path=corpus)
abae_manager = ABAEDefaultManagerFactory().factory_method(default_config)

In [None]:
history, _ = abae_manager.train(corpus)
model = abae_manager.get_compiled_model(refresh=False)

In [26]:
# print history
history.history

{'loss': [8.786524772644043,
  5.993138313293457,
  5.584963321685791,
  5.477548122406006,
  5.42295503616333,
  5.371152877807617,
  5.343845844268799,
  5.310521125793457,
  5.257896423339844,
  5.230459690093994,
  5.210410118103027,
  5.190746307373047,
  5.181276321411133,
  5.164516448974609,
  5.154855728149414],
 'max_margin_loss': [8.780280113220215,
  5.991584777832031,
  5.584194660186768,
  5.476739883422852,
  5.423596382141113,
  5.369997978210449,
  5.3439788818359375,
  5.3094024658203125,
  5.2574005126953125,
  5.22948694229126,
  5.208757400512695,
  5.1898369789123535,
  5.179889678955078,
  5.1640849113464355,
  5.154535293579102]}

In [29]:
from torch.nn.functional import normalize

inv_vocab = abae_manager.generator.emb_model.model.wv.index_to_key
word_embeddings = model.get_layer('word_embedding').weights[0].value.data
word_embeddings = normalize(word_embeddings, dim=-1)
aspect_embeddings = model.get_layer('weighted_aspect_embedding_2').w
aspect_embeddings = normalize(aspect_embeddings, dim=-1)
processor = ABAEEvaluationProcessor(word_embeddings, aspect_embeddings, inv_vocab)

In [27]:
from evaluation import ABAEEvaluationProcessor

inv_vocab = abae_manager.generator.emb_model.model.wv.index_to_key
processor = ABAEEvaluationProcessor.generate_for_model(model, inv_vocab)

ValueError: No such layer: weighted_aspect_embedding. Existing layers are: ['positive', 'word_embedding', 'attention', 'weight', 'negative', 'sentence_aspect', 'average_1', 'weighted_aspect_embedding_2', 'max_margin'].

In [30]:
from main.abae.dataset import PositiveNegativeABAEDataset
import pandas as pd
from torch.utils.data import DataLoader

test_corpus_path = "../dataset/output/default_sentences/pre_processed.80k.test.csv"
df = pd.read_csv(test_corpus_path)

npmi_coh = processor.c_npmi_coherence_model(top_n=10, ds=df['comments'].apply(lambda x: x.split(' ')))
npmi_coherence = npmi_coh.get_coherence()

cv_coh = processor.c_v_coherence_model(top_n=100, ds=df['comments'].apply(lambda x: x.split(' ')))
cv_coherence = cv_coh.get_coherence()


vocabulary = abae_manager.generator.emb_model.vocabulary()
max_seq_len = default_config.max_seq_len
negative_sample_size = default_config.negative_sample_size
test_ds = PositiveNegativeABAEDataset(df, vocabulary, max_seq_len, negative_sample_size)

res = model.evaluate(DataLoader(test_ds, batch_size=default_config.batch_size))
print(f"Max margin reconstruction result: {res}")

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary<7608 unique tokens: ['hide', 'like', 'multiple', 'possible', 'traitor']...>
INFO:gensim.corpora.dictionary:adding document #20000 to Dictionary<10781 unique tokens: ['hide', 'like', 'multiple', 'possible', 'traitor']...>
INFO:gensim.corpora.dictionary:built Dictionary<11509 unique tokens: ['hide', 'like', 'multiple', 'possible', 'traitor']...> from 22796 documents (total 176244 corpus positions)
DEBUG:gensim.utils:starting a new internal lifecycle event log for Dictionary
INFO:gensim.utils:Dictionary lifecycle event {'msg': "built Dictionary<11509 unique tokens: ['hide', 'like', 'multiple', 'possible', 'traitor']...> from 22796 documents (total 176244 corpus positions)", 'datetime': '2025-03-10T17:18:46.204925', 'gensim': '4.3.3', 'python': '3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]', 'platform': 

Generating numeric representation for each word of ds.


Pandas Apply:   0%|          | 0/22796 [00:00<?, ?it/s]

Max sequence length calculation in progress...
We loose information on 2(0.008773469029654327% of ds).
[1m179/179[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 126ms/step - loss: 5.2631 - max_margin_loss: 5.2623
Max margin reconstruction result: [5.261548042297363, 5.262136936187744]


In [31]:
print(f"NPMI coherence: {npmi_coherence}")
print(f"CV score: {cv_coherence}")
print(f"Max margin reconstruction result: {res}")

NPMI coherence: -0.3117940799781337
CV score: 0.6031670887985386
Max margin reconstruction result: [5.261548042297363, 5.262136936187744]


Results for the run:
```
NPMI coherence: -0.3117940799781337
CV score: 0.6031670887985386
Max margin reconstruction result: [5.261548042297363, 5.262136936187744]
```

In [37]:
list(processor.extract_top_k_words(11, 10))

[('instruction', tensor(0.7166, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('rule', tensor(0.7164, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('rulebook', tensor(0.7001, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('manual', tensor(0.6964, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('forum', tensor(0.6304, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('faq', tensor(0.6292, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('language', tensor(0.5814, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('booklet', tensor(0.5774, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('iconography', tensor(0.5654, device='cuda:0', grad_fn=<SelectBackward0>)),
 ('rules', tensor(0.5641, device='cuda:0', grad_fn=<SelectBackward0>))]

I know that doing a  comparison on a single run is not that meaningful. <br>
I could do k-CV to estimate the expected model loss to get a valid analysis. <br>
But for the sake of the experiment we consider this good enough.

For the future work and as one by the proposed ABAE paper we won't be splitting up reviews in sentences but use the full review as the model does not increase much if not done like this.