# LDA evaluation notebook

We use an interactive notebook to evaluate our summarization models using our LDA model.

The notebook uses custom-modules defined in other files, but to prevent ourselves from re-loading the data during training, it is easier to use a notebook.

### Setup logging

In [1]:
import logging
from logging import config
config.fileConfig('./logging.conf')

def pp(*args, **kwargs):
    logging.info(*args, **kwargs)

### Resource paths

In [2]:
import os
cwd = os.getcwd()

data_path = f'{cwd}/bart_output.json'
model_path = f'{cwd}/model/grid-xxx'
tf_idf_path = f'{cwd}/tf_idf'

### Load pre-computed resources

In [3]:
from gensim.models import TfidfModel

tf_idf = TfidfModel.load(tf_idf_path)

2022-08-13 12:35:23,563 - gensim.utils - INFO - loading TfidfModel object from /Users/danieltrugman/Documents/Education/NLP/Repository/tf_idf
2022-08-13 12:35:23,563 - smart_open.smart_open_lib - DEBUG - {'uri': '/Users/danieltrugman/Documents/Education/NLP/Repository/tf_idf', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'compression': 'infer_from_extension', 'transport_params': None}
2022-08-13 12:35:23,605 - gensim.utils - INFO - TfidfModel lifecycle event {'fname': '/Users/danieltrugman/Documents/Education/NLP/Repository/tf_idf', 'datetime': '2022-08-13T12:35:23.605300', 'gensim': '4.2.0', 'python': '3.10.4 (v3.10.4:9d38120e33, Mar 23 2022, 17:29:05) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-12.2.1-arm64-arm-64bit', 'event': 'loaded'}


In [4]:
from lda_model import LdaModel

lda = LdaModel.load(model_path)

2022-08-13 12:35:23,625 - gensim.corpora.dictionary - INFO - adding document #0 to Dictionary<0 unique tokens: []>
2022-08-13 12:35:23,625 - gensim.corpora.dictionary - INFO - built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2022-08-13 12:35:23,625 - gensim.utils - DEBUG - starting a new internal lifecycle event log for Dictionary
2022-08-13 12:35:23,626 - gensim.utils - INFO - Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2022-08-13T12:35:23.625836', 'gensim': '4.2.0', 'python': '3.10.4 (v3.10.4:9d38120e33, Mar 23 2022, 17:29:05) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-12.2.1-arm64-arm-64bit', 'event': 'created'}
2022-08-13 12:35:23,658 - gensim.utils - INFO - loading LdaMulticore object from /Users/danieltrugman/Documents/Education/N

### Load data

We use our self-made JSON file that stores the original article and abstract (part of the dataset) and the BART model summary

In [5]:
import json

with open(data_path) as fin:
    data = json.load(fin)

articles = [doc['article'] for doc in data]
abstracts = [doc['abstract'] for doc in data]
summaries = [doc['bart'] for doc in data]

### Tokenize and pre-process test data

The LDA model expects a BOW input (in our case TF-IDF), not strings. Hence we need to convert each of the texts into the expected format.

In [6]:
from generate_preprocessed import PreProcessor
from generate_bow import BowProcessor
from generate_tf_idf import TfIdfProcessor

pp_processor = PreProcessor()
bow_processor = BowProcessor(lda.dictionary)
tf_idf_processor = TfIdfProcessor(tf_idf)

articles_pp = pp_processor(articles)
abstracts_pp = pp_processor(abstracts)
summaries_pp = pp_processor(summaries)

articles_bow = bow_processor(articles_pp)
abstracts_bow = bow_processor(abstracts_pp)
summaries_bow = bow_processor(summaries_pp)

articles_tf_idf = tf_idf_processor(articles_bow)
abstracts_tf_idf = tf_idf_processor(abstracts_bow)
summaries_tf_idf = tf_idf_processor(summaries_bow)

### Evaluate the topics for each doc and calculate distances

For every original article, we have two gists: one human-made (abstract) and one computer-made (summary).  
We calculate the distance between the two pair (original, abstract) and (original, summary), and examine which one retains topics better.

In [8]:
from lda_eval import LdaEvaluator

evaluator = LdaEvaluator(lda)

human_better = 0
comp_better = 0

for article, abstract, summary in zip(articles_tf_idf, abstracts_tf_idf, summaries_tf_idf):
    human_dist = evaluator.distance(article, abstract)
    comp_dist = evaluator.distance(article, summary)
    diff = abs(human_dist - comp_dist)
    pp(f'{human_dist:.3f}, {comp_dist:.3f} --> {diff:.3f}')
    if human_dist < comp_dist:
        human_better += 1
    else:
        comp_better += 1

pp('---------------------------------------------------')
pp(f'Human [{human_better}] vs. Comp [{comp_better}]')


2022-08-13 12:35:40,102 - root - INFO - 10.896, 1.312 --> 9.584
2022-08-13 12:35:40,129 - root - INFO - 11.826, 12.293 --> 0.468
2022-08-13 12:35:40,141 - root - INFO - 1.153, 1.249 --> 0.097
2022-08-13 12:35:40,146 - root - INFO - 0.164, 2.352 --> 2.188
2022-08-13 12:35:40,208 - root - INFO - 4.679, 3.186 --> 1.493
2022-08-13 12:35:40,227 - root - INFO - 0.162, 0.126 --> 0.036
2022-08-13 12:35:40,238 - root - INFO - 0.974, 0.981 --> 0.007
2022-08-13 12:35:40,245 - root - INFO - 3.455, 7.556 --> 4.101
2022-08-13 12:35:40,260 - root - INFO - 4.264, 4.289 --> 0.025
2022-08-13 12:35:40,263 - root - INFO - 0.336, 0.389 --> 0.053
2022-08-13 12:35:40,290 - root - INFO - 0.000, 0.045 --> 0.045
2022-08-13 12:35:40,299 - root - INFO - 9.805, 5.239 --> 4.566
2022-08-13 12:35:40,307 - root - INFO - 9.925, 10.584 --> 0.659
2022-08-13 12:35:40,312 - root - INFO - 4.108, 8.212 --> 4.104
2022-08-13 12:35:40,315 - root - INFO - 5.767, 2.523 --> 3.244
2022-08-13 12:35:40,318 - root - INFO - 4.282, 4.69