# Topic Model Evaluation

This notebook contains code needed to run the experiments of my Bachelor's thesis on Topic Modeling Algorithms.

For installation please check out `README.md` of this repository if you haven't already.

In [1]:
import sys
sys.path.append('../')

In [2]:
# Import neccessary packages
import numpy as np
from sentence_transformers import SentenceTransformer

from evaluation import DataLoader, Trainer

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/budu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Trump's tweets

First we will run BERTopic and Top2Vec on Donald Trump's tweets.

His tweets' archive can be found here: https://www.thetrumparchive.com/

For our experiments we will not consider retweets and deleted tweets, those are being filtered out during the operation of `DataLoader` class.

In [3]:
dataset, custom = "trump", True

In [4]:
%%time
dataloader_trump = DataLoader(dataset=dataset).\
                    prepare_docs(save=f"{dataset}.txt").\
                    preprocess_octis(output_folder=f"{dataset}")

100%|█████████████████████████████████████████████████████████████████████████| 46693/46693 [00:00<00:00, 253384.77it/s]


created vocab
53637
words filtering done
CPU times: user 2.62 s, sys: 284 ms, total: 2.9 s
Wall time: 12.8 s


To speed up BERTopic we precalculate embeddings using `SentenceTransformer all-mpnet-base-v2` model. Otherwise, in every different parameter setting would require calculating embeddings, and that would result in a massive runtime.

In [6]:
# Prepare data
dataset, custom = "trump", True
dataloader = DataLoader(dataset)
data_trump = dataloader.load_octis(custom)
data_trump = [" ".join(words) for words in data_trump.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-mpnet-base-v2")
embeddings_trump = model.encode(data_trump, show_progress_bar=True)

Batches: 100%|██████████████████████████████████████████████████████████████████████| 1383/1383 [13:28<00:00,  1.71it/s]


In [7]:
# Save embeddings
np.savetxt('embeddings_trump.txt', embeddings_trump)

As described in the Thesis and the BERTopic paper [https://arxiv.org/pdf/2203.05794.pdf] the performance of the topic models is reflected by two widely-used metrics, topic coherence and topic diversity.

For each topic model, coherence is evaluated using Normalized Pointwise Mutual Information (NPMI). This measure ranges from $-1$ to $1$, where $1$ indicates a perfect association.

Topic diversity is the percentage of unique words for all topics. This measure ranges from $0$ to $1$ where $0$ indicates redundant topic and $1$ indicates varied topic.

Ranging from $10$ to $50$ topics with steps of $10$, the NPMI score is calculated at each step for each topic model. Results are averaged across $3$ runs for each step.

In [6]:
# Evaluating BERTopic on Trump's tweets

for i in range(3):

    params = {
        "embedding_model": "all-mpnet-base-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings_trump,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"../results/Basic/Trump/bertopic_{i+1}")

2024-04-29 17:29:12,850 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:30:43,513 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:30:43,517 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:30:47,714 - BERTopic - Cluster - Completed ✓
2024-04-29 17:30:47,717 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:30:49,622 - BERTopic - Representation - Completed ✓
2024-04-29 17:30:49,625 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:30:51,283 - BERTopic - Topic reduction - Reduced number of topics from 379 to 10


Results
npmi: -0.055941250571217155
diversity: 0.7222222222222222
 


2024-04-29 17:31:07,815 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:31:45,255 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:31:45,258 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:31:48,480 - BERTopic - Cluster - Completed ✓
2024-04-29 17:31:48,482 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:31:50,341 - BERTopic - Representation - Completed ✓
2024-04-29 17:31:50,343 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:31:51,821 - BERTopic - Topic reduction - Reduced number of topics from 368 to 20


Results
npmi: -0.045564519617954206
diversity: 0.7105263157894737
 


2024-04-29 17:32:10,286 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:32:51,209 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:32:51,213 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:32:54,290 - BERTopic - Cluster - Completed ✓
2024-04-29 17:32:54,292 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:32:56,356 - BERTopic - Representation - Completed ✓
2024-04-29 17:32:56,362 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:32:57,772 - BERTopic - Topic reduction - Reduced number of topics from 380 to 30


Results
npmi: -0.0519690206226661
diversity: 0.7517241379310344
 


2024-04-29 17:33:14,925 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:33:48,893 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:33:48,896 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:33:52,487 - BERTopic - Cluster - Completed ✓
2024-04-29 17:33:52,489 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:33:54,769 - BERTopic - Representation - Completed ✓
2024-04-29 17:33:54,772 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:33:56,604 - BERTopic - Topic reduction - Reduced number of topics from 378 to 40


Results
npmi: -0.0444661116546153
diversity: 0.7743589743589744
 


2024-04-29 17:34:16,889 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:34:49,062 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:34:49,066 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:34:52,134 - BERTopic - Cluster - Completed ✓
2024-04-29 17:34:52,136 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:34:53,886 - BERTopic - Representation - Completed ✓
2024-04-29 17:34:53,888 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:34:55,375 - BERTopic - Topic reduction - Reduced number of topics from 366 to 50


Results
npmi: -0.038141072709061086
diversity: 0.7959183673469388
 


2024-04-29 17:35:14,676 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:35:55,771 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:35:55,775 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:35:58,870 - BERTopic - Cluster - Completed ✓
2024-04-29 17:35:58,872 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:36:00,668 - BERTopic - Representation - Completed ✓
2024-04-29 17:36:00,670 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:36:02,291 - BERTopic - Topic reduction - Reduced number of topics from 359 to 10


Results
npmi: -0.015205397200341975
diversity: 0.7
 


2024-04-29 17:36:18,180 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:36:53,344 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:36:53,348 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:36:56,540 - BERTopic - Cluster - Completed ✓
2024-04-29 17:36:56,541 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:36:58,402 - BERTopic - Representation - Completed ✓
2024-04-29 17:36:58,406 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:36:59,832 - BERTopic - Topic reduction - Reduced number of topics from 384 to 20


Results
npmi: -0.01154739067237529
diversity: 0.7315789473684211
 


2024-04-29 17:37:17,472 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:38:00,354 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:38:00,358 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:38:03,503 - BERTopic - Cluster - Completed ✓
2024-04-29 17:38:03,505 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:38:05,739 - BERTopic - Representation - Completed ✓
2024-04-29 17:38:05,741 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:38:07,436 - BERTopic - Topic reduction - Reduced number of topics from 371 to 30


Results
npmi: 0.00513796394164826
diversity: 0.7206896551724138
 


2024-04-29 17:38:23,790 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:38:59,569 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:38:59,573 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:39:02,754 - BERTopic - Cluster - Completed ✓
2024-04-29 17:39:02,756 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:39:04,608 - BERTopic - Representation - Completed ✓
2024-04-29 17:39:04,609 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:39:06,196 - BERTopic - Topic reduction - Reduced number of topics from 372 to 40


Results
npmi: -0.02551536915132
diversity: 0.7512820512820513
 


2024-04-29 17:39:24,204 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:39:57,267 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:39:57,270 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:40:00,325 - BERTopic - Cluster - Completed ✓
2024-04-29 17:40:00,327 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:40:02,147 - BERTopic - Representation - Completed ✓
2024-04-29 17:40:02,149 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:40:03,657 - BERTopic - Topic reduction - Reduced number of topics from 362 to 50


Results
npmi: -0.04739479467302068
diversity: 0.7877551020408163
 


2024-04-29 17:40:23,814 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:40:55,641 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:40:55,645 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:40:58,964 - BERTopic - Cluster - Completed ✓
2024-04-29 17:40:58,967 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:41:00,866 - BERTopic - Representation - Completed ✓
2024-04-29 17:41:00,867 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:41:02,260 - BERTopic - Topic reduction - Reduced number of topics from 371 to 10


Results
npmi: -0.038341753637132256
diversity: 0.7444444444444445
 


2024-04-29 17:41:16,552 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:41:49,482 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:41:49,486 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:41:52,584 - BERTopic - Cluster - Completed ✓
2024-04-29 17:41:52,586 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:41:54,369 - BERTopic - Representation - Completed ✓
2024-04-29 17:41:54,370 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:41:55,762 - BERTopic - Topic reduction - Reduced number of topics from 371 to 20


Results
npmi: -0.032157074922578674
diversity: 0.6947368421052632
 


2024-04-29 17:42:11,294 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:42:43,196 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:42:43,200 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:42:46,353 - BERTopic - Cluster - Completed ✓
2024-04-29 17:42:46,355 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:42:48,152 - BERTopic - Representation - Completed ✓
2024-04-29 17:42:48,154 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:42:49,533 - BERTopic - Topic reduction - Reduced number of topics from 376 to 30


Results
npmi: -0.057880019457847015
diversity: 0.7724137931034483
 


2024-04-29 17:43:07,143 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:43:40,301 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:43:40,304 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:43:43,400 - BERTopic - Cluster - Completed ✓
2024-04-29 17:43:43,402 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:43:45,171 - BERTopic - Representation - Completed ✓
2024-04-29 17:43:45,173 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:43:46,671 - BERTopic - Topic reduction - Reduced number of topics from 372 to 40


Results
npmi: -0.02733355584221016
diversity: 0.7769230769230769
 


2024-04-29 17:44:06,543 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 17:44:39,834 - BERTopic - Dimensionality - Completed ✓
2024-04-29 17:44:39,838 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 17:44:43,185 - BERTopic - Cluster - Completed ✓
2024-04-29 17:44:43,189 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 17:44:45,034 - BERTopic - Representation - Completed ✓
2024-04-29 17:44:45,036 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 17:44:46,590 - BERTopic - Topic reduction - Reduced number of topics from 374 to 50


Results
npmi: -0.052933953618729726
diversity: 0.8122448979591836
 


In [9]:
# Evaluating Top2Vec on Trump's tweets

for i in range(3):

    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                      custom_model=None,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"../results/Basic/Trump/Top2Vec_{i+1}")

2024-04-29 09:51:40,302 - top2vec - INFO - Pre-processing documents for training
2024-04-29 09:51:41,683 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 09:56:17,586 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 09:56:30,994 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Di

Results
npmi: -0.10839552498307084
diversity: 0.73
 


2024-04-29 09:57:04,728 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:01:50,596 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:02:04,539 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1564537046781506
diversity: 0.7
 


2024-04-29 10:02:37,956 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:07:20,156 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:07:37,740 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1887930996218135
diversity: 0.74
 


2024-04-29 10:08:09,767 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:12:50,247 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:13:08,532 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.17824130198832835
diversity: 0.665
 


2024-04-29 10:13:38,068 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:18:21,570 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:18:36,527 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1889365130844378
diversity: 0.682
 


2024-04-29 10:19:07,688 - top2vec - INFO - Pre-processing documents for training
2024-04-29 10:19:08,759 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:23:46,704 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:24:05,154 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Di

Results
npmi: -0.09486627721478466
diversity: 0.72
 


2024-04-29 10:24:36,405 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:29:19,508 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:29:39,005 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.17518278762187464
diversity: 0.725
 


2024-04-29 10:30:16,010 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:34:59,792 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:35:14,085 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.17654762830718224
diversity: 0.72
 


2024-04-29 10:35:44,010 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:40:29,819 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:40:49,030 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.15864724223459115
diversity: 0.665
 


2024-04-29 10:41:25,123 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:46:10,182 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:46:24,872 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.19190606796403198
diversity: 0.664
 


2024-04-29 10:46:52,485 - top2vec - INFO - Pre-processing documents for training
2024-04-29 10:46:53,926 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:51:38,087 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:51:51,807 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Di

Results
npmi: -0.1253801065816984
diversity: 0.8
 


2024-04-29 10:52:26,019 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 10:57:05,003 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 10:57:18,804 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1418263622536161
diversity: 0.7
 


2024-04-29 10:57:50,564 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 11:02:32,435 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 11:02:49,205 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.13877093061024934
diversity: 0.62
 


2024-04-29 11:03:18,383 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 11:07:52,911 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 11:08:06,787 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1794100842662193
diversity: 0.6625
 


2024-04-29 11:08:36,539 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 11:13:16,493 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 11:13:33,125 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.20866024629079075
diversity: 0.678
 


## 20Newsgroups

The 20Newsgroups dataset comprises 16309 newsgroups posts on 20 topics.

In [7]:
dataset, custom = "20NewsGroup", False
dataloader_20ng = DataLoader(dataset)
data_20ng = dataloader_20ng.load_octis(custom)
data_20ng = [" ".join(words) for words in data_20ng.get_corpus()]

# Extract embeddings
#model = SentenceTransformer("all-mpnet-base-v2")
#embeddings_20ng = model.encode(data_20ng, show_progress_bar=True)

In [11]:
# Save embeddings
np.savetxt('embeddings_20ng.txt', embeddings_20ng)

In [12]:
# Evaluating BERTopic on 20Newsgroups

for i in range(3):

    params = {
        "embedding_model": "all-mpnet-base-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings_20ng,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"../results/Basic/20NewsGroup/bertopic_{i+1}")

2024-04-29 11:49:36,292 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:49:45,430 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:49:45,432 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:49:45,878 - BERTopic - Cluster - Completed ✓
2024-04-29 11:49:45,879 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:49:46,511 - BERTopic - Representation - Completed ✓
2024-04-29 11:49:46,512 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:49:47,093 - BERTopic - Topic reduction - Reduced number of topics from 90 to 10


Results
npmi: 0.099337322297008
diversity: 0.9111111111111111
 


2024-04-29 11:49:54,140 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:50:03,935 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:50:03,936 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:50:04,371 - BERTopic - Cluster - Completed ✓
2024-04-29 11:50:04,372 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:50:04,966 - BERTopic - Representation - Completed ✓
2024-04-29 11:50:04,967 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:50:05,555 - BERTopic - Topic reduction - Reduced number of topics from 75 to 20


Results
npmi: 0.11307398932436573
diversity: 0.7684210526315789
 


2024-04-29 11:50:13,484 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:50:20,309 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:50:20,311 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:50:20,753 - BERTopic - Cluster - Completed ✓
2024-04-29 11:50:20,754 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:50:21,332 - BERTopic - Representation - Completed ✓
2024-04-29 11:50:21,333 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:50:21,917 - BERTopic - Topic reduction - Reduced number of topics from 84 to 30


Results
npmi: 0.11118826619687264
diversity: 0.8068965517241379
 


2024-04-29 11:50:31,081 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:50:39,227 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:50:39,229 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:50:39,738 - BERTopic - Cluster - Completed ✓
2024-04-29 11:50:39,739 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:50:40,615 - BERTopic - Representation - Completed ✓
2024-04-29 11:50:40,616 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:50:41,245 - BERTopic - Topic reduction - Reduced number of topics from 88 to 40


Results
npmi: 0.11162682927325146
diversity: 0.782051282051282
 


2024-04-29 11:50:51,229 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:50:59,604 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:50:59,606 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:51:00,053 - BERTopic - Cluster - Completed ✓
2024-04-29 11:51:00,054 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:51:00,675 - BERTopic - Representation - Completed ✓
2024-04-29 11:51:00,676 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:51:01,302 - BERTopic - Topic reduction - Reduced number of topics from 78 to 50


Results
npmi: 0.11568636316859651
diversity: 0.7653061224489796
 


2024-04-29 11:51:11,912 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:51:18,803 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:51:18,805 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:51:19,265 - BERTopic - Cluster - Completed ✓
2024-04-29 11:51:19,266 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:51:20,002 - BERTopic - Representation - Completed ✓
2024-04-29 11:51:20,003 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:51:20,572 - BERTopic - Topic reduction - Reduced number of topics from 92 to 10


Results
npmi: 0.08912575048307643
diversity: 0.8555555555555555
 


2024-04-29 11:51:27,529 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:51:35,651 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:51:35,653 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:51:36,136 - BERTopic - Cluster - Completed ✓
2024-04-29 11:51:36,137 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:51:36,810 - BERTopic - Representation - Completed ✓
2024-04-29 11:51:36,811 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:51:37,575 - BERTopic - Topic reduction - Reduced number of topics from 86 to 20


Results
npmi: 0.11009031702994132
diversity: 0.8157894736842105
 


2024-04-29 11:51:46,994 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:51:55,222 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:51:55,224 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:51:55,667 - BERTopic - Cluster - Completed ✓
2024-04-29 11:51:55,668 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:51:56,377 - BERTopic - Representation - Completed ✓
2024-04-29 11:51:56,378 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:51:56,963 - BERTopic - Topic reduction - Reduced number of topics from 87 to 30


Results
npmi: 0.1206101017224498
diversity: 0.8
 


2024-04-29 11:52:05,636 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:52:17,191 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:52:17,193 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:52:17,666 - BERTopic - Cluster - Completed ✓
2024-04-29 11:52:17,667 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:52:18,304 - BERTopic - Representation - Completed ✓
2024-04-29 11:52:18,305 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:52:18,867 - BERTopic - Topic reduction - Reduced number of topics from 89 to 40


Results
npmi: 0.11585029897369872
diversity: 0.764102564102564
 


2024-04-29 11:52:29,271 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:52:36,292 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:52:36,294 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:52:36,750 - BERTopic - Cluster - Completed ✓
2024-04-29 11:52:36,751 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:52:37,340 - BERTopic - Representation - Completed ✓
2024-04-29 11:52:37,342 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:52:37,889 - BERTopic - Topic reduction - Reduced number of topics from 83 to 50


Results
npmi: 0.11187233643388889
diversity: 0.773469387755102
 


2024-04-29 11:52:48,920 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:52:56,637 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:52:56,639 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:52:57,104 - BERTopic - Cluster - Completed ✓
2024-04-29 11:52:57,105 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:52:57,840 - BERTopic - Representation - Completed ✓
2024-04-29 11:52:57,842 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:52:58,677 - BERTopic - Topic reduction - Reduced number of topics from 87 to 10


Results
npmi: 0.0888043763927259
diversity: 0.8666666666666667
 


2024-04-29 11:53:07,032 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:53:14,619 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:53:14,621 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:53:15,089 - BERTopic - Cluster - Completed ✓
2024-04-29 11:53:15,090 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:53:15,746 - BERTopic - Representation - Completed ✓
2024-04-29 11:53:15,747 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:53:16,290 - BERTopic - Topic reduction - Reduced number of topics from 86 to 20


Results
npmi: 0.12220646633055118
diversity: 0.8105263157894737
 


2024-04-29 11:53:24,344 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:53:32,893 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:53:32,894 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:53:33,347 - BERTopic - Cluster - Completed ✓
2024-04-29 11:53:33,348 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:53:33,983 - BERTopic - Representation - Completed ✓
2024-04-29 11:53:33,984 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:53:34,618 - BERTopic - Topic reduction - Reduced number of topics from 84 to 30


Results
npmi: 0.12545060235506214
diversity: 0.8103448275862069
 


2024-04-29 11:53:43,282 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:53:50,190 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:53:50,192 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:53:50,631 - BERTopic - Cluster - Completed ✓
2024-04-29 11:53:50,633 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:53:51,295 - BERTopic - Representation - Completed ✓
2024-04-29 11:53:51,295 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:53:51,873 - BERTopic - Topic reduction - Reduced number of topics from 84 to 40


Results
npmi: 0.11900921035347693
diversity: 0.7897435897435897
 


2024-04-29 11:54:01,181 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-29 11:54:08,962 - BERTopic - Dimensionality - Completed ✓
2024-04-29 11:54:08,963 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-29 11:54:09,408 - BERTopic - Cluster - Completed ✓
2024-04-29 11:54:09,410 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-29 11:54:10,004 - BERTopic - Representation - Completed ✓
2024-04-29 11:54:10,005 - BERTopic - Topic reduction - Reducing number of topics
2024-04-29 11:54:10,552 - BERTopic - Topic reduction - Reduced number of topics from 88 to 50


Results
npmi: 0.11900244987039556
diversity: 0.7551020408163265
 


In [None]:
# Evaluating Top2Vec on 20NewsGroups

for i in range(3):

    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                      custom_model=None,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"../results/Basic/20NewsGroup/Top2Vec_{i+1}")

2024-04-29 17:54:13,368 - top2vec - INFO - Pre-processing documents for training
2024-04-29 17:54:16,742 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 17:58:09,316 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 17:58:21,903 - top2vec - INFO - Finding dense areas of documents
2024-04-29 17:58:22,808 - top2vec - INFO - Finding topics
2024-04-29 17:58:48,238 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19436396186822624
diversity: 0.97
 


2024-04-29 17:58:51,167 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:02:48,505 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:03:01,720 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:03:02,612 - top2vec - INFO - Finding topics
2024-04-29 18:03:31,644 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.20609795474080722
diversity: 0.88
 


2024-04-29 18:03:34,631 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:07:33,930 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:07:46,880 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:07:47,762 - top2vec - INFO - Finding topics
2024-04-29 18:08:14,824 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19377096470248933
diversity: 0.8133333333333334
 


2024-04-29 18:08:17,801 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:12:13,952 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:12:27,285 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:12:28,141 - top2vec - INFO - Finding topics
2024-04-29 18:12:56,327 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.18342764027958572
diversity: 0.76
 


2024-04-29 18:12:59,216 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:16:57,584 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:17:09,967 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:17:10,856 - top2vec - INFO - Finding topics


Results
npmi: 0.18405755259111825
diversity: 0.684
 


2024-04-29 18:17:39,540 - top2vec - INFO - Pre-processing documents for training
2024-04-29 18:17:42,662 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:21:37,854 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:21:53,737 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:21:54,596 - top2vec - INFO - Finding topics
2024-04-29 18:22:19,807 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19227975601031116
diversity: 0.99
 


2024-04-29 18:22:22,821 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:26:21,753 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:26:35,063 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:26:35,986 - top2vec - INFO - Finding topics
2024-04-29 18:27:04,071 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.20270333749902666
diversity: 0.91
 


2024-04-29 18:27:07,192 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:31:05,405 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:31:22,702 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:31:23,575 - top2vec - INFO - Finding topics
2024-04-29 18:31:49,440 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19159036798322354
diversity: 0.8266666666666667
 


2024-04-29 18:31:52,426 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:35:49,030 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:36:01,473 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:36:02,306 - top2vec - INFO - Finding topics
2024-04-29 18:36:29,094 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.18052582397552203
diversity: 0.7075
 


2024-04-29 18:36:32,096 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:40:27,721 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:40:39,937 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:40:40,856 - top2vec - INFO - Finding topics


Results
npmi: 0.18509200599650555
diversity: 0.672
 


2024-04-29 18:41:03,448 - top2vec - INFO - Pre-processing documents for training
2024-04-29 18:41:06,262 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:44:21,204 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:44:26,503 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:44:26,845 - top2vec - INFO - Finding topics
2024-04-29 18:44:36,511 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.18815594791965626
diversity: 0.96
 


2024-04-29 18:44:37,659 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:46:04,049 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:46:10,654 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:46:10,998 - top2vec - INFO - Finding topics
2024-04-29 18:46:21,753 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.20176575892468757
diversity: 0.905
 


2024-04-29 18:46:22,913 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:47:49,105 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:47:54,801 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:47:55,163 - top2vec - INFO - Finding topics
2024-04-29 18:48:05,677 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.18974979893393054
diversity: 0.8333333333333334
 


2024-04-29 18:48:07,331 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:49:43,819 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:49:49,711 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:49:50,110 - top2vec - INFO - Finding topics
2024-04-29 18:50:01,922 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.1821333753194903
diversity: 0.7275
 


2024-04-29 18:50:03,116 - top2vec - INFO - Creating joint document/word embedding
2024-04-29 18:51:32,535 - top2vec - INFO - Creating lower dimension embedding of documents
2024-04-29 18:51:40,265 - top2vec - INFO - Finding dense areas of documents
2024-04-29 18:51:40,703 - top2vec - INFO - Finding topics
