# Topic Model Evaluation

This notebook contains code needed to run the experiments of my Bachelor's thesis on Topic Modeling Algorithms.

For installation please check out `README.md` of this repository if you haven't already.

In [2]:
import sys
sys.path.append('../')

In [3]:
# Import neccessary packages
import numpy as np
from sentence_transformers import SentenceTransformer

from evaluation import DataLoader, Trainer

[nltk_data] Downloading package punkt to /home/budu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Trump's tweets

First we will run BERTopic and Top2Vec on Donald Trump's tweets.

His tweets' archive can be found here: https://www.thetrumparchive.com/

For our experiments we will not consider retweets and deleted tweets, those are being filtered out during the operation of `DataLoader` class.

In [3]:
dataset, custom = "trump", True

In [4]:
%%time
dataloader_trump = DataLoader(dataset=dataset).\
                    prepare_docs(save=f"{dataset}.txt").\
                    preprocess_octis(output_folder=f"{dataset}")

100%|█████████████████████████████████████████████████████████████████████████| 46693/46693 [00:00<00:00, 621214.28it/s]


created vocab
53637
words filtering done
CPU times: user 1.29 s, sys: 143 ms, total: 1.44 s
Wall time: 9.33 s


To speed up BERTopic we precalculate embeddings using `SentenceTransformer all-mpnet-base-v2` model. Otherwise, in every different parameter setting would require calculating embeddings, and that would result in a massive runtime.

In [8]:
# Prepare data
dataset, custom = "trump", True
dataloader = DataLoader(dataset)
data_trump = dataloader.load_octis(custom)
data_trump = [" ".join(words) for words in data_trump.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-mpnet-base-v2")
embeddings_trump = model.encode(data_trump, show_progress_bar=True)

Batches: 100%|██████████████████████████████████████████████████████████████████████| 1383/1383 [13:43<00:00,  1.68it/s]


In [9]:
# Save embeddings
np.savetxt('embeddings_trump.txt', embeddings_trump)

As described in the Thesis and the BERTopic paper [https://arxiv.org/pdf/2203.05794.pdf] the performance of the topic models is reflected by two widely-used metrics, topic coherence and topic diversity.

For each topic model, coherence is evaluated using Normalized Pointwise Mutual Information (NPMI). This measure ranges from $-1$ to $1$, where $1$ indicates a perfect association.

Topic diversity is the percentage of unique words for all topics. This measure ranges from $0$ to $1$ where $0$ indicates redundant topic and $1$ indicates varied topic.

Ranging from $10$ to $50$ topics with steps of $10$, the NPMI score is calculated at each step for each topic model. Results are averaged across $3$ runs for each step.

In [11]:
# Evaluating BERTopic on Trump's tweets

for i in range(3):

    params = {
        "embedding_model": "all-mpnet-base-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings_trump,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"BERTopic_trump_{i+1}")

2024-03-08 09:04:07,820 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:05:09,432 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:05:09,435 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before

Results
npmi: 0.04670822821177222
diversity: 0.5888888888888889
 


2024-03-08 09:05:24,489 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:05:53,925 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:05:53,927 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:05:55,682 - BERTopic - Cluster - Completed ✓
2024-03-08 09:05:55,684 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:05:56,836 - BERTopic - Representation - Completed ✓
2024-03-08 09:05:56,837 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:05:57,718 - BERTopic - Topic reduction - Reduced number of topics from 376 to 20


Results
npmi: -0.030685794454056486
diversity: 0.7263157894736842
 


2024-03-08 09:06:06,855 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:06:33,051 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:06:33,053 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:06:34,801 - BERTopic - Cluster - Completed ✓
2024-03-08 09:06:34,802 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:06:36,206 - BERTopic - Representation - Completed ✓
2024-03-08 09:06:36,208 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:06:37,237 - BERTopic - Topic reduction - Reduced number of topics from 366 to 30


Results
npmi: -0.04022613936181687
diversity: 0.7758620689655172
 


2024-03-08 09:06:46,852 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:07:12,576 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:07:12,578 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:07:14,247 - BERTopic - Cluster - Completed ✓
2024-03-08 09:07:14,248 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:07:15,317 - BERTopic - Representation - Completed ✓
2024-03-08 09:07:15,318 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:07:16,167 - BERTopic - Topic reduction - Reduced number of topics from 371 to 40


Results
npmi: -0.05237327016179927
diversity: 0.7897435897435897
 


2024-03-08 09:07:25,618 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:07:48,394 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:07:48,397 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:07:50,026 - BERTopic - Cluster - Completed ✓
2024-03-08 09:07:50,027 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:07:51,069 - BERTopic - Representation - Completed ✓
2024-03-08 09:07:51,070 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:07:51,902 - BERTopic - Topic reduction - Reduced number of topics from 387 to 50


Results
npmi: -0.04633290133116846
diversity: 0.7857142857142857
 


2024-03-08 09:08:02,840 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:08:27,029 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:08:27,032 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:08:28,746 - BERTopic - Cluster - Completed ✓
2024-03-08 09:08:28,747 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:08:29,739 - BERTopic - Representation - Completed ✓
2024-03-08 09:08:29,740 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:08:30,542 - BERTopic - Topic reduction - Reduced number of topics from 380 to 10


Results
npmi: -0.02172341849388519
diversity: 0.6777777777777778
 


2024-03-08 09:08:39,247 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:09:04,558 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:09:04,561 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:09:06,178 - BERTopic - Cluster - Completed ✓
2024-03-08 09:09:06,179 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:09:07,208 - BERTopic - Representation - Completed ✓
2024-03-08 09:09:07,210 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:09:08,035 - BERTopic - Topic reduction - Reduced number of topics from 376 to 20


Results
npmi: -0.0356582605420622
diversity: 0.6789473684210526
 


2024-03-08 09:09:17,646 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:09:47,990 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:09:47,993 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:09:49,750 - BERTopic - Cluster - Completed ✓
2024-03-08 09:09:49,751 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:09:50,786 - BERTopic - Representation - Completed ✓
2024-03-08 09:09:50,788 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:09:51,576 - BERTopic - Topic reduction - Reduced number of topics from 375 to 30


Results
npmi: -0.041131454651631316
diversity: 0.7551724137931034
 


2024-03-08 09:10:01,787 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:10:28,112 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:10:28,115 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:10:29,833 - BERTopic - Cluster - Completed ✓
2024-03-08 09:10:29,834 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:10:31,015 - BERTopic - Representation - Completed ✓
2024-03-08 09:10:31,018 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:10:31,942 - BERTopic - Topic reduction - Reduced number of topics from 378 to 40


Results
npmi: -0.03687611110385416
diversity: 0.7538461538461538
 


2024-03-08 09:10:41,886 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:11:11,121 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:11:11,125 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:11:13,127 - BERTopic - Cluster - Completed ✓
2024-03-08 09:11:13,129 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:11:14,448 - BERTopic - Representation - Completed ✓
2024-03-08 09:11:14,449 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:11:15,249 - BERTopic - Topic reduction - Reduced number of topics from 373 to 50


Results
npmi: -0.05025273712276578
diversity: 0.7775510204081633
 


2024-03-08 09:11:27,625 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:11:56,105 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:11:56,107 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:11:57,620 - BERTopic - Cluster - Completed ✓
2024-03-08 09:11:57,622 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:11:58,721 - BERTopic - Representation - Completed ✓
2024-03-08 09:11:58,722 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:11:59,656 - BERTopic - Topic reduction - Reduced number of topics from 354 to 10


Results
npmi: -0.08271111096597342
diversity: 0.7333333333333333
 


2024-03-08 09:12:08,418 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:12:35,501 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:12:35,504 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:12:37,122 - BERTopic - Cluster - Completed ✓
2024-03-08 09:12:37,124 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:12:38,146 - BERTopic - Representation - Completed ✓
2024-03-08 09:12:38,147 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:12:38,967 - BERTopic - Topic reduction - Reduced number of topics from 372 to 20


Results
npmi: -0.034301492223340994
diversity: 0.7315789473684211
 


2024-03-08 09:12:47,527 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:13:09,536 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:13:09,538 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:13:11,158 - BERTopic - Cluster - Completed ✓
2024-03-08 09:13:11,159 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:13:12,221 - BERTopic - Representation - Completed ✓
2024-03-08 09:13:12,222 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:13:13,102 - BERTopic - Topic reduction - Reduced number of topics from 377 to 30


Results
npmi: -0.024701089070270986
diversity: 0.7620689655172413
 


2024-03-08 09:13:22,414 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:13:44,104 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:13:44,106 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:13:45,692 - BERTopic - Cluster - Completed ✓
2024-03-08 09:13:45,693 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:13:46,690 - BERTopic - Representation - Completed ✓
2024-03-08 09:13:46,692 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:13:47,481 - BERTopic - Topic reduction - Reduced number of topics from 363 to 40


Results
npmi: -0.05100797738638282
diversity: 0.782051282051282
 


2024-03-08 09:13:57,761 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 09:14:19,133 - BERTopic - Dimensionality - Completed ✓
2024-03-08 09:14:19,136 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 09:14:20,821 - BERTopic - Cluster - Completed ✓
2024-03-08 09:14:20,823 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 09:14:21,961 - BERTopic - Representation - Completed ✓
2024-03-08 09:14:21,963 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 09:14:22,826 - BERTopic - Topic reduction - Reduced number of topics from 379 to 50


Results
npmi: -0.05082242495793191
diversity: 0.7918367346938775
 


In [12]:
# Evaluating Top2Vec on Trump's tweets

for i in range(3):

    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                      custom_model=None,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"Top2Vec_trump_{i+1}")

2024-03-08 09:15:18,223 - top2vec - INFO - Pre-processing documents for training
2024-03-08 09:15:19,369 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:20:05,551 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:20:19,064 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Di

Results
npmi: -0.10390814651919382
diversity: 0.75
 


2024-03-08 09:20:54,672 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:25:31,982 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:25:48,585 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.16731218688713242
diversity: 0.675
 


2024-03-08 09:26:21,332 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:30:49,891 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:31:03,919 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.18263644044072497
diversity: 0.7333333333333333
 


2024-03-08 09:31:35,687 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:36:10,490 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:36:23,607 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.18739525237244692
diversity: 0.6575
 


2024-03-08 09:36:54,223 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:41:27,169 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:41:43,184 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.19062295402595072
diversity: 0.628
 


2024-03-08 09:42:13,550 - top2vec - INFO - Pre-processing documents for training
2024-03-08 09:42:15,104 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:46:51,364 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:47:04,652 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Di

Results
npmi: -0.07873945450145572
diversity: 0.8
 


2024-03-08 09:47:38,639 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:52:12,613 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:52:27,937 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1613443235492698
diversity: 0.765
 


2024-03-08 09:52:57,311 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 09:57:26,664 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 09:57:39,992 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.17164136166589014
diversity: 0.64
 


2024-03-08 09:58:11,261 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:02:42,190 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:02:59,379 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.18538097217944197
diversity: 0.6675
 


2024-03-08 10:03:28,451 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:07:59,671 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:08:13,134 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.19172241023067327
diversity: 0.654
 


2024-03-08 10:08:44,454 - top2vec - INFO - Pre-processing documents for training
2024-03-08 10:08:45,952 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:13:17,827 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:13:31,394 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Di

Results
npmi: -0.15038312355697606
diversity: 0.76
 


2024-03-08 10:14:05,015 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:18:37,078 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:18:50,059 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.1449889125821809
diversity: 0.75
 


2024-03-08 10:19:24,113 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:23:56,797 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:24:11,839 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.13949965709799034
diversity: 0.6766666666666666
 


2024-03-08 10:24:40,914 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:29:12,733 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:29:26,881 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.20281598009935015
diversity: 0.6575
 


2024-03-08 10:29:59,764 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:34:32,379 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:34:50,171 - top2vec - INFO - Finding dense areas of documents
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the 

Results
npmi: -0.16828750898234776
diversity: 0.648
 


## 20Newsgroups

The 20Newsgroups dataset comprises 16309 newsgroups posts on 20 topics.

In [13]:
dataset, custom = "20NewsGroup", False
dataloader_20ng = DataLoader(dataset)
data_20ng = dataloader_20ng.load_octis(custom)
data_20ng = [" ".join(words) for words in data_20ng.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-mpnet-base-v2")
embeddings_20ng = model.encode(data_20ng, show_progress_bar=True)

Batches: 100%|████████████████████████████████████████████████████████████████████████| 510/510 [13:42<00:00,  1.61s/it]


In [14]:
# Save embeddings
np.savetxt('embeddings_20ng.txt', embeddings_20ng)

In [15]:
# Evaluating BERTopic on 20Newsgroups

for i in range(3):

    params = {
        "embedding_model": "all-mpnet-base-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings_20ng,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"BERTopic_20ng_{i+1}")

2024-03-08 10:49:09,150 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:49:15,984 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:49:15,986 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:49:16,424 - BERTopic - Cluster - Completed ✓
2024-03-08 10:49:16,425 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:49:17,072 - BERTopic - Representation - Completed ✓
2024-03-08 10:49:17,073 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:49:17,699 - BERTopic - Topic reduction - Reduced number of topics from 83 to 10


Results
npmi: 0.09357326978765458
diversity: 0.8333333333333334
 


2024-03-08 10:49:28,728 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:49:37,055 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:49:37,057 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:49:37,493 - BERTopic - Cluster - Completed ✓
2024-03-08 10:49:37,495 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:49:38,138 - BERTopic - Representation - Completed ✓
2024-03-08 10:49:38,139 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:49:38,739 - BERTopic - Topic reduction - Reduced number of topics from 85 to 20


Results
npmi: 0.10795244543605588
diversity: 0.7894736842105263
 


2024-03-08 10:49:55,728 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:50:04,130 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:50:04,132 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:50:04,553 - BERTopic - Cluster - Completed ✓
2024-03-08 10:50:04,554 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:50:05,177 - BERTopic - Representation - Completed ✓
2024-03-08 10:50:05,178 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:50:05,796 - BERTopic - Topic reduction - Reduced number of topics from 82 to 30


Results
npmi: 0.11294163543914515
diversity: 0.7965517241379311
 


2024-03-08 10:50:14,764 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:50:21,277 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:50:21,279 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:50:21,712 - BERTopic - Cluster - Completed ✓
2024-03-08 10:50:21,713 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:50:22,320 - BERTopic - Representation - Completed ✓
2024-03-08 10:50:22,322 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:50:22,896 - BERTopic - Topic reduction - Reduced number of topics from 79 to 40


Results
npmi: 0.11429852829021793
diversity: 0.8051282051282052
 


2024-03-08 10:50:32,888 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:50:40,727 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:50:40,730 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:50:41,154 - BERTopic - Cluster - Completed ✓
2024-03-08 10:50:41,155 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:50:41,750 - BERTopic - Representation - Completed ✓
2024-03-08 10:50:41,751 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:50:42,303 - BERTopic - Topic reduction - Reduced number of topics from 88 to 50


Results
npmi: 0.1110840738349059
diversity: 0.7591836734693878
 


2024-03-08 10:50:52,906 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:51:00,052 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:51:00,054 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:51:00,483 - BERTopic - Cluster - Completed ✓
2024-03-08 10:51:00,484 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:51:01,089 - BERTopic - Representation - Completed ✓
2024-03-08 10:51:01,090 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:51:01,676 - BERTopic - Topic reduction - Reduced number of topics from 88 to 10


Results
npmi: 0.10204368387497295
diversity: 0.8444444444444444
 


2024-03-08 10:51:09,104 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:51:15,627 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:51:15,629 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:51:16,072 - BERTopic - Cluster - Completed ✓
2024-03-08 10:51:16,073 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:51:16,687 - BERTopic - Representation - Completed ✓
2024-03-08 10:51:16,689 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:51:17,275 - BERTopic - Topic reduction - Reduced number of topics from 85 to 20


Results
npmi: 0.11085663609025606
diversity: 0.8
 


2024-03-08 10:51:25,641 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:51:32,129 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:51:32,131 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:51:32,560 - BERTopic - Cluster - Completed ✓
2024-03-08 10:51:32,561 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:51:33,132 - BERTopic - Representation - Completed ✓
2024-03-08 10:51:33,133 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:51:33,661 - BERTopic - Topic reduction - Reduced number of topics from 86 to 30


Results
npmi: 0.12636172863946518
diversity: 0.8137931034482758
 


2024-03-08 10:51:42,762 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:51:49,898 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:51:49,899 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:51:50,324 - BERTopic - Cluster - Completed ✓
2024-03-08 10:51:50,325 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:51:50,987 - BERTopic - Representation - Completed ✓
2024-03-08 10:51:50,989 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:51:51,615 - BERTopic - Topic reduction - Reduced number of topics from 87 to 40


Results
npmi: 0.12527156262414135
diversity: 0.7666666666666667
 


2024-03-08 10:52:01,225 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:52:08,924 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:52:08,926 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:52:09,358 - BERTopic - Cluster - Completed ✓
2024-03-08 10:52:09,359 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:52:09,999 - BERTopic - Representation - Completed ✓
2024-03-08 10:52:10,000 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:52:10,619 - BERTopic - Topic reduction - Reduced number of topics from 83 to 50


Results
npmi: 0.12260199241710841
diversity: 0.7714285714285715
 


2024-03-08 10:52:21,326 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:52:28,065 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:52:28,066 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:52:28,506 - BERTopic - Cluster - Completed ✓
2024-03-08 10:52:28,507 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:52:29,089 - BERTopic - Representation - Completed ✓
2024-03-08 10:52:29,090 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:52:29,616 - BERTopic - Topic reduction - Reduced number of topics from 84 to 10


Results
npmi: 0.08834939967672445
diversity: 0.8777777777777778
 


2024-03-08 10:52:37,110 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:52:44,165 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:52:44,167 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:52:44,596 - BERTopic - Cluster - Completed ✓
2024-03-08 10:52:44,597 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:52:45,211 - BERTopic - Representation - Completed ✓
2024-03-08 10:52:45,212 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:52:45,817 - BERTopic - Topic reduction - Reduced number of topics from 87 to 20


Results
npmi: 0.1281782336339834
diversity: 0.8105263157894737
 


2024-03-08 10:52:54,147 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:53:00,728 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:53:00,729 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:53:01,170 - BERTopic - Cluster - Completed ✓
2024-03-08 10:53:01,171 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:53:01,792 - BERTopic - Representation - Completed ✓
2024-03-08 10:53:01,793 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:53:02,373 - BERTopic - Topic reduction - Reduced number of topics from 87 to 30


Results
npmi: 0.12695008613977155
diversity: 0.803448275862069
 


2024-03-08 10:53:11,417 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:53:19,480 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:53:19,482 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:53:19,928 - BERTopic - Cluster - Completed ✓
2024-03-08 10:53:19,929 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:53:20,498 - BERTopic - Representation - Completed ✓
2024-03-08 10:53:20,499 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:53:21,029 - BERTopic - Topic reduction - Reduced number of topics from 77 to 40


Results
npmi: 0.12256301925854575
diversity: 0.7948717948717948
 


2024-03-08 10:53:30,447 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-08 10:53:38,051 - BERTopic - Dimensionality - Completed ✓
2024-03-08 10:53:38,053 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-08 10:53:38,477 - BERTopic - Cluster - Completed ✓
2024-03-08 10:53:38,478 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-08 10:53:39,096 - BERTopic - Representation - Completed ✓
2024-03-08 10:53:39,098 - BERTopic - Topic reduction - Reducing number of topics
2024-03-08 10:53:39,743 - BERTopic - Topic reduction - Reduced number of topics from 83 to 50


Results
npmi: 0.12048334037241167
diversity: 0.7673469387755102
 


In [16]:
# Evaluating Top2Vec on 20NewsGroups

for i in range(3):

    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                      custom_model=None,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"Top2Vec_20ng_{i+1}")

2024-03-08 10:53:48,212 - top2vec - INFO - Pre-processing documents for training
2024-03-08 10:53:49,389 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:55:19,876 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:55:25,633 - top2vec - INFO - Finding dense areas of documents
2024-03-08 10:55:25,996 - top2vec - INFO - Finding topics
2024-03-08 10:55:36,993 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19368906248309753
diversity: 0.95
 


2024-03-08 10:55:38,175 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:57:10,768 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:57:16,744 - top2vec - INFO - Finding dense areas of documents
2024-03-08 10:57:17,100 - top2vec - INFO - Finding topics
2024-03-08 10:57:27,852 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.2009235015681318
diversity: 0.905
 


2024-03-08 10:57:29,021 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 10:58:59,640 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 10:59:05,247 - top2vec - INFO - Finding dense areas of documents
2024-03-08 10:59:05,616 - top2vec - INFO - Finding topics
2024-03-08 10:59:17,287 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.1991827216303684
diversity: 0.8066666666666666
 


2024-03-08 10:59:18,454 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:00:48,381 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:00:54,316 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:00:54,677 - top2vec - INFO - Finding topics
2024-03-08 11:01:05,625 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.1855369696568224
diversity: 0.7525
 


2024-03-08 11:01:06,803 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:02:38,742 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:02:46,144 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:02:46,508 - top2vec - INFO - Finding topics


Results
npmi: 0.16967111255765294
diversity: 0.7
 


2024-03-08 11:02:56,709 - top2vec - INFO - Pre-processing documents for training
2024-03-08 11:02:57,916 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:04:30,468 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:04:36,615 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:04:37,049 - top2vec - INFO - Finding topics
2024-03-08 11:04:47,543 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.1948303188871267
diversity: 0.95
 


2024-03-08 11:04:48,713 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:06:20,932 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:06:26,632 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:06:26,990 - top2vec - INFO - Finding topics
2024-03-08 11:06:36,981 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.21181375194545868
diversity: 0.885
 


2024-03-08 11:06:38,171 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:08:08,582 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:08:15,021 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:08:15,396 - top2vec - INFO - Finding topics
2024-03-08 11:08:26,535 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19785540626246328
diversity: 0.82
 


2024-03-08 11:08:27,726 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:09:58,675 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:10:04,115 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:10:04,468 - top2vec - INFO - Finding topics
2024-03-08 11:10:16,160 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.18962697637390638
diversity: 0.7475
 


2024-03-08 11:10:17,343 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:11:49,331 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:11:56,550 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:11:56,922 - top2vec - INFO - Finding topics


Results
npmi: 0.17508521194766924
diversity: 0.672
 


2024-03-08 11:12:07,507 - top2vec - INFO - Pre-processing documents for training
2024-03-08 11:12:08,705 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:13:40,360 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:13:46,138 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:13:46,507 - top2vec - INFO - Finding topics
2024-03-08 11:13:57,233 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.21062832296068504
diversity: 0.98
 


2024-03-08 11:13:58,950 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:15:31,742 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:15:37,511 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:15:37,955 - top2vec - INFO - Finding topics
2024-03-08 11:15:50,008 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.20691903374412743
diversity: 0.89
 


2024-03-08 11:15:51,348 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:17:27,505 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:17:33,121 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:17:33,487 - top2vec - INFO - Finding topics
2024-03-08 11:17:45,063 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.19587179289722034
diversity: 0.8166666666666667
 


2024-03-08 11:17:46,854 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:19:25,833 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:19:31,655 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:19:32,011 - top2vec - INFO - Finding topics
2024-03-08 11:19:43,393 - top2vec - INFO - Pre-processing documents for training


Results
npmi: 0.17776161630479406
diversity: 0.735
 


2024-03-08 11:19:44,604 - top2vec - INFO - Creating joint document/word embedding
2024-03-08 11:21:22,284 - top2vec - INFO - Creating lower dimension embedding of documents
2024-03-08 11:21:29,177 - top2vec - INFO - Finding dense areas of documents
2024-03-08 11:21:29,546 - top2vec - INFO - Finding topics


Results
npmi: 0.17568380073170395
diversity: 0.676
 
