# **Topic Model Evaluation**
Here, you will find the code needed to run the experiments of the paper:

*BERTopic: Neural topic modeling with a class-based TF-IDF procedure*.

The package itself can be found [here](https://github.com/MaartenGr/BERTopic) and the repository for evaluation [here]().

## **Installation**
First, we need to install a few packages in order to run our experiments. Most of the packages are installed through the `tm_evaluation` package of which [OCTIS](https://github.com/MIND-Lab/OCTIS) is an important component. 

You can install the evaluation package with `pip install .` from the root. To additionally install CTM run `pip install .[ctm]`To install BERTopic, run `pip install bertopic==v0.9.4` after installing the base package or use `pip install .[bertopic]`. Top2Vec should be installed with `pip install top2vec==v1.0.26` after installing the base package. 

To run a faster version of LDAseq for dynamic topic modeling, we need to uninstall gensim and install a specific merge that allows for this speed-up. First, run `pip uninstall gensim -y`, then, run `pip install git+https://github.com/RaRe-Technologies/gensim.git@refs/pull/3172/merge`

**NOTE**: After installing the above packages, make sure to restart the runtime otherwise you are likely to run into issues. 

#  1. **Data**
Some of the data can be accessed through OCTIS, such as the `20NewsGroup` and `BBC_News` datasets. Other datasets, however, are downloaded and then run through OCTIS in order to be used in their pipeline. 

The datasets that we are going to be preparing are: 
* Trump's tweets
* United Nations general debates between 2006 and 2015 

In [1]:
import pandas as pd
print("Loading climate data")
df = pd.read_csv('/Users/alessiogandelli/dev/internship/topic_modeling/data/alessio.csv',  sep = '\t', lineterminator='\n')
df = df[~df['text'].str.startswith('RT')]
# only english tweets 
df = df[df['lang'] == 'en']
# remove links
df['text'] = df['text'].str.replace(r'http\S+', '', regex=True)

# to lowercase 
df['text'] = df['text'].str.lower()
#remove punctuation
df['text'] = df['text'].str.replace(r'[^\w\s]', '', regex=True)

# remove #cop22 and #climatechange
df['text'] = df['text'].str.replace(r'cop22', '', regex=True)
df['text'] = df['text'].str.replace(r'climatechange', '', regex=True)
df['text'] = df['text'].str.replace(r'p2', '', regex=True)
df['text'] = df['text'].str.replace(r'rt', '', regex=True)
# remove empty tweets
df = df[df['text'] != '']

df['date'] = pd.to_datetime(df['year'].astype(str) + '-' + df['month'].astype(str) + '-' + df['day'].astype(str))

timestamps = df.date.to_list()
docs = df.text.to_list()



Loading climate data


In [2]:
from evaluation import Trainer, DataLoader

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alessiogandelli/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Climate
Using our `DataLoader` we can prepare the documents and save them in an OCTIS-based format: 

In [4]:
%%time
dataloader = DataLoader(dataset="climate").prepare_docs(save="climate.txt").preprocess_octis(output_folder="climate")

Loading climate data
created vocab
5225
words filtering done
CPU times: user 1.33 s, sys: 91.1 ms, total: 1.42 s
Wall time: 2.72 s


In [None]:
%%time
dataloader = DataLoader(dataset="un_dtm").prepare_docs(save="un_dtm.txt").preprocess_octis(output_folder="un_dtm")

created vocab
69447
words filtering done
CPU times: user 22min, sys: 21.5 s, total: 22min 21s
Wall time: 22min 22s


# 2. **Evaluation**
After preparing our data, we can start evaluating the topic models as used in the experiments. OCTIS already has a number of models prepared that we can use directly as shown below. 

First, we specify what the dataset is and whether that was a custom dataset not found in OCTIS. To run our custom trump dataset, we run `dataset, custom = "trump", True`. In contrast, if we are to use the prepackaged 20NewsGroup dataset, we run `dataset, custom = "20NewsGroup", False` instead. 

The OCTIS datasets can be found [here](https://github.com/MIND-Lab/OCTIS#available-datasets). 

Second, we define a number of parameters to be used for the model. It uses the following format: 

`params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}`

were we define a number of topics to loop over and calculate the evluation metrics but also define a number of parameters used in the models. 

#### **Parameters**
The parameters for LDA and NMF:


```python
params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}`
```

The parameters for Top2Vec:

```python
params = {"nr_topics": [(i+1)*10 for i in range(5)],
          "hdbscan_args": {'min_cluster_size': 15,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'}}
```
Note that the `min_cluster_size` is 15 for all datasets except BBC_News.

The parameters for CTM:

```python
params = {
    "n_components": [(i+1)*10 for i in range(5)],
    "contextual_size":768
}
```

The parameters for BERTopic:

```python
params = {
    "nr_topics": [(i+1)*10 for i in range(5)],
    "min_topic_size": 15,
    "verbose": True
}
```

Note that the `min_topic_size` is 15 for all datasets except BBC_News. Note that we do not set a `embedding_model` here. We do this on purpose as we can generate the embeddings beforehand and pass those to BERTopic. 

## **OCTIS**
Here, we can run the experiments for NMF and LDA. 

#### NMF

In [6]:
for i, random_state in enumerate([0, 21, 42]):
    dataset, custom = "climate", True
    params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}

    trainer = Trainer(dataset=dataset,
                      model_name="NMF",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"NMF_trump_{i+1}")

Results
npmi: -0.05603186703787161
diversity: 0.47
 
Results
npmi: -0.04093330696194557
diversity: 0.475
 
Results
npmi: -0.04475179399892415
diversity: 0.41
 
Results
npmi: -0.07069132820451146
diversity: 0.3675
 
Results
npmi: -0.05190434104957615
diversity: 0.392
 
Results
npmi: -0.05631840316969816
diversity: 0.5
 
Results
npmi: -0.0684209728954257
diversity: 0.43
 
Results
npmi: -0.05425098973475934
diversity: 0.37333333333333335
 
Results
npmi: -0.05317002138686715
diversity: 0.3875
 
Results
npmi: -0.07643445436005926
diversity: 0.382
 
Results
npmi: -0.0309767664156465
diversity: 0.5
 
Results
npmi: -0.07403645270712103
diversity: 0.405
 
Results
npmi: -0.06447445313296334
diversity: 0.39
 
Results
npmi: -0.055027317482541485
diversity: 0.415
 
Results
npmi: -0.06521831805369326
diversity: 0.386
 


#### LDA

In [12]:
for i, random_state in enumerate([0, 21, 42]):
    dataset, custom = "climate", True
    params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}

    trainer = Trainer(dataset=dataset,
                      model_name="LDA",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"LDA_trump_{i+1}")

Results
npmi: -0.03375526051098856
diversity: 0.25
 
Results
npmi: -0.027672619656356534
diversity: 0.185
 
Results
npmi: -0.038811510855194374
diversity: 0.15666666666666668
 
Results
npmi: -0.038554992738596625
diversity: 0.1425
 
Results
npmi: -0.04854904207078382
diversity: 0.152
 
Results
npmi: -0.03133763429252951
diversity: 0.25
 
Results
npmi: -0.03644984526176739
diversity: 0.185
 
Results
npmi: -0.041992068543786204
diversity: 0.15333333333333332
 
Results
npmi: -0.042552408782688976
diversity: 0.15
 
Results
npmi: -0.050066544594813925
diversity: 0.17
 
Results
npmi: -0.040218373898912994
diversity: 0.25
 
Results
npmi: -0.03462023351900684
diversity: 0.19
 
Results
npmi: -0.0406568512150982
diversity: 0.16
 
Results
npmi: -0.04640165888202994
diversity: 0.1975
 
Results
npmi: -0.04643163547300727
diversity: 0.154
 


## **BERTopic**

To speed up BERTopic, we can generate the embeddings before passing it to the `Trainer`. This way, the same embeddings do not have to be generated 5 times which speeds up evaluation quite a bit. 

In [13]:
%%capture
from sentence_transformers import SentenceTransformer

# Prepare data
dataset, custom = "climate", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]



In [16]:
# Extract embeddings
bert_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = bert_model.encode(data, show_progress_bar=True)

Batches:   0%|          | 0/53 [00:00<?, ?it/s]

As show above, we load in the `data` which the data loader and combine the tokens in each document to generate our training data. Then, we pass it to the sentence transformer model of our choice and generate the embeddings. 

Next, we pass these embeddings to the `bt_embeddings` parameter to speed up training: 

In [19]:
for i in range(3):
    params = {
        "embedding_model": "all-MiniLM-L6-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"BERTopic_MiniLM_{i+1}")

2023-03-15 17:45:43,654 - BERTopic - Reduced dimensionality
2023-03-15 17:45:43,951 - BERTopic - Clustered reduced embeddings
2023-03-15 17:45:47,081 - BERTopic - Reduced number of topics from 23 to 10


Results
npmi: -0.07073105062034826
diversity: 0.8111111111111111
 


2023-03-15 17:46:02,765 - BERTopic - Reduced dimensionality
2023-03-15 17:46:02,918 - BERTopic - Clustered reduced embeddings
2023-03-15 17:46:04,967 - BERTopic - Reduced number of topics from 21 to 20


Results
npmi: 0.06707825585125478
diversity: 0.8210526315789474
 


2023-03-15 17:46:20,988 - BERTopic - Reduced dimensionality
2023-03-15 17:46:21,154 - BERTopic - Clustered reduced embeddings
2023-03-15 17:46:21,603 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.26655536540664176
diversity: 0.9666666666666667
 


2023-03-15 17:46:38,000 - BERTopic - Reduced dimensionality
2023-03-15 17:46:38,252 - BERTopic - Clustered reduced embeddings
2023-03-15 17:46:38,971 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.26655536540664176
diversity: 0.9666666666666667
 


2023-03-15 17:46:55,431 - BERTopic - Reduced dimensionality
2023-03-15 17:46:55,584 - BERTopic - Clustered reduced embeddings
2023-03-15 17:46:56,719 - BERTopic - Reduced number of topics from 23 to 23


Results
npmi: 0.06764701045924498
diversity: 0.8090909090909091
 


2023-03-15 17:47:13,540 - BERTopic - Reduced dimensionality
2023-03-15 17:47:13,729 - BERTopic - Clustered reduced embeddings
2023-03-15 17:47:14,168 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.21626427024596295
diversity: 1.0
 


2023-03-15 17:47:28,594 - BERTopic - Reduced dimensionality
2023-03-15 17:47:28,780 - BERTopic - Clustered reduced embeddings
2023-03-15 17:47:29,222 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.21626427024596295
diversity: 1.0
 


2023-03-15 17:47:51,285 - BERTopic - Reduced dimensionality
2023-03-15 17:47:51,476 - BERTopic - Clustered reduced embeddings
2023-03-15 17:47:53,289 - BERTopic - Reduced number of topics from 23 to 23


Results
npmi: 0.059599982325588156
diversity: 0.8181818181818182
 


2023-03-15 17:48:11,738 - BERTopic - Reduced dimensionality
2023-03-15 17:48:11,923 - BERTopic - Clustered reduced embeddings
2023-03-15 17:48:12,264 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.26655536540664176
diversity: 0.9666666666666667
 


2023-03-15 17:48:27,260 - BERTopic - Reduced dimensionality
2023-03-15 17:48:27,460 - BERTopic - Clustered reduced embeddings
2023-03-15 17:48:27,821 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.26655536540664176
diversity: 0.9666666666666667
 


2023-03-15 17:48:42,626 - BERTopic - Reduced dimensionality
2023-03-15 17:48:42,811 - BERTopic - Clustered reduced embeddings
2023-03-15 17:48:43,225 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.21626427024596295
diversity: 1.0
 


2023-03-15 17:48:59,246 - BERTopic - Reduced dimensionality
2023-03-15 17:48:59,462 - BERTopic - Clustered reduced embeddings
2023-03-15 17:48:59,988 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.26655536540664176
diversity: 0.9666666666666667
 


2023-03-15 17:49:14,763 - BERTopic - Reduced dimensionality
2023-03-15 17:49:14,942 - BERTopic - Clustered reduced embeddings
2023-03-15 17:49:15,354 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.21626427024596295
diversity: 1.0
 


2023-03-15 17:49:30,204 - BERTopic - Reduced dimensionality
2023-03-15 17:49:30,405 - BERTopic - Clustered reduced embeddings
2023-03-15 17:49:31,130 - BERTopic - Reduced number of topics from 5 to 5


Results
npmi: 0.2676503800027916
diversity: 0.975
 


2023-03-15 17:49:47,837 - BERTopic - Reduced dimensionality
2023-03-15 17:49:48,042 - BERTopic - Clustered reduced embeddings
2023-03-15 17:49:48,354 - BERTopic - Reduced number of topics from 4 to 4


Results
npmi: 0.21626427024596295
diversity: 1.0
 


## **Top2Vec**
Aside from its Doc2Vec backend, we also want to explore its performance using the `"all-mpnet-base-v2"` SBERT model as that was used in BERTopic. To do so, we make a very slight change to the core code of Top2Vec, namely replacing all instances of `""distiluse-base-multilingual-cased"` with `"all-mpnet-base-v2"`:

In [22]:
import logging
import numpy as np
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
import umap
import hdbscan
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from joblib import dump, load
from sklearn.cluster import dbscan
import tempfile
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from scipy.special import softmax
from top2vec import Top2Vec

try:
    import hnswlib

    _HAVE_HNSWLIB = True
except ImportError:
    _HAVE_HNSWLIB = False

try:
    import tensorflow as tf
    import tensorflow_hub as hub
    import tensorflow_text

    _HAVE_TENSORFLOW = True
except ImportError:
    _HAVE_TENSORFLOW = False

try:
    from sentence_transformers import SentenceTransformer

    _HAVE_TORCH = True
except ImportError:
    _HAVE_TORCH = False

logger = logging.getLogger('top2vec')
logger.setLevel(logging.WARNING)
sh = logging.StreamHandler()
sh.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(sh)


def default_tokenizer(doc):
    """Tokenize documents for training and remove too long/short words"""
    return simple_preprocess(strip_tags(doc), deacc=True)


class Top2VecNew(Top2Vec):
    """
    Top2Vec
    Creates jointly embedded topic, document and word vectors.
    Parameters
    ----------
    embedding_model: string
        This will determine which model is used to generate the document and
        word embeddings. The valid string options are:
            * doc2vec
            * universal-sentence-encoder
            * universal-sentence-encoder-multilingual
            * distiluse-base-multilingual-cased
        For large data sets and data sets with very unique vocabulary doc2vec
        could produce better results. This will train a doc2vec model from
        scratch. This method is language agnostic. However multiple languages
        will not be aligned.
        Using the universal sentence encoder options will be much faster since
        those are pre-trained and efficient models. The universal sentence
        encoder options are suggested for smaller data sets. They are also
        good options for large data sets that are in English or in languages
        covered by the multilingual model. It is also suggested for data sets
        that are multilingual.
        For more information on universal-sentence-encoder visit:
        https://tfhub.dev/google/universal-sentence-encoder/4
        For more information on universal-sentence-encoder-multilingual visit:
        https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
        The distiluse-base-multilingual-cased pre-trained sentence transformer
        is suggested for multilingual datasets and languages that are not
        covered by the multilingual universal sentence encoder. The
        transformer is significantly slower than the universal sentence
        encoder options.
        For more informati ond istiluse-base-multilingual-cased visit:
        https://www.sbert.net/docs/pretrained_models.html
    embedding_model_path: string (Optional)
        Pre-trained embedding models will be downloaded automatically by
        default. However they can also be uploaded from a file that is in the
        location of embedding_model_path.
        Warning: the model at embedding_model_path must match the
        embedding_model parameter type.
    documents: List of str
        Input corpus, should be a list of strings.
    min_count: int (Optional, default 50)
        Ignores all words with total frequency lower than this. For smaller
        corpora a smaller min_count will be necessary.
    speed: string (Optional, default 'learn')
        This parameter is only used when using doc2vec as embedding_model.
        It will determine how fast the model takes to train. The
        fast-learn option is the fastest and will generate the lowest quality
        vectors. The learn option will learn better quality vectors but take
        a longer time to train. The deep-learn option will learn the best
        quality vectors but will take significant time to train. The valid
        string speed options are:
        
            * fast-learn
            * learn
            * deep-learn
    use_corpus_file: bool (Optional, default False)
        This parameter is only used when using doc2vec as embedding_model.
        Setting use_corpus_file to True can sometimes provide speedup for
        large datasets when multiple worker threads are available. Documents
        are still passed to the model as a list of str, the model will create
        a temporary corpus file for training.
    document_ids: List of str, int (Optional)
        A unique value per document that will be used for referring to
        documents in search results. If ids are not given to the model, the
        index of each document in the original corpus will become the id.
    keep_documents: bool (Optional, default True)
        If set to False documents will only be used for training and not saved
        as part of the model. This will reduce model size. When using search
        functions only document ids will be returned, not the actual
        documents.
    workers: int (Optional)
        The amount of worker threads to be used in training the model. Larger
        amount will lead to faster training.
    
    tokenizer: callable (Optional, default None)
        Override the default tokenization method. If None then
        gensim.utils.simple_preprocess will be used.
    use_embedding_model_tokenizer: bool (Optional, default False)
        If using an embedding model other than doc2vec, use the model's
        tokenizer for document embedding. If set to True the tokenizer, either
        default or passed callable will be used to tokenize the text to
        extract the vocabulary for word embedding.
    umap_args: dict (Optional, default None)
        Pass custom arguments to UMAP.
    hdbscan_args: dict (Optional, default None)
        Pass custom arguments to HDBSCAN.
    
    verbose: bool (Optional, default True)
        Whether to print status data during training.
    """

    def __init__(self,
                 documents,
                 min_count=50,
                 embedding_model='doc2vec',
                 embedding_model_path=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=False,
                 umap_args=None,
                 hdbscan_args=None,
                 verbose=True
                 ):

        if verbose:
            logger.setLevel(logging.DEBUG)
            self.verbose = True
        else:
            logger.setLevel(logging.WARNING)
            self.verbose = False

        if tokenizer is None:
            tokenizer = default_tokenizer

        # validate documents
        if not (isinstance(documents, list) or isinstance(documents, np.ndarray)):
            raise ValueError("Documents need to be a list of strings")
        if not all((isinstance(doc, str) or isinstance(doc, np.str_)) for doc in documents):
            raise ValueError("Documents need to be a list of strings")
        if keep_documents:
            self.documents = np.array(documents, dtype="object")
        else:
            self.documents = None

        # validate document ids
        if document_ids is not None:
            if not (isinstance(document_ids, list) or isinstance(document_ids, np.ndarray)):
                raise ValueError("Documents ids need to be a list of str or int")

            if len(documents) != len(document_ids):
                raise ValueError("Document ids need to match number of documents")
            elif len(document_ids) != len(set(document_ids)):
                raise ValueError("Document ids need to be unique")

            if all((isinstance(doc_id, str) or isinstance(doc_id, np.str_)) for doc_id in document_ids):
                self.doc_id_type = np.str_
            elif all((isinstance(doc_id, int) or isinstance(doc_id, np.int_)) for doc_id in document_ids):
                self.doc_id_type = np.int_
            else:
                raise ValueError("Document ids need to be str or int")

            self.document_ids_provided = True
            self.document_ids = np.array(document_ids)
            self.doc_id2index = dict(zip(document_ids, list(range(0, len(document_ids)))))
        else:
            self.document_ids_provided = False
            self.document_ids = np.array(range(0, len(documents)))
            self.doc_id2index = dict(zip(self.document_ids, list(range(0, len(self.document_ids)))))
            self.doc_id_type = np.int_

        acceptable_embedding_models = ["universal-sentence-encoder-multilingual",
                                       "universal-sentence-encoder",
                                       "all-MiniLM-L6-v2"]

        self.embedding_model_path = embedding_model_path

        if embedding_model == 'doc2vec':

            # validate training inputs
            if speed == "fast-learn":
                hs = 0
                negative = 5
                epochs = 40
            elif speed == "learn":
                hs = 1
                negative = 0
                epochs = 40
            elif speed == "deep-learn":
                hs = 1
                negative = 0
                epochs = 400
            elif speed == "test-learn":
                hs = 0
                negative = 5
                epochs = 1
            else:
                raise ValueError("speed parameter needs to be one of: fast-learn, learn or deep-learn")

            if workers is None:
                pass
            elif isinstance(workers, int):
                pass
            else:
                raise ValueError("workers needs to be an int")

            doc2vec_args = {"vector_size": 300,
                            "min_count": min_count,
                            "window": 15,
                            "sample": 1e-5,
                            "negative": negative,
                            "hs": hs,
                            "epochs": epochs,
                            "dm": 0,
                            "dbow_words": 1}

            if workers is not None:
                doc2vec_args["workers"] = workers

            logger.info('Pre-processing documents for training')

            if use_corpus_file:
                processed = [' '.join(tokenizer(doc)) for doc in documents]
                lines = "\n".join(processed)
                temp = tempfile.NamedTemporaryFile(mode='w+t')
                temp.write(lines)
                doc2vec_args["corpus_file"] = temp.name


            else:
                train_corpus = [TaggedDocument(tokenizer(doc), [i]) for i, doc in enumerate(documents)]
                doc2vec_args["documents"] = train_corpus

            logger.info('Creating joint document/word embedding')
            self.embedding_model = 'doc2vec'
            self.model = Doc2Vec(**doc2vec_args)

            if use_corpus_file:
                temp.close()

        elif embedding_model in acceptable_embedding_models:

            self.embed = None
            self.embedding_model = embedding_model

            self._check_import_status()

            logger.info('Pre-processing documents for training')

            # preprocess documents
            tokenized_corpus = [tokenizer(doc) for doc in documents]

            def return_doc(doc):
                return doc

            # preprocess vocabulary
            vectorizer = CountVectorizer(tokenizer=return_doc, preprocessor=return_doc)
            doc_word_counts = vectorizer.fit_transform(tokenized_corpus)
            words = vectorizer.get_feature_names()
            word_counts = np.array(np.sum(doc_word_counts, axis=0).tolist()[0])
            vocab_inds = np.where(word_counts > min_count)[0]

            if len(vocab_inds) == 0:
                raise ValueError(f"A min_count of {min_count} results in "
                                 f"all words being ignored, choose a lower value.")
            self.vocab = [words[ind] for ind in vocab_inds]

            self._check_model_status()

            logger.info('Creating joint document/word embedding')

            # embed words
            self.word_indexes = dict(zip(self.vocab, range(len(self.vocab))))
            self.word_vectors = self._l2_normalize(np.array(self.embed(self.vocab)))

            # embed documents
            if use_embedding_model_tokenizer:
                self.document_vectors = self._embed_documents(documents)
            else:
                train_corpus = [' '.join(tokens) for tokens in tokenized_corpus]
                self.document_vectors = self._embed_documents(train_corpus)

        else:
            raise ValueError(f"{embedding_model} is an invalid embedding model.")

        # create 5D embeddings of documents
        logger.info('Creating lower dimension embedding of documents')

        if umap_args is None:
            umap_args = {'n_neighbors': 15,
                         'n_components': 5,
                         'metric': 'cosine'}

        umap_model = umap.UMAP(**umap_args).fit(self._get_document_vectors(norm=False))

        # find dense areas of document vectors
        logger.info('Finding dense areas of documents')

        if hdbscan_args is None:
            hdbscan_args = {'min_cluster_size': 15,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'}

        cluster = hdbscan.HDBSCAN(**hdbscan_args).fit(umap_model.embedding_)

        # calculate topic vectors from dense areas of documents
        logger.info('Finding topics')

        # create topic vectors
        self._create_topic_vectors(cluster.labels_)

        # deduplicate topics
        self._deduplicate_topics()

        # find topic words and scores
        self.topic_words, self.topic_word_scores = self._find_topic_words_and_scores(topic_vectors=self.topic_vectors)

        # assign documents to topic
        self.doc_top, self.doc_dist = self._calculate_documents_topic(self.topic_vectors,
                                                                      self._get_document_vectors())

        # calculate topic sizes
        self.topic_sizes = self._calculate_topic_sizes(hierarchy=False)

        # re-order topics
        self._reorder_topics(hierarchy=False)

        # initialize variables for hierarchical topic reduction
        self.topic_vectors_reduced = None
        self.doc_top_reduced = None
        self.doc_dist_reduced = None
        self.topic_sizes_reduced = None
        self.topic_words_reduced = None
        self.topic_word_scores_reduced = None
        self.hierarchy = None

        # initialize document indexing variables
        self.document_index = None
        self.serialized_document_index = None
        self.documents_indexed = False
        self.index_id2doc_id = None
        self.doc_id2index_id = None

        # initialize word indexing variables
        self.word_index = None
        self.serialized_word_index = None
        self.words_indexed = False

    def _check_import_status(self):
        if self.embedding_model != 'all-MiniLM-L6-v2':
            if not _HAVE_TENSORFLOW:
                raise ImportError(f"{self.embedding_model} is not available.\n\n"
                                  "Try: pip install top2vec[sentence_encoders]\n\n"
                                  "Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text")
        else:
            if not _HAVE_TORCH:
                raise ImportError(f"{self.embedding_model} is not available.\n\n"
                                  "Try: pip install top2vec[sentence_transformers]\n\n"
                                  "Alternatively try: pip install torch sentence_transformers")

    def _check_model_status(self):
        if self.embed is None:
            if self.verbose is False:
                logger.setLevel(logging.DEBUG)

            if self.embedding_model != "all-MiniLM-L6-v2":
                if self.embedding_model_path is None:
                    logger.info(f'Downloading {self.embedding_model} model')
                    if self.embedding_model == "universal-sentence-encoder-multilingual":
                        module = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
                    else:
                        module = "https://tfhub.dev/google/universal-sentence-encoder/4"
                else:
                    logger.info(f'Loading {self.embedding_model} model at {self.embedding_model_path}')
                    module = self.embedding_model_path
                self.embed = hub.load(module)

            else:
                if self.embedding_model_path is None:
                    logger.info(f'Downloading {self.embedding_model} model')
                    module = 'all-MiniLM-L6-v2'
                else:
                    logger.info(f'Loading {self.embedding_model} model at {self.embedding_model_path}')
                    module = self.embedding_model_path
                model = SentenceTransformer(module)
                self.embed = model.encode

        if self.verbose is False:
            logger.setLevel(logging.WARNING)

We can then use this `Top2VecNew` class to run our experiments including the `"all-mpnet-base-v2"` model. 

In [28]:
for i in range(3):
    dataset, custom = "climate", True
    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              # "embedding_model": "all-MiniLM-L6-v2",
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                      custom_model=Top2Vec,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"Top2Vec_MiniLM_{i+1}")

2023-03-15 18:22:45,262 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:22:45,262 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:22:49,231 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:22:49,231 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:22:56,962 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:22:56,962 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:23:18,582 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:23:18,582 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:23:19,097 - top2vec - INFO - Finding topics
2023-03-15 18:23:19,097 - top2vec - INFO - Finding topics
2023-03-15 18:23:19,887 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:23:19,887 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.12922219450162747
diversity: 0.8
 


2023-03-15 18:23:20,443 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:23:20,443 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:23:28,614 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:23:28,614 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:23:51,268 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:23:51,268 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:23:51,534 - top2vec - INFO - Finding topics
2023-03-15 18:23:51,534 - top2vec - INFO - Finding topics
2023-03-15 18:23:52,844 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:23:52,844 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.16649268981896242
diversity: 0.8
 


2023-03-15 18:23:53,638 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:23:53,638 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:24:02,347 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:24:02,347 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:24:22,124 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:24:22,124 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:24:22,465 - top2vec - INFO - Finding topics
2023-03-15 18:24:22,465 - top2vec - INFO - Finding topics
2023-03-15 18:24:23,270 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:24:23,270 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.14366363907558516
diversity: 0.4666666666666667
 


2023-03-15 18:24:23,560 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:24:23,560 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:24:31,252 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:24:31,252 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:24:45,759 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:24:45,759 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:24:45,998 - top2vec - INFO - Finding topics
2023-03-15 18:24:45,998 - top2vec - INFO - Finding topics
2023-03-15 18:24:46,853 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:24:46,853 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.12642976149993565
diversity: 0.42857142857142855
 


2023-03-15 18:24:47,614 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:24:47,614 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:24:55,948 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:24:55,948 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:25:11,973 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:25:11,973 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:25:12,260 - top2vec - INFO - Finding topics
2023-03-15 18:25:12,260 - top2vec - INFO - Finding topics
2023-03-15 18:25:13,117 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:25:13,117 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.10694083932972438
diversity: 0.5166666666666667
 


2023-03-15 18:25:13,426 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:25:13,426 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:25:20,911 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:25:20,911 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:25:46,879 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:25:46,879 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:25:47,492 - top2vec - INFO - Finding topics
2023-03-15 18:25:47,492 - top2vec - INFO - Finding topics
2023-03-15 18:25:50,482 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:25:50,482 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.13399173608841577
diversity: 0.4142857142857143
 


2023-03-15 18:25:52,166 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:25:52,166 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:26:01,314 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:26:01,314 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:26:20,733 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:26:20,733 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:26:21,192 - top2vec - INFO - Finding topics
2023-03-15 18:26:21,192 - top2vec - INFO - Finding topics
2023-03-15 18:26:22,973 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:26:22,973 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.14451484245069598
diversity: 0.7666666666666667
 


2023-03-15 18:26:23,563 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:26:23,563 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:26:32,493 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:26:32,493 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:26:49,957 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:26:49,957 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:26:50,197 - top2vec - INFO - Finding topics
2023-03-15 18:26:50,197 - top2vec - INFO - Finding topics
2023-03-15 18:26:51,248 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:26:51,248 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.14478720769966877
diversity: 0.4125
 


2023-03-15 18:26:51,897 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:26:51,897 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:26:58,666 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:26:58,666 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:27:12,962 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:27:12,962 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:27:13,218 - top2vec - INFO - Finding topics
2023-03-15 18:27:13,218 - top2vec - INFO - Finding topics
2023-03-15 18:27:14,101 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:27:14,101 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.11916273477666309
diversity: 0.56
 


2023-03-15 18:27:14,380 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:27:14,380 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:27:20,260 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:27:20,260 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:27:35,879 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:27:35,879 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:27:36,344 - top2vec - INFO - Finding topics
2023-03-15 18:27:36,344 - top2vec - INFO - Finding topics
2023-03-15 18:27:37,330 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:27:37,330 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.162804230032822
diversity: 0.3333333333333333
 


2023-03-15 18:27:37,665 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:27:37,665 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:27:44,724 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:27:44,724 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:27:58,470 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:27:58,470 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:27:59,164 - top2vec - INFO - Finding topics
2023-03-15 18:27:59,164 - top2vec - INFO - Finding topics
2023-03-15 18:27:59,894 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:27:59,894 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.13883678667321278
diversity: 0.44285714285714284
 


2023-03-15 18:28:00,397 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:28:00,397 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:28:08,494 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:28:08,494 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:28:25,087 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:28:25,087 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:28:25,524 - top2vec - INFO - Finding topics
2023-03-15 18:28:25,524 - top2vec - INFO - Finding topics
2023-03-15 18:28:27,122 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:28:27,122 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.13531790843642938
diversity: 0.4
 


2023-03-15 18:28:27,466 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:28:27,466 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:28:33,492 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:28:33,492 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:28:47,588 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:28:47,588 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:28:47,871 - top2vec - INFO - Finding topics
2023-03-15 18:28:47,871 - top2vec - INFO - Finding topics
2023-03-15 18:28:48,660 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:28:48,660 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.1434393836091666
diversity: 0.36666666666666664
 


2023-03-15 18:28:48,911 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:28:48,911 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:28:54,954 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:28:54,954 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:29:09,962 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:29:09,962 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:29:10,329 - top2vec - INFO - Finding topics
2023-03-15 18:29:10,329 - top2vec - INFO - Finding topics
2023-03-15 18:29:11,267 - top2vec - INFO - Pre-processing documents for training
2023-03-15 18:29:11,267 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.13795672558617847
diversity: 0.4125
 


2023-03-15 18:29:11,713 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:29:11,713 - top2vec - INFO - Creating joint document/word embedding
2023-03-15 18:29:18,963 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:29:18,963 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-15 18:29:31,429 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:29:31,429 - top2vec - INFO - Finding dense areas of documents
2023-03-15 18:29:31,749 - top2vec - INFO - Finding topics
2023-03-15 18:29:31,749 - top2vec - INFO - Finding topics


Results
npmi: -0.11192726769531944
diversity: 0.45714285714285713
 


# **Wall time**
Here, we only focus on the wall time of each topic model, from instantiating the model to training. To do so, we take the Trump dataset and split it up into steps of 1000 documents. Then, we can train a model and track the wall time:

In [32]:
from tqdm import tqdm
import random
import time
import pandas as pd

embedding_model = "all-MiniLM-L6-v2"
# embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embedding_model_name = "all-MiniLM-L6-v2"
topic_model_name = "BERTopic_USE"

results = pd.DataFrame(columns=["dataset", "nr_documents", "vocab_size", "time",
                                "cpu", "gpu", "gpu_cudnn", "gpu_memory", "embedding_model"])
for index, nr_documents in enumerate(tqdm(np.arange(1000, len(data), 2_000, dtype=int))):
    
    selected_data = random.sample(data, nr_documents)
    selected_tokenized_data = random.sample(tokenized_data, nr_documents)
    

    # Run model
    start = time.time()
    
    if topic_model_name == "LDA":
        id2word = corpora.Dictionary(selected_tokenized_data)
        id_corpus = [id2word.doc2bow(document) for document in selected_tokenized_data]
        lda = LdaMulticore(id_corpus, id2word=id2word, num_topics=100)
    
    elif topic_model_name == "NFM":
        id2word = corpora.Dictionary(selected_tokenized_data)
        id_corpus = [id2word.doc2bow(document) for document in selected_tokenized_data]
        nmf_model = nmf.Nmf(id_corpus, id2word=id2word, num_topics=100)

    elif topic_model_name == "BERTopic":
        topic_model = BERTopic(embedding_model=embedding_model)    
        topics, probs = topic_model.fit_transform(selected_data)
        
    elif topic_model_name == "BERTopic_Doc2Vec":
        train_corpus = [TaggedDocument(default_tokenizer(doc), [i]) for i, doc in enumerate(selected_data)]
        doc2vec_args = {"vector_size": 300,
                        "min_count": 50,
                        "window": 15,
                        "sample": 1e-5,
                        "negative": 0,
                        "hs": 1,
                        "epochs": 40,
                        "dm": 0,
                        "dbow_words": 1,
                       "documents": train_corpus,
                       "workers": -1}
        model = Doc2Vec(**doc2vec_args)
        embeddings = model.docvecs.vectors_docs
        topic_model = BERTopic()    
        topics, probs = topic_model.fit_transform(selected_data, embeddings)
        
    elif topic_model_name == "BERTopic_USE":
        embeddings = embedding_model(selected_data).cpu().numpy()
        topic_model = BERTopic(embedding_model=embedding_model)    
        topics, probs = topic_model.fit_transform(selected_data, embeddings)

    elif topic_model_name == "Top2Vec":
        model = Top2Vec(selected_data, hdbscan_args={"min_cluster_size": 15}, workers=-1)
#         model = Top2VecNew(selected_data, hdbscan_args={"min_cluster_size": 15}, embedding_model=embedding_model)
        

    
    end = time.time()

    # Calculate vocab size
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(selected_data)
    vocab_size = len(vectorizer.get_feature_names())
    
    results.loc[len(results)] = [dataset, len(selected_data), vocab_size, end - start, cpu_name, gpu_name, 
                                 gpu_cudnn, gpu_memory, embedding_model_name]

  0%|          | 0/1 [00:00<?, ?it/s]


NameError: name 'tokenized_data' is not defined