<a href="https://colab.research.google.com/github/drob-xx/UMAP_Instability/blob/main/UMAP_Instabiliity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates the impact that UMAP instability can exert on BERTopic and shows that tuning HDBSCAN can both reduce that impact while improving BERTopic modeling overall.

The steps below are:

1. Setup environment
1. Create ten TopicTuner models where each model has its own UMAP instance.
1. Select two of these models for demonstration
1. Generate BERTopic models based on the 1. 1. TopicTuner models using default BERTopic settings
1. Apply tuned HDBSCAN settings to BERTopic Models

The text used here is a 2000 article random sample taken from the [bbc-news](https://huggingface.co/datasets/SetFit/bbc-news) dataset, labeled for five categories - *business, entertainment, politics, sport* and *tech*.

The tuning parameters were arrived at (elsewhere) by generating ten models using this data and using TopicTuner to search through hundreds of combinations of HDBSCAN's min_cluster_size and sample_size parameters to choose settings that produced similar results in all ten models for a given number of clusters.

To ensure reproducability, UMAP's random_state parameter is set to the same value used for the models tuned in the previous step. Note, however that each of the UMAP instances is different from the others.

This notebook assumes a colab environment. You can get a free colab account if you don't have one, or you can modify the notebook for your environment.

In [None]:
!pip install BERTopic

[TopicTuner](https://github.com/drob-xx/TopicTuner) is an HDBSCAN tuning solution for BERTopic. There is no current install but cloning the repository into the base directory will work.

In [None]:
!git clone https://github.com/drob-xx/TopicTuner.git /content/TopicTuner

Then make sure it is on the path.

In [None]:
import sys
sys.path.insert(0,'/content/TopicTuner')

Clone the UMAP_Instability repo to access the needed data files.

In [None]:
!git clone https://github.com/drob-xx/UMAP_Instability.git /content/UMAP_Instability

In [None]:
import pandas as pd
import numpy as np
from random import randrange
import plotly.express as px
import pickle
from tqdm.notebook import tqdm

from topictuner import TopicModelTuner as TMT

from sklearn.metrics import adjusted_mutual_info_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

Load and save utility procedures.

In [None]:
def load(filepath):
    with open(filepath, 'rb') as fp:
        return pickle.load(fp)

def save(var, filepath):
    with open(filepath, 'wb') as fp:
        return pickle.dump(var, fp)

In [None]:
BBC_ModelParamSettings = load('/content/UMAP_Instability/BBC_ModelParamSettings')
BBC_UMAP_RandomStates = load('/content/UMAP_Instability/BBC_UMAP_RandomStates')
bbcDataSets = load('/content/UMAP_Instability/bbcDataSets')

Create the embeddings. On a colab instance with gpu this should take less than two minutes. Longer without gpu.

In [None]:
bbcModels = {}
# for ssize, aDF in tqdm(bbcDataSets.items()) :
for _ in range(10) :
  aModel = TMT(verbose=0)
  aModel.createEmbeddings(bbcDataSets[2000]['text'].to_list())
  bbcModels[2000] = aModel

Setup a container for the ten models.

In [None]:
BBC_Not_Optimized_Models = {}

Create ten TopicModelTuner models. 

1. Set UMAP's random state
1. set the docs so we have an easy way to access them 
1. set the embeddings
1. Call .reduce() which runs UMAP against the embeddings (takes about 2 minutes with gpu).

In [None]:
for sampleSize in tqdm([2000]) :
  tmtmodels = []
  for idx in tqdm(range(len(BBC_UMAP_RandomStates[sampleSize]))) :
    tmtmodel = TMT(verbose=0)
    tmtmodel.reducer_model.random_state = BBC_UMAP_RandomStates[sampleSize][idx]
    tmtmodel.docs = bbcDataSets[sampleSize]['text'].to_list()
    tmtmodel.embeddings = bbcModels[sampleSize]
    tmtmodel.reduce()
    tmtmodels.append(tmtmodel)
  BBC_Not_Optimized_Models[sampleSize] = tmtmodels

We now have ten models generated against the same text and using ten different UMAP instances. Now we can compare each of the models to the others in the set to see what sort of variations occur due to UMAP instability.

The following code will compare each pair of models in the set (100 comparisons) by running HDBSCAN against the UMAP reduction and comparing the resultant .labels_ - the classifications that BERTopic uses to create a topic model. The labels are compared using Sklearn's [adjusted_mutual_info_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html) which gives a percentage match between two cluster classifications.

We generate a heatmap to visualize the relationships.

In [None]:
modelCompareResults = {}
modelFigs = {}
for setKey, modelSet in tqdm(BBC_Not_Optimized_Models.items()) :
    csize = len(modelSet)
    ComparedResults = np.zeros((csize,csize))
    for row, model1 in enumerate(modelSet) :
      for column, model2 in enumerate(modelSet) :
        ComparedResults[row, column] = adjusted_mutual_info_score(model1.runHDBSCAN(),
                                                                  model2.runHDBSCAN())
    fig = px.imshow(ComparedResults, color_continuous_scale='RdBu_r', zmin=0, zmax=1)
    fig.update_layout(
            xaxis={'side': 'top'}, 
    )
    modelFigs[setKey] = fig
    modelCompareResults[setKey] = ComparedResults

The heatmap shows how alike each model is from the others in the set.

In [None]:
for fig in modelFigs.values() :
  fig.show()

We can summarize these results:

In [None]:
txformedData = {}
for idx, runData in modelCompareResults.items() :
    vals = np.zeros(45)
    count = 0
    for x in range(10) :
      for y in range(10) :
        if (x <= y) :
          continue
        else :
          vals[count] = runData[x,y]
          count += 1
    txformedData[idx] = vals

pd.DataFrame(txformedData).describe()

To provide a more detailed look on how UMAP instability effects these models we'll select two of the lesser correlated models - 4 and 9 - and see how these differences are reflected in BERTopic models.

Each of the models has been created using BERTopic default settings. When we compare 4 and 9 we see that they are 81% correlated, below the 1st quartile cutoff of 83%.

In [None]:
model4 = BBC_Not_Optimized_Models[2000][4]
model9 = BBC_Not_Optimized_Models[2000][9]
adjusted_mutual_info_score(model4.runHDBSCAN(), model9.runHDBSCAN())

Now lets create BERTopic models using these parameters.

In [None]:
btmodels = {4:None, 9:None}
custom_stop_words = ['mr', 'said']
stop_words_list = text.ENGLISH_STOP_WORDS.union(custom_stop_words)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=stop_words_list)

for idx in btmodels.keys() :
  btmodels[idx] = BBC_Not_Optimized_Models[2000][idx].getBERTopicModel(10, None)

for idx, model in btmodels.items() :
  model.umap_model = BBC_Not_Optimized_Models[2000][idx].reducer_model
  model.vectorizer_model = vectorizer_model
  model.fit_transform(bbcDataSets[2000]['text'].to_list(), embeddings=BBC_Not_Optimized_Models[2000][idx].embeddings)

Of course the TMT model and the BERTopic Models produce the same results.

In [None]:
adjusted_mutual_info_score(model4.runHDBSCAN(), btmodels[4].topics_)

Taking a look at the topics that BERTopic has determined.

In [None]:
for model in btmodels.values() :
  print(model.get_topic_info())
  print('')
  print('-----------------------------------------------')
  print('')

That's a lot of topics. Let's reduce them.

In [None]:
for idx, model in btmodels.items() :
  model.reduce_topics(bbcDataSets[2000]['text'].to_list(), 5)
  for topic in model.generate_topic_labels(nr_words=10, topic_prefix=False, separator=' ') :
    print(topic)
  print('')
  print('----------------------------------')
  print('')

More comprehensible. But now lets look at how correlated the models are.

In [None]:
adjusted_mutual_info_score(btmodels[4].topics_,btmodels[9].topics_)

Looking at the topic vocabularies we see that four of the five topics line up very closely to one another. However, the fifth - 

- growth economy bank year economic sales prices rates rise dollar

and

- people mobile phone technology broadband tv digital search phones games

do not.

We can also see the differences in the embeddings visualizations they produce.

In [None]:
embedding_vizs = []
for idx, model in tqdm(btmodels.items()) :
  BBC_Not_Optimized_Models[2000][idx].createVizReduction()
  embedding_vizs.append(model.visualize_documents(bbcDataSets[2000]['text'].to_list(),
                            reduced_embeddings=BBC_Not_Optimized_Models[2000][idx].viz_reducer.embedding_))
for viz in embedding_vizs :
  viz.show()

Let's see what happens when we use optimized UMAP settings (calculated elsewhere) and run summaries on their classifications.

In [None]:
tunedModels = {}
for idx in [4, 9] :
  tunedModels[idx] = (BBC_Not_Optimized_Models[2000][idx]) 
  tunedModels[idx].best_cs = BBC_ModelParamSettings[2000][idx][0]
  tunedModels[idx].best_ss = BBC_ModelParamSettings[2000][idx][1]
for model in tunedModels.values() :
  print(model.best_cs, model.best_ss)
  print(pd.Series(model.runHDBSCAN(model.best_cs, model.best_ss)).value_counts())

During the optimization process we don't just make the model more efficient (reducing -1 results) but we also can often choose a specific number of clusters.

We can see our metric thinks these models are very aligned.

In [None]:
adjusted_mutual_info_score(tunedModels[4].runHDBSCAN(), tunedModels[9].runHDBSCAN())

Next we can create BERTopic models based on these optimized results.

In [None]:
btmodels2 = {}

for idx, model in tqdm(tunedModels.items()) :
  btmodel = model.getBERTopicModel()
  btmodel.umap_model = model.reducer_model
  btmodel.vectorizer_model = vectorizer_model
  btmodel.fit_transform(bbcDataSets[2000]['text'].to_list(), model.embeddings)
  btmodels2[idx] = btmodel

Let's look at some scored differences. The first is how the first two BERTopic models compare to one another. The second is how different the "before" models are from the "after" models and finally, how correlated the "after" models are to each other. 

In [None]:
print('BERTopic "before" models compared with each other')
print(adjusted_mutual_info_score(btmodels[4].topics_, btmodels[9].topics_))
print('')
print('BERTopic "after" models vs. "before" models')
print(adjusted_mutual_info_score(btmodels2[4].topics_, btmodels[4].topics_))
print(adjusted_mutual_info_score(btmodels2[9].topics_, btmodels[9].topics_))
print('')
print('BERTopic "after" models compared with each other')
print(adjusted_mutual_info_score(btmodels2[4].topics_, btmodels2[9].topics_))


In [None]:
for model in btmodels2.values() :
  for topic in model.generate_topic_labels(nr_words=10, topic_prefix=False, separator=' ') :
    print(topic)
  print('')
  print('----------------------------------')
  print('')

Despite the fact that each of these models has a different UMAP instance they are almost the same.

The takeaway is that UMAP instability can have a very pronounced effect on model consistency, unless the model is tuned. Tuning also has the added advantage of significantly reducing the number of uncategorized documents.

In [None]:
docs = bbcDataSets[2000]['text'].to_list()
viz_embeddings = tunedModels[4].viz_reducer.embedding_
tuned_btmodel4 = btmodels2[4]
tuned_btmodel4.visualize_documents(docs, reduced_embeddings=viz_embeddings).show()