<a href="https://colab.research.google.com/github/drob-xx/TopicTuner/blob/main/TopicTunerDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install bertopic

Get TopicTuner from github

In [None]:
!git clone  https://github.com/drob-xx/TopicTuner.git

Place TopicTuner on the path

In [3]:
import sys
sys.path.insert(0,'/content/TopicTuner')

In [4]:
from topictuner import TopicModelTuner as TMT
from sklearn.datasets import fetch_20newsgroups

Get 20NewsGroup data

In [5]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

Create a TMT instance from scratch

In [6]:
tmt = TMT()

Alternatively, you can create one from an existing BERTopic instance by calling 

newTMT = TMT.wrapBERTopicModel(<your BERTopic model>)

Create the embeddings.

In [7]:
tmt.createEmbeddings(docs)

Then reduce them to 5 features ala BERTopic byt calling TMT.reduce()

In [None]:
tmt.reduce()

Now we can explore different HDBSCAN settings for this instance of the UMAP reductions.

TMT.randomSearch takes two arrays as arguments. By default it will execute 20 searches, randomly selecting the values passed in the first array as the min_cluster_size. The decimal float values in the second array will be randomly selected and then multiplied with the selected min_cluster_size to determine a sample_size.

Note that the values for in each of these search examples below will likely have to be modified to give you optimized results for your specific UMAP reduction.

In [None]:
lastRunResultsDF = tmt.randomSearch([*range(120,180)], [.1, .25, .5, .75, 1])

Each time a TMT search is performed all the results are collected in the TMT.ResultsDF DataFrame. Each search returns a DataFrame with just the results of that search.

TMT.visualizeSearch produces a plotly parallel coordinates graph. You can pass it TMT.ResultsDF to get a view of all the searches, or pass it the results from a particular search.

In [None]:
tmt.visualizeSearch(lastRunResultsDF).show()

TMT.summarizeResults sorts a results table by number_of_cluster and selects the 'best' value for that number of clusters by choosing the one with the least uncategorized results.

In [None]:
tmt.summarizeResults(lastRunResultsDF).sort_values(by=['number_uncategorized'])

TMT.gridSearch() is suitable once you have narrowed down the ranges. It will search ALL the min_cluster_size values passed with each of the percentage values being evaluated to the given sample_size value. In the example below 15 runs will be performed - five different sample_sizes for each of the three min_cluster_size values - 131, 132, 133.

In [14]:
lastRunResultsDF = tmt.gridSearch([*range(131,134)], [.1, .25, .5, .75, 1])

  0%|          | 0/15 [00:00<?, ?it/s]

Once you have narrowed down the values of interest further you may want to run a more thorough search to do this. TMT.simpleSearch takes two arrays as arguments. The first are the min_cluster_sizes and the second are the sample_sizes. You might prepare these lists like this:

In [15]:
csizes = []
ssizes = []
for csize in range(131,132) :
  for ssize in range(1, csize+1) :
    csizes.append(csize)
    ssizes.append(ssize)

In the above example csizes is a list of 131 values of 131. ssizes is 131 values of 1 to 131. This runs ALL possible sample sizes (131) for the min_cluster_size of 131. The first run will be min_cluster_size=131, sample_size=1, the second min_cluster_size=131, sample_size=2 etc.

In [None]:
lastRunResultsDF = tmt.simpleSearch(csizes, ssizes)

TMT can generate a scatterplot of your embeddings overlayed with the clustering of a given set of parameters. This can assist in deciding how many clusters to select for your model.

To accomplish this, first you must create a 2D reduction of the embeddings suitable for a 2D scatterplot.

In [None]:
tmt.createVizReduction()

If TMT has access to the docs it will use them to add document text to the scatterplot.

In [None]:
tmt.visualizeEmbeddings(131,78).show()

You can save your TMT model with TMT.save()

In [None]:
tmt.save('temp')

And restore it using TMT.load()

In [None]:
tmt2 = TMT.load('temp')

Once you have determined parameters that work for your text, TMT can manufacture a BERTopic model. Note in this example we pass BERTopic the embeddings created earlier - no need to have BERTopic re-run them.

In [None]:
bt1 = tmt2.getBERTopicModel(131, 24)
bt1.fit_transform(tmt2.docs, tmt2.embeddings)
bt1.get_topics_info()