### Import libaries and dataset

In [1]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: hdbscan, sentence-transformers
  Building wheel for hdbscan (pyproject.toml) 

In [2]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


So let’s start by creating topics, for example, I am using the popular 20 Newsgroups dataset provided by Scikit-Learn which has around 18,000 newsgroup articles on 20 topics. I will therefore select English as the main language for our documents:

In [18]:
model = BERTopic(calculate_probabilities=True)
topics, probs = model.fit_transform(docs)

Now let’s extract the topics with the most number of frequencies:

In [19]:
model.get_topic_freq().head()

Unnamed: 0,Topic,Count
1,-1,6887
0,0,1834
24,1,640
41,2,524
46,3,464


You can see -1 in the first row (index 0), -1 refers to all outliers and should generally be ignored. Next, let’s take a look at the most common topic generated:

In [20]:
topic_model.get_topic(0)[:10]

[('game', 0.010562552289888656),
 ('team', 0.009197001279897003),
 ('games', 0.007322072742198445),
 ('he', 0.007241745405015588),
 ('players', 0.006398398622812115),
 ('season', 0.006340396998154335),
 ('hockey', 0.0062259853942926565),
 ('play', 0.005885814506097448),
 ('25', 0.005759534007055261),
 ('year', 0.005722821565009483)]

The model is therefore stochastic, which means that the topics may differ from one run to another. Now let’s take a look at the topic probabilities to understand how safe BERTopic is that certain topics can be found in a document:

In [21]:
model.visualize_distribution(probs[0])

### Topic Reduction

Finally, we can also reduce the number of subjects after training a BERTopic model. The advantage of doing this is that you can decide the number of topics after knowing how many are actually created.

It is difficult to predict before training your model how many topics are in your documents and how many will be retrieved. Instead, we can decide afterwards how many subjects look realistic:

In [30]:
new_topics= model.reduce_topics(docs,nr_topics=30)

We can now use the update_topics function to update the subject representation with new parameters for the TF-IDF vectorization:

In [32]:
model.update_topics(docs, topics, n_gram_range=(1, 3))

### Topic Modelling

After training our model, we can use find_topics to search for topics similar to a search_term entry. Here we will be looking for topics that are closely related to the search term car. Next, we extract the most similar topic and check the results:

In [34]:
similar_topics, similarity = model.find_topics("car", top_n=5); similar_topics
model.get_topic(28)

[('ear', 0.04351919910159403),
 ('the ear', 0.024707498431836603),
 ('the', 0.01944052804787548),
 ('wax', 0.018332402237108674),
 ('to', 0.01810263325504848),
 ('ears', 0.01778350654801203),
 ('and', 0.014766975473806608),
 ('aids', 0.013569875426414652),
 ('with', 0.013430102829237324),
 ('hearing', 0.012698610903812234)]

Source:

https://thecleverprogrammer.com/2021/01/12/topic-modeling-with-machine-learning/