Every day, businesses process large volumes of unstructured text. From customer interactions in emails to online reviews and reviews. To deal with this large amount of text, we use the concept of topic modeling. 

Topic modeling can be seen as a task of machine learning which can be used to present the huge volume of data generated due to advancements in computer and web technology in low dimension and to present the hidden concepts, important characteristics or latent variables of the data, depending on the context of the application of the identified text.

In the section, we will use Machine Learning project on Topic Modeling by using the **BERTopic** library. We can simply install this library by using the **pip command**; `pip install bertopic==0.9.3`.

When working with **BERTopic**, be sure to select a GPU runtime. Otherwise, the algorithm may take some time to create the document embeds.

In [1]:
# !pip install bertopic==0.9.3

In [2]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

In [3]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Let’s start by creating topics, for example, we are using the popular 20 Newsgroups dataset provided by Scikit-Learn which has around 18,000 newsgroup articles on 20 topics. We will therefore select English as the main language for our documents:

In [22]:
model = BERTopic(calculate_probabilities = True)
topics, probs = model.fit_transform(docs) # probs is probability

Now let’s extract the topics with the most number of frequencies:

In [23]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,7020
1,0,1825
2,1,621
3,2,454
4,3,326
...,...,...
219,218,10
220,219,10
221,220,10
222,221,10


We can see `-1` in the first row `(index 0)`, `-1` refers to all outliers and should generally be ignored. Next, let’s take a look at the most common topic generated:

In [24]:
model.get_topic(0)[:10]

[('games', 0.007011386314955602),
 ('players', 0.0062435246236494345),
 ('season', 0.006140010156016231),
 ('hockey', 0.006023493438856502),
 ('league', 0.004944771280055125),
 ('teams', 0.004618597644602934),
 ('baseball', 0.004350635257635803),
 ('player', 0.004334369853589185),
 ('nhl', 0.004075929442863264),
 ('gm', 0.003480695682702875)]

The model is therefore stochastic, which means that the topics may differ from one run to another. Now let’s take a look at the full list of support languages:

In [25]:
from bertopic.backend import languages
print(languages)

['afrikaans', 'albanian', 'amharic', 'arabic', 'armenian', 'assamese', 'azerbaijani', 'basque', 'belarusian', 'bengali', 'bengali romanize', 'bosnian', 'breton', 'bulgarian', 'burmese', 'burmese zawgyi font', 'catalan', 'chinese (simplified)', 'chinese (traditional)', 'croatian', 'czech', 'danish', 'dutch', 'english', 'esperanto', 'estonian', 'filipino', 'finnish', 'french', 'galician', 'georgian', 'german', 'greek', 'gujarati', 'hausa', 'hebrew', 'hindi', 'hindi romanize', 'hungarian', 'icelandic', 'indonesian', 'irish', 'italian', 'japanese', 'javanese', 'kannada', 'kazakh', 'khmer', 'korean', 'kurdish (kurmanji)', 'kyrgyz', 'lao', 'latin', 'latvian', 'lithuanian', 'macedonian', 'malagasy', 'malay', 'malayalam', 'marathi', 'mongolian', 'nepali', 'norwegian', 'oriya', 'oromo', 'pashto', 'persian', 'polish', 'portuguese', 'punjabi', 'romanian', 'russian', 'sanskrit', 'scottish gaelic', 'serbian', 'sindhi', 'sinhala', 'slovak', 'slovenian', 'somali', 'spanish', 'sundanese', 'swahili', '

Now let’s take a look at the topic probabilities to understand how safe BERTopic is that certain topics can be found in a document:

In [27]:
model.visualize_distribution(probs[0])

### Topic Reduction

Finally, we can also reduce the number of subjects after training a BERTopic model. The advantage of doing this is that we can decide the number of topics after knowing how many are actually created.

It is difficult to predict before training our model how many topics are in our documents and how many will be retrieved. Instead, we can decide afterwards how many subjects look realistic:

In [28]:
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=60)

The reasoning for placing documents, topics and probs as parameters is that these values are not saved in BERTopic for any purpose. If we had a million documents, it seems very inefficient to save them to BERTopic instead of a dedicated database.

We can now use the `update_topics` function to update the subject representation with new parameters for the **TF-IDF vectorization**

In [30]:
model.update_topics(docs, topics, n_gram_range=(1, 3))

### Topic Modeling

After training our model, we can use `find_topics` to search for topics similar to a `search_term` entry. Here we will be looking for topics that are closely related to the search term vehicle. Next, we extract the most similar topic and check the results:

In [31]:
similar_topics, similarity = model.find_topics("vehicle", top_n=5); similar_topics
model.get_topic(28)

[('jews', 0.011420209853273652),
 ('nazis', 0.007736673158460116),
 ('nazi', 0.007381017694300174),
 ('the nazis', 0.006611423665943947),
 ('jewish', 0.005491714851303139),
 ('the nazi', 0.004815521547444272),
 ('the holocaust', 0.0036082423266371694),
 ('the jews', 0.0029710905720823385),
 ('antisemitism', 0.0023869447218242803),
 ('the jewish', 0.0023809252454821636)]