### Getting Topics Using BERTopic and Sentence Transformer Embeddings

Every day, businesses deal with large volumes of unstructured text. From customer interactions in emails to online feedback and reviews. To deal with this large amount of text, we look towards topic modeling. A technique to automatically extract meaning from documents by identifying recurrent topics.
<br><br>
BERTopic is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
<br><br><br>
The main purpose of this article is to give you an in-depth overview of BERTopic’s features and tutorials on how to best apply this for your own projects.


In [None]:
!pip3 install numpy==1.21.2

Restart the Kernel here and check if the notebook is using latest version of numpy.

In [1]:
import numpy as np
np.__version__

'1.21.2'

#### Install the packages

In [2]:
!pip install bertopic
!pip install bertopic[visualization]
!pip install -U sentence-transformers

Collecting bertopic
  Downloading bertopic-0.9.0-py2.py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 252 kB/s eta 0:00:01
Collecting plotly<4.14.3,>=4.7.0
  Downloading plotly-4.14.2-py2.py3-none-any.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 28.5 MB/s eta 0:00:01
[?25hCollecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.27.tar.gz (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 49.4 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.2 MB/s  eta 0:00:01
Building wheels for collected packages: hdbscan, sentence-transformers
  Building wheel for hdbscan (PEP 517) ... [?25ldone
[?25h  Created wheel for hdbscan: filename=hdbscan-0.8.27-cp37-cp37m-linux_x86_64

Restart the Kernel after the pip install<br>
numpy has to be the latest, i.e. 1.21.2 is the latest one as of today

#### Import the libraries

In [3]:
import pandas as pd
import numpy as np
import string
import regex as re
import nltk 
from nltk.tokenize import TweetTokenizer
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sentence_transformers import SentenceTransformer


Read the tweets file and choose only the "tweet" column

In [4]:
data = pd.read_csv('/kaggle/input/windows11-tweets/Windows11_Tweets.csv')
data = data[data['language']=='en'][['user_id','tweet']]
data.head()

Unnamed: 0,user_id,tweet
0,831944710541418498,Running dev channel Windows 11 and Android 12 ...
5,465091440,@Longplay_Games @Thunder_Owl and i'm exacerbat...
6,4650866357,"@ppy i think it was a windows update, i realiz..."
9,1339170008866430977,@NathanMcNulty I THOUGHT YOU WERE TALKING ABOU...
10,1399778506159054850,@x0_1372 @msuhr10 Windows 11 fixes this.


In [5]:
# Use the nltk stopword
stopword = nltk.corpus.stopwords.words('english')

## remove the punctuations and only numbers
def remove_punct(text):
    text = re.sub('[^\w\s]', ' ', text)
    text = re.sub('[0-9]+', ' ', text)
    return text.lower()

## remove the hashtags within the tweets
def remove_hashtags(text):
    text  = re.sub(r"#(\w+)", '', text)
    return text

## remove the tweet handles
def remove_tagged_users(text):
    text  = re.sub(r"@(\w+)", '', text)
    return text

## remove the stopwords
def remove_stopwords(text):
    out = " ".join(word for word in text.split() if word not in stopword)
    return out

## remove the URLs embedded within tweets
def remove_urls(text):
    text = re.sub(r"((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",'', text)
    return text

## remove the emojis from the tweets
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)



#### Data Pre-processing

In [7]:
### Clean the data using the preprocessing functions
data['clean_tweet'] = data['tweet'].apply(lambda x: remove_hashtags(x))
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: remove_tagged_users(x))
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: remove_emoji(x))
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: remove_urls(x))
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: remove_punct(x))
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: remove_stopwords(x))

In [8]:
### find the common words which occur frequently in the tweet.
from collections import Counter
cnt = Counter()
for text in data['clean_tweet'].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('windows', 2392),
 ('new', 444),
 ('microsoft', 330),
 ('apps', 303),
 ('tool', 180),
 ('mail', 164),
 ('snipping', 160),
 ('feature', 157),
 ('calculator', 157),
 ('get', 143)]

In [9]:
### REmove the top-5 commonly occuring words 

freq = set([w for (w, wc) in cnt.most_common(5)])

def freqwords(text):
    return " ".join([word for word in str(text).split() if word not in freq])

data['clean_tweet'] = data['clean_tweet'].apply(freqwords)

In [10]:
### remove teh rare words e.g. words appearing only once

freq = pd.Series(' '.join(data['clean_tweet']).split()).value_counts()[-50:] # 10 rare words
freq = list(freq.index)
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
data.head()

Unnamed: 0,user_id,tweet,clean_tweet
0,831944710541418498,Running dev channel Windows 11 and Android 12 ...,running dev channel android beta simply much p...
5,465091440,@Longplay_Games @Thunder_Owl and i'm exacerbat...,exacerbating issue also dev ring use one set d...
6,4650866357,"@ppy i think it was a windows update, i realiz...",think update realized update went asleep idk c...
9,1339170008866430977,@NathanMcNulty I THOUGHT YOU WERE TALKING ABOU...,thought talking lmaooo
10,1399778506159054850,@x0_1372 @msuhr10 Windows 11 fixes this.,fixes


#### Lemmatization

In [11]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} # Pos tag, used Noun, Verb, Adjective and Adverb

# Function for lemmatization using POS tag
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

# Passing the function to 'text_rare' and store in 'text_lemma'
data['clean_tweet'] = data['clean_tweet'].apply(lemmatize_words)


In [12]:
data.sample(10)

Unnamed: 0,user_id,tweet,clean_tweet
795,1469265582,Clean Installation of Windows 11 with legit pr...,clean installation legit product key
1559,87537732,Microsoft releases Windows 11 Insider Preview ...,release insider preview build snip calculator ...
4587,1215912913342304257,https://t.co/HrbDfwhPgb Here are seven of th...,seven best third party start menu include one ...
1481,2693847595,"""Calculator now comes with a dark mode, and of...",calculator come dark mode course aligns look a...
3121,333969115,Intel released new Wi-Fi and Bluetooth drivers...,intel release wi fi bluetooth driver support
4896,2325079028,@windowsinsider Announcing Windows 11 Inside...,announce insider preview build write amanda la...
2792,1313106807707975680,How to Change Touch Keyboard Themes on Windows...,change touch keyboard theme
326,98160432,@JenMsft @JustinZumwalt My biggest issue is th...,big issue everything seem least one extra clic...
2235,1926422736,Will Your Windows 10 Apps Work on Windows 11? ...,work
2330,877514066436214785,Windows 11 now has its first beta release ...,first beta release release first beta availabl...


##### Getting Topics Using BERTopic and Sentence Transformer Embeddings

In [13]:
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [14]:
tweets_data = data['clean_tweet'].tolist()
embeddings = model.encode(tweets_data, show_progress_bar=True)

Batches:   0%|          | 0/70 [00:00<?, ?it/s]

In [15]:
from bertopic import BERTopic

Under the hood, BERTopic is using sentence-transformers to create embeddings for the documents you pass it. As a default, BERTopic is set to using an English model but is supports any language for which an embedding model exists.
You can choose the language by simply setting the language parameter in BERTopic

In [16]:
model2 = BERTopic(language="english")
topics, probabilities = model2.fit_transform(tweets_data,embeddings)

There are two outputs generated, topics and probabilities. A value in topics simply represents the topic it is assigned to. Probabilities on the other hand demonstrate the likelihood of a document falling into any of the possible topics.

After generating topics and their probabilities, we can access the frequent topics that were generated:<br>

In [17]:
model2.get_topic_info().head()

Unnamed: 0,Topic,Count,Name
0,-1,618,-1_want_take_pc_check
1,0,152,0_shortage_chip_slow_click
2,1,114,1_phase_skype_default_mean
3,2,109,2_bad_claim_reduce_trim
4,3,97,3_kesy_le_hello_ho


In the output above, it seems that Topic -1 is the largest. -1 refers to all outliers which do not have a topic assigned. Forcing documents in a topic could lead to poor performance. Thus, we ignore Topic -1.

In [18]:
model2.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,-1,618
1,0,152
2,1,114
3,2,109
4,3,97


In [19]:
model2.get_topic(1)

[('phase', 0.17401928913748355),
 ('skype', 0.17263523123817986),
 ('default', 0.14715061930690299),
 ('mean', 0.14463153117103902),
 ('seem', 0.14104493044360628),
 ('feature', 0.052818618461119876),
 ('learn', 0.004593108459323202),
 ('could', 0.003339526249014399),
 ('forever', 0.0),
 ('foresee', 0.0),
 ('foremost', 0.0),
 ('forehead', 0.0),
 ('force', 0.0),
 ('font', 0.0),
 ('fond', 0.0),
 ('follow', 0.0),
 ('followme', 0.0),
 ('folklore', 0.0),
 ('folder', 0.0),
 ('foldable', 0.0),
 ('fold', 0.0),
 ('foil', 0.0),
 ('foi', 0.0),
 ('focus', 0.0),
 ('fo', 0.0),
 ('fn', 0.0),
 ('flyout', 0.0),
 ('forget', 0.0),
 ('forgotten', 0.0),
 ('forgot', 0.0)]

In [20]:
model2.visualize_topics()

### Topic Reduction after Training

What if you are left with too many topics after training which took many hours? It would be a shame to have to re-train your model just to experiment with the number of topics.
<br>
Fortunately, we can reduce the number of topics after having trained a BERTopic model. Another advantage of doing so is that you can decide the number of topics after knowing how many are actually created:


In [22]:
new_topics, new_probs = model2.reduce_topics(tweets_data, topics, probabilities, nr_topics=20)

In [23]:
model2.get_topic_info().head()

Unnamed: 0,Topic,Count,Name
0,-1,957,-1_pc_update_use_app
1,0,152,0_pc_shortage_chip_make
2,1,114,1_phase_skype_default_mean
3,2,109,2_bad_claim_reduce_good
4,3,97,3_hello_well_kesy_le


In [24]:
model2.get_topic(1)

[('phase', 0.3300204357461385),
 ('skype', 0.3300169189672654),
 ('default', 0.30453230703598855),
 ('mean', 0.300632677779694),
 ('seem', 0.2984266181726918),
 ('feature', 0.2102003061902054),
 ('learn', 0.005973649579753778),
 ('could', 0.004720067369444974),
 ('forever', 0.0),
 ('foresee', 0.0),
 ('foremost', 0.0),
 ('forehead', 0.0),
 ('force', 0.0),
 ('font', 0.0),
 ('fond', 0.0),
 ('follow', 0.0),
 ('followme', 0.0),
 ('folklore', 0.0),
 ('folder', 0.0),
 ('foldable', 0.0),
 ('fold', 0.0),
 ('foil', 0.0),
 ('foi', 0.0),
 ('focus', 0.0),
 ('fo', 0.0),
 ('fn', 0.0),
 ('flyout', 0.0),
 ('forget', 0.0),
 ('forgotten', 0.0),
 ('forgot', 0.0)]

### Topic Representation

Topics are typically represented by a set of words. In BERTopic, these words are extracted from the documents using a class-based TF-IDF. <br>

At times, you might not be happy with the representation of the topics that were created. This is possible when you selected to have only 1-gram words as representation. Perhaps you want to try out a different n-gram range or you have a custom vectorizer that you want to use.

In [26]:
# Update topic representation by increasing n-gram range and removing english stopwords
model2.update_topics(tweets_data, new_topics, n_gram_range=(1, 3))

In [27]:
model2.get_topic_info().head()

Unnamed: 0,Topic,Count,Name
0,-1,957,-1_update_pc_use_app
1,0,152,0_pc_chip shortage_shortage_run
2,1,114,1_seem default_default feature_skype seem defa...
3,2,109,2_bad_good bad_good_claim
4,3,97,3_well_hello_le_kesy ho


In [28]:
model2.get_topic(1)

[('seem default', 0.17170916351625662),
 ('default feature', 0.17170916351625662),
 ('skype seem default', 0.17170916351625662),
 ('seem default feature', 0.17170916351625662),
 ('skype seem', 0.17170916351625662),
 ('skype', 0.17122324867172853),
 ('mean phase', 0.17068883890570938),
 ('phase', 0.17068883890570938),
 ('default feature mean', 0.16966419527464766),
 ('feature mean', 0.16966419527464766),
 ('feature mean phase', 0.16966419527464766),
 ('phase skype', 0.1686351940593681),
 ('mean phase skype', 0.1686351940593681),
 ('phase skype seem', 0.1686351940593681),
 ('default', 0.16271594044283963),
 ('mean', 0.1608785770955346),
 ('seem', 0.16067773098334964),
 ('feature', 0.13122590209314922),
 ('feature skype seem', 0.0038176889846214344),
 ('feature could mean', 0.0038176889846214344),
 ('feature could', 0.0038176889846214344),
 ('mean phase learn', 0.0038176889846214344),
 ('feature skype', 0.0038176889846214344),
 ('learn skype seem', 0.0038176889846214344),
 ('learn skype',

For more configurations visit : https://github.com/MaartenGr/BERTopic