**BERTopic Maarten Grootendorst**

Installation with sentence-transformers

3 main algorithm components
1. Embed Documents: Extract document embeddings with Sentence Transformers
2. Cluster Documents: Create groups of similar documents with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents)
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF   (class-based term frequency, inverse document frequency)

In [1]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.9.4-py2.py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 2.7 MB/s 
Collecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.28.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 8.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K     |████████████████████████████████| 86 kB 7.3 MB/s 
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 10.7 MB/s 
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 74.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64

In [2]:
# extracting topics and generting probabilities

from bertopic import BERTopic
import pandas as pd 
import numpy as np
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [3]:
# access the frequent topics that were generated
# -1 refers to all outliers and should typically be ignored

topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,7025,-1_to_the_is_and
1,0,1827,0_game_team_games_he
2,1,591,1_key_clipper_encryption_chip
3,2,525,2_ites_cheek_yep_huh
4,3,463,3_israel_israeli_jews_arab
...,...,...,...
229,228,10,228_68070_motorola_68010_68040
230,229,10,229_sound_soundbase_stereo_channel
231,230,10,230_manhattan_bobbeviceicotekcom_beauchaine_sank
232,231,10,231_fat_weight_insulin_muscle


In [4]:
# most frequent topic that was generated, topic 0

topic_model.get_topic(0)

[('game', 0.010108389225147882),
 ('team', 0.008831490076270147),
 ('games', 0.007043755775980492),
 ('he', 0.0067277959256621155),
 ('players', 0.006206522524973441),
 ('season', 0.006138933697557294),
 ('hockey', 0.00601929992299583),
 ('play', 0.0056584316354696615),
 ('25', 0.005518971113407293),
 ('year', 0.005475923799531879)]

In [6]:
# Visualize Topics

topic_model.visualize_topics()

In [7]:
topic_model.visualize_barchart()

In [8]:
# Dynamic Topic Modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. 
# These methods allow you to understand how a topic is represented over time.
# Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:

import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

In [9]:
# Extract the global topic representations by creating and training a BERTopic model

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)

Batches:   0%|          | 0/1418 [00:00<?, ?it/s]

2022-04-07 17:44:09,181 - BERTopic - Transformed documents to Embeddings
2022-04-07 17:44:45,821 - BERTopic - Reduced dimensionality with UMAP
2022-04-07 17:44:49,106 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [10]:
# From these topics generate the topic representations at each timestamp for each topic
# by calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics

topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)

20it [00:09,  2.07it/s]


In [11]:
# Visualize the topics by calling visualize_topics_over_time()

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)