# This is a demo of BERTtopic

(see https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html)

Note that this notebook exists just for convenience reasons and to get started!

Read the original tutorial to get complete information (in particular, also about different transformer models, for example for other languages than English)

In [4]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer

# Load the model and tokenizer
model_name = 'jegorkitskerkin/robbert-v2-dutch-base-mqa-finetuned'
model = SentenceTransformer(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Your sentences
sentences = ["Dit is een voorbeeld zin", "Elke zin wordt omgezet"]

# Get tokens
tokens = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get embeddings
embeddings = model.encode(sentences)

# Now you have both tokens and embeddings
print("Tokens:", tokens[0].tokens)
print("Embeddings shape:", embeddings.shape)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Tokens: ['<s>', 'Dit', 'Ġis', 'Ġeen', 'Ġvoorbeeld', 'Ġzin', '</s>']
Embeddings shape: (2, 768)


## 1. Load Data

In [5]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

print("Number of documents:", len(docs))
print("First document:", docs[0])

Number of documents: 18846
First document: 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




## 2. Train a BERTopic model: 

The following elements are comprise the BERTopic stack:

<img src="https://raw.githubusercontent.com/annekroon/gesis-machine-learning/main/pictures/overview_bert.png" alt="Default Settings of BERTopic" width="300" height="200">

https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview

1. create embeddings from documents (with Sentence Transformers)
2. reduce the dimensionality of the embeddings (with UMAP)
3. cluster the reduced embeddings (with HDBSCAN)
4. create topics from the clusters (with c-TF-IDF)
5. optional fine tuning (advanced, including LLMs)

**NOTE: simply running BERTopic will automatically execute all of these steps!**



In [6]:
#initialize the model:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, 
                  min_dist=0.0, metric='cosine', random_state=42)


topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L6-v2", umap_model=umap_model)

#fit the model to the texts:
#NOTE: this may take a while on a CPU!
topics, probs = topic_model.fit_transform(docs)

#NOTE: notice how we do not perform any preprocessing on the documents!

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got f

## 3. Inspect the results: 

In [7]:
#let's see the topics:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,7387,-1_the_of_to_and,"[the, of, to, and, is, you, it, in, that, for]",[\n\nI assume you are posting to encourage com...
1,0,1790,0_game_team_he_games,"[game, team, he, games, hockey, season, year, ...",[The FLYERS team that can beat any team on any...
2,1,607,1_key_encryption_clipper_chip,"[key, encryption, clipper, chip, keys, nsa, de...","[Hmm, followup on my own posting... Well, who ..."
3,2,479,2_awsome_nyc_matt_observed,"[awsome, nyc, matt, observed, nations, peace, ...","[\n, , Just observed at the National Model..."
4,3,457,3_space_launch_nasa_shuttle,"[space, launch, nasa, shuttle, orbit, mission,...",[Archive-name: space/acronyms\nEdition: 8\n\nA...
...,...,...,...,...,...
171,170,10,170_card_mirage_dip_tsoft,"[card, mirage, dip, tsoft, louisville, switche...",[Could someone please tell me what the dip swi...
172,171,10,171_server_memory_screen_shared,"[server, memory, screen, shared, shm, pixmaps,...",[This is a question aimed at those who have do...
173,172,10,172_polio_patients_she_motor,"[polio, patients, she, motor, syndrome, pps, m...",[[reply to keith@actrix.gen.nz (Keith Stewart)...
174,173,10,173_glassboro_950k_lockridge_dealler,"[glassboro, 950k, lockridge, dealler, sales, m...","[Of course, if you want to check the honesty o..."


In [8]:
#let's get the topN words for a specific topic:
topic_model.get_topic(1)

[('key', 0.017382794942834966),
 ('encryption', 0.012646143220835184),
 ('clipper', 0.012245511060452465),
 ('chip', 0.011264899094988543),
 ('keys', 0.009608300330786555),
 ('nsa', 0.008508451442746793),
 ('des', 0.008005974695392451),
 ('escrow', 0.007367843008019796),
 ('be', 0.007335579608747007),
 ('algorithm', 0.007173844716354192)]

In [9]:
#the information we get for each document:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,\n\nI am sure some bashers of Pens fans are pr...,0,0_game_team_he_games,"[game, team, he, games, hockey, season, year, ...",[The FLYERS team that can beat any team on any...,game - team - he - games - hockey - season - y...,1.000000,False
1,My brother is in the market for a high-perform...,-1,-1_the_of_to_and,"[the, of, to, and, is, you, it, in, that, for]",[\n\nI assume you are posting to encourage com...,the - of - to - and - is - you - it - in - tha...,0.000000,False
2,\n\n\n\n\tFinally you said what you dream abou...,-1,-1_the_of_to_and,"[the, of, to, and, is, you, it, in, that, for]",[\n\nI assume you are posting to encourage com...,the - of - to - and - is - you - it - in - tha...,0.000000,False
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,-1,-1_the_of_to_and,"[the, of, to, and, is, you, it, in, that, for]",[\n\nI assume you are posting to encourage com...,the - of - to - and - is - you - it - in - tha...,0.000000,False
4,1) I have an old Jasmine drive which I cann...,5,5_scsi_drive_ide_drives,"[scsi, drive, ide, drives, disk, controller, h...","[\n[ First of all, please edit your postings. ...",scsi - drive - ide - drives - disk - controlle...,1.000000,False
...,...,...,...,...,...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,-1,-1_the_of_to_and,"[the, of, to, and, is, you, it, in, that, for]",[\n\nI assume you are posting to encourage com...,the - of - to - and - is - you - it - in - tha...,0.000000,False
18842,\nNot in isolated ground recepticles (usually ...,75,75_paint_wax_scratches_coating,"[paint, wax, scratches, coating, polish, car, ...","[Is clear coat really worth it? Yes, on the s...",paint - wax - scratches - coating - polish - c...,0.395833,False
18843,I just installed a DX2-66 CPU in a clone mothe...,94,94_fan_cpu_heat_fans,"[fan, cpu, heat, fans, sink, cooling, chip, he...",[\n<speaking of CPU fans>\n\n\nDo these CPU Fa...,fan - cpu - heat - fans - sink - cooling - chi...,0.492144,False
18844,\nWouldn't this require a hyper-sphere. In 3-...,30,30_den_p3_p1_p2,"[den, p3, p1, p2, radius, op_rows, op_cols, po...",[There is another useful method based on Least...,den - p3 - p1 - p2 - radius - op_rows - op_col...,0.988176,False


## 4. change the topic labels

In [10]:
# automatically, based on top3 words:

topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 word_length=10,
                                                 separator=", ")

topic_model.set_topic_labels(topic_labels)

## 5. Visualize the results:

In [11]:
#how the topics are distributed:
#NOTE: you can hover / select topics in the plot!
topic_model.visualize_topics()

In [8]:
topic_model.visualize_barchart(top_n_topics=10)

NameError: name 'topic_model' is not defined

In [12]:
#visualize the hierarchy of topics:
topic_model.visualize_hierarchy()

In [7]:
topic_model.visualize_documents(docs, reduced_embeddings=topic_model)

SyntaxError: invalid syntax (3036836115.py, line 1)

In [10]:
topic_model.visualize_barchart()

In [11]:
topic_model.visualize_heatmap()

## 6. Advanced: building your own BERTopic stack:

### Modularity: Choosing Your Preferred BERT Flavor
Customize the foundational elements of BERTopic to create the optimal model for your specific use case.

<img src="https://raw.githubusercontent.com/annekroon/gesis-machine-learning/main/pictures/modularity_bert.png" alt="Default Settings of BERTopic" width="600" height="600">

https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview


In [12]:
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)


NameError: name 'SentenceTransformer' is not defined

In [None]:
topics, probs = topic_model.fit_transform(docs)

In [None]:
embeddings = embedding_model.encode(docs, show_progress_bar=False)

topic_model.visualize_documents(docs, embeddings=embeddings)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)

### Changing pieces of the stack: PCA, K-means and TFidF

In [16]:

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
dim_model = PCA(n_components=5)

# Step 3 - Cluster reduced embeddings
cluster_model = KMeans(n_clusters=50)

# Step 4 - Tokenize topics
vectorizer_model = TfidfVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=dim_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=cluster_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

In [17]:
topics, probs = topic_model.fit_transform(docs)