# Use BERTopic to do a litterature review

Here is a [quick presentation of BERTopic](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html)

We will use scientific abstracts extracted from [Open Alex](https://openalex.org/) using the request "large language model" and "social"

We recommand using a GPU (Runtime > Change Runtime > Choose one with GPU), then Reconnect

Install packages

In [1]:
!pip install -q bertopic pandas "nbformat>=4.2.0" openai tiktoken

Load the packages

In [1]:
import pandas as pd
import bertopic

  from .autonotebook import tqdm as notebook_tqdm


## Load the data and clean

In [2]:
# load the data
url = ("https://raw.githubusercontent.com/css-polytechnique/css-ipp-materials/"
       "refs/heads/main/Python-tutorials/SICSS-2025/bertopic/"
       "openalex_llm_social_02072025.csv")
df = pd.read_csv(url)

# filter existing content
df = df[~df["abstract"].isna() & ~df["title"].isna()]

# create a text column
df["text"] = df["title"] + "\n" + df["abstract"]

# keep "small" abstracts (avoid plain text errors)
df = df[df["text"].apply(len) < 5000]

Get a sense of the dataset

In [3]:
df["text"].apply(len).describe()

count    2490.000000
mean     1516.529719
std       518.751288
min       275.000000
25%      1191.000000
50%      1441.000000
75%      1738.000000
max      4792.000000
Name: text, dtype: float64

## Let's use Bertopic

Out-of-the-box solution : BertTopic with default parameters

![](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.svg)



### Run the pipeline

In [4]:
topic_model = bertopic.BERTopic(language="english")
topics, probabilities = topic_model.fit_transform(df["text"])

### The topic_model object

In [5]:
topic_model.get_topic_info()[0:15]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,801,-1_the_and_of_to,"[the, and, of, to, in, for, we, language, that...",[Looking forward to the new year\n2023 was a y...
1,0,97,0_hate_speech_detection_content,"[hate, speech, detection, content, and, the, o...",[Investigating the Predominance of Large Langu...
2,1,92,1_robots_robot_interaction_and,"[robots, robot, interaction, and, humanrobot, ...",[Nadine: An LLM-driven Intelligent Social Robo...
3,2,92,2_medical_and_chatgpt_the,"[medical, and, chatgpt, the, of, health, to, i...",[ChatGPT: friend or foe?\nYou would have been ...
4,3,85,3_agents_simulation_social_simulations,"[agents, simulation, social, simulations, and,...",[Generative Agents: Interactive Simulacra of H...
5,4,68,4_urban_disaster_and_the,"[urban, disaster, and, the, media, of, data, t...",[Towards Human-AI Collaborative Urban Science ...
6,5,63,5_mental_depression_health_suicide,"[mental, depression, health, suicide, media, a...",[Utilizing Large Language Models to Detect Dep...
7,6,61,6_tom_mind_reasoning_of,"[tom, mind, reasoning, of, theory, and, llms, ...",[Do LLMs Exhibit Human-Like Reasoning? Evaluat...
8,7,59,7_public_health_analysis_media,"[public, health, analysis, media, and, covid19...",[Using Large Language Models for sentiment ana...
9,8,52,8_game_agents_games_behavior,"[game, agents, games, behavior, cooperation, i...",[Investigating Emergent Goal-Like Behaviour in...


Save it

In [8]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("bertopic", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

### Vizualisations

Topic level

Once topics have been identified based on semantic document proximity, there is a need to interprete them. To do so, it is useful to vizualise the distribution of specific words for each of them.

In [7]:
topic_model.visualize_barchart()

Topics can be more or less different. One way to interpret them is to project them in a 2D space based on their embeddings.

In [8]:
topic_model.visualize_topics()

Building on the distance between topics, it is possible to get the hierarchical clustering of all the topics. It is useful if you want to reduce the number of topics or to know how to gather some of them.

In [9]:
topic_model.visualize_hierarchy()

**Document level**

Based on the semantic embedding of documents, we can obtain a 2D projection with each abstract represented by one point.

In [10]:
topic_model.visualize_documents(df["text"].to_list())

Save for a few documents that have a single topic, one document is a generally combination of topics. There is the possibility to calculate the probability for a document to belong to each topic and to vizualise this distribution. It helps to investigate documents that straddle topics.

In [11]:
topic_model = bertopic.BERTopic(language="english", calculate_probabilities=True)
topics, probabilities = topic_model.fit_transform(df["text"])
topic_model.visualize_distribution(probabilities=probabilities[10], min_probability = 0.005)

### Note that

The description of the topics is not perfect
- Maybe we should use better embeddings?
- Maybe we should have more/fewer clusters?

We can modify each part of the process to this effect

## Towards better results

Each element can be adapted

- Remove empty words in the cluster description
- Change the text representation


For instance, we can define the parameters of the dimentionality reduction (UMAP) and the clustering algorithm (hdbscan)

In [7]:
from umap import UMAP
import hdbscan

umap_model = UMAP(n_neighbors=15, n_components=6, min_dist=0.0, metric='cosine')
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom')

Clean the text representation by removing stop words

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")

Re-run with these new paramerts (options)

In [14]:
from bertopic import BERTopic

topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model = vectorizer_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

2025-09-29 11:11:13,218 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 78/78 [00:15<00:00,  4.93it/s]
2025-09-29 11:11:31,461 - BERTopic - Embedding - Completed ✓
2025-09-29 11:11:31,461 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-29 11:11:35,804 - BERTopic - Dimensionality - Completed ✓
2025-09-29 11:11:35,805 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-29 11:11:35,836 - BERTopic - Cluster - Completed ✓
2025-09-29 11:11:35,841 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-29 11:11:36,059 - BERTopic - Representation - Completed ✓


In [15]:
topic_model.visualize_barchart()

Without stopwords, it becomes more readable

### Use a better text embedding

Let's use a sentence transformer model. [What is the latest trend in HuggingFace ?](https://huggingface.co/models?library=sentence-transformers&sort=likes)

Let's use Qwen, which has a larger context windows that allows to represent the complete abstract, and not only part of it.

In [16]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("Alibaba-NLP/gte-multilingual-base", trust_remote_code=True)

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
topic_model = bertopic.BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model = vectorizer_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

2025-09-29 11:12:22,617 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 78/78 [04:26<00:00,  3.41s/it]
2025-09-29 11:16:48,811 - BERTopic - Embedding - Completed ✓
2025-09-29 11:16:48,815 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-29 11:16:54,788 - BERTopic - Dimensionality - Completed ✓
2025-09-29 11:16:54,790 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-29 11:16:54,887 - BERTopic - Cluster - Completed ✓
2025-09-29 11:16:54,921 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-29 11:16:55,170 - BERTopic - Representation - Completed ✓


In [18]:
topic_model.get_topics()

{-1: [('ai', 0.013949735316118823),
  ('language', 0.013488201963898702),
  ('social', 0.013121915967556588),
  ('models', 0.01265324659924374),
  ('large', 0.012039136556194756),
  ('llms', 0.011132423687530897),
  ('human', 0.010906792105795905),
  ('llm', 0.008544974173594462),
  ('based', 0.008532086421850929),
  ('media', 0.008479979765699096)],
 0: [('bias', 0.04920363144564411),
  ('biases', 0.04206428024561533),
  ('gender', 0.02584673184662964),
  ('llms', 0.021043620672770325),
  ('models', 0.018802634128974295),
  ('language', 0.01677886572704667),
  ('fairness', 0.015996577058548404),
  ('stereotypes', 0.014801726266684828),
  ('social', 0.014313106185950187),
  ('large', 0.013877351436702065)],
 1: [('hate', 0.05349765121114142),
  ('speech', 0.042276110294848526),
  ('content', 0.02821914994352692),
  ('moderation', 0.023610289787686322),
  ('detection', 0.023477200663612047),
  ('online', 0.018939902670081464),
  ('offensive', 0.018042089943968907),
  ('harmful', 0.01689

## Use GenIA to Name Topics

More information on this: https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#prompt-engineering

The original approach consists in computing a c-tf-idf based on the specifity of vocabulary in the cluster.

This can be improved. The idea is to send representative documents & keywords with a prompt to a genAI model to get description of the topic.

In [3]:
# Load specific package to genAI
import openai
import tiktoken
from bertopic.representation import OpenAI

Configure the way you want to request the genAI model

In [11]:
# ENTER A KEY
key = "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"

# Tokenizer to limit the length of the texts
tokenizer= tiktoken.encoding_for_model("gpt-4o")

# Create your representation model
client = openai.OpenAI(api_key=key,
                       base_url="https://openrouter.ai/api/v1")
representation_model = OpenAI(
    client,
    model="gpt-4o",
    delay_in_seconds=2,
    chat=True,
    nr_docs=4,
    doc_length=100,
    tokenizer=tokenizer
)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model = vectorizer_model,
    representation_model = representation_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

2025-09-29 11:25:26,388 - BERTopic - Embedding - Transforming documents to embeddings.


Batches: 100%|██████████| 78/78 [00:16<00:00,  4.84it/s]
2025-09-29 11:25:44,363 - BERTopic - Embedding - Completed ✓
2025-09-29 11:25:44,363 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-29 11:25:50,219 - BERTopic - Dimensionality - Completed ✓
2025-09-29 11:25:50,220 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-29 11:25:50,251 - BERTopic - Cluster - Completed ✓
2025-09-29 11:25:50,253 - BERTopic - Representation - Fine-tuning topics using representation models.
100%|██████████| 38/38 [01:51<00:00,  2.93s/it]
2025-09-29 11:27:42,004 - BERTopic - Representation - Completed ✓


In [12]:
topic_model.get_topic_info()[0:10]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,612,-1_Large Language Models,[Large Language Models],[Measuring Social Norms of Large Language Mode...
1,0,321,0_AI in Healthcare,[AI in Healthcare],[The Growing Impact of Natural Language Proces...
2,1,221,1_Bias in LLMs,[Bias in LLMs],[Explicit vs. Implicit: Investigating Social B...
3,2,153,2_AI in Education,[AI in Education],"[Foreword\nAs technology advances, there are g..."
4,3,95,3_Human-Robot Interaction,[Human-Robot Interaction],[NewsGPT: ChatGPT Integration for Robot-Report...
5,4,85,4_LLM-based Social Simulation,[LLM-based Social Simulation],[Spontaneous Emergence of Agent Individuality ...
6,5,70,5_AI in Urban Research,[AI in Urban Research],[Towards Human-Ai Collaborative Urban Science ...
7,6,65,6_Hate Speech Detection,[Hate Speech Detection],[Investigating the Predominance of Large Langu...
8,7,60,7_Theory of Mind in LLMs,[Theory of Mind in LLMs],[Towards A Holistic Landscape of Situated Theo...
9,8,45,8_Multimodal Humor Understanding,[Multimodal Humor Understanding],[Can Language Models Laugh at YouTube Short-fo...


## Exercise

Use a custom BERT model, potentially better aligned with your dataset, to do the embedding.
For instance, we could use ScienceBERT: https://huggingface.co/allenai/scibert_scivocab_uncased


In [13]:
from transformers.pipelines import pipeline
embedding_model_bert = pipeline("feature-extraction",
                                model="allenai/scibert_scivocab_uncased",
                                tokenizer="allenai/scibert_scivocab_uncased",
                                truncation=True,
                                padding=True,
                                max_length=512)

Device set to use mps:0
