# Use BERTopic to do a litterature review

Here is a [quick presentation of BERTopic](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html)

We will use scientific abstracts extracted from [Open Alex](https://openalex.org/) using the request "large language model" and "social"

We recommand using a GPU (Runtime > Change Runtime > Choose one with GPU), then Reconnect

Install packages

In [None]:
#!pip install bertopic pandas

Load the packages

In [None]:
import pandas as pd
import bertopic

## Load the data and clean

In [None]:
# load the data
url = "https://raw.githubusercontent.com/css-polytechnique/ic2s2-tutorial-llm-2025/refs/heads/main/data/openalex_llm_social_02072025.csv"
df = pd.read_csv(url)

# filter existing content
df = df[~df["abstract"].isna() & ~df["title"].isna()]

# create a text column
df["text"] = df["title"] + "\n" + df["abstract"]

# keep "small" abstracts (avoid plain text errors)
df = df[df["text"].apply(len) < 5000]

Get a sense of the dataset

In [None]:
df["text"].apply(len).describe()

## Let's use Bertopic

Out-of-the-box solution : BertTopic with default parameters

![](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.svg)



### Run the pipeline

In [None]:
topic_model = bertopic.BERTopic(language="english")
topics, probabilities = topic_model.fit_transform(df["text"])

### The topic_model object

In [None]:
topic_model.get_topic_info()[0:15]

Save it

In [None]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("bertopic", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

### Vizualisations

Topic level

Once topics have been identified based on semantic document proximity, there is a need to interprete them. To do so, it is useful to vizualise the distribution of specific words for each of them.

In [None]:
topic_model.visualize_barchart()

Topics can be more or less different. One way to interpret them is to project them in a 2D space based on their embeddings.

In [None]:
topic_model.visualize_topics()

Building on the distance between topics, it is possible to get the hierarchical clustering of all the topics. It is useful if you want to reduce the number of topics or to know how to gather some of them.

In [None]:
topic_model.visualize_hierarchy()

**Document level**

Based on the semantic embedding of documents, we can obtain a 2D projection with each abstract represented by one point.

In [None]:
topic_model.visualize_documents(df["text"].to_list())

Save for a few documents that have a single topic, one document is a generally combination of topics. There is the possibility to calculate the probability for a document to belong to each topic and to vizualise this distribution. It helps to investigate documents that straddle topics.

In [None]:
topic_model = bertopic.BERTopic(language="english", calculate_probabilities=True)
topics, probabilities = topic_model.fit_transform(df["text"])
topic_model.visualize_distribution(probabilities=probabilities[10])

### Note that

The description of the topics is not perfect
- Maybe we should use better embeddings?
- Maybe we should have more/fewer clusters?

We can modify each part of the process to this effect

## Towards better results

Each element can be adapted

- Remove empty words in the cluster description
- Change the text representation


For instance, we can define the parameters of the dimentionality reduction (UMAP) and the clustering algorithm (hdbscan)

In [None]:
from umap import UMAP
import hdbscan

umap_model = UMAP(n_neighbors=15, n_components=6, min_dist=0.0, metric='cosine')
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom')

Clean the text representation by removing stop words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")

Re-run with these new paramerts (options)

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model = vectorizer_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

In [None]:
topic_model.visualize_barchart()



```
# This is formatted as code
```

Without stopwords, it becomes more readable

### Use a better text embedding

Let's use a sentence transformer model. [What is the latest trend in HuggingFace ?](https://huggingface.co/models?library=sentence-transformers&sort=likes)

Let's use Qwen, which has a larger context windows that allows to represent the complete abstract, and not only part of it.

In [None]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

In [None]:
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model = vectorizer_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

## Use GenIA to Name Topics

More information on this: https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#prompt-engineering

The original approach consists in computing a c-tf-idf based on the specifity of vocabulary in the cluster.

This can be improved. The idea is to send representative documents & keywords with a prompt to a genAI model to get description of the topic.

In [None]:
# Load specific package to genAI
import openai
import tiktoken
from bertopic.representation import OpenAI

Configure the way you want to request the genAI model

In [None]:
# ENTER A KEY
key = "YOUR_KEY"

# Tokenizer to limit the length of the texts
tokenizer= tiktoken.encoding_for_model("gpt-4o")

# Create your representation model
client = openai.OpenAI(api_key=key,
                       base_url="https://openrouter.ai/api/v1")
representation_model = OpenAI(
    client,
    model="gpt-4o",
    delay_in_seconds=2,
    chat=True,
    nr_docs=4,
    doc_length=100,
    tokenizer=tokenizer
)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model = vectorizer_model,
    representation_model = representation_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

In [None]:
topic_model.get_topic_info()[0:10]

## Exercise

Use a custom BERT model, potentially better aligned with your dataset, to do the embedding.
For instance, we could use ScienceBERT: https://huggingface.co/allenai/scibert_scivocab_uncased


In [None]:
from transformers.pipelines import pipeline
embedding_model_bert = pipeline("feature-extraction",
                                model="allenai/scibert_scivocab_uncased",
                                tokenizer="allenai/scibert_scivocab_uncased",
                                truncation=True,
                                padding=True,
                                max_length=512)