# BERTopic Demo

The purpose of this notebook is to demonstrate topic modelling and related concepts using [BERTopic](https://maartengr.github.io/BERTopic/index.html)

BERTopic is a widely using topic-modelling library that follows a modularizable five-step process, making it flexible and useful for many different topic-modelling circumstances.

##  Imports

In [25]:
from datasets import load_dataset
import pandas as pd

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

## Dataset

We will use the [stanfordnlp/web_questions](https://huggingface.co/datasets/stanfordnlp/web_questions) dataset for our modelling. 

This dataset consists of a set of questions and answers sourced from web forums circa 2013. 

In [17]:
data = load_dataset("stanfordnlp/web_questions")

train = pd.DataFrame(data["train"])
test = pd.DataFrame(data["test"])

#concatenate comma-separated answers into single string
train["answers"] = train["answers"].apply(lambda x: ','.join(x))
test["answers"] = test["answers"].apply(lambda x: ','.join(x))

The BERTopic API requires a list of strings as training data.

Let's concatenate the `url`, `question`, and `answers` columns into a single `document` column for use as our training data.

In [23]:
train["document"] = train.apply(lambda row: f"{row["url"]} Q: {row["question"]} A: {row["answers"]}", axis=1)
test["document"] = test.apply(lambda row: f"{row["url"]} Q: {row["question"]} A: {row["answers"]}", axis=1)

train["document"].iloc[0]

'http://www.freebase.com/view/en/justin_bieber Q: what is the name of justin bieber brother? A: Jazmyn Bieber,Jaxon Bieber'

## BERTopic Pipeline

The BERTopic pipeline is composed of five steps:
1. **Embedding** - transform the raw text into floating point numbers that the computer can operate on
2. **Dimensionality Reduction** - Try to represent the floating point arrays in a smaller size so that they take up less memory and are easier to work with
3. **Clustering** - Group the embedded documents by similarities in semantics, text, etc. This will help us discover patterns in the training data.
4. **Tokenize** - Split up longer texts into chunks i.e. tokens. The chunking (tokenization) strategy used here affects how fine-grained the topic results are.
5. **Representation**  - Create a human-friendly display of the shared topics that were discovered in the text. By default this makes keywords.

Each step has multiple choices of algorithm; for example, swapping out one dimensionality reduction algorithm for another might result in different end results. For the purpose of this demonstration I'll use the default algorithms to keep it simple.

Let's walk through each step of the pipeline.

**Note**: The algorithms used here are all CPU based and can work without a GPU. For this demonstration, I'll use CPU versions of the algorithms so that this notebook is portable across environments. In the event that a GPU is available, the code can be modified to use GPU acceleration by following the instructions in the [BERTopic documentation here](https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-gpu-to-speed-up-the-model).

### Embedding

Convert the text into floating-point numerical arrays. By converting text into numbers we can apply mathematical operations and discover mathematical patterns within the embeddings. These mathematical patterns and similarities will ultimately be used to group documents by topic.

The `SentenceTransformers` library is a good default here, balancing speed and quality.

In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

### Dimensionality Reduction

The embeddings created in the previous step can result in high-dimension matrices that have difficulty fitting into memory and affect performance. Dimensionality reduction algorithms like `UMAP()` can shrink the size of the matrices while maintaining most of the embedded information.

The most important parameter in `UMAP()` is `n_neighbors` which controls how many surrounding data points to use when estimating the structure of the data. Higher values create a more global view of the data, at the cost of local detail.

In [None]:
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric='cosine'
)

### Clustering

This is the core step in the pipeline, because this step is the one that starts grouping documents by shared topic. These documents groups (clusters) are what the topic representation step will use as a basis to extract keywords. 

The three most important parameters for `HDBSCAN()` are `min_cluster_size`, `min_samples`, and `cluster_selection_method`. These values affect how large and spread-out the resulting clusters are.

Higher values of `min_cluster_size` will create larger clusters that cover more documents but have more general high-level topics.
The value of `min_samples` defaults to `min_cluster_size` but can be set independently to further alter the resulting clusters.
The `cluster_selection_method` has multiple options, but the two most commonly used are `eom` and `leaf`. `eom` has a tendency to create a single large cluster and several smaller clusters, while `leaf` tries to make evenly sized clusters. 

Experiment with combinations of the three parameters to see what fits your use case.

Read more about the three parameters and their effects:<br>
[min_samples and min_cluster_size](https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html#hdbscan)<br>
[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html)

In [26]:
hdbscan_model = HDBSCAN(
    min_cluster_size=15,
    min_samples=15,
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

### Vectorization

This will tokenize (that is, chunk) texts. 

The most important parameter for `CountVectorizer()` is `ngram_range` which sets how many words each token has; choose a value that makes sense for your data. For example, if you are interested in finding instances of "steak restaurant" in your data, `ngram_range` should allow for tokens up to 2 words long, since "steak restaurant" has two words.

`stop_words` are words that appear frequently in your documents and you want to ignore. By default it uses the `sklearn` english stopwords, [but those have issues](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words), so you are advised to use another set of stopwords. [NLTK stopwords are a common alternative](https://www.geeksforgeeks.org/nlp/removing-stop-words-nltk-python/).

[You can read more about the vectorizer parameters and algorithms here](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html)

In [None]:
vectorizer_model = CountVectorizer(ngram_range=(1,3), stop_words="english")

### Representation

This step examines each cluster and extracts keywords that are representative of each cluster. These keywords are used to generate the topics for each cluster. 

The default c-TF-IDF algorithm is an innovative algorithm created by the BERTopic developers, and the distinguishing feature of the library. c-TF-IDF differs from standard TF-IDF by treating each cluster of documents as a single merged document and performing TF-IDF on that document. The most frequent terms in that merged document would then be the same as the topics.

The documentation explains it well: _"When you apply TF-IDF as usual on a set of documents, what you are doing is comparing the importance of words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. The more important words are within a cluster, the more it is representative of that topic. In other words, if we extract the most important words per cluster, we get descriptions of **topics**! This model is called **class-based TF-IDF**"_

[You can read more about the c-TF-IDF algorithm here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#5-topic-representation)

In [None]:
ctfidf_model = ClassTfidfTransformer()

## Training

Now that all the individual pipeline components have been prepared, we can begin training.

Training is as simple as passing the pipeline components to the `BERTopic()` constructor and running `fit_transform()`.

In [None]:
topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  ctfidf_model=ctfidf_model
)

topics, probs = topic_model.fit_transform(train["document"])