[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CRCTransformers/deepdive-book/blob/main/Chapter-3-TopicModeling.ipynb)

# Motivation

In this chapter, we looked at several applications of the Transformer architecture. In this case study, we see how to use pretrained (or finetuned) Transformer models to do topic modeling. If one is exploring a new dataset, this method could be used during exploratory data analysis.

We'll use pretrained Transformers to explore the [Yelp reviews dataset](https://huggingface.co/datasets/yelp_review_full) and see what kinds of things the reviewrs have to say.

There are many ways one can generate sentence embeddings, but we are going to use sentence embeddings from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library. Sentence-transformers provides models pretrained for specific tasks, such as semantic search.

We're going to use [BERTopic](https://github.com/MaartenGr/BERTopic) for topic modeling and [Huggingface Datasets](https://huggingface.co/docs/datasets/) for loading the data.

Note: Huggingface Datasets lets you work with large datasets without needing to store the entire thing in memory (the data is memory mapped using Apache Airflow).



# Environment setup

In [25]:
# Workaround to avoid error when installing pyyaml
!pip install "cython<3.0.0" && pip install --no-build-isolation pyyaml==5.4.1

Collecting pyyaml==5.4.1
  Using cached PyYAML-5.4.1.tar.gz (175 kB)
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: pyyaml
  Building wheel for pyyaml (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pyyaml: filename=PyYAML-5.4.1-cp310-cp310-linux_x86_64.whl size=45658 sha256=cae09b845fc1cf1d092ff6dad49b43cad480f40fb2029a6ad9ea17b7335339fc
  Stored in directory: /root/.cache/pip/wheels/c7/0d/22/696ee92245ad710f506eee79bb05c740d8abccd3ecdb778683
Successfully built pyyaml
Installing collected packages: pyyaml
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 5.4
    Uninstalling PyYAML-5.4:
      Successfully uninstalled PyYAML-5.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not i

In [26]:
!pip install -U datasets==2.2.1 bertopic==0.10.0

Collecting datasets==2.2.1
  Using cached datasets-2.2.1-py3-none-any.whl (342 kB)
Collecting bertopic==0.10.0
  Using cached bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
Collecting dill (from datasets==2.2.1)
  Using cached dill-0.3.7-py3-none-any.whl (115 kB)
Collecting multiprocess (from datasets==2.2.1)
  Using cached multiprocess-0.70.15-py310-none-any.whl (134 kB)
Collecting responses<0.19 (from datasets==2.2.1)
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Collecting hdbscan>=0.8.28 (from bertopic==0.10.0)
  Using cached hdbscan-0.8.33.tar.gz (5.2 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic==0.10.0)
  Using cached umap-learn-0.5.5.tar.gz (90 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic==0.10.0)
  Using cached sentence-transformers

In [27]:
import matplotlib.pyplot as plt

%matplotlib notebook

# Data

In [28]:
from datasets import load_dataset
import numpy as np

There are 650,000 reviews in the dataset. To keep the runtime of this case study within reason, we'll only process the first 10,000 reviews.

To process more reviews, simply change `N`.

In [29]:
N = 10_000
dataset = load_dataset("yelp_review_full", split=f"train[:{N}]")

Downloading builder script:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/979 [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


In [30]:
dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 10000
})

# Sentence Embeddings

In this case study, we're interested in exploring the Yelp dataset, seeing what topics are being written about.

We'll use the [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model from sentence-transformers. It's built to perform well on semantic search when embedding sentences and longer spans of text.

To use the GPU when computing the embeddings, we set the `device` parameter in `SentenceTransformer` to "cuda".

In [31]:
from sentence_transformers import SentenceTransformer

embeddings_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [32]:
# We embed the reviews in batches, to speed things up
batch_size = 256

In [33]:
def embed(batch):
    batch["embedding"] = embeddings_model.encode(batch["text"])
    return batch

In [None]:
dataset = dataset.map(embed, batch_size=batch_size, batched=True)
dataset.set_format(type='numpy', columns=['embedding'], output_all_columns=True)



  0%|          | 0/40 [00:00<?, ?ba/s]

# Topics

## Building Topics

In [None]:
from bertopic import BERTopic

In [None]:
topic_model = BERTopic(n_gram_range=(1, 2))

In [None]:
topics, probs = topic_model.fit_transform(dataset["text"],
                                          np.array(dataset["embedding"]))

In [None]:
topic_model1 = BERTopic(n_gram_range=(1, 3), calculate_probabilities=True)
topics1, probs1 = topic_model1.fit_transform(dataset["text"],
                                          np.array(dataset["embedding"]))

In [None]:
print(f"Number of topics: {len(topic_model.get_topics())}")

Now that we have computed a topic distribution, we need to see what kind of reviews are in each topic.

In [None]:
topic_model.get_topic_info()

## Topic size distribution

What is the distribution of topic size, where the size is the number of reviews that contain that topic?

In [None]:
topic_sizes = topic_model.get_topic_freq()

In [None]:
topic_sizes

Note the topic with id of -1. This corresponds to the unassigned cluster output by the HDBSCAN algorithm. The unassigned cluster is composed of all the things that could not be assigned to one of the other clusters. It can *generally* be ignored, but if it were too large, it would be a sign that our choice of parameters are probably not good for our data.

In [None]:
topic_sizes[topic_sizes["Topic"] != -1]["Count"].hist()

Most topics have less than 50 reviews.

Note that the unassigned cluster has been omitted from the histogram.

In [None]:
n = len(topic_sizes) - 1 # subtract 1 to ingnore unassigned cluster

# Visualization of topics

This section shows off some of the ways the topics can be visualized with the BERTopic library.

In [None]:
# Visualize the 10 topics that are most prevalent in the dataset
topic_model.visualize_barchart(top_n_topics=10,
                               n_words=5, width=1000, height=800)

BERTopic can also show a heatmap of the cosine similarities of the topic embeddings.

In [None]:
topic_model.visualize_heatmap(top_n_topics=20, n_clusters=5)

# Sampling the distribution of topics

Let's look at the largest two topics, smallest two topics, and a topic with median.

In [None]:
def dump_topic_and_docs(text, topic_id):
    print(f"{text} size: {topic_sizes['Count'][topic_id + 1]}\n")
    n = len(topic_sizes) - 1

    if topic_id != -1:
        reviews = topic_model.get_representative_docs(topic_id)
        print("**** Representative reviews ****")
        for review in reviews:
            print(review, "\n")

    return topic_model.get_topic(topic_id)[:10]

### Unassigned cluster

In [None]:
dump_topic_and_docs("Unassigned cluster", -1)

As we can see, the content of the unassigned cluster contains words that do not strongly belong to any topic.

## Largest topic

In [None]:
dump_topic_and_docs("Largest topic", 0)

## Smallest topic

In [None]:
dump_topic_and_docs("Smallest topic", n-1)

## Median size topic

In [None]:
dump_topic_and_docs("Median topic", n//2)