[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CRCTransformers/deepdive-book/blob/main/Chapter-3-TopicModeling.ipynb)

# Motivation

In this chapter, we looked at several applications of the Transformer architecture. In this case study, we see how to use pretrained (or finetuned) Transformer models to do topic modeling. If one is exploring a new dataset, this method could be used during exploratory data analysis.

We'll use pretrained Transformers to explore the [Yelp reviews dataset](https://huggingface.co/datasets/yelp_review_full) and see what kinds of things the reviewrs have to say.

There are many ways one can generate sentence embeddings, but we are going to use sentence embeddings from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library. Sentence-transformers provides models pretrained for specific tasks, such as semantic search.

We're going to use [BERTopic](https://github.com/MaartenGr/BERTopic) for topic modeling and [Huggingface Datasets](https://huggingface.co/docs/datasets/) for loading the data.

Note: Huggingface Datasets lets you work with large datasets without needing to store the entire thing in memory (the data is memory mapped using Apache Airflow).



# Environment setup

In [2]:
# Workaround to avoid error when installing pyyaml
!pip install "cython<3.0.0" && pip install --no-build-isolation pyyaml==5.4.1

Collecting cython<3.0.0
  Downloading Cython-0.29.37-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: cython
  Attempting uninstall: cython
    Found existing installation: Cython 3.0.7
    Uninstalling Cython-3.0.7:
      Successfully uninstalled Cython-3.0.7
Successfully installed cython-0.29.37
Collecting pyyaml==5.4.1
  Downloading PyYAML-5.4.1.tar.gz (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.1/175.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: pyyaml
  Building wheel for pyyaml (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pyyaml: filename=PyYAML-5.4.1-cp310-cp310-linux_x86_64.whl size=45658 sha256=08219d71e1e73b3f7c8466add3cd09582a1c424

In [3]:
# Workaround to avoid error when using sklearn >= 1.2
!pip install "scikit-learn<1.2"



In [4]:
!pip install -U datasets==2.2.1 bertopic==0.10.0

Collecting datasets==2.2.1
  Downloading datasets-2.2.1-py3-none-any.whl (342 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m342.2/342.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bertopic==0.10.0
  Downloading bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.6/58.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from datasets==2.2.1)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets==2.2.1)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from datasets==2.2.1)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting hdbscan>=0.8.28 (from b

In [5]:
import matplotlib.pyplot as plt

%matplotlib notebook

# Data

In [6]:
from datasets import load_dataset
import numpy as np

There are 650,000 reviews in the dataset. To keep the runtime of this case study within reason, we'll only process the first 10,000 reviews.

To process more reviews, simply change `N`.

In [7]:
N = 10_000
dataset = load_dataset("yelp_review_full", split=f"train[:{N}]")

Downloading builder script:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/979 [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


In [8]:
dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 10000
})

# Sentence Embeddings

In this case study, we're interested in exploring the Yelp dataset, seeing what topics are being written about.

We'll use the [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model from sentence-transformers. It's built to perform well on semantic search when embedding sentences and longer spans of text.

To use the GPU when computing the embeddings, we set the `device` parameter in `SentenceTransformer` to "cuda".

In [9]:
from sentence_transformers import SentenceTransformer

embeddings_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [10]:
# We embed the reviews in batches, to speed things up
batch_size = 256

In [11]:
def embed(batch):
    batch["embedding"] = embeddings_model.encode(batch["text"])
    return batch

In [12]:
dataset = dataset.map(embed, batch_size=batch_size, batched=True)
dataset.set_format(type='numpy', columns=['embedding'], output_all_columns=True)



  0%|          | 0/40 [00:00<?, ?ba/s]

# Topics

## Building Topics

In [13]:
from bertopic import BERTopic

In [14]:
topic_model = BERTopic(n_gram_range=(1, 2))

In [15]:
topics, probs = topic_model.fit_transform(dataset["text"],
                                          np.array(dataset["embedding"]))

In [16]:
topic_model1 = BERTopic(n_gram_range=(1, 3), calculate_probabilities=True)
topics1, probs1 = topic_model1.fit_transform(dataset["text"],
                                          np.array(dataset["embedding"]))

In [17]:
print(f"Number of topics: {len(topic_model.get_topics())}")

Number of topics: 141


Now that we have computed a topic distribution, we need to see what kind of reviews are in each topic.

In [18]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3072,-1_the_and_was_to
1,0,356,0_we_our_to_was
2,1,331,1_italian_pasta_was_the
3,2,294,2_pizza_the pizza_crust_cheese
4,3,283,3_pittsburgh_the_is_and
...,...,...,...
136,135,11,135_nature_trails_preserve_lake
137,136,11,136_gyro_gyros_the gyro_the pita
138,137,11,137_comic_comics_comic book_comic books
139,138,10,138_thai_pad_pad thai_thai house


## Topic size distribution

What is the distribution of topic size, where the size is the number of reviews that contain that topic?

In [19]:
topic_sizes = topic_model.get_topic_freq()

In [20]:
topic_sizes

Unnamed: 0,Topic,Count
0,-1,3072
1,0,356
2,1,331
3,2,294
4,3,283
...,...,...
136,135,11
137,136,11
138,137,11
139,138,10


Note the topic with id of -1. This corresponds to the unassigned cluster output by the HDBSCAN algorithm. The unassigned cluster is composed of all the things that could not be assigned to one of the other clusters. It can *generally* be ignored, but if it were too large, it would be a sign that our choice of parameters are probably not good for our data.

In [21]:
topic_sizes[topic_sizes["Topic"] != -1]["Count"].hist()

<IPython.core.display.Javascript object>

<Axes: >

Most topics have less than 50 reviews.

Note that the unassigned cluster has been omitted from the histogram.

In [22]:
n = len(topic_sizes) - 1 # subtract 1 to ingnore unassigned cluster

# Visualization of topics

This section shows off some of the ways the topics can be visualized with the BERTopic library.

In [23]:
# Visualize the 10 topics that are most prevalent in the dataset
topic_model.visualize_barchart(top_n_topics=10,
                               n_words=5, width=1000, height=800)

BERTopic can also show a heatmap of the cosine similarities of the topic embeddings.

In [24]:
topic_model.visualize_heatmap(top_n_topics=20, n_clusters=5)

# Sampling the distribution of topics

Let's look at the largest two topics, smallest two topics, and a topic with median.

In [25]:
def dump_topic_and_docs(text, topic_id):
    print(f"{text} size: {topic_sizes['Count'][topic_id + 1]}\n")
    n = len(topic_sizes) - 1

    if topic_id != -1:
        reviews = topic_model.get_representative_docs(topic_id)
        print("**** Representative reviews ****")
        for review in reviews:
            print(review, "\n")

    return topic_model.get_topic(topic_id)[:10]

### Unassigned cluster

In [26]:
dump_topic_and_docs("Unassigned cluster", -1)

Unassigned cluster size: 3072



[('the', 0.006432108394492327),
 ('and', 0.005906223018030194),
 ('was', 0.00569506001514948),
 ('to', 0.005399500567884833),
 ('it', 0.0050572421870686245),
 ('of', 0.004938706757216315),
 ('for', 0.004802385197807139),
 ('is', 0.004699809047929484),
 ('in', 0.004443802940346615),
 ('but', 0.004441165974871375)]

As we can see, the content of the unassigned cluster contains words that do not strongly belong to any topic.

## Largest topic

In [27]:
dump_topic_and_docs("Largest topic", 0)

Largest topic size: 356

**** Representative reviews ****
I ate at this restaurant last night with three dining companions.  We decided to sit in the bar area at a high table which had views of the ample number of TV's.  The beer list that they had was nice with about 20 different beers on tap and a nice selection of bottles.  I ended up having the Philly Shackamaximum with my waitress telling me that she preferred the Long Trails Porter.  It was served in a chilled glass and tasted pretty good to me.\n\nFor dinner we attempted to start off by ordering the Blues BBQ Pork Nachos.  We were told that they did not have any pork and so we were not able to order it.  At this point we were also told that they did not have any veggie burgers as well.  So we started out with the Chicken Chili Nachos.  They were pretty good and were covered with real cheese as opposed to nacho cheese which was to my liking.  For dinner I decided on having the Lovely Lisa's Salad since I love both blue cheese and

[('we', 0.009563742591435952),
 ('our', 0.008555623990846278),
 ('to', 0.006678497091471693),
 ('was', 0.006677529107727938),
 ('she', 0.006273530016344255),
 ('the', 0.00626708298542424),
 ('and', 0.005996068581890486),
 ('us', 0.005878453744101336),
 ('he', 0.00531349569030209),
 ('were', 0.005159237653436755)]

## Smallest topic

In [28]:
dump_topic_and_docs("Smallest topic", n-1)

Smallest topic size: 10

**** Representative reviews ****
For the location that it is in, a very busy intersection across from a hospital the parking lot is annoying to maneuver around. Once you finally make it into the store it is very hit or miss. And I'm talking everything from product selection, employee politeness and prescription availability. Went to Walgreens early in the morning (3am) and they where not able to fill a script because of it being written for the same day. Kinda makes being across from a hospital and having a 24hr pharmacy a waste of time. Children get sick in the middle of the night. How are you not going to dispense the medicine? Where not talking controlled substances here! I don't know that I would go back again. 

The last two times we've had to pick up prescriptions, we have been told a time they would be ready and have arrived promptly expecting things to go quickly and smoothly. However, the last two times they have lied and we have spent over an hour eac

[('prescription', 0.02705357048123608),
 ('pharmacy', 0.021642856384988862),
 ('pharmacist', 0.021324226940528903),
 ('prescriptions', 0.021027360200091008),
 ('insurance', 0.020092829742336654),
 ('cvs', 0.016742199443081817),
 ('wannabe pharmacist', 0.016414748271859188),
 ('wannabe', 0.015019543000065007),
 ('the prescription', 0.010465897255661293),
 ('the store', 0.010017232228389147)]

## Median size topic

In [29]:
dump_topic_and_docs("Median topic", n//2)

Median topic size: 28

**** Representative reviews ****
After multiple flat out miserable experiences at the Center City Hilton, I tried out the Omni and have been going back ever since. If you are looking for something more budget minded, Aloft right across the street will do well. If you are looking for something a little more high end but not quite Ritz level, Omni hits it on the head. I have stayed here 8 times thus far and on every occasion I have had excellent service via the front desk, immaculately clean rooms, and good room service. I always stay in the same executive corner suite over looking the epicenter so I can't speak toward the standard junior suite quality. I prefer the larger rooms. I have recommended this to my family coming to town for my upcoming wedding. Love this place. Easy walking distance to everything from Chima, Ruths Chris, Pita Pit, and smack across the street from the epicenter and all its restraunts, bars, movie theater, bowling alley, and the like. 

On

[('hotel', 0.015190692771942745),
 ('room', 0.01255084321837896),
 ('westin', 0.011081003903365147),
 ('rooms', 0.00945470222256651),
 ('stay', 0.00859681481279719),
 ('the hotel', 0.008495559812860671),
 ('stayed', 0.00761129817446559),
 ('are', 0.0072019352958643135),
 ('the', 0.0070946483932241376),
 ('for', 0.006813719879036257)]