(Custom) Embedding Models in BERTopic

In this notebook we will be going through the embedding models that can be used in BERTopic. Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case. Moreover, it helps creating a topic when you have little data to your availability. 

## Embedding models
Embedding models are used for the representation of text, such as words, sentences and documents, in the form of real-valued vectors. They typically encode the semantic meaning of text. 

<br>

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing BERTopic

We start by installing BERTopic, with all backends possible, from PyPi:

In [5]:
%%capture
!pip install bertopic[flair,gensim,spacy,use]

**NOTE**: This may take a while as it needs to install Spacy, Torch, Gensim, USE, etc. 

**NOTE 1**: There might be dependency-conflicts if you install back-ends so it might be worthwhile to only choose one to experiment with. 

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='train',  remove=('headers', 'footers', 'quotes'))['data']

In [2]:
print(docs[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


# **Embedding Models**
In this section, we will go through all embedding models and backends that are supported in BERTopic.

## Sentence Transformers
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) and pass it through BERTopic with `embedding_model`:

In [3]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens").fit(docs)

Downloading:   0%|          | 0.00/795 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/722 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/518 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [4]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5035,-1_the_to_and_of
1,0,942,0_team_he_game_the
2,1,576,1_god_that_of_is
3,2,439,2_space_nasa_the_of
4,3,375,3_is_of_it_msg


Or we can select a SentenceTransformer model with our own parameters:



In [5]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")
topic_model = BERTopic(embedding_model=sentence_model).fit(docs)

In [6]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4930,-1_the_to_and_of
1,0,917,0_he_team_game_the
2,1,659,1_god_that_of_is
3,2,447,2_space_nasa_the_of
4,3,321,3_key_encryption_chip_clipper


## 🤗 Transformers

To use a Hugging Face transformers model, load in a pipeline and point to any model found on their model hub (https://huggingface.co/models):

In [7]:
from transformers.pipelines import pipeline

embedding_model = pipeline("feature-extraction", model="distilbert-base-cased")
topic_model = BERTopic(embedding_model=embedding_model).fit(docs)

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [8]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5014,-1_the_to_and_of
1,0,759,0_he_team_game_the
2,1,569,1_car_bike_my_it
3,2,300,2____
4,3,281,3_msg_is_she_disease


## Flair

Flair allows you to choose almost any embedding model that is publicly available.<br> Flair can be used as follows:

In [5]:
!pip install flair
!pip install --upgrade bertopic[flair,gensim,spacy,use]

In [None]:
#%%capture
!pip uninstall numpy
!pip install numpy

In [1]:
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')

In [None]:
topic_model = BERTopic(embedding_model=roberta).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

You can select any 🤗 transformers model [here](https://huggingface.co/models).

Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily pass it to BERTopic in order to use those word embeddings as document embeddings:

In [6]:
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

topic_model = BERTopic(embedding_model=document_glove_embeddings).fit(docs)

2022-12-29 16:03:15,866 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M.vectors.npy not found in cache, downloading to /tmp/tmpq2t81k8m


100%|██████████| 1200000128/1200000128 [01:35<00:00, 12610554.16B/s]

2022-12-29 16:04:51,735 copying /tmp/tmpq2t81k8m to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M.vectors.npy





2022-12-29 16:04:56,055 removing temp file /tmp/tmpq2t81k8m
2022-12-29 16:04:57,252 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M not found in cache, downloading to /tmp/tmpqtxqz3wq


100%|██████████| 39323680/39323680 [00:04<00:00, 8841368.25B/s] 

2022-12-29 16:05:02,394 copying /tmp/tmpqtxqz3wq to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M
2022-12-29 16:05:02,428 removing temp file /tmp/tmpqtxqz3wq





In [9]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,0,11013,0_the_ax_to_of
1,1,301,1_compuserve_com__


## Spacy
Spacy has shown great promise over the last years and is now slowly transitioning into transformer-based techniques which makes it interesting to use in BERTopic. 

We start by using a non-transformer-based model which we will have to download first:

In [10]:
!python -m spacy download en_core_web_md

Next, simply load the model into a Spacy nlp instance and pass it through BERTopic:

In [11]:
import spacy

nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

topic_model = BERTopic(embedding_model=nlp, verbose=True).fit(docs)

100%|██████████| 11314/11314 [02:51<00:00, 65.80it/s]
2022-12-29 16:13:29,924 - BERTopic - Transformed documents to Embeddings
2022-12-29 16:13:45,201 - BERTopic - Reduced dimensionality
2022-12-29 16:13:46,290 - BERTopic - Clustered reduced embeddings


In [12]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,10,-1_judas_the_of_that
1,0,10852,0_the_ax_to_of
2,1,305,1_critus_melittin_chris_
3,2,53,2_the_for_it_paradox
4,3,38,3_forged_the__________________________________...


In [13]:
topic_model.get_topic_info().tail()

Unnamed: 0,Topic,Count,Name
2,1,305,1_critus_melittin_chris_
3,2,53,2_the_for_it_paradox
4,3,38,3_forged_the__________________________________...
5,4,35,4_the_to_and_of
6,5,21,5_the_you_to_of


We can also use their transformer-based models which we also have to download first:

In [14]:
!python -m spacy download en_core_web_trf

Installing collected packages: spacy-alignments, spacy-transformers, en-core-web-trf
Successfully installed en-core-web-trf-3.4.1 spacy-alignments-0.9.0 spacy-transformers-1.1.9
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


As before, we simply load the model and pass it through BERTopic. Note that we exclude a bunch of features as they are not used in BERTopic.

In [4]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
topic_model = BERTopic(embedding_model=nlp, verbose=True, min_topic_size=5).fit(docs)

  1%|▏         | 144/11314 [00:11<20:10,  9.23it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1575 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 11314/11314 [08:55<00:00, 21.13it/s]
2022-12-29 16:26:36,500 - BERTopic - Transformed documents to Embeddings
2022-12-29 16:27:05,965 - BERTopic - Reduced dimensionality
2022-12-29 16:27:06,544 - BERTopic - Clustered reduced embeddings


If you run into memory issues with spacy-transformer models, try:

In [None]:
import spacy
from thinc.api import set_gpu_allocator, require_gpu

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)

topic_model = BERTopic(embedding_model=nlp, verbose=True).fit(docs)

  1%|▏         | 144/11314 [02:22<9:16:03,  2.99s/it] Token indices sequence length is longer than the specified maximum sequence length for this model (1575 > 512). Running this sequence through the model will result in indexing errors
  3%|▎         | 358/11314 [06:01<53:17,  3.43it/s]

In [None]:
topic_model.get_topic_info().head()

In [None]:
topic_model.get_topic_info().tail()

## Universal Sentence Encoder (USE)
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here for embedding the documents. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.



In [3]:
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
topic_model = BERTopic(verbose=True, embedding_model=embedding_model).fit(docs)

Instructions for updating:
Use tf.identity instead.
100%|██████████| 11314/11314 [01:35<00:00, 118.60it/s]
2022-12-29 16:36:37,413 - BERTopic - Transformed documents to Embeddings
2022-12-29 16:37:05,816 - BERTopic - Reduced dimensionality
2022-12-29 16:37:06,271 - BERTopic - Clustered reduced embeddings


In [4]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3743,-1_the_to_of_is
1,0,1048,0_he_team_game_season
2,1,886,1_car_bike_my_the
3,2,654,2_you_your_this_it
4,3,428,3_key_encryption_chip_clipper


In [5]:
topic_model.get_topic_info().tail()

Unnamed: 0,Topic,Count,Name
96,95,12,95_monitor_17_t560i_mag
97,96,12,96_dpy_color_colormap_colormaps
98,97,11,97_pmp_payne_kits_dec
99,98,11,98_aspects_split_graphics_ch
100,99,11,99_life_christianity_you_christian


## Gensim
For Gensim, BERTopic supports its `gensim.downloader` module. Here, we can download any model word embedding model to be used in BERTopic. Note that Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled. 

In [6]:
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')
topic_model = BERTopic(verbose=True, embedding_model=ft).fit(docs)



100%|██████████| 11314/11314 [00:06<00:00, 1727.67it/s]
2022-12-29 16:46:09,793 - BERTopic - Transformed documents to Embeddings
2022-12-29 16:46:26,077 - BERTopic - Reduced dimensionality
2022-12-29 16:46:26,560 - BERTopic - Clustered reduced embeddings


In [7]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,47,-1_ax_max_g9v_b8f
1,0,10823,0_the_to_of_and
2,1,351,1_com_postmaster_keywords_ieee
3,2,37,2_the_of_or_and
4,3,36,3_it_dsl_n3jxp_chastity


In [8]:
topic_model.get_topic_info().tail()

Unnamed: 0,Topic,Count,Name
1,0,10823,0_the_to_of_and
2,1,351,1_com_postmaster_keywords_ieee
3,2,37,2_the_of_or_and
4,3,36,3_it_dsl_n3jxp_chastity
5,4,20,4_kk_of_the_to


# **Customization**
Over the last years, many new embedding models have been released that could be interesting to use as a backend in BERTopic. It is not always feasible to implement them all as there are simply too many to follow. 

In order to still allow to use those embeddings, BERTopic knows several ways to add these embeddings while still allowing for full functionality of BERTopic. 

Moreover, there are several customization options that allow for a bit more control over which embedding to use when. 

## Word + Document Embeddings
You might want to be using different language models for creating document- and word-embeddings.
For example, while SentenceTransformers might be great in embedding sentences and documents, you
might prefer to use FastText to create the word embeddings.

In [9]:
from bertopic.backend import WordDocEmbedder
import gensim.downloader as api
from sentence_transformers import SentenceTransformer

# Word embedding model
ft = api.load('fasttext-wiki-news-subwords-300')

# Document embedding model
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")

# Create a model that uses both language models and pass it through BERTopic
word_doc_embedder = WordDocEmbedder(embedding_model=distilbert, word_embedding_model=ft)
topic_model = BERTopic(verbose=True, embedding_model=word_doc_embedder).fit(docs)

Downloading:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/555 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/505 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/354 [00:00<?, ?it/s]

2022-12-29 16:52:02,348 - BERTopic - Transformed documents to Embeddings
2022-12-29 16:52:16,536 - BERTopic - Reduced dimensionality
2022-12-29 16:52:17,041 - BERTopic - Clustered reduced embeddings


In [10]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5574,-1_the_to_and_of
1,0,407,0_hockey_team_game_nhl
2,1,407,1_god_jesus_that_of
3,2,403,2_of_is_it_msg
4,3,386,3_he_year_game_the


In [13]:
topic_model.get_topic_info().tail()

Unnamed: 0,Topic,Count,Name
0,-1,2,-1____
1,0,10988,0_the_ax_to_of
2,1,280,1_keywords_follows_art_article
3,2,22,2_the_to_that_we
4,3,22,3_critus___


## Custom Backend
If your backend or model cannot be found in the ones currently available, you can use the BaseEmbedder
class to create your own backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:


In [11]:
from bertopic.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
    def __init__(self, embedding_model):
        super().__init__()
        self.embedding_model = embedding_model

    def embed(self, documents, verbose=False):
        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
        return embeddings 

# Create custom backend
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

# Pass custom backend to bertopic
topic_model = BERTopic(embedding_model=custom_embedder).fit(docs)

2022-12-29 16:52:58,472 - BERTopic - Transformed documents to Embeddings
2022-12-29 16:53:13,287 - BERTopic - Reduced dimensionality
2022-12-29 16:53:13,737 - BERTopic - Clustered reduced embeddings


In [12]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,2,-1____
1,0,10988,0_the_ax_to_of
2,1,280,1_keywords_follows_art_article
3,2,22,2_the_to_that_we
4,3,22,3_critus___


## Custom Embeddings
You can use any embedding that you have previously created. This can be to speed up creating topic models but also if your language model does not fit in any of the options above. 

Here, we will be using **TF-IDF** as the main embedder in BERTopic:

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

embeddings = TfidfVectorizer(min_df=5, stop_words="english").fit_transform(docs)
topic_model = BERTopic().fit(docs, embeddings)

  self._set_arrayXarray(i, j, x)
2022-12-29 16:55:44,923 - BERTopic - Reduced dimensionality
2022-12-29 16:55:45,474 - BERTopic - Clustered reduced embeddings


In [15]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4511,-1_to_of_the_is
1,0,1000,0_team_game_he_games
2,1,330,1_woof_uhhhh_tesrt_critus
3,2,328,2_space_launch_nasa_satellite
4,3,254,3_israel_israeli_jews_arab


In [16]:
topic_model.get_topic_info().tail()

Unnamed: 0,Topic,Count,Name
139,138,10,138_satan_angels_freewill_metaphorical
140,139,10,139_tank_gas_zephyr_seat
141,140,10,140_ellipse_problem_offset_trimming
142,141,10,141_hi_cornerstone_gifs_anyone
143,142,10,142_whatever_injury_eye_motto
