<a href="https://colab.research.google.com/github/bin-crypto-test/BERTopic/blob/master/BERTopic_Embedding_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial** - (Custom) Embedding Models in BERTopic
(last updated 26-04-2021)

In this tutorial we will be going through the embedding models that can be used in BERTopic. Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case. Moreover, it helps creating a topic when you have little data to your availability. 

## Embedding models
Embedding models are used for the representation of text, such as words, sentences and documents, in the form of real-valued vectors. They typically encode the semantic meaning of text. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing BERTopic

We start by installing BERTopic, with all backends possible, from PyPi:

In [None]:
%%capture
!pip install bertopic[flair, gensim, spacy, use]

**NOTE**: This may take a while as it needs to install Spacy, Torch, Gensim, USE, etc. 

**NOTE 1**: There might be dependency-conflicts if you install back-ends so it might be worthwhile to only choose one to experiment with. 

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='train',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
print(docs[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


# **Embedding Models**
In this section, we will go through all embedding models and backends that are supported in BERTopic.

## Sentence Transformers
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) and pass it through BERTopic with `embedding_model`:

In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens").fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4630,-1_can_your_will_any
1,32,693,32_jesus_church_bible_we
2,49,466,49_patients_health_medical_pain
3,2,441,2_space_launch_orbit_lunar
4,22,381,22_key_encryption_keys_encrypted


Or we can select a SentenceTransformer model with our own parameters:



In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")
topic_model = BERTopic(embedding_model=sentence_model).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4560,-1_if_can_but_any
1,5,934,5_team_game_hockey_season
2,45,510,45_jesus_bible_christian_faith
3,2,443,2_space_nasa_launch_orbit
4,29,320,29_key_encryption_keys_encrypted


## Flair

Flair allows you to choose almost any embedding model that is publicly available.<br> Flair can be used as follows:

In [None]:
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4063,-1_they_can_an_your
1,16,957,16_his_we_bible_christian
2,8,862,8_team_hockey_games_his
3,23,670,23_car_bike_cars_had
4,6,451,6_hiv_gordon_geb_chastity


You can select any 🤗 transformers model [here](https://huggingface.co/models).

Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily pass it to BERTopic in order to use those word embeddings as document embeddings:

In [None]:
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

topic_model = BERTopic(embedding_model=document_glove_embeddings).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4892,-1_do_any_like_just
1,0,300,0_disapprove_disapproves_disarm_disarmed
2,85,271,85_encryption_keys_government_secure
3,148,183,148_jesus_god_christ_lord
4,141,180,141_anyone_please_me_help


## Spacy
Spacy has shown great promise over the last years and is now slowly transitioning into transformer-based techniques which makes it interesting to use in BERTopic. 

We start by using a non-transformer-based model which we will have to download first:

In [None]:
!python -m spacy download en_core_web_md

Next, simply load the model into a Spacy nlp instance and pass it through BERTopic:

In [None]:
import spacy

nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

topic_model = BERTopic(embedding_model=nlp, verbose=True).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4985,-1_would_so_just_get
1,84,692,84_game_team_he_season
2,10,411,10_patients_disease_pain_doctor
3,0,306,0_disarray_disassociate_disaster_disasterous
4,88,173,88_what_why_think_hey


We can also use their transformer-based models which we also have to download first:

In [None]:
!python -m spacy download en_core_web_trf

As before, we simply load the model and pass it through BERTopic. Note that we exclude a bunch of features as they are not used in BERTopic.

In [None]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
topic_model = BERTopic(embedding_model=nlp, verbose=True, min_topic_size=5).fit(docs)

If you run into memory issues with spacy-transformer models, try:

In [None]:
import spacy
from thinc.api import set_gpu_allocator, require_gpu

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)

topic_model = BERTopic(embedding_model=nlp, verbose=True).fit(docs)

## Universal Sentence Encoder (USE)
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here for embedding the documents. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.



In [None]:
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
topic_model = BERTopic(verbose=True, embedding_model=embedding_model).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3714,-1_any_about_edu_db
1,3,1069,3_team_hockey_players_league
2,8,494,8_nasa_orbit_satellite_lunar
3,17,459,17_encryption_keys_privacy_security
4,21,290,21_car_cars_ford_toyota


## Gensim
For Gensim, BERTopic supports its `gensim.downloader` module. Here, we can download any model word embedding model to be used in BERTopic. Note that Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled. 

In [None]:
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')
topic_model = BERTopic(verbose=True, embedding_model=ft).fit(docs)

# **Customization**
Over the last years, many new embedding models have been released that could be interesting to use as a backend in BERTopic. It is not always feasible to implement them all as there are simply too many to follow. 

In order to still allow to use those embeddings, BERTopic knows several ways to add these embeddings while still allowing for full functionality of BERTopic. 

Moreover, there are several customization options that allow for a bit more control over which embedding to use when. 

## Word + Document Embeddings
You might want to be using different language models for creating document- and word-embeddings.
For example, while SentenceTransformers might be great in embedding sentences and documents, you
might prefer to use FastText to create the word embeddings.

In [None]:
from bertopic.backend import WordDocEmbedder
import gensim.downloader as api
from sentence_transformers import SentenceTransformer

# Word embedding model
ft = api.load('fasttext-wiki-news-subwords-300')

# Document embedding model
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")

# Create a model that uses both language models and pass it through BERTopic
word_doc_embedder = WordDocEmbedder(embedding_model=distilbert, word_embedding_model=ft)
topic_model = BERTopic(verbose=True, embedding_model=word_doc_embedder).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4980,-1_any_do_one_with
1,66,475,66_he_not_what_do
2,3,427,3_space_shuttle_orbit_satellite
3,12,427,12_game_team_players_last
4,17,396,17_team_game_season_players


## Custom Backend
If your backend or model cannot be found in the ones currently available, you can use the BaseEmbedder
class to create your own backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:


In [None]:
from bertopic.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
    def __init__(self, embedding_model):
        super().__init__()
        self.embedding_model = embedding_model

    def embed(self, documents, verbose=False):
        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
        return embeddings 

# Create custom backend
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

# Pass custom backend to bertopic
topic_model = BERTopic(embedding_model=custom_embedder).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5149,-1_can_any_if_all
1,19,459,19_02_pitching_baseball_03
2,17,444,17_hockey_nhl_players_leafs
3,15,415,15_space_nasa_lunar_orbit
4,4,395,4_key_chip_encryption_keys


## Custom Embeddings
You can use any embedding that you have previously created. This can be to speed up creating topic models but also if your language model does not fit in any of the options above. 

Here, we will be using **TF-IDF** as the main embedder in BERTopic:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

embeddings = TfidfVectorizer(min_df=5, stop_words="english").fit_transform(docs)
topic_model = BERTopic().fit(docs, embeddings)