# BERTopic - Tutorial
We start with installing bertopic from pypi before preparing the data. 

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

In [6]:
!pip install bertopic



# Prepare data
For this example, we use the famous 20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics.

In [7]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='train')['data']

# Create Topics
We use the **distilbert-base-nli-mean-tokens** model as it is the recommended model for creating sentence embeddings according to the authors of the [sentence-embeddings](https://www.sbert.net/docs/pretrained_models.html) package. However, you can use whatever embeddings is currently pre-trained in the package.

In [8]:
model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
topics = model.fit_transform(docs)

2020-10-13 00:30:58,729 - BERTopic - Loaded BERT model
INFO:BERTopic:Loaded BERT model
2020-10-13 00:33:14,912 - BERTopic - Transformed documents to Embeddings
INFO:BERTopic:Transformed documents to Embeddings
2020-10-13 00:33:31,872 - BERTopic - Reduced dimensionality with UMAP
INFO:BERTopic:Reduced dimensionality with UMAP
2020-10-13 00:33:32,430 - BERTopic - Clustered UMAP embeddings with HDBSCAN
INFO:BERTopic:Clustered UMAP embeddings with HDBSCAN


In [9]:
# Get most frequent topics
model.get_topics_freq()[:5]

Unnamed: 0,Topic,Count
0,-1,4996
1,19,1019
2,27,671
3,7,582
4,32,574


In [10]:
# Get a topic 
model.get_topic(22)[:10]

[('food', 0.02923392521819248),
 ('msg', 0.02481089911162462),
 ('foods', 0.020275559472531353),
 ('candida', 0.01946387751674543),
 ('aspartame', 0.01808243022495726),
 ('yeast', 0.015392662727587922),
 ('diet', 0.015146299226572546),
 ('crohn', 0.014936783654283775),
 ('meat', 0.014771013204004462),
 ('kodak', 0.013566318087196327)]

## Model serialization
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [11]:
# Save model
model.save("my_model")	

In [12]:
# Load model
my_model = BERTopic.load("my_model")	

In [13]:
my_model.get_topic(4)[:10]

[('vs', 0.05358789990328188),
 ('chi', 0.04033116302824787),
 ('bos', 0.03707226540353622),
 ('cal', 0.03669518701060147),
 ('finals', 0.02928099281456523),
 ('det', 0.02882054974351385),
 ('pitt', 0.02795541930269434),
 ('playoff', 0.02299155071029984),
 ('round', 0.02113515613539666),
 ('pool', 0.020496358953549353)]