# BERTopic - Tutorial
We start with installing bertopic from pypi before preparing the data. 

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

In [80]:
!pip install bertopic
!pip install vncorenlp
!pip install underthesea
!pip install tqdm



In [1]:
!mkdir -p models/wordsegmenter
!wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
!wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
!wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
!mv vi-vocab models/wordsegmenter/
!mv wordsegmenter.rdr models/wordsegmenter/

Collecting vncorenlp
  Downloading vncorenlp-1.0.3.tar.gz (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 10.2 MB/s eta 0:00:01
Building wheels for collected packages: vncorenlp
  Building wheel for vncorenlp (setup.py) ... [?25l- \ | done
[?25h  Created wheel for vncorenlp: filename=vncorenlp-1.0.3-py3-none-any.whl size=2645935 sha256=418d59556343c25165fb7e10706ef6db70320321fe4c30304d7f6daeb5b2345a
  Stored in directory: /root/.cache/pip/wheels/3c/e9/86/706911c12e307aeb9a2702399f0dad38d36f1d6f9dde8af35e
Successfully built vncorenlp
Installing collected packages: vncorenlp
Successfully installed vncorenlp-1.0.3
--2020-10-09 02:20:32--  https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awai

In [2]:
import os       #importing os to set environment variable
def install_java():
  !apt update
  !apt install -y openjdk-11-jre-headless    #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
install_java()

Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://deb.debian.org/debian buster InRelease [121 kB]0m
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB][0m[33m
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [233 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7906 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
              [0m        [0m[33m                       [0m[33m[33m[33mFetched 8387 kB in 1s (5665 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
7 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ca-certificates-java dbus java-common libapparmor1 libasound2
  libasound2-data libavahi

# Prepare data
For this example, we use the famous 20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics.

In [82]:
from tqdm import tqdm
import urllib.request
url = "http://de3.qmanga.com/posts.txt"
file = urllib.request.urlopen(url)
docs = []
for line in tqdm(file):
    decoded_line = line.decode("utf-8")
    docs.append(decoded_line)
print(len(docs))

2332325it [00:08, 264650.12it/s]


2332325


In [21]:
from bertopic import BERTopic
from vncorenlp import VnCoreNLP
import re
from underthesea import word_tokenize, sent_tokenize

rdrsegmenter = VnCoreNLP("./VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m')
print("READY")

READY


In [84]:
import concurrent.futures
from tqdm import tqdm

def process_doc(doc):
    # print(doc)
    # vncore = ("\n".join(map(lambda sentence: " ".join(sentence), rdrsegmenter.tokenize(doc))))
    under = ("\n".join([word_tokenize(sen, format="text") for sen in sent_tokenize(doc)]))
    # print(vncore)
    # print(under)
    return under
tokenized_docs = []
with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
    i =0
    future_to_url = []
    for doc in tqdm(docs):
        i+=1
        future_to_url.append(executor.submit(process_doc, (re.sub('[^\w\s.,!;\n]+','',doc.replace("\\n","\n"), re.UNICODE))))
        if i % 100 == 0:
            # print(i)
            for future in concurrent.futures.as_completed(future_to_url):
                try:
                    tokenized_docs.append(future.result())
                    # print(len(tokenized_docs))
                except Exception as exc:
                    print('%r generated an exception: %s' % (url, exc))
            future_to_url = []

100%|██████████| 2332325/2332325 [1:47:30<00:00, 361.55it/s]  


# Create Topics
We use the **distilbert-base-nli-mean-tokens** model as it is the recommended model for creating sentence embeddings according to the authors of the [sentence-embeddings](https://www.sbert.net/docs/pretrained_models.html) package. However, you can use whatever embeddings is currently pre-trained in the package.

In [None]:
# print(tokenized_docs[10:20])
model = BERTopic("xlm-r-100langs-bert-base-nli-stsb-mean-tokens", verbose=True)
topics = model.fit_transform(docs)

100%|██████████| 1.01G/1.01G [00:17<00:00, 57.4MB/s]
2020-10-09 06:22:22,867 - BERTopic - Loaded BERT model
INFO:BERTopic:Loaded BERT model


In [4]:
# Get most frequent topics
model.get_topics_freq()[:5]

Unnamed: 0,Topic,Count
0,-1,3815
1,36,2322
2,10,731
3,17,484
4,32,457


In [14]:
# Get a topic 
model.get_topic(8)[:10]

[('gay', 0.0378408054504732),
 ('cramer', 0.035105007037722194),
 ('optilink', 0.029938059903937837),
 ('homosexual', 0.028372928346573543),
 ('men', 0.027332436712704754),
 ('sexual', 0.024208844760479513),
 ('clayton', 0.020979311369394887),
 ('homosexuals', 0.01782175249550026),
 ('male', 0.016996039406207666),
 ('promiscuous', 0.01671068613901953)]

## Model serialization
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
model.save("my_model")	

In [None]:
# Load model
my_model = BERTopic.load("my_model")	

In [None]:
my_model.get_topic(4)[:10]

[('baseball', 0.01534818753609341),
 ('players', 0.01113384693242755),
 ('cubs', 0.010651317673247482),
 ('game', 0.01064425481072388),
 ('braves', 0.010439585241772109),
 ('pitching', 0.009477156669897367),
 ('games', 0.009166144809830891),
 ('runs', 0.009154570979537589),
 ('year', 0.008982491530594413),
 ('team', 0.00894693731063402)]