# Tutorial: Contextualized Topic Models with Wikipedia Documents

(last updated 5-11-2020)

In this tutorial, we are going to use contextualized topic modeling to get topics out of a collections of articles you will upload here.

## Topic Models 

Topic models allow you to discover latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out.

## Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents.

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://travis-ci.com/MilaNLProc/contextualized-topic-models](https://travis-ci.com/MilaNLProc/contextualized-topic-models.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)




# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

In [None]:
# !wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt

In [None]:
# !head -n 1 dbpedia_sample_abstract_20k_unprep.txt

In [None]:
from google.colab import files 
uploaded = files.upload() 

Saving COVID_1year_group_3000.csv to COVID_1year_group_3000.csv


In [None]:
text_file = "COVID_1year_group_3000.csv" # EDIT THIS WITH THE FILE YOU UPLOAD

# Installing Contextualized Topic Models

Now, we install the contextualized topic model library

In [None]:
!pip install contextualized-topic-models==1.7.0
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp36-cp36m-linux_x86_64.whl (708.0MB)
[K     |████████████████████████████████| 708.0MB 25kB/s 
[?25hCollecting torchvision==0.7.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.7.0%2Bcu101-cp36-cp36m-linux_x86_64.whl (5.9MB)
[K     |████████████████████████████████| 5.9MB 6.3MB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.6.0
    Uninstalling torch-1.6.0:
      Successfully uninstalled torch-1.6.0
  Found existing installation: torchvision 0.7.0
    Uninstalling torchvision-0.7.0:
      Successfully uninstalled torchvision-0.7.0
Successfully installed torch-1.6.0+cu101 torchvision-0.7.0+cu101


## Importing what we need

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list, QuickText
from contextualized_topic_models.datasets.dataset import CTMDataset
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_texts
from gensim.models import ldamodel 
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import os
import pickle

## Preprocessing

Let's pass our file with preprocess data to our text handler object. This object takes care of creating the bag of words for you.

Why do we use the **preprocessed text** here? we need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [None]:
import pandas as pd


In [None]:
nltk.download('stopwords')
documents = pd.read_csv(text_file)['Message'][:3000]
# documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()][:3000]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
documents

0       Remember Danielle DiCenso, the nurse age 33 wh...
1       Our Father Who art in heaven, Hallowed be his ...
2                              Death liability waiver. :(
3       The reason Dr. Fauci, our leading infectious d...
4       This is my brother Rob He let his wife cut his...
                              ...                        
2995    How is everyone?‼‼ ️ This is not fraud. 😍 I'am...
2996    I am pissed, I am so very angry. Why? Let me t...
2997                                                   Up
2998                                   More good news 🙏🙏🙏
2999    Good Evening All, I don't have the Time to Rea...
Name: Message, Length: 3000, dtype: object

In [None]:
# # news_df['clean_doc'].apply(lambda x: str(x).lower())
# # doc.lower() for doc in preprocessed_docs_tmp
# documents.apply(lambda x: str(x).lower())

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import string
from nltk.corpus import stopwords as stop_words
import warnings
documents = documents
stopwords = set(stop_words.words("english"))
stopwords.update(["html","com","https","www","fwww", "dhttps","flogin","utm", "fnext","cnn","fox","foxnews","covid","coronavirus"])
vocabulary_size = 2000

preprocessed_docs_tmp = documents
preprocessed_docs_tmp = preprocessed_docs_tmp.apply(lambda x: str(x).lower())
preprocessed_docs_tmp = [doc.translate(
    str.maketrans(string.punctuation, ' ' * len(string.punctuation))) for doc in preprocessed_docs_tmp]
preprocessed_docs_tmp = [' '.join([w for w in doc.split() if len(w) > 3 and w not in stopwords])
                      for doc in preprocessed_docs_tmp]

vectorizer = CountVectorizer(max_features=vocabulary_size, token_pattern=r'\b[a-zA-Z]{2,}\b')
vectorizer.fit_transform(preprocessed_docs_tmp)
vocabulary = set(vectorizer.get_feature_names())
preprocessed_docs_tmp = [' '.join([w for w in doc.split() if w in vocabulary])
                          for doc in preprocessed_docs_tmp]

preprocessed_docs, unpreprocessed_docs = [], []
cnt=0
index=[]
for i, doc in enumerate(preprocessed_docs_tmp):
  if len(doc) > 0:
    index.append(cnt)
    preprocessed_docs.append(doc)
    unpreprocessed_docs.append(documents[i])
  cnt=cnt+1

In [None]:
preprocessed_documents, unpreprocessed_corpus, vocab = preprocessed_docs, unpreprocessed_docs, list(vocabulary)

In [None]:
# nltk.download('stopwords')

# documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
# sp = WhiteSpacePreprocessing(documents, "english")
# preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

You might want to pickle the **training_dataset** object to avoid recomputing the BoW multiple times

In [None]:
qt = QuickText("bert-base-nli-mean-tokens",
            text_for_bert=unpreprocessed_corpus,
            text_for_bow=preprocessed_documents)

In [None]:
training_dataset = qt.load_dataset()

100%|██████████| 405M/405M [00:13<00:00, 31.1MB/s]


HBox(children=(FloatProgress(value=0.0, description='Batches', max=15.0, style=ProgressStyle(description_width…




## Training our Contextualized Topic Model

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection (n_component parameter of the CTM object).

In [None]:
ctm = CombinedTM(input_size=len(qt.vocab), bert_input_size=768, n_components=10, num_epochs=1000)

ctm.fit(training_dataset) 

# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge).

In [None]:
print(ctm.get_topic_lists(10))

[['successful', 'exercise', 'thanksgiving', 'copied', 'decisions', 'wore', 'sound', 'enemy', 'onto', 'decades'], ['trump', 'government', 'country', 'president', 'vaccine', 'gates', 'media', 'african', 'lockdown', 'billion'], ['sing', 'join', 'battle', 'sharing', 'welsh', 'lost', 'joining', 'anthem', 'tomorrow', 'wales'], ['jesus', 'pray', 'amen', 'lord', 'holy', 'mary', 'heart', 'love', 'mother', 'mercy'], ['million', 'share', 'name', 'fraud', 'nombre', 'para', 'organization', 'register', 'post', 'group'], ['health', 'class', 'public', 'department', 'employees', 'quarantine', 'center', 'said', 'medical', 'office'], ['successful', 'thanksgiving', 'exercise', 'decisions', 'copied', 'wore', 'sound', 'rise', 'realize', 'experiencing'], ['people', 'virus', 'patients', 'health', 'also', 'disease', 'like', 'many', 'deaths', 'care'], ['time', 'home', 'work', 'every', 'would', 'please', 'kids', 'know', 'make', 'need'], ['please', 'hospital', 'mask', 'hours', 'still', 'need', 'phone', 'room', 's

### Let's find our documents' topics

Ok now we can take a document and see which topic has been asigned to it. Results will obviously change with respect to the documents you are using. For example, let's predict the topic of the first preprocessed document that is taling about a peninsula.

In [None]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=20) # get all the topic predictions

In [None]:
topics_predictions.shape

(2860, 10)

In [None]:
preprocessed_documents[-1] # see the text of our preprocessed document

'good evening time read people taking virus another positive case area lost person needed fight behalf members wish family soul rest peace clinical trials unit unit director prevention research unit south african medical research council south africa trials working wife passed away london'

In [None]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document

In [None]:
ctm.get_topic_lists(5)[topic_number] #and the topic should and could be about nature/locations related things

['sing', 'join', 'battle', 'sharing', 'welsh']

In [None]:
import pandas as pd
from google.colab import files
result_df = pd.DataFrame(topics_predictions).assign(index=index)
result_df.to_csv('topics_predictions.csv') 
files.download('topics_predictions.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>