# **Contextualized topic modeling to get topics out of a collections made of Wikipedia Abstracts**

**Topic Models**
Topic models allow us to discover latent topics in your documents in a completely unsuperivsed way.

**Contextualized Topic Models**
What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the usupervised capabilities of topic models to get topics out of documents.

## **References:**
* https://colab.research.google.com/drive/1euxW3ya3_PX6Kj1tnCNrIQ7pjZIODsB6?usp=sharing

## **Dataset:**

Downloading some abstracts from Wikipedia and using them to run the topic modeling pipeline. 

In [1]:
%%capture
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_prep.txt

In [2]:
## Installing the contextualized topic model library
%%capture
!pip install contextualized-topic-models==1.8.1
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

## **Installing TensorBoard**


In [25]:
!pip install tensorboard



In [23]:
# from keras.callbacks import TensorBoard
# from time import time

# # Create a TensorBoard instance with the path to the logs directory
# tensorboard = TensorBoard(log_dir='logs/{}'.format(time()))

In [26]:
from torch.utils.tensorboard import SummaryWriter
tb = SummaryWriter()

## **Installing necessary libraries**

In [3]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file, TopicModelDataPreparation
from contextualized_topic_models.datasets.dataset import CTMDataset
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
from gensim.corpora.dictionary import Dictionary
from gensim.models import ldamodel 
import os
import numpy as np
import pickle

Reading our data files and storing the documents as a lists of strings:

In [4]:
with open("dbpedia_sample_abstract_20k_prep.txt", 'r') as fr_prep:
  text_training_preprocessed = [line.strip() for line in fr_prep.readlines()]

with open("dbpedia_sample_abstract_20k_unprep.txt", 'r') as fr_unprep:
  text_training_not_preprocessed = [line.strip() for line in fr_unprep.readlines()]

NOTE: It is important to make sure that the lengths of the two lists of documents are the same and the index of a not preprocessed document corresponds to the index of the same preprocessed document. 

In [5]:
assert len(text_training_preprocessed) == len(text_training_not_preprocessed)

print(text_training_not_preprocessed[0])
print(text_training_preprocessed[0])

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry


## **Splitting the documents into training and testing**

In [6]:
training_bow_documents = text_training_preprocessed[0:15000]
training_contextual_document = text_training_not_preprocessed[0:15000]

testing_bow_documents = text_training_preprocessed[15000:]
testing_contextual_documents = text_training_not_preprocessed[15000:]

## **Creating the Training Dataset**
* Passing our files with preprocess data to our TopicModelDataPreparation object. 
* This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. 
* This operation allows us to create our training dataset.


In [7]:
tp = TopicModelDataPreparation("bert-base-nli-mean-tokens")

training_dataset = tp.create_training_set(training_contextual_document, training_bow_documents)

100%|██████████| 405M/405M [00:07<00:00, 52.7MB/s]
Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/sbert.net_models_bert-base-nli-mean-tokens/0_BERT were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/75 [00:00<?, ?it/s]


**Preprocessed text:**

We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

**Unpreprocessed text**: 

We provide unpreprocessed text as the input for BERT (or the contextualized model of your choice) to let the model output more accurate document representations.

**Vocabulary:**

In [8]:
tp.vocab[:10]

['abbreviated',
 'academic',
 'academy',
 'access',
 'according',
 'achieved',
 'acquired',
 'acre',
 'acres',
 'across']

## **Training the Combined Contextualized Topic Model**

Finally, we can fit our new topic model. 
Asking the model to find 50 topics in our collection (`n_component` parameter of the CombinedTM object). 

In [27]:
ctm = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, num_epochs=50, n_components=50)
ctm.fit(training_dataset)

Epoch: [50/50]	 Seen Samples: [750000/750000]	Train Loss: 136.03000888671875	Time: 0:00:04.297509: : 50it [03:36,  4.34s/it]


In [30]:
tb.add_scalar("Loss", ctm.best_loss_train)
tb.close()

## **Saving the Model**

In [10]:
ctm.save(models_dir="./")



## **Loading the Model**

In [None]:
# del ctm

In [11]:
!ls

'contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99'
 dbpedia_sample_abstract_20k_prep.txt
 dbpedia_sample_abstract_20k_unprep.txt
 sample_data


In [12]:
ctm = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, num_epochs=100, n_components=50)

ctm.load("contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99/",
                                                                                                      epoch=99)



## **Topics**

After training, now it is the time to look at our topics: we can use the 'get_topic_lists' function to get the topics. It also accept a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge).

In [13]:
ctm.get_topic_lists(5)

[['party', 'council', 'member', 'election', 'constituency'],
 ['book', 'novel', 'written', 'published', 'fiction'],
 ['member', 'politician', 'party', 'norwegian', 'political'],
 ['club', 'association', 'stadium', 'football', 'federation'],
 ['approximately', 'within', 'administrative', 'mi', 'poland'],
 ['school', 'high', 'education', 'college', 'students'],
 ['english', 'first', 'made', 'right', 'class'],
 ['family', 'brown', 'mm', 'found', 'described'],
 ['head', 'university', 'led', 'coach', 'represented'],
 ['company', 'founded', 'largest', 'based', 'headquartered'],
 ['son', 'st', 'bishop', 'john', 'priest'],
 ['best', 'known', 'director', 'american', 'born'],
 ['ice', 'hockey', 'professional', 'canadian', 'nhl'],
 ['czech', 'civil', 'village', 'parish', 'square'],
 ['usually', 'used', 'often', 'uses', 'generally'],
 ['west', 'east', 'south', 'within', 'capital'],
 ['united', 'states', 'representatives', 'senate', 'republican'],
 ['station', 'street', 'line', 'radio', 'fm'],
 ['s

## **Using the Test Set**

Now we are going to use the testset: we want to predict the topic for unseen documents.

In [14]:
testing_dataset = tp.create_test_set(testing_contextual_documents, testing_bow_documents) # create dataset for the testset
predictions = ctm.get_doc_topic_distribution(testing_dataset, n_samples=1)

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/sbert.net_models_bert-base-nli-mean-tokens/0_BERT were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Sampling: [1/1]: : 1it [00:01,  1.34s/it]


In [15]:
print(testing_contextual_documents[10])

topic_index = np.argmax(predictions[10])
ctm.get_topic_lists(5)[topic_index]

The Special Operations Task Force (Abbreviation: SOTF; Chinese: 特别行动队; Malay: Operasi Khas Pasukan Khas) is a Special operations Force created by the Singapore Armed Forces to better combat terrorists threats that would harm Singaporean interests at home and overseas. According to Colonel Benedict Lim, then Assistant Chief of General Staff


['government', 'responsible', 'republic', 'ministry', 'national']

# **Gradio**

In [18]:
!pip install -q gradio

In [21]:
import tensorflow as tf
import numpy as np
# from urllib.request import urlretrieve
import gradio as gr

def NER(text):
    # text_dataset = tp.create_test_set(text, text_for_bow=None)
    # prediction = ctm.get_doc_topic_distribution(text_dataset, n_samples=1)
    topic_index = np.argmax(predictions[10])
    return ctm.get_topic_lists(5)[topic_index]

gr.Interface(fn=NER, 
             inputs="textbox", 
             outputs='textbox').launch(share=True);

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
This share link will expire in 72 hours. If you need a permanent link, visit: https://gradio.app/introducing-hosted
Running on External URL: https://46205.gradio.app
