<a href="https://colab.research.google.com/github/gauravlochab/notebooks/blob/main/Beyond_BOW_Text_Analysis_with_Contextualized_Topic_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial "Beyond the BOW: Text Analysis with Contextualized Topic Models"

### NLP+CSS 201 Series, November 22, 2021 (LINK TO THE TUTORIAL SERIES: https://nlp-css-201-tutorials.github.io/nlp-css-201-tutorials/) 

This tutorial will introduce Contextualized Topic Models (CTM), neural topic models which combine **contextualized document embeddings** with the classical BoW representations to increase the quality of the topics. Moreover, we will see how we can use multilingual embeddings to allow the model to **learn topics in one language and predict them for documents in unseen languages**, addressing a task of zero-shot cross-lingual topic modeling.

Contact: 

*   Silvia Terragni 
*   s.terragni4@campus.unimib.it
*   [silviatti.github.io](silviatti.github.io)

Main References:

* GitHub repo: https://github.com/MilaNLProc/contextualized-topic-models
* [Blog post on cross-lingual topic modeling](https://fede-bianchi.medium.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576)
* [Paper "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence"](https://aclanthology.org/2021.acl-short.96/)
* [Paper "Cross-lingual Contextualized Topic Models with Zero-shot Learning"](https://aclanthology.org/2021.eacl-main.143.pdf)




## Quick intro: What is Topic Modeling?

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/topic_modeling.PNG)


A topic model is usually an unsupervised model that aims at discovering the underlying themes or *topics* in large collections of documents.  

**Main inputs:**

*   Corpus of documents (D)
*   Number of topics (K)

**Main outputs:**

*   Topics or topic indicators (lists of words or distributions of the words in the vocabulary)  
*   Distributions of the topics on the documents





## Topic Models as Probabilistic Models

### Document representation

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/doc_simplex.PNG)

We can express a document as a **multinomial distribution over the topics**: a document talks about different topics in different proportions

### Topic representation
![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/topic_distrib.PNG)

We can express it as a **multinomial distribution over the vocabulary**: a topic is not just a unordered list of words, but each word has a specific probability weight. 

## What is a Neural Topic model? 

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/neural_topic_modeling.PNG)


It is usually based on the Variational Autoencoder architecture: 
*   The encoder network learns the parameters of a probability distribution and from this distribution we sample the K-dimensional topical document representation (or distribution)
*   The decoder network aims to reconstruct the original BoW document representation
*   We get the top words of the documents from the weight matrix that reconstructs the BoW representations


### BoW Limitation: 
The BoW representation **disregards the syntactic and semantic** information of the words in a document

For example: 
"*the department chair couches offers*" and "*the chair department offers couches*" have the same BoW but **if we knew the context** of the word "chair" it would be easier to assign it the correct topic.

Note: also LDA has the same limitation! It assumes the words in a document are independently and identically distributed (i.i.d. assumption)

## Contextualized Representations

Current language models (e.g. BERT) are trained on huge document collections and can capture syntactic and semantic information of the words and documents. 

They can learn **contextualized representations of words**, i.e. word embeddings that change depending on the context of the given word 

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/contextualized_embeddings.png)

Image source: http://ai.stanford.edu/blog/contextual/


### Contextualized representations of documents (SentenceBERT)

We can also learn contextualized representations of the documents by **averaging over the word representations of the words** of a document 

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/sentencebert.PNG)

*NOTE*: These representations can also be multilingual! 

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/multilingual_contextualized_embeddings.png)


Can we use this type of representations to improve topic models and overcome the BoW limitation? 


## Contextualized Topic Models: the Combined Topic Model

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/combined_ctm.PNG)

We can concatenate the two representations (BoW and contextualized) to help the model learn better topical representations of the documents 


Let see how this works in practice.

## Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)


## Contextualized Topic Models (python library)

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

You can find the CTM package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)


# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [None]:
%%capture
!pip install contextualized_topic_models


# Data

We are going to use the x-stance dataset [[paper](http://ceur-ws.org/Vol-2624/paper9.pdf) and [original data](https://github.com/ZurichNLP/xstance)]. 


* X-stance comprises more than 150 questions about Swiss politics and more than 67k answers given by candidates running for political office in Switzerland. 
* Questions are available in four languages: English, Swiss Standard German, French, and Italian. We only have answers in German, French and Italian. 
* The language of a comment depends on the candidate’s region of origin. The data cover 175 communal, cantonal and national elections between 2011 and 2020. 
* The questions asked on Smartvote have been edited by a team of political scientists. They are intended to cover a broad range of political issues relevant at the time of the election:
   * Welfare, Healthcare, Education, Immigration, Society, Security, Finances, Economy, Foreign Policy, Infrastructure & Environment, Political System, Digitisation

Let us start to investigate the topics of the questions in English.


##Import data

Let us import the questions in English

In [None]:
import pandas as pd    

!wget https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/data/questions.en.jsonl
result = pd.read_json(path_or_buf='/content/questions.en.jsonl', lines=True)
result.head()

--2021-11-22 17:24:44--  https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/data/questions.en.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33999 (33K) [text/plain]
Saving to: ‘questions.en.jsonl’


2021-11-22 17:24:44 (14.3 MB/s) - ‘questions.en.jsonl’ saved [33999/33999]



Unnamed: 0,id,text,topic
0,2,Do you think it is fundamentally right that th...,Welfare
1,4,"Should a 24-week period of ""parental leave"" be...",Welfare
2,6,The disability insurance system no longer prov...,Welfare
3,7,Would you support a national hospital planning...,Healthcare
4,9,Do you think it's right that certain forms of ...,Healthcare


In [None]:
result['topic'].unique()

array(['Welfare', 'Healthcare', 'Education', 'Immigration', 'Society',
       'Security', 'Finances', 'Economy', 'Foreign Policy',
       'Infrastructure & Environment', 'Political System', 'Digitisation'],
      dtype=object)

Let's drop the duplicates.

In [None]:
result = result.drop_duplicates(subset=['text'])
result

Unnamed: 0,id,text,topic
0,2,Do you think it is fundamentally right that th...,Welfare
1,4,"Should a 24-week period of ""parental leave"" be...",Welfare
2,6,The disability insurance system no longer prov...,Welfare
3,7,Would you support a national hospital planning...,Healthcare
4,9,Do you think it's right that certain forms of ...,Healthcare
...,...,...,...
189,3464,Do you support an expansion of the legal possi...,Security
190,3468,Should Switzerland start membership negotiatio...,Foreign Policy
191,3469,Should Switzerland strive for a free trade agr...,Foreign Policy
192,3470,An initiative calls for liability rules for Sw...,Foreign Policy


## Importing what we need

In [None]:
from contextualized_topic_models.models.ctm import ZeroShotTM, CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
import nltk
import torch
import random
import numpy as np

We are going to create a function that fixes the random seeds so that we can replicate the results. We will use this function later.

In [None]:
def fix_seeds():
  torch.manual_seed(10)
  torch.cuda.manual_seed(10)
  np.random.seed(10)
  random.seed(10)
  torch.backends.cudnn.enabled = False
  torch.backends.cudnn.deterministic = True

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. We also remove stop-words, which usually do not convey thematic information. 
Also, in some cases, we might want only to have the most frequent words inside the BoW. Too many words might not help.

However, **there is not a standard pre-processing** for each dataset. In this specific case, we have only 173 documents with at most 500 words, so we don't remove less frequent words.   


![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/combined_ctm.PNG)


In [None]:
from nltk.corpus import stopwords as stop_words
nltk.download('stopwords')
stopwords = list(set(stop_words.words('english')))

documents = result.text.tolist()
sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.




Other parameters of the object `WhiteSpacePreprocessingStopwords`: 
*  *vocabulary_size*: the number of most frequent words to include in the documents. Infrequent words will be discarded from the list of preprocessed documents
* *max_df* : When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. Default: 1
* *min_words*: Documents with less words than the parameter will be removed. Default: 1 
* *remove_numbers*: If true, numbers are removed from the documents. Default=True. 

Let's check the first ten words of the vocabulary 

In [None]:
vocab[:10]

['toughest',
 'commissioner',
 'low',
 'adults',
 'integrated',
 'embark',
 'resolutions',
 'supply',
 'relating',
 'tighten']

In [None]:
preprocessed_documents[:5]

['think fundamentally right state financially support provision childcare working parents tax allowances subsidies',
 'week period parental leave introduced addition existing maternity insurance benefits',
 'disability insurance system longer provides disability benefits paid pain disorders cannot objectively proved result whiplash injury approve',
 'would support national hospital planning scheme even might lead closure hospitals',
 'think right certain forms alternative medicine reimbursed basic healthcare system']

In [None]:
unpreprocessed_corpus[:5]

['Do you think it is fundamentally right that the state should financially support the provision of childcare for working parents (through tax allowances or subsidies)?',
 'Should a 24-week period of "parental leave" be introduced in addition to the existing maternity insurance benefits?',
 'The disability insurance system no longer provides for disability benefits to be paid for pain disorders that cannot be objectively proved (e.g. as a result of whiplash injury). Do you approve?',
 'Would you support a national hospital planning scheme even if it might lead to the closure of hospitals?',
 "Do you think it's right that certain forms of alternative medicine are once again to be reimbursed under the basic healthcare system?"]

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized representations of documents. This operation allows us to create our training dataset.

Note: You can use the contextualized representation that you like. In our experiments, we noticed that a "better" language models usually leads to more coherent results. For this reason, we are going to use "paraphrase-distilroberta-base-v2". For other models: https://www.sbert.net/docs/pretrained_models.html

In [None]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



### How many topics? 
There are different techniques to select the best number of topics. In this case, I **run our topic model with a different number of topics (5, 10, 15, 20) and selected the one that produces the topics with the highest coherence**. 

Also remember that **a topic model is a probabilistic model, and each time produces different results** if run with the same values of hyperparameters (e.g. the same number of topics). For this reason, I've run the topic model with the same number of topics for 5 times. 

For time constraints, we are not going to do this, but we can play with different number of topics later. There are other techniques, for example you can use a black-box optimization strategy to find the best number of topics w.r.t. an arbitrary metric. See OCTIS: https://github.com/mind-Lab/octis

However, my ultimate suggestion is to manually inspect the topics (this is reasonable if we don't have many topics to investigate). See also the references at the end of this notebook.


### Code to find the best number of topics (do not run it during the tutorial)

To run this, you don't have to set the random seeds, otherwise, you will always get the same results with the same number of topics.

In [None]:
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
corpus = [d.split() for d in preprocessed_documents]

num_topics = [5, 10, 15, 20]
num_runs = 5

best_topic_coherence = -999
best_num_topics = 0
for n_components in num_topics:
  for i in range(num_runs):
    print("num topics:", n_components, "/ num run:", i)
    ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, 
                     n_components=n_components, num_epochs=50)
    ctm.fit(training_dataset) # run the model
    coh = CoherenceNPMI(ctm.get_topic_lists(10), corpus)
    coh_score = coh.score()
    print("coherence score:", coh_score)
    if best_topic_coherence < coh_score:
      best_topic_coherence = coh_score
      best_num_topics = n_components
    print("current best coherence", best_topic_coherence, "/ best num topics", best_num_topics)

## Training our Combined Contextualized Topic Model
Let us run the topic model with 12 topics (parameter *n_components*). 

Recall that CTM is a neural model. So we need to define for **how many epochs** the model will run. We can also use early stopping criterion to let the model stop automatically. In this case, we should provide a validation dataset to the `fit` function (parameter `validation_dataset`).

We also need to set the dimension of the BoW and the dimension of the contextualized representation. 


In [None]:
fix_seeds() # uncomment if you don't want to fix the random seeds

num_topics = 12
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=num_topics, num_epochs=50)
ctm.fit(training_dataset) # run the model

Epoch: [50/50]	 Seen Samples: [8650/8650]	Train Loss: 72.65301266708815	Time: 0:00:00.226982: : 50it [00:12,  4.12it/s]
Sampling: [20/20]: : 20it [00:04,  4.59it/s]


There are other parameters that you may want to play with:
* *lr* (float): the learning rate, i.e. the step size at each iteration while moving towards a minimum of a loss function. If it's too small, the network will require too much time to reach a minimum, if it's too high then training may not converge;
* *batch_size* (integer): the batch size, i.e. the number of samples that will be propagated through the network. If it's too high (batch size == num of total instances), you may not be able to fit the samples in your machine's memory. If it's too small, the less accurate the estimate of the gradient will be.
* *hidden_sizes* (tuple of integers): the number of hidden layers and neurons. Default: (100, 100) --> two layers of 100 neurons each
* *dropout* (float): probability of dropping out the units in the latent representation layer as regularization.

You can see the full list of parameters [here](https://github.com/MilaNLProc/contextualized-topic-models/blob/6c6d6a996ceae1d203ab34a08c72f8214f98ab65/contextualized_topic_models/models/ctm.py#L19).

# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you can see if they all make sense. 

In [None]:
ctm.get_topic_lists(5)

[['expanded', 'framework', 'tax', 'taxation', 'uber'],
 ['petrol', 'switzerland', 'co', 'fossil', 'fuels'],
 ['chf', 'minimum', 'wage', 'full', 'listed'],
 ['government', 'mountain', 'sites', 'focused', 'public'],
 ['refugees', 'united', 'accept', 'unhcr', 'asylum'],
 ['contributions', 'weak', 'cantons', 'road', 'women'],
 ['well', 'consumption', 'possession', 'legalize', 'soft'],
 ['schools', 'subjects', 'pe', 'swimming', 'events'],
 ['openly', 'telephone', 'security', 'political', 'socialization'],
 ['eu', 'trade', 'post', 'reliefs', 'agreement'],
 ['support', 'federal', 'government', 'financial', 'equal'],
 ['companies', 'human', 'relaxed', 'compliance', 'environmental']]

However, we also want to quantify how better the contextualized models are with respect to previous work. For example, how much does CTM perform better than LDA? 

Let's compare the models.

## Latent Dirichlet Allocation (LDA) 
We are going to use gensim library to train LDA and then assess the quality of the topics using NPMI topic coherence (normalized point-wise mutual information).
 

In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_texts
from gensim.models import LdaModel 
from gensim.models.coherencemodel import CoherenceModel

split_preprocessed_documents = [d.split() for d in preprocessed_documents]
dictionary = Dictionary(split_preprocessed_documents)
corpus = [dictionary.doc2bow(text) for text in split_preprocessed_documents]

lda = LdaModel(corpus, num_topics=num_topics, iterations=500, random_state=42)

Let's see the topics discovered by LDA

In [None]:
def get_topics_lda(topk=10):
  topic_terms = []
  for i in range(num_topics):
      topic_words_list = []
      for word_tuple in lda.get_topic_terms(i, topk):
          topic_words_list.append(dictionary[word_tuple[0]])
      topic_terms.append(topic_words_list)
  return topic_terms

get_topics_lda(5)

[['service', 'favour', 'federal', 'years', 'future'],
 ['think', 'refugees', 'increased', 'vaccination', 'countries'],
 ['support', 'switzerland', 'favour', 'federal', 'approve'],
 ['favour', 'construction', 'individual', 'operations', 'switzerland'],
 ['support', 'would', 'switzerland', 'level', 'cantons'],
 ['support', 'favour', 'building', 'services', 'federal'],
 ['switzerland', 'relaxed', 'consumption', 'chf', 'drugs'],
 ['switzerland', 'years', 'foreign', 'welcome', 'benefits'],
 ['switzerland', 'would', 'charge', 'support', 'fuels'],
 ['swiss', 'support', 'initiative', 'federal', 'switzerland'],
 ['switzerland', 'support', 'insurance', 'benefits', 'swiss'],
 ['swiss', 'support', 'federal', 'zurich', 'geneva']]

### Topic Coherence
We usually use the topic coherence as main indicator of the quality of the topics. NPMI topic coherence is the most used one and it is computed on the co-occurrences of the words in the original or in an external corpus. The intuition is that if two words often co-occur together, then they are more likely to be related to each other.

In [None]:
cm = CoherenceModel(model=lda, dictionary=dictionary, 
                    texts=split_preprocessed_documents, coherence='c_npmi')
lda_coherence = cm.get_coherence()  # get coherence value
print("coherence score LDA:", lda_coherence)

coherence score LDA: -0.38652023652055956


### Coherence on CTM
CTM library already integrates gensim's computation of coherence. We just provide the list of topics and the corpus as input to the class `CoherenceNPMI` and compute the score with the `.score()` function

In [None]:
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
corpus = [d.split() for d in preprocessed_documents]
coh = CoherenceNPMI(ctm.get_topic_lists(10), corpus)
print("coherence score CTM:", coh.score())

coherence score CTM: -0.17936184974516126


### Diversity of the topics 

We can also compute how much diverse are the topics from each other. Ideally we expect topics which represent separate concepts or ideas. In this case, we use the IRBO (inverted ranked biased overlap) measure. Topics with common words at different rankings are penalized less than topics sharing the same words at the highest ranks. 

In [None]:
irbo_lda = InvertedRBO(get_topics_lda(10))
print("diversity score LDA:", irbo_lda.score())

irbo_ctm = InvertedRBO(ctm.get_topic_lists(10))
print("coherence score CTM:", irbo_ctm.score())

diversity score LDA: 0.817167000141342
coherence score CTM: 0.9960327574487013


# Topic Predictions

Now we can take a document and see which topics have been assigned to it. 

We first consider the topic distribution of the training documents, which CTM already computed.

In [None]:
topics_predictions = ctm.training_doc_topic_distributions # get all the topic predictions

Then we get the index of the most likely topic of the document of our choice

In [None]:
import numpy as np
train_doc_id = 0
topic_id = np.argmax(topics_predictions[train_doc_id]) # get the topic id of the  document

And finally get the top words of the most likely topic for the considered document.

In [None]:
ctm.get_topic_lists(10)[topic_id]

['support',
 'federal',
 'government',
 'financial',
 'equal',
 'family',
 'students',
 'low',
 'income',
 'provide']

Let us compare it with the original document

In [None]:
unpreprocessed_corpus[train_doc_id]

'Do you think it is fundamentally right that the state should financially support the provision of childcare for working parents (through tax allowances or subsidies)?'

We can also compare the topic with the corresponding ground truth label.


In [None]:
print("Original label:", result['topic'][train_doc_id])
print("Most likely topic:", ctm.get_topic_lists(10)[topic_id])

Original label: Welfare
Most likely topic: ['support', 'federal', 'government', 'financial', 'equal', 'family', 'students', 'low', 'income', 'provide']


## Get the top K documents for a topic

A different way to explore the results consists in retrieving all the K documents which are most likely assigned to a specific topic.

Let us first consider a topic index

In [None]:
topic_id = 5
print(ctm.get_topics()[topic_id])

['contributions', 'weak', 'cantons', 'road', 'women', 'reduced', 'employees', 'equalisation', 'municipality', 'introducing']


And then we use the `get_top_documents_per_topic_id` function to get the list of most likely documents with their corresponding probability. The probability we see here corresponds to the conditional probability of the document to be assigned to the considered topic. The parameter `k` controls how many documents we want to retrieve.  

In [None]:
ctm.get_top_documents_per_topic_id(unpreprocessed_corpus, ctm.training_doc_topic_distributions, topic_id, k=7)

[('Financially strong cantons want their payment of contributions to the financially weak cantons be reduced within the framework of the financial equalisation (NFA). Do you support this?',
  0.5713527902960778),
 ('Would you be in favour of introducing a compulsory general civic service (military service, extended civil service or participation in municipality militias) for men and women?',
  0.5500882517546415),
 ('Are you in favour of stricter monitoring of pay equity for women and men?',
  0.39857772663235663),
 ('Would you support a national hospital planning scheme even if it might lead to the closure of hospitals?',
  0.3734117427840829),
 ('The disability insurance system no longer provides for disability benefits to be paid for pain disorders that cannot be objectively proved (e.g. as a result of whiplash injury). Do you approve?',
  0.36601305976510046),
 ('Do you support a further reduction in contributions paid by financially strong cantons to financially weak cantons withi

# Cross-lingual Topic Modeling with Zero-shot Contextualized Topic Model 
Recall that the data we have contain answers in German, French and Italian. It would be impossible for us to predict the topics of these documents without speaking German, French or Italian. 

Instead of concatenating the input BoW representation with the contextualized representation, we can just replace it. And instead of using a mono-lingual representation, we can use a multilingual one.

In this way, the model will take as input a multilingual representation, try to learn a good topical representation of the documents that allows it to reconstruct the original BoW.

![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/zeroshot_ctm.PNG)


##Training our Zero-Shot Contextualized Topic Model
Also here we need both preprocessed and unpreprocessed documents. We need the preprocessed text to extract the top words of the topics, while we need the un-preprocessed text to generate the contextualized document representations.
We are going to use the same preprocessed and unpreprocessed corpus as before. 


But this time we are going to need a **multilingual sentence encoder**. We are going to use "paraphrase-multilingual-mpnet-base-v2", because it is already pre-trained on the languages that we would like to explore later (French, German and Italian)


In [None]:
fix_seeds() #uncomment this if you don't want to fix the seeds

zero_tp = TopicModelDataPreparation("paraphrase-multilingual-mpnet-base-v2")

zero_training_dataset = zero_tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



And we are ready to train the model. Make sure you use the "ZeroShotTM" class and not the "CombinedTM" one. 

In [None]:
zero_ctm = ZeroShotTM(bow_size=len(zero_tp.vocab), contextual_size=768, 
                      n_components=12, num_epochs=50)
zero_ctm.fit(zero_training_dataset) # run the model

Epoch: [50/50]	 Seen Samples: [8650/8650]	Train Loss: 78.36704355581648	Time: 0:00:00.318247: : 50it [00:15,  3.14it/s]
Sampling: [20/20]: : 20it [00:05,  3.35it/s]


### Topics 
As before, let us look at the topics of the model


In [None]:
zero_ctm.get_topic_lists(5)

[['government', 'framework', 'federal', 'proposal', 'reduction'],
 ['petrol', 'fuels', 'co', 'chf', 'currently'],
 ['consumption', 'road', 'schengen', 'switzerland', 'service'],
 ['subjects', 'pe', 'events', 'education', 'integrated'],
 ['think', 'closed', 'plants', 'missions', 'approved'],
 ['benefits', 'government', 'listed', 'women', 'launched'],
 ['eu', 'trade', 'geneva', 'subject', 'well'],
 ['least', 'increased', 'foreigners', 'given', 'years'],
 ['plants', 'increased', 'export', 'regard', 'liability'],
 ['post', 'extended', 'plants', 'eliminate', 'relaxed'],
 ['introducing', 'federal', 'minimum', 'times', 'introduce'],
 ['hours', 'companies', 'large', 'brokerage', 'favour']]

### Let's predict the topics of the documents in unseen languages 

It's time to take advantage of the power of multilingual language models. Let's predict the topics of some answers in French, German and Italian. Rembember that these answers are in languages that the model has not seen during training.


 First, we download the data as before.

In [None]:
!wget https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/data/test.jsonl

df_test = pd.read_json(path_or_buf='/content/test.jsonl', lines=True)
df_test


--2021-11-22 17:54:30--  https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/data/test.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9109367 (8.7M) [text/plain]
Saving to: ‘test.jsonl’


2021-11-22 17:54:30 (90.0 MB/s) - ‘test.jsonl’ saved [9109367/9109367]



Unnamed: 0,id,language,question_id,question,comment,label,numerical_label,author,topic,test_set
0,17820,de,3415,Sollen die Renten der Pensionskasse durch eine...,Es muss nach anderen Lösungen gesucht werden. ...,AGAINST,25,aea1e176f453,Welfare,new_comments_defr
1,17821,de,3416,Befürworten Sie Bestrebungen in den Kantonen z...,Die Kantone sollen sich um Missbräuche durch L...,AGAINST,25,aea1e176f453,Welfare,new_comments_defr
2,17823,de,3423,Soll sich der Staat stärker für gleiche Bildun...,Bildung ist eines unserer Ressourcen und sollt...,FAVOR,100,aea1e176f453,Education,new_comments_defr
3,17824,de,3446,Soll der Ausbau des Mobilfunknetzes nach 5G-St...,Solange die Auswirkungen auf den Menschen nich...,AGAINST,25,aea1e176f453,Digitisation,new_comments_defr
4,18220,de,3414,Eine Initiative fordert einen bezahlten Vaters...,"Nein, da ich den Gegenvorschlag bezüglich 2 Wo...",AGAINST,0,953e5b52fe06,Welfare,new_comments_defr
...,...,...,...,...,...,...,...,...,...,...
17700,144258,it,3421,Un'iniziativa chiede che il sistema di riduzio...,Questa iniziativa non solo non risolve in alcu...,AGAINST,0,b2a346607f51,Healthcare,new_topics_it
17701,144259,it,3422,Un'iniziativa vorrebbe dare alla Confederazion...,Il budget globale nel sistema sanitario non ha...,AGAINST,25,b2a346607f51,Healthcare,new_topics_it
17702,144280,it,3458,I finanziamenti ai partiti e alle campagne per...,La protezione della sfera privata dei donatori...,AGAINST,25,b2a346607f51,Political System,new_topics_it
17703,144281,it,3459,Ritiene che debba prosequire l'introduzione de...,Solamente una volta garantita la completa sicu...,FAVOR,75,b2a346607f51,Political System,new_topics_it


Since the text of the answer may be insufficient to understand the context, a testing document will be composed by the question and the corresponding answer. We want something like:

In [None]:
df_test['question'][0] + " <SEP> " + df_test['comment'][0]

'Sollen die Renten der Pensionskasse durch eine Senkung des Umwandlungssatzes gekürzt und an die gestiegene Lebenserwartung angepasst werden? <SEP> Es muss nach anderen Lösungen gesucht werden. Die Lebenshaltungskosten (u.a. Mieten) steigen auch weiter.'

Let's concatenate question and answer for all the instances

In [None]:
test_docs = [quest + " " + answ for quest, answ in zip(df_test['question'].tolist(), df_test['comment'].tolist())]

In [None]:
test_docs[:5]

['Sollen die Renten der Pensionskasse durch eine Senkung des Umwandlungssatzes gekürzt und an die gestiegene Lebenserwartung angepasst werden? Es muss nach anderen Lösungen gesucht werden. Die Lebenshaltungskosten (u.a. Mieten) steigen auch weiter.',
 'Befürworten Sie Bestrebungen in den Kantonen zur Senkung der Sozialhilfeleistungen? Die Kantone sollen sich um Missbräuche durch Leistungsbezüger kümmern.',
 'Soll sich der Staat stärker für gleiche Bildungschancen einsetzen (z.B. mit Nachhilfe-Gutscheinen für Schüler/-innen aus Familien mit geringem Einkommen)? Bildung ist eines unserer Ressourcen und sollte wo immer möglich gestärkt werden.',
 'Soll der Ausbau des Mobilfunknetzes nach 5G-Standard weiter vorangetrieben werden? Solange die Auswirkungen auf den Menschen nicht geklärt, soll auf den Ausbau verzichtet werden.',
 'Eine Initiative fordert einen bezahlten Vaterschaftsurlaub von vier Wochen. Befürworten Sie dieses Anliegen? Nein, da ich den Gegenvorschlag bezüglich 2 Wochen unte

There's no need to preprocess the documents if you want to do zero-shot topic modeling! (The vocabulary obtained from the French and German documents wouldn't match our English vocabulary!) Let's just pass the French, German and Italian documents as they are (without preprocessing) to our `TopicModelDataPreparation` object and create the testing dataset using the function `transform`. 

In [None]:
testing_dataset = zero_tp.transform(test_docs) # create dataset for the testset



Batches:   0%|          | 0/89 [00:00<?, ?it/s]

Now we are ready to compute the topic predictions for each document. In this case, we are going to use the function `get_thetas` because the model has never seen the documents during training. The parameter `n_samples` controls the number of times we sample from the distribution the model has learned. The higher, the more accurate the results, but it will also take more time to execute 


In [None]:
# n_sample how many times to sample the distribution (see the documentation)
test_topics_predictions = zero_ctm.get_thetas(testing_dataset, n_samples=10) # get all the topic predictions

Sampling: [10/10]: : 10it [01:09,  6.91s/it]


Let's install this machine translation library that we are going to use later.

In [None]:
%%capture 
!pip install deep-translator

Now we can predict the topics of documents in unseen languages. Let's consider a document with an arbitrary index.

In [None]:
test_document_index=10000
test_docs[test_document_index]

"La naturalisation devrait-elle être facilitée aux étrangers de la troisième génération? pourquoi attendre 3 générations? je l'ai obtenue à l'époque le jour de mon mariage !"

Let us translate the document, only to check if the model predicts correctly the topics. 

In [None]:
from deep_translator import GoogleTranslator
gt = GoogleTranslator(source='auto', target='en')
translated = gt.translate(test_docs[test_document_index])
translated

  from collections import Mapping


'Should naturalization be made easier for third generation foreigners? why wait 3 generations? I got it at the time on my wedding day!'

Let's get the index of most likely topic of the first document and then show the topic words to see if the topic's prediction is accurate

In [None]:
topic_number = np.argmax(test_topics_predictions[test_document_index]) # get the topic id of the first document
zero_ctm.get_topic_lists(15)[topic_number] 

['eu',
 'trade',
 'geneva',
 'subject',
 'well',
 'ticket',
 'users',
 'join',
 'next',
 'residence',
 'countries',
 'data',
 'greater',
 'plants',
 'agreements']

### Top K documents per topic
Also in this case, we can retrieve the top K documents which are most likely assigned to a given topic. Let us use the function `get_top_documents_per_topic_id` as before, but in this case, we will use the `test_docs` and `test_topic_predictions` as input.

In [None]:
topic_id = 1
print(zero_ctm.get_topics()[topic_id])
top_documents = zero_ctm.get_top_documents_per_topic_id(test_docs, test_topics_predictions, topic_id, k=10)
top_documents

['petrol', 'fuels', 'co', 'chf', 'currently', 'motor', 'fossil', 'combustibles', 'oil', 'natural']


[('Actuellement, une taxe CO2 est prélevée sur les combustibles fossiles (p. ex. mazout ou gaz naturel). Cette taxe devrait-elle également être étendue aux carburants (p. ex. essence, diesel, etc.)? En tout cas pas avant que l\'on ait des vraies solutions "vertes" de remplacement.',
  0.7705287575721741),
 ('Actuellement, une taxe CO2 est prélevée sur les combustibles fossiles (p. ex. mazout ou gaz naturel). Cette taxe devrait-elle également être étendue aux carburants (p. ex. essence, diesel, etc.)? Sans pénaliser le pouvoir d’achat des milieux populaires.',
  0.767707234621048),
 ('Finora sui combustibili fossili (p. es. olio da riscaldamento o gas naturale) viene riscossa una tassa per il CO2. Ritiene che questa tassa debba essere estesa anche ai carburanti (p.es. benzina e diesel)? Assolutamente. Non possiamo più permetterci di continuare con le emissioni.',
  0.7561641722917557),
 ('Finora una tassa sul CO2 vige sui combustibili fossili (olio, gas naturale). Ritiene che debba esse

Let's translate the documents to check

In [None]:
for original_doc, probability in top_documents:
  print(gt.translate(original_doc), probability)

NameError: ignored

### Topic distribution on the overall corpus  (training vs test)

Given the discovered topics, we can investigate how the topics distribute in the training and the testing set. It's easier if we try to assign a label to each topic

Let's try to assign a label to each discovered topic

In [None]:
labels = ['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 
          'topic_10', 'topic_11']

Maybe it's helpful to find the most likely training documents for a given topic

In [None]:
topic_id = 4
print(zero_ctm.get_topic_lists(10)[topic_id])
ctm.get_top_documents_per_topic_id(unpreprocessed_corpus, zero_ctm.training_doc_topic_distributions, topic_id, k=5)

In [None]:
print(labels,"\n")

zero_ctm.get_topic_lists(5)

Now we can see how the different topics distribute over the whole training corpus

In [None]:
import matplotlib.pyplot as plt
fig1, ax1 = plt.subplots()
ax1.pie(np.average(zero_ctm.training_doc_topic_distributions,axis=0),
        labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  

plt.show()


And we can compare this with the testing corpus 

In [None]:
fig1, ax1 = plt.subplots()
ax1.pie(np.average(test_topics_predictions,axis=0),
        labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  

plt.show()


We can also compare the distribution of the topics between the different languages. Let's split the topic predictions by language.

In [None]:
test_topics_predictions_it, test_topics_predictions_de, test_topics_predictions_fr = [], [], [] 
for ttp, lang in zip(test_topics_predictions, df_test['language'].tolist()):
  if lang == 'it':
    test_topics_predictions_it.append(ttp)
  elif lang == 'de':
    test_topics_predictions_de.append(ttp)
  elif lang == 'fr':
    test_topics_predictions_fr.append(ttp)
  else:
    print('something went wrong')

Now we are ready to plot the different distributions, as seen above

In [None]:
fig1, axs = plt.subplots(1, 3, figsize=(20, 5), sharey=True)
axs[0].pie(np.average(test_topics_predictions_it,axis=0),
        labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
axs[0].axis('equal')  
axs[0].title.set_text('ITALIAN')

axs[1].pie(np.average(test_topics_predictions_de,axis=0),
        labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
axs[1].axis('equal')  
axs[1].title.set_text('GERMAN')


axs[2].pie(np.average(test_topics_predictions_fr,axis=0),
        labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
axs[2].axis('equal')  
axs[2].title.set_text('FRENCH')


plt.show()


# What's next?

## Kitty
**Kitty** classifies documents using a human in the loop approach supported by Contextualized Topic Models.

![](https://github.com/silviatti/Contextualized-Topic-Models-Tutorial/blob/main/images/kitty.PNG?raw=true)

[Link](https://contextualized-topic-models.readthedocs.io/en/develop/kitty.html)

## OCTIS

**OCTIS (Optimizing and Comparing Topic models Is Simple)** can discover automatically the optimal hyperparameter configuration w.r.t. a given evaluation metric. It also contains CTM but many other models and different evaluation metrics.

![](https://github.com/MIND-Lab/OCTIS/blob/master/logo.png?raw=true)


[Link](https://github.com/MIND-Lab/OCTIS/)


## Additional Resources 
* Jordan Boyd-Graber, David Mimno, and David Newman. "Care and Feeding of Topic Models: Problems,
Diagnostics, and Improvements": https://home.cs.colorado.edu/~jbg/docs/2014_book_chapter_care_and_feeding.pdf
*  David Mimno, Jordan Boyd-Graber, and Yuening Hu. "Applications of Topic Models": https://mimno.infosci.cornell.edu/papers/2017_fntir_tm_applications.pdf