<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/Zero_shot_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Zero-shot topic modeling with ZeroShotTM


We are going to use our **Zero-Shot Topic Model** to get the topics out of the collection of academic papers we have for all Hertie School supervisor. 

## Topic Models 

Topic models allow you to discover latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out.

## Contextualized Topic Models

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents.


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [1]:
%%capture
!pip install contextualized-topic-models
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install -U tqdm

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [32]:
import pandas as pd

file = pd.read_csv("/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-final.csv")
#transform content column row values into a list
text_file = file['Content'][1:].values.tolist()

#print(text_file)
print(type(text_file))

<class 'list'>


# Importing what we need

In [33]:
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [34]:
nltk.download('stopwords')

documents = text_file
#documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
preprocessed_documents[:10]

['nsee discussions author publication https www researchgate net publication incentive preferences issues strategies local government first evidence narticle authors including nsome authors publication also working related projects open innovative governments view project public sector future view project publications nsee profile school governance publications nsee profile nall content following page september nthe user downloaded nhttps www researchgate net publication motivation incentive preferences issues strategies local government first evidence austria enrichid rgreq xxx enrichsource el esc publicationcoverpdf nhttps www researchgate net publication motivation incentive preferences issues strategies local government first evidence austria enrichid rgreq xxx enrichsource el esc publicationcoverpdf nhttps www researchgate net project open innovative governments enrichid rgreq xxx enrichsource el esc publicationcoverpdf nhttps www researchgate net project public sector future enri

Stopwords coming from layout, publisher etc still in there. Authors names as well 

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset. 

In [37]:
#Note: Here we use the contextualized model "distiluse-base-multilingual-cased", because we need a multilingual model for performing cross-lingual predictions later. Maybe switch to English at a later point
tp = TopicModelDataPreparation("distiluse-base-multilingual-cased")

training_dataset = tp.create_training_set(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Let's check the first thirty words of the vocabulary 

In [39]:
tp.vocab[:30]

['ability',
 'able',
 'absence',
 'ac',
 'academic',
 'accepted',
 'access',
 'according',
 'account',
 'accountability',
 'accounting',
 'accounts',
 'achieve',
 'achieved',
 'across',
 'act',
 'action',
 'actions',
 'active',
 'activities',
 'activity',
 'actor',
 'actors',
 'actual',
 'actually',
 'ad',
 'adaptation',
 'added',
 'addition',
 'additional']

## Training our Zero-Shot Contextualized Topic Model

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection (n_component parameter of the CTM object).

In [41]:
ctm = ZeroShotTM(input_size=len(tp.vocab), bert_input_size=512, n_components=50, num_epochs=15)
ctm.fit(training_dataset) # run the model


Epoch: [15/15]	 Seen Samples: [12150/12150]	Train Loss: 13821.533410493827	Time: 0:00:00.416182: : 15it [00:06,  2.39it/s]


# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [42]:
ctm.get_topic_lists(5)

[['research', 'higher', 'cognitive', 'effect', 'social'],
 ['task', 'al', 'www', 'innovation', 'dual'],
 ['size', 'price', 'majority', 'funds', 'analysis'],
 ['new', 'may', 'productivity', 'et', 'firms'],
 ['increases', 'range', 'small', 'often', 'likely'],
 ['sector', 'bank', 'often', 'concern', 'expenditures'],
 ['data', 'use', 'information', 'policy', 'bank'],
 ['informal', 'analysis', 'normative', 'vote', 'de'],
 ['fertility', 'nand', 'research', 'sector', 'data'],
 ['issue', 'dominant', 'increases', 'probability', 'good'],
 ['et', 'increases', 'law', 'legislation', 'uses'],
 ['social', 'energy', 'based', 'low', 'policy'],
 ['governments', 'private', 'governance', 'regimes', 'government'],
 ['model', 'growth', 'nthe', 'global', 'welfare'],
 ['countries', 'economy', 'country', 'organizations', 'nsome'],
 ['vote', 'support', 'change', 'parties', 'global'],
 ['online', 'voters', 'term', 'participants', 'found'],
 ['migration', 'system', 'market', 'new', 'research'],
 ['integration', '