<a href="https://colab.research.google.com/github/davidbadajkov/team_armenia_mda/blob/main/BERT_TopicModelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic
<br></br>
I am going to try to use BERT [see here](https://github.com/MaartenGr/BERTopic) to generate contextualized word embeddings to later base the topic modelling on. 
<br></br>
After generating the sentence/document embeddings, we use UMAP to perform non-linear dimensionality reductions, and then cluster the reduced document embeddings according to a category TF-IDF.
<br></br>
The benefit of using contextualized word embeddings over BoW, TF-IDF, Word2Vec or GloVE embeddings is that they distinguish the representation of identical words depending on the context in which they are used. 
<br></br> 


## Imports

In [24]:
#!pip install bertopic[all]
#from bertopic import BERTopic
#from sentence_transformers import SentenceTransformer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import tensorflow as tf
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import re
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Connecting to cloud hardware

In [2]:
device_name = tf.test.gpu_device_name()
if device_name == '/device:GPU:0':
  print('Found GPU at: {}'.format(device_name))
else:
  raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [3]:
if torch.cuda.is_available():
  device = torch.device("cuda")
  print('There are %d GPU(s) available' % torch.cuda.device_count())
  print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
  print('No GPU available, using CPU')
  device = torch.device("cpu")

There are 1 GPU(s) available
We will use the GPU: Tesla K80


## Importing data

In [29]:
df = pd.read_csv('/content/consolidated-transcripts')
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Session,Country,Transcript
0,0,2018,73,BRB,Let me begin by congratulating Ms. María Ferna...
1,1,2018,73,IND,"On my own behalf and on behalf of my country, ..."
2,2,2018,73,ARG,I would like to congratulate the President on ...
3,3,2018,73,JOR,It is an honour to take part in the general de...
4,4,2018,73,SWE,"Just a bit more than a week ago, we honoured t..."


In [30]:
df.drop(columns = 'Unnamed: 0', inplace = True)
df.head()

Unnamed: 0,Year,Session,Country,Transcript
0,2018,73,BRB,Let me begin by congratulating Ms. María Ferna...
1,2018,73,IND,"On my own behalf and on behalf of my country, ..."
2,2018,73,ARG,I would like to congratulate the President on ...
3,2018,73,JOR,It is an honour to take part in the general de...
4,2018,73,SWE,"Just a bit more than a week ago, we honoured t..."


## Pre-processing and cleaning data

In [31]:
# gonna borrow this function from Kachatur
def clean_text(text):
    text = re.sub(r'[0-9]', '', text)
    text = text.lower()
    text = re.sub(r'\[,!.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    return text

In [32]:
def remove_stopwords(text):
  filtered = []
  stop_words = set(stopwords.words('english'))
  word_tokens = word_tokenize(text)
  for w in word_tokens:
    if w not in stop_words:
      filtered.append(w)
  filtered_doc = ' '.join(str(i) for i in filtered)
  return filtered_doc 

In [33]:
df['Transcript'] = df['Transcript'].apply(lambda x: clean_text(str(x)))

In [34]:
df['Transcript'] = df['Transcript'].apply(lambda x: remove_stopwords(str(x)))

In [35]:
df['Transcript'][0]

'let begin congratulating ms maría fernanda espinosa garcés election preside general assembly seventythird session particular fourth woman year history united nations receive high honour elevated position president general assembly however would like pause stage came speech impossible deliver given events happened past hours world live ignored events include transit tropical storm across country thought would passed us floods hit many communities overnight onslaught storm sister country saint lucia earthquake shores martinique guadeloupe dominica morning affect land destabilized islands earthquake tsunami indonesia earlier today typhoon deal blow people japan events great concern world live different world one known ask matter last year prime minister dominica stood rostrum within days passage hurricane violent describe category hurricane would injustice people dominica exposed see apv assembly heard colleague prime minister antigua speak earlier fact billion damage come caribbean regi

![](https://www.myinstants.com/media/instants_images/boratgs.jpg)

In [41]:
docs = list(df['Transcript'].values)
len(docs)

8094

## Setting up BERTopic

In [42]:
topic_model = BERTopic(language = 'english', calculate_probabilities=True)
topics, _ = topic_model.fit_transform(docs)

HBox(children=(FloatProgress(value=0.0, max=244715968.0), HTML(value='')))




In [46]:
topic_freq = topic_model.get_topic_freq()
outliers = topic_freq['Count'][topic_freq['Topic']==-1].iloc[0]
print(f"{outliers} documents have not been classified")
print(f"The other {topic_freq['Count'].sum() - outliers} documents are {topic_freq['Topic'].shape[0]-1} topics")

1829 documents have not been classified
The other 6265 documents are 160 topics


In [47]:
topic_freq.head()

Unnamed: 0,Topic,Count
0,-1,1829
1,151,688
2,122,356
3,125,233
4,140,196


In [48]:
# we can inspect the most common words for each topic as follows
topic_model.get_topic(topic_freq['Topic'][1])

[('global', 0.0037329359258601955),
 ('change', 0.0029076191897645706),
 ('need', 0.002865366100473969),
 ('terrorism', 0.002805874754185906),
 ('us', 0.0027618868313453535),
 ('climate', 0.0027023797680305094),
 ('war', 0.0026487936617529285),
 ('every', 0.0024023842998822796),
 ('nuclear', 0.002381102803956094),
 ('conflict', 0.0023681940203896007)]

In [49]:
topic_model.visualize_topics()

The fact that there are many clusters together actually means that we can continue optimizing the clustering algorithm and reduce the dimensionality. 

In [None]:
# let's try to search topics related to the environment
similar_topics, similarity = topic_model.find_topics('climate', top_n = 5); similar_topics