<a href="https://colab.research.google.com/github/belom-nlp/micro_topic_modelling/blob/main/notebooks/MTM_visualized.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we print out the whole collection of texts from Iran 1 Dataset. The colour each sentence is printed in corresponds to the cluster it belongs to. Neighbouring sentences are often printed in the same colour, which means that their proximity in meaning is indeed taken into account by our model.

#Installing MTM

In [None]:
from IPython.display import clear_output

!pip install -q umap-learn
!pip install -q --upgrade tbb
clear_output()

In [None]:
!pip uninstall scikit-learn -y

!pip install -U scikit-learn
clear_output()

In [None]:
!pip install umap-learn[plot]
clear_output()

In [None]:
! pip install sentence_transformers
! pip install transformers
clear_output()

In [None]:
#importing necessary libraries
from collections import Counter
import numpy as np
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import nltk

from transformers import AutoTokenizer
from transformers import BertModel
from sentence_transformers import SentenceTransformer

from sklearn.decomposition import PCA
from sklearn.cluster import HDBSCAN, DBSCAN, KMeans
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import matplotlib.pyplot as plt

In [None]:
import umap
import umap.plot

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
clear_output()

In this version, we will use a simplified implementation of our model, without vectorizer and LDA.

In [None]:
class Trial_MicroTopicModeller():

  def __init__(self, n_clusters=None, sent_transformer='intfloat/multilingual-e5-base', method='count', stop_words='english'):
    self.n_clusters = n_clusters
    self.data = None
    self.sent_transformer = SentenceTransformer(sent_transformer)
    if method == 'count':
      self.vectorizer = CountVectorizer(stop_words=stop_words, max_features=40000)


  def get_embeddings(self, data):

    """
    split documents by sentences and get embeddings for each sentence
    """
    self.data = sent_tokenize(data) #split by punctuation marks at the end of the sentence
    sent_embs = self.sent_transformer.encode(self.data) #sentence embeddings; an array of shape (self.n_documents, 768)

    return sent_embs

  def get_sentence_clusters(self, sent_embs):
    """
    sentences are grouped into clusters with either KMeans (recommended) or HDBSCAN

    returns 2 lists:
    emb_clusters contains lists of embeddings belonging to each cluster
    sent_collection contains corresponding sentences
    """
    if self.n_clusters is not None:
      cluster_maker = KMeans(n_clusters = self.n_clusters)
    else:
      cluster_maker = HDBSCAN(min_cluster_size=3)
    cluster_maker.fit(sent_embs)
    n_clusters = len(np.unique(cluster_maker.labels_))
    emb_clusters = []
    sent_collection = []
    for j in range(n_clusters):
      emb_clusters.append(list())
      sent_collection.append(list())
      for i in range(len(self.data)):
        if cluster_maker.labels_[i] == j - 1: #because we have '-1' cluster /// CHECK FOR KMEANS!
          emb_clusters[j].append(sent_embs[i])
          sent_collection[j].append(self.data[i])
    return emb_clusters, sent_collection

#Data processing

In [None]:
with open('text_data/iran_1.txt', 'r') as file:
  lines = file.read()

In [None]:
tmtm = Trial_MicroTopicModeller(n_clusters=12)
sent_embs = tmtm.get_embeddings(lines)
emb_clusters, sent_collection = tmtm.get_sentence_clusters(sent_embs)

Downloading (…)fe2c8/.gitattributes:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Downloading (…)18db3fe2c8/README.md:   0%|          | 0.00/179k [00:00<?, ?B/s]

Downloading (…)db3fe2c8/config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)2c8/onnx/config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

Downloading (…)b3fe2c8/modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

  super()._check_params_vs_input(X, default_n_init=10)


#Results Visualization

In [None]:
colors = {
    0: '\033[90m',
    1: '\033[91m',
    2: '\033[92m',
    3: '\033[93m',
    4: '\033[94m',
    5: '\033[95m',
    6: '\033[96m',
    7: '\033[30m',
    8: '\033[31m',
    9: '\033[32m',
    10: '\033[33m',
    11: '\033[34m',
    12: '\033[35m',
    13: '\033[36m',
}

In [None]:
for sent in tmtm.data:
    for i in range(len(sent_collection)):
        if sent in sent_collection[i]:
            print(f"{colors[i]} {sent}")

[30m Mahsa Amini’s death could be the spark that ignites Iran around women’s rights
The country faces a litany of problems, from inflation to a democratic deficit, and the women’s movement is seen as an agent of change.
[30m On the day that news of Mahsa Amini’s death spread throughout Iran, a young woman with a shaved head joined protesters who had gathered outside Kasra hospital, where Amini had lain in a coma since her violent arrest by Iran’s morality police days earlier.
[95m In her hand she carried a plastic bag full of her long hair, shorn off in a gesture of solidarity with Amini and in defiance of the increasing crackdown on women by the regime.
[92m A week later, and protests sparked by Amini’s death are raging in the province of Kurdistan and Tehran as well as cities such as Rasht, Isfahan and Qom, one of Iran’s most religiously conservative cities.
[30m The rage across Iran at the brutal pointlessness of Amini’s death has lit the fires of protest and the increasing des