<a href="https://colab.research.google.com/github/belom-nlp/micro_topic_modelling/blob/main/notebooks/MTM_on_train_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will check how

In [1]:
from IPython.display import clear_output

! pip install sentence_transformers

!pip uninstall scikit-learn -y

!pip install -U scikit-learn

import nltk
nltk.download('punkt')

from collections import Counter

import numpy as np

import torch

from sklearn.decomposition import PCA
from sklearn.cluster import HDBSCAN, KMeans
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, PCA

from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

from sentence_transformers import SentenceTransformer

clear_output()

In [None]:
import random

def set_random_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)

set_random_seed(42)

In [None]:
import model

In [None]:
nltk.download('stopwords')
french_stops = list(set(stopwords.words('french')))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


We will use two instances of MicroTopicModeller, for English and French texts correspondently.

In [None]:
mtm = model.MicroTopicModeller(n_clusters=12) #with English stop words by default

In [None]:
mtm_fr = model.MicroTopicModeller(n_clusters=12, stop_words=french_stops) #with French stop words downloaded from nltk

## Metrics

Here we install metrics with which we will evaluate our results.

In [None]:
! git clone https://github.com/christianrfg/tm_metrics.git

Cloning into 'tm_metrics'...
remote: Enumerating objects: 79, done.[K
remote: Total 79 (delta 0), reused 0 (delta 0), pack-reused 79[K
Receiving objects: 100% (79/79), 1.19 MiB | 2.81 MiB/s, done.
Resolving deltas: 100% (22/22), done.


In [None]:
!sed -i 's/cv_model.get_feature_names()/cv_model.get_feature_names_out()/' tm_metrics/tm_metrics/feature_extraction/text.py

In [None]:
!pip install -U /content/tm_metrics

Processing ./tm_metrics
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tm-metrics
  Building wheel for tm-metrics (setup.py) ... [?25l[?25hdone
  Created wheel for tm-metrics: filename=tm_metrics-0.1-py3-none-any.whl size=8201 sha256=e26886205adb427d5d570dd916e6dccc3bb3931f452b8dc199f5eff2a8a4f3f9
  Stored in directory: /tmp/pip-ephem-wheel-cache-_0ybiksb/wheels/88/73/88/e84eb7e10e9fc6ecb2f0e636d851f394cd8789c1ed40d09cc7
Successfully built tm-metrics
Installing collected packages: tm-metrics
Successfully installed tm-metrics-0.1


Two of the metrics are counted slightly differently from their implementation in this code, therefore we will redefine them.

In [None]:
def pmi(topic_words, word_frequency, word_frequency_in_documents, n_docs, normalise=False):
    """PMI/NPMI topic quality metric for a topic.

    Calculates the PMI/NPMI topic quality metric for one individual topic based on the topic words.

    Args:
        topic_words: list
            Words that compose one individual topic.
        word_frequency: dict
            Frequency of each word in corpus.
        word_frequency_in_documents: dict
            Frequency of each word for each document in corpus.
        n_docs: int
            Number of documents in the corpus.
        normalise: bool, default=False
            Where to normalise (NPMI) or not (PMI).

    Returns:
        pmi: float
            Resultant PMI metric value for the topic.
        npmi: float
            Resultant NPMI metric value for the topic.
    """
    n_top = len(topic_words)
    pmi = 0.0
    npmi = 0.0

    for j in range(1, n_top):
        for i in range(0, j):
            ti = topic_words[i]
            tj = topic_words[j]

            c_i = word_frequency[ti]
            c_j = word_frequency[tj]
            c_i_and_j = len(word_frequency_in_documents[ti].intersection(word_frequency_in_documents[tj]))

            dividend = (c_i_and_j + 1.0) / float(n_docs)
            divisor = ((c_i * c_j) / float(n_docs) ** 2)
            pmi += max([np.log(dividend / divisor), 0])

            npmi += -1.0 * np.log((c_i_and_j + 0.01) / float(n_docs))

    if npmi != 0:
        npmi = pmi / npmi

    if normalise:
        return npmi
    else:
        return pmi

In [None]:
from tm_metrics.feature_extraction import get_vocabulary, get_word_frequencies
from tm_metrics.metrics import coherence

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from gensim.models import Word2Vec

In [None]:
def prepare_data_for_metrics(data):
  """
  data are prepared for metrics as expected in the installed implementation. takes a list of strings as an input
  """
  n_features = int(len(data)/2)
  tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english', max_features=1000)
  tfidf = tfidf_vectorizer.fit_transform(data)
  tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
  vocabulary = get_vocabulary(data)
  word_frequency, word_frequency_in_documents = get_word_frequencies(data)
  return word_frequency, word_frequency_in_documents

In [None]:
def count_metrics(x, word_frequency, word_frequency_in_documents):

    """
    count PMI, NPMI, and coherence.
    Takes topic words (list of lists of strings) and the results of prepare_data_for_metrics() function as an input
    """

    pmi_results = []
    npmi_results = []
    coherence_results = []


    n_samples=len(x)

    for tw in x:
        pmi_ = pmi(tw, word_frequency, word_frequency_in_documents, n_samples, normalise=False)
        npmi_ = pmi(tw, word_frequency, word_frequency_in_documents, n_samples, normalise=True)

        coherence_ = coherence(tw, word_frequency, word_frequency_in_documents)


        pmi_results.append(pmi_)
        npmi_results.append(npmi_)
        coherence_results.append(coherence_)

    return {'pmi_results': pmi_results,
          'npmi_results': npmi_results,
          'coherence_results': coherence_results}

In [None]:
def get_result_table(d):

    """
    counts average PMI, NPMI, and coherence
    returns Pandas dataframe with means and standard deviations for each metric
    """
  pmi_results = d['pmi_results']
  npmi_results = d['npmi_results']
  coherence_results = d['coherence_results']


  avg_pmi, std_pmi = np.mean(pmi_results), np.std(pmi_results)
  avg_npmi, std_npmi = np.mean(npmi_results), np.std(npmi_results)
  avg_coherence, std_coherence = np.mean(coherence_results), np.std(coherence_results)



  data = [
    ["PMI", avg_pmi, std_pmi],
    ["NPMI", avg_npmi, std_npmi],
    ["Coherence", avg_coherence, std_coherence],
  ]
  columns = ["Metric", "Avg", "Std"]
  df = pd.DataFrame(data, columns=columns)
  return df

The experiments conducted below are basically the same. We get topic words for each dataset, print them out, then count metrics and print them out.

## Coronavirus 1 Dataset

In [None]:
with open('micro_topic_modelling/text_data/coronavirus.txt') as f:
    lines = f.readlines()

In [None]:
distr = mtm.pipeline(lines)

Getting sentence embeddings...
Getting sentence clusters...


  super()._check_params_vs_input(X, default_n_init=10)


Computing LDA...
Preparing your results...
Complete!


In [None]:
for key, value in distr.items():
    print(key, ':', ', '.join(value))

topic0 : 15, human, data, population, cases, situation, probable, coronavirus, ranging, able, year, medical, organisation, said, media, origins, virus, old, second, health, died, wuhan, authorities
topic1 : area, cases, report, worried, commission, said, virus, family, detection, widely, chinese, contact, health, new, pneumonia, attempts, wuhan, authorities, including, outbreak, hubei
topic2 : human, area, cases, coronavirus, people, radiographs, detected, recognised, said, organisation, illness, time, tweet, virus, mystery, recommended, shortcomings, enhanced, health, things, reports, contract, eyeglasses
topic3 : infected, sunday, 68, cases, people, commission, 324, illness, seven, condition, visited, 81, proven, fever, health, contracted, wuhan, authorities, patients, confirmed
topic4 : spokesman, washington, department, population, epidemics, cases, conceivable, month, ministry, said, patient, mayer, fix, state, political, fact, foreign, 20, zhao, fuelled, farrar, lookout, universi

In [None]:
word_frequency, word_frequency_in_documents = prepare_data_for_metrics(lines)

In [None]:
topics = []
for key in distr.keys():
    if key != 'common_key_words':
        topics.append(distr[key])
results = count_metrics(topics, word_frequency, word_frequency_in_documents)
get_result_table(results)

Unnamed: 0,Metric,Avg,Std
0,PMI,402.477379,94.027514
1,NPMI,0.556557,0.035232
2,Coherence,0.837278,9.199869


## Coronavirus 2 Dataset

In [None]:
with open('micro_topic_modelling/text_data/coronavirus_part_2.txt') as f:
    lines = f.read()

In [None]:
distr = mtm.pipeline(lines)

Getting sentence embeddings...
Getting sentence clusters...


  super()._check_params_vs_input(X, default_n_init=10)


Computing LDA...
Preparing your results...
Complete!


In [None]:
for key, value in distr.items():
    print(key, ':', ', '.join(value))

topic0 : city, spring, hubei, stop, ezhou, october, according, necessities, friday, museum, train, 95, said, passengers, general, huanggang, crowded, forbidden, cancelled, health, stations, diseases, session, distance, ignoring
topic1 : city, asked, facebook, tourists, primary, president, responsibly, special, said, care, rising, visit, ing, state, school, told, reported, nicola, leave, china, wuhan, session, come, tsai, 26
topic2 : kong, people, hong, lunar, year, said, media, virus, developed, authority, china, new, sars, believed, province, wuhan, concern, confirmed, outbreak, tripled, hubei
topic3 : britain, time, said, study, strategy, head, scheduled, pandemics, told, uk, brexit, reported, existential, chief, likely, government, security, outhwaite, diseases, come, outbreak, ghs
topic4 : cafes, apart, stop, tour, people, markets, police, internet, said, outdoor, events, masks, scale, park, ordered, prohibited, hotels, playgrounds, cultural, banned, rumours, bs, local, notice, hal

In [None]:
with open('micro_topic_modelling/text_data/coronavirus_part_2.txt') as f:
    lines = f.readlines()
word_frequency, word_frequency_in_documents = prepare_data_for_metrics(lines)

In [None]:
topics = []
for key in distr.keys():
    if key != 'common_key_words':
        topics.append(distr[key])
results = count_metrics(topics, word_frequency, word_frequency_in_documents)
get_result_table(results)

Unnamed: 0,Metric,Avg,Std
0,PMI,514.941049,68.439725
1,NPMI,0.602925,0.172321
2,Coherence,6.007732,12.317476


## French Coronavirus Dataset

In [None]:
with open('micro_topic_modelling/text_data/french_coronavirus.txt') as f:
    lines = f.read()

In [None]:
distr = mtm_fr.pipeline(lines)

Getting sentence embeddings...
Getting sentence clusters...
Computing LDA...


  super()._check_params_vs_input(X, default_n_init=10)


Preparing your results...
Complete!


In [None]:
for key, value in distr.items():
    print(key, ':', ', '.join(value))

topic0 : doses, signe, satisfait, vu, milliard, vaccin, pays, rapport, arme, certainement, retardée, vaccination, virus, comme, significatif, campagne, plus, oms, nouvelle, contre, dmitriev, très, mondiale
topic1 : étude, grand, rapport, vacciner, début, sondage, vaccination, directeur, nonagénaire, où, intentions, faire, 19, aucune, france, sort, plus, saint, assure, très, sud, français
topic2 : olivier, alarmant, grand, côté, week, signal, milliard, retraitée, pasteur, sécurité, hors, être, directeur, loin, suivent, end, bon, collègues, massives, sort, saint, très, produire, nuit, observé
topic3 : doses, salle, lieux, tour, arrivées, bertheau, vaccin, marie, raymonda, vaccination, problème, opération, heure, centre, aucune, sûr, gamaleya, absolument, minutes, quart, déjà, produire, cloche, semaine
topic4 : personnes, volontaires, 16, réaction, toutefois, 000, vaccin, placebo, ligne, produit, groupe, dont, phase, italie, heureuse, milliers, également, plus, effets, ans, essai, tous, a

In [None]:
with open('micro_topic_modelling/text_data/french_coronavirus.txt') as f:
    lines = f.readlines()
word_frequency, word_frequency_in_documents = prepare_data_for_metrics(lines)

In [None]:
topics = []
for key in distr.keys():
    if key != 'common_key_words':
        topics.append(distr[key])
results = count_metrics(topics, word_frequency, word_frequency_in_documents)
get_result_table(results)

Unnamed: 0,Metric,Avg,Std
0,PMI,577.01596,91.154104
1,NPMI,0.53572,0.073373
2,Coherence,13.345617,8.101391


## Iran 1 Dataset

In [None]:
with open('micro_topic_modelling/text_data/iran_1.txt') as f:
    lines = f.read()

In [None]:
distr = mtm.pipeline(lines)

Getting sentence embeddings...
Getting sentence clusters...


  super()._check_params_vs_input(X, default_n_init=10)


Computing LDA...
Preparing your results...
Complete!


In [None]:
for key, value in distr.items():
    print(key, ':', ', '.join(value))

topic0 : country, tuesday, nuclear, wednesday, said, weapons, york, general, rights, negotiator, protests, new, raisi, vague, iran, assembly, president
topic1 : headscarf, death, iranian, police, year, erfan, said, head, hashtags, tightened, enforcement, amini, women, historic, past, morality, hijab, iran, ebrahim, iranwire
topic2 : defiance, hijabs, hashtag, mahsa_amini, police, scarves, waving, taken, flames, protest, hair, posting, cut, headscarves, women, emotional, air, circle, hijab
topic3 : country, sound, people, police, arrested, detained, assault, internet, organise, media, difficult, social, story, additional, does, government, independently, regime, cities, president
topic4 : kurdish, said, khamenei, mahsa, death, protests, control, amini, women, iranian, resistance, people, iran, cities, president
topic5 : demonstrations, death, islamic, heard, iranian, 000, dictator, said, forces, witness, state, protests, tehran, security, government, iran, cities
topic6 : dealing, offic

In [None]:
with open('micro_topic_modelling/text_data/iran_1.txt') as f:
    lines = f.readlines()
word_frequency, word_frequency_in_documents = prepare_data_for_metrics(lines)

In [None]:
topics = []
for key in distr.keys():
    if key != 'common_key_words':
        topics.append(distr[key])
results = count_metrics(topics, word_frequency, word_frequency_in_documents)
get_result_table(results)

Unnamed: 0,Metric,Avg,Std
0,PMI,31.47536,32.276422
1,NPMI,0.038957,0.033444
2,Coherence,-109.982014,25.314477


## French Iran Dataset

In [None]:
with open('micro_topic_modelling/text_data/iran_fr .txt') as f:
    lines = f.read()

In [None]:
distr = mtm_fr.pipeline(lines)

Getting sentence embeddings...
Getting sentence clusters...


  super()._check_params_vs_input(X, default_n_init=10)


Computing LDA...
Preparing your results...
Complete!


In [None]:
for key, value in distr.items():
    print(key, ':', ', '.join(value))

topic0 : twitter, whatsapp, réseau, accès, instagram, ayatollah, pays, rares, antony, internet, accéder, dont, inaccessibles, afin, accessibles, manifestations, plus, secrétaire, iran, certains
topic1 : mœurs, femme, raïssi, selon, chef, mohsen, presse, police, hôpital, emmener, gouverneur, parquet, téhéran, iranien, enquête, déclaré, créer, ebrahim, jeune, troubles
topic2 : personnes, aide, homme, fars, fait, instagram, comptes, police, lacrymogènes, derniers, être, agence, gouverneur, gaz, manifestations, iran, arrestations, foule, violences, informer, ailleurs
topic3 : détention, puis, mahsa, iranien, étrangères, amini, porte, déclaré, sous, impartiale, mort, police, enquête, sanctions, puissants
topic4 : personnes, oui, quand, montre, non, ajouté, repentent, être, orientation, deux, liberté, séances, négligence, virales, sociaux, égalité, plus, cette, foulard, femmes, responsables, etat, interrogé, député
topic5 : jeudi, personnes, membres, selon, hommes, enfant, manifestants, simi

In [None]:
with open('micro_topic_modelling/text_data/iran_fr .txt') as f:
    lines = f.readlines()
word_frequency, word_frequency_in_documents = prepare_data_for_metrics(lines)

In [None]:
topics = []
for key in distr.keys():
    if key != 'common_key_words':
        topics.append(distr[key])
results = count_metrics(topics, word_frequency, word_frequency_in_documents)
get_result_table(results)

Unnamed: 0,Metric,Avg,Std
0,PMI,79.102385,49.796082
1,NPMI,0.080538,0.028063
2,Coherence,-94.730515,21.422938
