In this notebook we show how to use the tm_metrics library for extract the quality metrics propose by [Topic Quality Metrics Based on Distributed Word
Representations](https://logic.pdmi.ras.ru/~sergey/papers/N16_SIGIR.pdf).

For show the usage of the tm_metrics library, we use the example show by [Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation](https://scikit-learn.org/0.20/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py) with a few modifications. For the topics generated by the NMF algorithm, we show the summary of each metrics provided by the library (i.e., Coherence, TFIDF-Coherence, LCP, PMI, NMPI and Topic_W2V).

**Attention:** Although we use the NMF algorithm for topic modelling, we can use any algorithm for calculate the quality metrics. All what we need is the words per topic and the raw documents.

# Configuration

In [1]:
from time import time

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

from tm_metrics.feature_extraction import get_tfidf_matrices, get_vocabulary, get_word_frequencies
from tm_metrics.metrics import coherence, tfidf_coherence, lcp, pmi, topic_w2v

In [2]:
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
seed = 29

In [3]:
def get_topics(model, feature_names, n_top_words):
    topics = []

    for topic_idx, topic in enumerate(model.components_):
        topic = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        topics.append(topic)

    return topics


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f"Topic #{topic_idx}: "
        message += " ".join(
            [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

# Topic Modelling with TF-IDF e NMF for the 20 News Group Dataset

Loading 20 News Group Dataset:

In [4]:
dataset = fetch_20newsgroups(shuffle=True, random_state=seed,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]

TF-IDF Representation:

In [5]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(data_samples)

NMF for topic modelling:

In [6]:
nmf = NMF(n_components=n_components, random_state=1, alpha=.1, l1_ratio=.5)
nmf.fit(tfidf)

NMF(alpha=0.1, beta_loss='frobenius', init=None, l1_ratio=0.5, max_iter=200,
  n_components=10, random_state=1, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

Topics generated by the NMF algorithm using the TF-IDF representation:

In [7]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)
topics = get_topics(nmf, tfidf_feature_names, n_top_words)

Topic #0: just like don people think know good time ve right say way make did really does want ll going use
Topic #1: windows files use dos file program version ftp ms 00 application directory using hi pc window used image printer drivers
Topic #2: game team games hockey play year league win players season teams player best runs think points star fan good night
Topic #3: god jesus faith christian man christ sin christians father church life believe son lord bible religion truth christianity belief word
Topic #4: drive scsi disk drives hard floppy ide mb computer controller tape sale cd dos pay bus format speed power local
Topic #5: com list request mailing send hp sun edu article email ibm address voice internet want file wish user ask games
Topic #6: thanks advance address know looking does mail hi info send anybody help interested appreciate information email appreciated software like work
Topic #7: key chip clipper encryption keys government escrow security privacy public use algori

# Topic Quality Metrics

For each topic generated by the NMF algorithm in the last step, we calculate the metrics provided by tm_metrics library. We show the **mean** and **standard deviation** of each metric implemented.

First, we've to get the informations from the dataset documents. **We chose to use the calculation of these informations before the metrics executions for improve efficience**. Otherwiser, we need to calculate each information in each metric. By doing this step we need to do this just once, reutilizing the calculations.

In [8]:
tfidf_matrix, tfidf_matrix_transpose = get_tfidf_matrices(data_samples)
vocabulary = get_vocabulary(data_samples)
word_frequency, word_frequency_in_documents = get_word_frequencies(data_samples)

Calculation of all quality metrics for the topics we generated:

In [9]:
pmi_results = []
npmi_results = []
coherence_results = []
tfidf_coherence_results = []
lcp_results = []

for topic_words in topics:
    pmi_ = pmi(topic_words, word_frequency, word_frequency_in_documents, n_samples, normalise=False)
    npmi_ = pmi(topic_words, word_frequency, word_frequency_in_documents, n_samples, normalise=True)
    coherence_ = coherence(topic_words, word_frequency, word_frequency_in_documents)
    tfidf_coherence_ = tfidf_coherence(topic_words, tfidf_matrix_transpose, vocabulary)
    lcp_ = lcp(topic_words, word_frequency, word_frequency_in_documents)
    
    pmi_results.append(pmi_)
    npmi_results.append(npmi_)
    coherence_results.append(coherence_)
    tfidf_coherence_results.append(tfidf_coherence_)
    lcp_results.append(lcp_)

Now we've all the quality metrics for each topic, where all the topics are generated with the TF-IDF representation and the NMF algorithm.

Next we can see the average and standard deviation of each metric:

In [10]:
avg_pmi, std_pmi = np.mean(pmi_results), np.std(pmi_results)
avg_npmi, std_npmi = np.mean(npmi_results), np.std(npmi_results)
avg_coherence, std_coherence = np.mean(coherence_results), np.std(coherence_results)
avg_tfidf, std_tfidf = np.mean(tfidf_coherence_results), np.std(tfidf_coherence_results)
avg_lcp, std_lcp = np.mean(lcp_results), np.std(lcp_results)

data = [
    ["PMI", avg_pmi, std_pmi],
    ["NPMI", avg_npmi, std_npmi],
    ["Coherence", avg_coherence, std_coherence],
    ["TFIDF_Coherence", avg_tfidf, std_tfidf],
    ["LCP", avg_lcp, std_lcp]
]
columns = ["Metric", "Avg", "Std"]
df = pd.DataFrame(data, columns=columns)
df

Unnamed: 0,Metric,Avg,Std
0,PMI,303.014431,101.428537
1,NPMI,0.285646,0.085899
2,Coherence,-156.715721,23.784665
3,TFIDF_Coherence,-131.323619,30.57812
4,LCP,-375.27175,49.22821
