<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis-informatics/blob/main/BT%20INFO%20-%20OCTIS%3A%20LDA%20%26%20NMF%20%26%20BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

OCTIS: Optimizing and Comparing Topic models Is Simple

Implementation of LDA and NMF topic models using OCTIS framework

https://github.com/MIND-Lab/OCTIS

https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_LDA_training_only.ipynb

https://octis.readthedocs.io/_/downloads/en/latest/pdf/

https://github.com/MIND-Lab/OCTIS/blob/7529e23c0f852076a46b88c8c073d54a8bf0d26b/octis/dataset/dataset.py

https://github.com/MIND-Lab/OCTIS/blob/7529e23c0f852076a46b88c8c073d54a8bf0d26b/octis/models/LDA.py

https://github.com/MIND-Lab/OCTIS/blob/7529e23c0f852076a46b88c8c073d54a8bf0d26b/octis/models/NMF.py

Evaluation of LDA, NMF and BERTopic models using topic diversity and topic coherence (NPMI and CV) metrics provided in OCTIS framework

https://github.com/MIND-Lab/OCTIS/tree/master/octis/evaluation_metrics

https://github.com/MIND-Lab/OCTIS/issues/61

https://github.com/MIND-Lab/OCTIS/issues/55

In [1]:
%%capture
!pip install octis

In [2]:
from octis.models.LDA import LDA
from octis.models.NeuralLDA import NeuralLDA
from octis.models.NMF import NMF
from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
import pandas as pd
import io

In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


BERTopic

In [4]:
# Read BERTopic models results into pandas dataframes 
model_0_results_df = pd.read_csv('/content/drive/MyDrive/BT INFO/Results/Model_0_Topics_Complete.csv')
model_1_results_df = pd.read_csv('/content/drive/MyDrive/BT INFO/Results/Model_1_Topics_Complete.csv')
model_2_results_df = pd.read_csv('/content/drive/MyDrive/BT INFO/Results/Model_2_Topics_Complete.csv')
model_3_results_df = pd.read_csv('/content/drive/MyDrive/BT INFO/Results/Model_3_Topics_Complete.csv')
model_4_results_df = pd.read_csv('/content/drive/MyDrive/BT INFO/Results/Model_4_Topics_Complete.csv')

bertopic_results = [model_0_results_df, model_1_results_df, model_2_results_df, model_3_results_df, model_4_results_df]

In [5]:
# Transform BERTopic results of each model to dict with list as required by OCTIS metrics 
bertopic_results_list = []
for model_results_df in bertopic_results:
    model_results_df.drop(labels=[0], axis=0, inplace=True)
    model_results_df.drop(labels=['Unnamed: 0'], axis=1, inplace=True)
    bertopic_results_dict = {'topics': model_results_df.values.tolist()}
    bertopic_results_list.append(bertopic_results_dict)
# bertopic_results_list

OCTIS: LDA & NMF

In [6]:
# Load custom Parler dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("/content/drive/MyDrive/BT INFO/OCTIS/TSV without partition")

In [7]:
# Check Parler dataset
print(len(dataset.get_corpus()))
print(*list(dataset.get_corpus()[0:10]), sep="\n")

309069
['body']
['glad', 'see', 'parler', 'free', 'speech', 'actually', 'alive', 'well', 'looking', 'forward', 'mixing', 'keeping', 'truth', 'alive', 'face', 'constant', 'bias', 'face', 'let']
['not', 'enough', 'year', 'minimum']
['wonder', 'kamalaharris', 'blm', 'think', 'white', 'guy', 'placed', 'spot', 'power', 'dog', 'eat', 'dog', 'world', 'let', 'fight', 'amongst']
['agreed', 'seemed', 'like', 'close', 'race', 'till', 'inner', 'city', 'started', 'reporting', 'fake', 'result']
['well', 'well', 'abercrombie', 'fitch', 'president', 'canada', 'end', 'fan', 'disciple', 'trumpism', 'all', 'canada', 'first']
['well', 'well', 'well', 'fuck', 'fuckin', 'year', 'ago', 'know', 'every', 'damn', 'thing', 'say', 'except', 'jumping', 'cliff', 'even', 'damn', 'air', 'breath']
['big', 'guy', 'need', 'keep', 'bagman', 'close', 'prevent', 'ever', 'questioned', 'high']
['fear', 'post', 'trumpwhenever', 'isthe', 'republican', 'party', 'revert', 'back', 'play', 'nice', 'roll', 'over', 'sell', 'base', '

In [8]:
# Create OCTIS models: LDA, NMF
model_lda = LDA(num_topics=10, alpha=0.1)
model_lda.partitioning(False)
model_nmf = NMF(num_topics=10)
model_nmf.partitioning(False)
# model_neural_lda = NeuralLDA(num_topics=10, num_epochs=1, num_layers=1, num_neurons=10)
# model_ctm = CTM(num_topics=10, num_epochs=1, num_layers=1, num_neurons=10, bert_model="all-MiniLM-L6-v2") 
# Colab Pro subscription but still not enough available RAM to train NeuralLDA and CTM

In [9]:
# Train OCTIS models: LDA and NMF
topics_lda = model_lda.train_model(dataset)
topics_nmf = model_nmf.train_model(dataset)

In [10]:
# Print topics extracted by OCTIS models: LDA and NMF
print("Topics extracted by LDA: ")
for topic in topics_lda['topics']:
    print(topic)
print("\nTopics extracted by NMF: ")
for topic in topics_nmf['topics']:
    print(topic)

Topics extracted by LDA: 
['law', 'constitution', 'court', 'congress', 'senator', 'capitol', 'corrupt', 'military', 'justice', 'system']
['not', 'people', 'know', 'get', 'would', 'going', 'want', 'think', 'take', 'one']
['election', 'trump', 'state', 'not', 'republican', 'vote', 'fraud', 'party', 'need', 'democrat']
['are', 'you', 'they', 'going', 'coming', 'will', 'not', 'get', 'wait', 'black']
['not', 'would', 'twitter', 'parler', 'video', 'have', 'get', 'account', 'post', 'see']
['like', 'antifa', 'look', 'blm', 'police', 'war', 'shit', 'woman', 'name', 'fuck']
['god', 'trump', 'president', 'patriot', 'love', 'you', 'thank', 'penny', 'great', 'america']
['biden', 'year', 'trump', 'president', 'last', 'joe', 'communist', 'vote', 'obama', 'ago']
['news', 'watch', 'china', 'fake', 'money', 'fox', 'paid', 'traitor', 'coward', 'commie']
['supporter', 'big', 'medium', 'trump', 'lol', 'follow', 'maga', 'tech', 'part', 'social']

Topics extracted by NMF: 
['trump', 'supporter', 'suck', 'wou

In [11]:
# Transform the topics found by LDA and NMF to dataframes and save as CSV files
topics_df_lda = pd.DataFrame(topics_lda['topics'], index = ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'],
                                 columns = ['Word 1', 'Word 2', 'Word 3', 'Word 4', 'Word 5', 'Word 6', 'Word 7', 'Word 8', 'Word 9', 'Word 10'])
topics_df_lda.to_csv('OCTIS_LDA_Topics_Complete.csv')
topics_df_lda

Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
Topic 0,law,constitution,court,congress,senator,capitol,corrupt,military,justice,system
Topic 1,not,people,know,get,would,going,want,think,take,one
Topic 2,election,trump,state,not,republican,vote,fraud,party,need,democrat
Topic 3,are,you,they,going,coming,will,not,get,wait,black
Topic 4,not,would,twitter,parler,video,have,get,account,post,see
Topic 5,like,antifa,look,blm,police,war,shit,woman,name,fuck
Topic 6,god,trump,president,patriot,love,you,thank,penny,great,america
Topic 7,biden,year,trump,president,last,joe,communist,vote,obama,ago
Topic 8,news,watch,china,fake,money,fox,paid,traitor,coward,commie
Topic 9,supporter,big,medium,trump,lol,follow,maga,tech,part,social


In [12]:
topics_df_nmf = pd.DataFrame(topics_nmf['topics'], index = ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'],
                                 columns = ['Word 1', 'Word 2', 'Word 3', 'Word 4', 'Word 5', 'Word 6', 'Word 7', 'Word 8', 'Word 9', 'Word 10'])
topics_df_nmf.to_csv('OCTIS_NMF_Topics_Complete.csv')
topics_df_nmf

Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
Topic 0,trump,supporter,suck,would,donald,maga,win,know,republican,team
Topic 1,you,are,they,time,have,back,know,that,thank,will
Topic 2,not,doe,did,will,know,even,would,want,say,anything
Topic 3,president,god,biden,country,america,never,trump,love,bless,year
Topic 4,election,vote,biden,state,fraud,democrat,ballot,republican,voter,voting
Topic 5,need,country,back,take,make,would,state,stop,stand,time
Topic 6,going,think,one,right,take,know,that,cannot,democrat,would
Topic 7,get,not,cannot,let,will,back,away,time,way,rid
Topic 8,people,american,god,know,want,country,many,america,government,right
Topic 9,like,would,news,one,look,fox,see,good,never,sound


Evaluation of LDA, NMF and BERTopic using OCTIS metrics

In [13]:
results_df = pd.DataFrame(index = ['LDA', 'NMF', 'BERTopic Model 0', 'BERTopic Model 1', 'BERTopic Model 2', 'BERTopic Model 3', 'BERTopic Model 4'], columns = ['Topic Coherence NPMI', 'Topic Coherence CV', 'Topic Diversity'])
results_df

Unnamed: 0,Topic Coherence NPMI,Topic Coherence CV,Topic Diversity
LDA,,,
NMF,,,
BERTopic Model 0,,,
BERTopic Model 1,,,
BERTopic Model 2,,,
BERTopic Model 3,,,
BERTopic Model 4,,,


In [16]:
# Evaluation metric: NPMI (topic coherence)
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi') # Initialize metric
results_df['Topic Coherence NPMI']['LDA'] = round(npmi.score(topics_lda), 4) 
results_df['Topic Coherence NPMI']['NMF'] = round(npmi.score(topics_nmf), 4)
for topics_list in bertopic_results_list:
    results_df['Topic Coherence NPMI'][bertopic_results_list.index(topics_list)+2] = round(npmi.score(topics_list), 4)
results_df

Unnamed: 0,Topic Coherence NPMI,Topic Coherence CV,Topic Diversity
LDA,0.0508,,
NMF,0.042,,
BERTopic Model 0,0.1269,,
BERTopic Model 1,0.1537,,
BERTopic Model 2,0.1269,,
BERTopic Model 3,0.1203,,
BERTopic Model 4,0.1251,,


In [20]:
# Evaluation metric: CV (topic coherence)
cv = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_v') # Initialize metric
results_df['Topic Coherence CV']['LDA'] = round(cv.score(topics_lda), 2) 
results_df['Topic Coherence CV']['NMF'] = round(cv.score(topics_nmf), 2) 
for topics_list in bertopic_results_list:
    results_df['Topic Coherence CV'][bertopic_results_list.index(topics_list)+2] = round(cv.score(topics_list), 2)
results_df

Unnamed: 0,Topic Coherence NPMI,Topic Coherence CV,Topic Diversity
LDA,0.0508,0.58,0.87
NMF,0.042,0.59,0.69
BERTopic Model 0,0.1269,0.69,0.83
BERTopic Model 1,0.1537,0.71,0.89
BERTopic Model 2,0.1269,0.67,0.88
BERTopic Model 3,0.1203,0.65,0.91
BERTopic Model 4,0.1251,0.68,0.9


In [19]:
# Evaluation metric: TopicDiversity (number of unique words in the top-words of the resulting topics)
topic_diversity = TopicDiversity(topk=10) # Initialize metric
results_df['Topic Diversity']['LDA'] = round(topic_diversity.score(topics_lda), 4) 
results_df['Topic Diversity']['NMF'] = round(topic_diversity.score(topics_nmf), 4) 
for topics_list in bertopic_results_list:
    results_df['Topic Diversity'][bertopic_results_list.index(topics_list)+2] = round(topic_diversity.score(topics_list), 4)
results_df

Unnamed: 0,Topic Coherence NPMI,Topic Coherence CV,Topic Diversity
LDA,0.0508,0.5778,0.87
NMF,0.042,0.586,0.69
BERTopic Model 0,0.1269,0.6948,0.83
BERTopic Model 1,0.1537,0.7053,0.89
BERTopic Model 2,0.1269,0.6722,0.88
BERTopic Model 3,0.1203,0.6522,0.91
BERTopic Model 4,0.1251,0.6753,0.9


In [37]:
# Save dataframe with metrics as CSV file
results_df.to_csv('OCTIS_Evaluation_LDA_NMF_BERTopic.csv')