Step 1: Loading Data
For this tutorial, we’ll use the dataset of papers published in NIPS conference. The NIPS conference (Neural Information Processing Systems) is one of the most prestigious yearly events in the machine learning community. The CSV data file contains information on the different NIPS papers that were published from 1987 until 2016 (29 years!). These papers discuss a wide variety of topics in machine learning, from neural networks to optimization methods, and many more.

Let’s start by looking at the content of the file

In [2]:
import zipfile
import pandas as pd
import os

# Open the zip file
with zipfile.ZipFile("./NIPS Papers.zip", "r") as zip_ref:
    # Extract the file to a temporary directory
    zip_ref.extractall("temp")

# Read the CSV file into a pandas DataFrame
papers = pd.read_csv("temp/NIPS Papers/papers.csv")

# Print head
papers.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


Step 2: Data Cleaning
Since the goal of this analysis is to perform topic modeling, we will solely focus on the text data from each paper, and drop other metadata columns

In [3]:
# Remove the columns
papers = papers.drop(columns=['id', 'title', 'abstract', 
                              'event_type', 'pdf_name', 'year'], axis=1)

# sample only 100 papers
papers = papers.sample(100)

# Print out the first rows of papers
papers.head()

Unnamed: 0,paper_text
2389,Simplified Rules and Theoretical Analysis for\...
6024,Using hippocampal 'place cells' for\nnavigatio...
1932,From Batch to Transductive Online Learning\n\n...
1603,Information Bottleneck for\nGaussian Variables...
282,A Constructive RBF Network\nfor Writer Adaptat...


Remove punctuation/lower casing
Next, let’s perform a simple preprocessing on the content of paper_text column to make them more amenable for analysis, and reliable results. To do that, we’ll use a regular expression to remove any punctuation, and then lowercase the text

In [4]:
# Load the regular expression library
import re

# Remove punctuation
papers['paper_text_processed'] = papers['paper_text'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: x.lower())

# Print out the first rows of papers
papers['paper_text_processed'].head()

2389    simplified rules and theoretical analysis for\...
6024    using hippocampal 'place cells' for\nnavigatio...
1932    from batch to transductive online learning\n\n...
1603    information bottleneck for\ngaussian variables...
282     a constructive rbf network\nfor writer adaptat...
Name: paper_text_processed, dtype: object

Tokenize words and further clean-up text
Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

In [7]:
import gensim
from gensim.utils import simple_preprocess

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data = papers.paper_text_processed.values.tolist()
data_words = list(sent_to_words(data))

print(data_words[:1][0][:30])

['simplified', 'rules', 'and', 'theoretical', 'analysis', 'for', 'information', 'bottleneck', 'optimization', 'and', 'pca', 'with', 'spiking', 'neurons', 'lars', 'buesing', 'wolfgang', 'maass', 'institute', 'for', 'theoretical', 'computer', 'science', 'graz', 'university', 'of', 'technology', 'graz', 'austria', 'larsmaass']


Step 3: Phrase Modeling: Bigram and Trigram Models
Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring. Some examples in our example are: 'back_bumper', 'oil_leakage', 'maryland_college_park' etc.

Gensim's Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold.

The higher the values of these param, the harder it is for words to be combined.

In [8]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Remove Stopwords, Make Bigrams and Lemmatize
The phrase models are ready. Let’s define the functions to remove the stopwords, make trigrams and lemmatization and call them sequentially.

In [11]:
# NLTK Stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/damlasenturk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [13]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [14]:
import spacy

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1][0][:30])

['simplify', 'rule', 'theoretical', 'analysis', 'information_bottleneck', 'optimization', 'pca', 'spiking_neuron', 'lar', 'buese', 'technology', 'abstract', 'show', 'suitable', 'assumption', 'primarily', 'linearization', 'simple', 'perspicuous', 'online', 'learning', 'rule', 'optimization', 'spiking_neuron', 'derive', 'rule', 'perform', 'common', 'benchmark', 'task']


Step 4: Data transformation: Corpus and Dictionary
The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them.

In [15]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:30])

[(0, 1), (1, 2), (2, 1), (3, 4), (4, 2), (5, 1), (6, 2), (7, 1), (8, 1), (9, 4), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 10), (18, 1), (19, 1), (20, 1), (21, 9), (22, 3), (23, 2), (24, 4), (25, 1), (26, 1), (27, 5), (28, 2), (29, 1)]


Step 5: Base Model
We have everything required to train the base LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior (we'll use default for the base model).

chunksize controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.

passes controls how often we train the model on the entire corpus (set to 10). Another word for passes might be "epochs". iterations is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of "passes" and "iterations" high enough.

In [16]:
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=10, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)

The above LDA model is built with 10 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics()

In [17]:
from pprint import pprint

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.025*"model" + 0.012*"use" + 0.011*"network" + 0.010*"graph" + 0.010*"set" '
  '+ 0.008*"learn" + 0.007*"give" + 0.007*"structure" + 0.007*"datum" + '
  '0.007*"distribution"'),
 (1,
  '0.023*"model" + 0.010*"influence" + 0.010*"use" + 0.009*"graph" + '
  '0.009*"parameter" + 0.009*"function" + 0.008*"set" + 0.008*"result" + '
  '0.008*"cluster" + 0.008*"time"'),
 (2,
  '0.009*"method" + 0.009*"feature" + 0.008*"set" + 0.008*"training" + '
  '0.008*"use" + 0.007*"figure" + 0.007*"system" + 0.006*"neural" + '
  '0.006*"learn" + 0.006*"rate"'),
 (3,
  '0.011*"network" + 0.010*"cluster" + 0.010*"game" + 0.009*"player" + '
  '0.008*"noise" + 0.008*"point" + 0.008*"value" + 0.008*"set" + 0.007*"use" + '
  '0.006*"learn"'),
 (4,
  '0.008*"model" + 0.008*"use" + 0.008*"point" + 0.008*"datum" + '
  '0.007*"result" + 0.007*"problem" + 0.007*"cluster" + 0.007*"order" + '
  '0.006*"error" + 0.006*"number"'),
 (5,
  '0.010*"use" + 0.010*"problem" + 0.010*"function" + 0.009*"example" + '
 

Compute Model Perplexity and Coherence Score
Let's calculate the baseline coherence score

In [18]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.2715528592528609


Step 6: Hyperparameter tuning¶
First, let's differentiate between model hyperparameters and model parameters :

Model hyperparameters can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or in our case, number of topics K

Model parameters can be thought of as what the model learns during training, such as the weights for each word in a given topic.

Now that we have the baseline coherence score for the default LDA model, let's perform a series of sensitivity tests to help determine the following model hyperparameters:

Number of Topics (K)
Dirichlet hyperparameter alpha: Document-Topic Density
Dirichlet hyperparameter beta: Word-Topic Density
We'll perform these tests in sequence, one parameter at a time by keeping others constant and run them over the two difference validation corpus sets. We'll use C_v as our choice of metric for performance comparison

In [19]:
 # supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

In [21]:
import numpy as np
import tqdm

grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)), 
               corpus]

corpus_title = ['75% Corpus', '100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('./results/lda_tuning_results.csv', index=False)
    pbar.close()


  7%|██▋                                     | 36/540 [17:13<4:01:11, 28.71s/it][A

  0%|                                         | 1/540 [00:12<1:54:18, 12.73s/it][A
  0%|▏                                        | 2/540 [00:24<1:46:35, 11.89s/it][A
  1%|▏                                        | 3/540 [00:35<1:46:34, 11.91s/it][A
  1%|▎                                        | 4/540 [00:47<1:45:24, 11.80s/it][A
  1%|▍                                        | 5/540 [00:59<1:44:16, 11.69s/it][A
  1%|▍                                        | 6/540 [01:10<1:43:37, 11.64s/it][A
  1%|▌                                        | 7/540 [01:22<1:44:38, 11.78s/it][A
  1%|▌                                        | 8/540 [01:34<1:44:11, 11.75s/it][A
  2%|▋                                        | 9/540 [01:46<1:43:54, 11.74s/it][A
  2%|▋                                       | 10/540 [01:57<1:43:03, 11.67s/it][A
  2%|▊                                       | 11/540 [02:09<1:42:48, 11.6

 18%|███████▏                                | 97/540 [18:45<1:26:35, 11.73s/it][A
 18%|███████▎                                | 98/540 [18:57<1:26:24, 11.73s/it][A
 18%|███████▎                                | 99/540 [19:09<1:26:41, 11.79s/it][A
 19%|███████▏                               | 100/540 [19:21<1:26:04, 11.74s/it][A
 19%|███████▎                               | 101/540 [19:32<1:25:47, 11.73s/it][A
 19%|███████▎                               | 102/540 [19:44<1:25:11, 11.67s/it][A
 19%|███████▍                               | 103/540 [19:56<1:27:09, 11.97s/it][A
 19%|███████▌                               | 104/540 [20:08<1:25:56, 11.83s/it][A
 19%|███████▌                               | 105/540 [20:19<1:24:44, 11.69s/it][A
 20%|███████▋                               | 106/540 [20:31<1:24:21, 11.66s/it][A
 20%|███████▋                               | 107/540 [20:42<1:23:45, 11.61s/it][A
 20%|███████▊                               | 108/540 [20:54<1:24:28, 11.73s

 36%|██████████████                         | 194/540 [37:57<1:12:37, 12.59s/it][A
 36%|██████████████                         | 195/540 [38:11<1:14:38, 12.98s/it][A
 36%|██████████████▏                        | 196/540 [38:25<1:15:48, 13.22s/it][A
 36%|██████████████▏                        | 197/540 [38:38<1:15:26, 13.20s/it][A
 37%|██████████████▎                        | 198/540 [38:51<1:14:47, 13.12s/it][A
 37%|██████████████▎                        | 199/540 [39:04<1:14:08, 13.05s/it][A
 37%|██████████████▍                        | 200/540 [39:17<1:13:25, 12.96s/it][A
 37%|██████████████▌                        | 201/540 [39:30<1:13:11, 12.96s/it][A
 37%|██████████████▌                        | 202/540 [39:42<1:12:25, 12.86s/it][A
 38%|██████████████▋                        | 203/540 [39:55<1:11:43, 12.77s/it][A
 38%|██████████████▋                        | 204/540 [40:08<1:11:25, 12.76s/it][A
 38%|██████████████▊                        | 205/540 [40:21<1:11:33, 12.82s

 54%|██████████████████████                   | 291/540 [58:30<51:08, 12.32s/it][A
 54%|██████████████████████▏                  | 292/540 [58:43<51:53, 12.55s/it][A
 54%|██████████████████████▏                  | 293/540 [58:55<51:06, 12.41s/it][A
 54%|██████████████████████▎                  | 294/540 [59:07<50:39, 12.36s/it][A
 55%|██████████████████████▍                  | 295/540 [59:20<50:39, 12.40s/it][A
 55%|██████████████████████▍                  | 296/540 [59:33<51:12, 12.59s/it][A
 55%|██████████████████████▌                  | 297/540 [59:45<50:46, 12.54s/it][A
 55%|██████████████████████▋                  | 298/540 [59:57<50:02, 12.41s/it][A
 55%|█████████████████████▌                 | 299/540 [1:00:10<50:35, 12.59s/it][A
 56%|█████████████████████▋                 | 300/540 [1:00:22<49:47, 12.45s/it][A
 56%|█████████████████████▋                 | 301/540 [1:00:35<50:08, 12.59s/it][A
 56%|█████████████████████▊                 | 302/540 [1:00:49<51:24, 12.96s

 72%|████████████████████████████           | 388/540 [1:18:47<31:16, 12.34s/it][A
 72%|████████████████████████████           | 389/540 [1:19:00<31:26, 12.50s/it][A
 72%|████████████████████████████▏          | 390/540 [1:19:14<32:26, 12.98s/it][A
 72%|████████████████████████████▏          | 391/540 [1:19:27<32:01, 12.90s/it][A
 73%|████████████████████████████▎          | 392/540 [1:19:39<31:19, 12.70s/it][A
 73%|████████████████████████████▍          | 393/540 [1:19:52<31:26, 12.83s/it][A
 73%|████████████████████████████▍          | 394/540 [1:20:05<30:54, 12.71s/it][A
 73%|████████████████████████████▌          | 395/540 [1:20:17<30:21, 12.56s/it][A
 73%|████████████████████████████▌          | 396/540 [1:20:30<30:12, 12.59s/it][A
 74%|████████████████████████████▋          | 397/540 [1:20:43<30:11, 12.67s/it][A
 74%|████████████████████████████▋          | 398/540 [1:20:55<29:48, 12.60s/it][A
 74%|████████████████████████████▊          | 399/540 [1:21:07<29:32, 12.57s

 90%|███████████████████████████████████    | 485/540 [1:39:57<11:58, 13.06s/it][A
 90%|███████████████████████████████████    | 486/540 [1:40:10<11:40, 12.98s/it][A
 90%|███████████████████████████████████▏   | 487/540 [1:40:23<11:26, 12.95s/it][A
 90%|███████████████████████████████████▏   | 488/540 [1:40:36<11:17, 13.02s/it][A
 91%|███████████████████████████████████▎   | 489/540 [1:40:49<11:01, 12.96s/it][A
 91%|███████████████████████████████████▍   | 490/540 [1:41:02<10:45, 12.90s/it][A
 91%|███████████████████████████████████▍   | 491/540 [1:41:15<10:30, 12.87s/it][A
 91%|███████████████████████████████████▌   | 492/540 [1:41:28<10:23, 12.99s/it][A
 91%|███████████████████████████████████▌   | 493/540 [1:41:41<10:16, 13.12s/it][A
 91%|███████████████████████████████████▋   | 494/540 [1:41:54<10:02, 13.11s/it][A
 92%|███████████████████████████████████▊   | 495/540 [1:42:07<09:45, 13.01s/it][A
 92%|███████████████████████████████████▊   | 496/540 [1:42:20<09:33, 13.04s

OSError: Cannot save file into a non-existent directory: 'results'

Step 7: Final Model
Based on external evaluation (Code to be added from Excel based analysis), let's train the final model with parameters yielding highest coherence score

In [22]:
num_topics = 8

lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.01,
                                           eta=0.9)

In [23]:
from pprint import pprint

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.011*"model" + 0.006*"object" + 0.006*"input" + 0.005*"cell" + 0.005*"use" '
  '+ 0.004*"neuron" + 0.004*"word" + 0.004*"performance" + 0.003*"show" + '
  '0.003*"learn"'),
 (1,
  '0.006*"player" + 0.006*"task" + 0.004*"regret" + 0.004*"game" + '
  '0.003*"regression" + 0.003*"different" + 0.003*"classification" + '
  '0.003*"result" + 0.003*"utility" + 0.003*"parameter"'),
 (2,
  '0.006*"set" + 0.006*"method" + 0.006*"feature" + 0.006*"training" + '
  '0.005*"neural" + 0.005*"use" + 0.005*"network" + 0.005*"figure" + '
  '0.004*"system" + 0.004*"learn"'),
 (3,
  '0.013*"network" + 0.007*"learn" + 0.006*"value" + 0.006*"layer" + '
  '0.005*"use" + 0.005*"set" + 0.004*"language" + 0.004*"neural" + '
  '0.004*"method" + 0.004*"game"'),
 (4,
  '0.011*"model" + 0.008*"learn" + 0.007*"use" + 0.007*"cluster" + '
  '0.007*"point" + 0.006*"graph" + 0.006*"result" + 0.006*"set" + 0.006*"show" '
  '+ 0.006*"function"'),
 (5,
  '0.009*"function" + 0.009*"set" + 0.008*"use" + 0.008*"learn

Step 8: Visualize Results

In [27]:
import pyLDAvis.gensim_models as gensimvis
import pickle 
import pyLDAvis

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_data_filepath = os.path.join('./results/ldavis_tuned_'+str(num_topics))

# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = gensimvis.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

pyLDAvis.save_html(LDAvis_prepared, './results/ldavis_tuned_'+ str(num_topics) +'.html')

LDAvis_prepared

Hierarchical LDA

In [32]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
# Assuming documents is a list of tokenized documents
dictionary = Dictionary(data_lemmatized)
corpus = [dictionary.doc2bow(doc) for doc in data_lemmatized]

# Training the top-level LDA model
top_model = LdaModel(corpus, num_topics=10, id2word=dictionary)

# For each topic, we collect the documents most relevant to that topic
topic_docs = {}
for i, doc in enumerate(corpus):
    topic = max(top_model[doc], key=lambda x: x[1])[0]
    if topic not in topic_docs:
        topic_docs[topic] = []
    topic_docs[topic].append(data_lemmatized[i])

# Now we train a separate LDA model for each set of documents
sub_models = {}
for topic, docs in topic_docs.items():
    sub_dictionary = Dictionary(docs)
    sub_corpus = [sub_dictionary.doc2bow(doc) for doc in docs]
    sub_model = LdaModel(sub_corpus, num_topics=5, id2word=sub_dictionary)
    sub_models[topic] = sub_model

In [33]:
# Visualize the topics
vis = gensimvis.prepare(sub_model, sub_corpus, sub_dictionary)
pyLDAvis.display(vis)

Correlated Topic Models

In [44]:
import pandas as pd
from contextualized_topic_models.models.ctm import CombinedTM
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
# initialize the CTM
ctm = CombinedTM(n_components=10, model_type="prodLDA", bow_size=len(data_lemmatized), contextual_size=768)

# fit the CTM to the data
ctm.fit(data_lemmatized)

# visualize
topics = ctm.get_topic_lists(5)  # get top 5 words for each topic
topic_word_distributions = ctm.get_topic_word_matrix()
vis_data = gensimvis.prepare(topic_word_distributions, data_lemmatized, ctm.vectorizer)
pyLDAvis.display(vis_data)

AttributeError: 'list' object has no attribute 'idx2token'