Ignis: Latent Dirichlet Allocation
============

In [1]:
import glob
import math
import re
import threading
import time

import gensim
import nltk
import pyLDAvis
import tomotopy as tp
import tqdm
import pathlib

In [2]:
import ipywidgets as widgets

Model training (LDA)
----

Load from an `ignis.Corpus`, add the processed docs to an LDA model, and train it.

The random seed and parallelisation can both affect results, so setting the seed and number of workers is necessary for reproducibility.

In [3]:
import ignis

In [4]:
corpus = ignis.load_corpus("bbc-full.corpus")

In [5]:
# model_options = {"k": 10, "term_weighting": "idf", "until_max_ll": True, "verbose": True}
# vis_options = {"verbose": True}
# results = ignis.train_model(corpus, model_type="lda", model_options=model_options, vis_type="pyldavis", vis_options=vis_options)

In [6]:
# results.save("bbc-full.aurum")
results = ignis.load_results("bbc-full.aurum")

In [7]:
results.init_labeller("tomotopy", verbose=True)

Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.


Textual results
------

In [19]:
def show_topic(topic_id=1):
    print(f"[Topic {topic_id}]\n")
    
    # Labels
    labels = ", ".join(
        label for label, score in results.get_topic_labels(topic_id, top_n=20)
    )
    print(f"Suggested labels:\n{labels}\n")

    # Print this topic
    words_probs = results.get_topic_words(topic_id, top_n=10)
    words = [x[0] for x in words_probs]

    words = ", ".join(words)
    print(f"Top words:\n{words}")

widgets.interact(show_topic, topic_id=(1, results.get_num_topics()))

interactive(children=(IntSlider(value=1, description='topic_id', max=10, min=1), Output()), _dom_classes=('wid…

<function __main__.show_topic(topic_id=1)>

Graphical visualisation
--------

In [9]:
vis_data = results.get_vis_data()
pyLDAvis.display(vis_data, local=True)



In [10]:
results.export_visualisation("visualisation")

Experimenting with slightly different random results
--------------------------------
- Changed the number of workers to 16 (Default is 8)

In [11]:
# model_options = {"k": 10, "term_weighting": "idf", "until_max_ll": True, "verbose": True, "workers": 16}
# results2 = ignis.train_model(corpus, model_type="lda", model_options=model_options)

In [12]:
# Example: Initialising the visualisation after training
# results2.init_vis("pyldavis", force=True, verbose=True)

In [13]:
# results2.save("bbc-full-2.aurum")
results2 = ignis.load_results("bbc-full-2.aurum")

In [14]:
vis_data2 = results2.get_vis_data()
pyLDAvis.display(vis_data2, local=True)

## Finding documents that belong to a given topic

Because topics are distributions over words and documents are *distributions* over topics, documents don't belong to individual topics per se; every topic is represented in every document with some probability.

We therefore have to specify how many of the document's top `n` topics we want to check for the actual topic we're interested in.

This is especially significant for topic models that use a term weighting scheme, because all the common words (i.e., what we might consider stopwords) tend to get grouped into a single large topic; if we only consider each document's single most probable topic, we will unintentionally exclude documents which have this "stopwords" topic as their top topic.

In [15]:
import pprint

# Looking at the visualisation, we notice that there is a single "stopwords" topic (Topic 6).
# If we are interested in Topic 4:
topic_docs = [doc for doc, prob in results2.get_topic_documents(topic_id=4, within_top_n=2)]

def show_topic_docs(index=0):
    doc_id = topic_docs[index]
    print(results2.get_document(doc_id))
    print()
    pprint.pprint(results2.get_document_topics(doc_id, 10))

widgets.interact(show_topic_docs, index=(0, len(topic_docs) - 1))

interactive(children=(IntSlider(value=0, description='index', max=370), Output()), _dom_classes=('widget-inter…

<function __main__.show_topic_docs(index=0)>

Slicing and iteration
--------
After seeing what the main topics might be, we can slice the initial corpus further and re-run topic modelling to get better resolution.