Ignis: Latent Dirichlet Allocation
============

In [1]:
import ignis

In [2]:
# Python setup: Note that the `PYTHONHASHSEED` environmental variable needs to be set *before* the Python kernel is intialised --
# We only print it here for easy post-hoc reference.
import os
os.environ.get("PYTHONHASHSEED")

'11399'

In [3]:
# Jupyter notebook setup
import ipywidgets as widgets
from IPython.core.display import display, HTML

# Custom styling:
# - Prevent vertical scrollbars in output subareas
# - Resize to fit pyLDAvis visualisations without causing other cells to overflow
style = """
<style>
   .jupyter-widgets-output-area .output_scroll {
        height: unset !important;
        border-radius: unset !important;
        -webkit-box-shadow: unset !important;
        box-shadow: unset !important;
    }
    .jupyter-widgets-output-area  {
        height: auto !important;
    }
</style>
<style>
    #notebook-container { width: 1370px !important; }
    div.output_area { width: unset !important; }
</style>
"""
display(HTML(style))

Model training (LDA)
----

Load from an `ignis.Corpus`, add the processed docs to an LDA model, and train it.

The random seed and parallelisation can both affect results, so setting the seed and number of workers is necessary for reproducibility.

In [4]:
corpus = ignis.load_corpus("bbc-full.corpus")

In [5]:
model_options = {
    "k": 6,
    "term_weighting": "idf",
    "workers": 12,
    "until_max_ll": True,
    "verbose": True,
}
vis_options = {"verbose": True}
results = ignis.train_model(
    corpus,
    model_type="tp_lda",
    model_options=model_options,
    vis_type="pyldavis",
    vis_options=vis_options,
)

Training LDA model on 2122 documents:
{'term_weighting': 'idf', 'k': 6, 'seed': 11399, 'workers': 12, 'parallel_scheme': 'default', 'iterations': 1000, 'update_every': 100, 'until_max_ll': True, 'max_extra_iterations': 5000, 'verbose': True, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



100%|██████████| 1000/1000 [00:12<00:00, 82.06it/s, Log-likelihood=-21.18028]



Continuing to train until maximum log-likelihood.
(N.B.: This may not correlate with increased interpretability)



2900it [00:39, 72.59it/s, Log-likelihood=-21.07227]


Model training complete. (53.189s)
Preparing LDA visualisation . . . . . . . . . . . . . . Done. (17.056s)


In [6]:
# results.save("bbc-full.aurum")
# results = ignis.load_results("bbc-full.aurum")

In [7]:
results.init_labeller("tomotopy", verbose=True)

Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.


Graphical visualisation
--------

In [8]:
results.show_visualisation()

In [9]:
# results.export_visualisation("bbc_results")

Exploring the top words and suggested labels for each topic
------

In [10]:
results.nb_show_topics(top_labels=15, top_words=15)

interactive(children=(IntSlider(value=1, description='topic_id', max=6, min=1), Output()), _dom_classes=('widg…

<function ignis.aurum.Aurum.nb_show_topics.<locals>.show_topic(topic_id=1)>

## Exploring documents that "belong" to a given topic

Because topics are distributions over words and documents are *distributions* over topics, documents don't belong to individual topics per se; every topic is represented in every document with some probability.

We therefore have to specify how many of the document's top `n` topics we want to check for the actual topic we're interested in.

This is especially significant for topic models that use a term weighting scheme, because all the common words (i.e., what we might consider stopwords) tend to get grouped into a single large topic; if we only consider each document's single most probable topic, we will unintentionally exclude documents which have this "stopwords" topic as their top topic.

In [11]:
# Topic 4 appears to be related to the entertainment industry
results.nb_show_topic_documents(topic_id=4, within_top_n=2)

interactive(children=(IntSlider(value=0, description='index', max=379), Output()), _dom_classes=('widget-inter…

<function ignis.aurum.Aurum.nb_show_topic_documents.<locals>.show_topic_doc(index=0)>

Slicing and iteration
--------
After seeing what the main topics might be, we can slice the initial corpus further and re-run topic modelling to get better resolution.

In [12]:
# Try zooming in on Topic 4
entertainment_slice = results.slice_by_topic(topic_id=4, within_top_n=2)
entertainment_model = results.retrain_model(corpus_slice=entertainment_slice)

Training LDA model on 380 documents:
{'term_weighting': 'idf', 'k': 6, 'seed': 11399, 'workers': 12, 'parallel_scheme': 'default', 'iterations': 1000, 'update_every': 100, 'until_max_ll': True, 'max_extra_iterations': 5000, 'verbose': True, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



100%|██████████| 1000/1000 [00:03<00:00, 328.08it/s, Log-likelihood=-19.58574]


Continuing to train until maximum log-likelihood.
(N.B.: This may not correlate with increased interpretability)




1800it [00:06, 277.18it/s, Log-likelihood=-19.50379]

Model training complete. (9.742s)
Extracting label candidates from model...





Preparing First-order relevance labeller...
Done.
Preparing LDA visualisation . . . . . . Done. (7.931s)


In [13]:
entertainment_model.show_visualisation()

In [14]:
# entertainment_model.export_visualisation("bbc_results_2")

The position of each topic cluster on the graph is not intrinsically informative per se (being simply the result of some specified dimensionality-reducing technique), but if we want we can run the modelling algorithm with a different random seed and see if we get a more meaningful set of topics.

(We can also override any of the previously set options.)

In [15]:
model_options = {"seed": 1234}
entertainment_model_2 = entertainment_model.retrain_model(model_options=model_options)
entertainment_model_2.show_visualisation()

Training LDA model on 380 documents:
{'term_weighting': 'idf', 'k': 6, 'seed': 1234, 'workers': 12, 'parallel_scheme': 'default', 'iterations': 1000, 'update_every': 100, 'until_max_ll': True, 'max_extra_iterations': 5000, 'verbose': True, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



100%|██████████| 1000/1000 [00:03<00:00, 330.25it/s, Log-likelihood=-19.62582]


Continuing to train until maximum log-likelihood.
(N.B.: This may not correlate with increased interpretability)




1600it [00:06, 266.13it/s, Log-likelihood=-19.54794]

Model training complete. (9.244s)
Extracting label candidates from model...
Preparing First-order relevance labeller...





Done.
Preparing LDA visualisation . . . . . . Done. (7.135s)


In [16]:
# entertainment_model_2.export_visualisation("bbc_results_2a")

Alternatively, we can slice the corpus by explicitly searching for documents that contain certain tokens, in case we want to be absolutely sure we got all the documents that mention certain words or phrases.

In [17]:
# Topic that seems to deal with films
film_topic = entertainment_model_2.slice_by_topic(topic_id=2, within_top_n=2)

# Tokens that are related to films
by_token_slice = entertainment_model_2.slice_by_tokens(
    ["film", "films", "movie", "cinema"], include_root=True
)

concat_slice = film_topic.concat(by_token_slice)

model_options = {"iterations": 2000, "k": 5}
concat_model = entertainment_model_2.retrain_model(
    concat_slice, model_options=model_options
)
concat_model.show_visualisation()

Training LDA model on 309 documents:
{'term_weighting': 'idf', 'k': 5, 'seed': 1234, 'workers': 12, 'parallel_scheme': 'default', 'iterations': 2000, 'update_every': 100, 'until_max_ll': True, 'max_extra_iterations': 5000, 'verbose': True, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



100%|██████████| 2000/2000 [00:05<00:00, 371.33it/s, Log-likelihood=-18.28157]


Continuing to train until maximum log-likelihood.
(N.B.: This may not correlate with increased interpretability)




800it [00:02, 309.84it/s, Log-likelihood=-18.27224]

Model training complete. (8.145s)
Extracting label candidates from model...
Preparing First-order relevance labeller...





Done.
Preparing LDA visualisation . . . . . Done. (6.768s)


In [18]:
# concat_model.export_visualisation("bbc_results_3")

In [20]:
# Topic 3 appears to be related to film awards
concat_model.nb_show_topic_documents(topic_id=3, within_top_n=2)

interactive(children=(IntSlider(value=0, description='index', max=162), Output()), _dom_classes=('widget-inter…

<function ignis.aurum.Aurum.nb_show_topic_documents.<locals>.show_topic_doc(index=0)>