Ignis: Hierarchical Dirichlet Process
============

In [1]:
import ignis

In [2]:
# Jupyter notebook setup
import ipywidgets as widgets
from IPython.core.display import display, HTML

# Custom styling:
# - Prevent vertical scrollbars in output subareas
# - Resize to fit pyLDAvis visualisations without causing other cells to overflow
style = """
<style>
   .jupyter-widgets-output-area .output_scroll {
        height: unset !important;
        border-radius: unset !important;
        -webkit-box-shadow: unset !important;
        box-shadow: unset !important;
    }
    .jupyter-widgets-output-area  {
        height: auto !important;
    }
</style>
<style>
    #notebook-container { width: 1370px !important; }
    div.output_area { width: unset !important; }
</style>
"""
display(HTML(style))

Model training (LDA)
----

Load from an `ignis.Corpus`, add the processed docs to an LDA model, and train it.

The random seed and parallelisation can both affect results, so setting the seed and number of workers is necessary for reproducibility.

In [3]:
corpus = ignis.load_corpus("bbc.corpus")

With the current public version of `pyLDAvis`, (2.1.2), preparing the visualisation data takes very long with recent versions of `pandas` (>0.23.4).  We have an option here to use an optimised version of the preparation function built into Ignis.

In [4]:
use_optimised = True

In [5]:
model_options = {"term_weighting": "pmi", "verbose": True, "workers": 10, "parallel_scheme": "none"}
vis_options = {"verbose": True, "use_optimised": use_optimised}

Ignis can suggest a recommended number of topics to use for LDA, based on the coherence scores of a range of lightly-trained models.

In [None]:
results = ignis.train_model(
    corpus,
    model_type="tp_hdp",
    model_options=dict(model_options)
)

Training model on 2118 documents:
{'term_weighting': 'pmi', 'initial_k': 2, 'seed': 11399, 'workers': 10, 'parallel_scheme': 'none', 'iterations': 500, 'update_every': 100, 'until_max_ll': False, 'max_extra_iterations': 1000, 'verbose': True, 'tw': <TermWeight.PMI: 2>, 'parallel': <ParallelScheme.NONE: 1>}



  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
import pandas as pd

model = results.ignis_model.model

topic_term_dists = [model.get_topic_word_dist(k) for k in range(model.k)]
topic_term_dist_cols = [
    pd.Series(topic_term_dist, dtype="float64") for topic_term_dist in topic_term_dists
]
topic_term_dists = pd.concat(topic_term_dist_cols, axis=1).T

topic_term_dists = [model.get_topic_word_dist(k) for k in range(model.k)]
(pd.DataFrame(topic_term_dists).sum(axis=1)) < 0.999

In [None]:
results.init_vis("pyldavis", skip_validate=True, verbose=True)
results.show_visualisation()

In [None]:
# results.export_visualisation("bbc_results")