Ignis: Hierarchical Dirichlet Process
============

The skeleton code below should work for an HDP setup, but the accuracy/run-time trade-off might be worse than with the basic LDA setup.

In [1]:
import ignis

In [2]:
# Jupyter notebook setup
import ipywidgets as widgets
from IPython.core.display import display, HTML

# Custom styling:
# - Prevent vertical scrollbars in output subareas
# - Resize to fit pyLDAvis visualisations without causing other cells to overflow
style = """
<style>
   .jupyter-widgets-output-area .output_scroll {
        height: unset !important;
        border-radius: unset !important;
        -webkit-box-shadow: unset !important;
        box-shadow: unset !important;
    }
    .jupyter-widgets-output-area  {
        height: auto !important;
    }
</style>
<style>
    #notebook-container { width: 1370px !important; }
    div.output_area { width: unset !important; }
</style>
"""
display(HTML(style))

Model training (LDA)
----

Load from an `ignis.Corpus`, add the processed docs to an LDA model, and train it.

The random seed and parallelisation can both affect results, so setting the seed and number of workers is necessary for reproducibility.

In [3]:
corpus = ignis.load_corpus("bbc.corpus")

In [4]:
model_options = {
    "verbose": True,
    "workers": 10,
    "until_max_coherence": True,
    "alpha": 0.05,
    "eta": 0.1,
    "gamma": 0.9,
    "initial_k": 7,
}
vis_options = {"verbose": True}

Precision issues might cause the pyLDAvis to error out if it attempts to validate the model before preparing the visualisation.

In [5]:
results = ignis.train_model(
    corpus, model_type="tp_hdp", model_options=dict(model_options)
)
results.init_vis("pyldavis", skip_validate=True, verbose=True)
results.show_visualisation()

Training model on 2118 documents:
{'term_weighting': 'one', 'initial_k': 7, 'seed': 11399, 'workers': 10, 'parallel_scheme': 'default', 'iterations': 500, 'update_every': 100, 'until_max_ll': False, 'until_max_coherence': True, 'max_extra_iterations': 1000, 'verbose': True, 'alpha': 0.05, 'eta': 0.1, 'gamma': 0.9, 'tw': <TermWeight.ONE: 0>, 'parallel': <ParallelScheme.DEFAULT: 0>}



HBox(children=(FloatProgress(value=0.0, max=500.0), HTML(value='')))



Continuing to train until maximum coherence.



HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…


Best coherence: -4.40724 (Starting: -4.53217)
Docs: 2118, Vocab size: 19290, Total Words: 412636
Model training complete. (37.574s)
Preparing LDA visualisation... Done. (0.562s)
