Ignis: Latent Dirichlet Allocation
============

In [1]:
import ignis

Model training (LDA)
----

Load from an `ignis.Corpus`, add the processed docs to an LDA model, and train it.

The random seed and parallelisation can both affect results, so setting the seed and number of workers is necessary for reproducibility.

(Ignis sets both of them to an initial default value to attempt to maintain reproducibility, but these defaults can be overridden as necessary.)

In [2]:
corpus = ignis.load_corpus("bbc.corpus")

In [3]:
# Refer to the Tomotopy docs for model-specific options
model_options = {
    "term_weighting": "idf",
    "k": 5,
    "verbose": True,
}

In [4]:
results = ignis.train_model(corpus, model_options=model_options,)
results.init_vis("pyldavis", verbose=True)
results.show_visualisation()

Training model on 2118 documents:
{'term_weighting': 'idf', 'k': 5, 'seed': 11399, 'workers': 8, 'parallel_scheme': 'default', 'iterations': 2000, 'update_every': 500, 'until_max_ll': False, 'until_max_coherence': False, 'max_extra_iterations': 2000, 'verbose': True, 'alpha': 0.1, 'eta': 0.01, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))


Docs: 2118, Vocab size: 19290, Total Words: 412636
Model training complete. (17.557s)
Preparing LDA visualisation... Done. (0.337s)


In [5]:
# results.export_visualisation("bbc_results")

## Saving results, using the automated labeller

In [6]:
results.save("bbc.aurum")
results = ignis.load_results("bbc.aurum")

In [7]:
results.init_labeller("tomotopy", verbose=True)

Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.


## Exploring documents that "belong" to a given topic

Because topics are distributions over words and documents are *distributions* over topics, documents don't belong to individual topics per se; every topic is represented in every document with some probability.

We therefore have to specify how many of the document's top `n` topics we want to check for the actual topic we're interested in.

This is especially significant for topic models that use a term weighting scheme, because all the common words (i.e., what we might consider stopwords) tend to get grouped into a single large topic; if we only consider each document's single most probable topic, we will unintentionally exclude documents which have this "stopwords" topic as their top topic.  (This is, however, not always a given; one would need to check the model's output and iterate from there).

Alternatively, we can specify `ignore_topics` when slicing to completely ignore a given topic when classifying documents.

In [8]:
results.nb_explore_topics()

interactive(children=(IntSlider(value=1, description='Topic', max=5, min=1), Output()), _dom_classes=('widget-…

<function ignis.aurum.Aurum.nb_explore_topics.<locals>.show_topic(topic_id=1)>

Slicing and iteration
--------
After seeing what the main topics might be, we can slice the initial corpus further and re-run topic modelling to get better resolution.

In [9]:
# Try zooming in on a sub-topic
tech_slice = results.slice_by_topic(topic_id=1)

We might start by checking exactly which documents a slice of the corpus contains, by exploring it directly:

In [10]:
# If all the documents in this slice have the same top topic and we order them by the probability of that topic descending,
# we will get the same document order as in the `nb_explore_topics()` widget above.
tech_slice.nb_explore(
    doc_sort_key=lambda doc: results.get_document_top_topic(doc.id)[1], reverse=True
)

interactive(children=(IntSlider(value=0, description='Document', max=334), Output()), _dom_classes=('widget-in…

<function ignis.corpus.CorpusSlice.nb_explore.<locals>.show_doc(index=0)>

Ignis can suggest a recommended number of topics to use for LDA, based on the coherence scores of a range of lightly-trained models.

The default coherence score, `u_mass`, is quick to calculate but might not correspond well to human intuitions.  Consider using `c_v` instead if runtime is not an issue.

Alternatively, you can iterate manually by increasing the number of topics if it looks like there are multiple sub-topics being joined into larger ones, and reducing the number of topics if it looks like many incoherent topics are forming instead.

In [11]:
# Getting a suggested number of topics and retraining the model on the new slice
best_k = results.resuggest_num_topics(
    coherence="u_mass", corpus_slice=tech_slice, verbose=True, start_k=4, end_k=10
)

Training 7 mini-models to suggest a suitable number of topics between 4 and 10.
(335 documents, 100 iterations each, considering top 30 terms per topic)


HBox(children=(FloatProgress(value=0.0, max=700.0), HTML(value='')))

Suggested topic count: 4	Coherence: -2.44842	
All suggestions: [4] -2.44842, [6] -3.37790, [5] -3.43766, [7] -3.91176, [8] -4.13817, [9] -4.14025, [10] -4.39088



In [12]:
tech_model = results.retrain_model(
    corpus_slice=tech_slice, model_options={"k": best_k, "until_max_coherence": True}
)
tech_model.show_visualisation()

Training model on 335 documents:
{'term_weighting': 'idf', 'k': 4, 'seed': 11399, 'workers': 8, 'parallel_scheme': 'default', 'iterations': 2000, 'update_every': 500, 'until_max_ll': False, 'until_max_coherence': True, 'max_extra_iterations': 2000, 'verbose': True, 'alpha': 0.1, 'eta': 0.01, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))



Continuing to train until maximum coherence.



HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…


Best coherence: -3.27110 (Starting: -3.29984)
Docs: 335, Vocab size: 9845, Total Words: 88066
Model training complete. (10.018s)
Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.
Preparing LDA visualisation... Done. (0.254s)


The position of each topic cluster on the graph is not intrinsically informative per se (being simply the result of some specified dimensionality-reducing technique), but if we want we can run the modelling algorithm with a different random seed and see if we get a more nicely-separated set of topics.

(We can also override any of the previously set options.)

In [13]:
tech_model_2 = tech_model.retrain_model(model_options={"seed": 7156})
tech_model_2.show_visualisation()

Training model on 335 documents:
{'term_weighting': 'idf', 'k': 4, 'seed': 7156, 'workers': 8, 'parallel_scheme': 'default', 'iterations': 2000, 'update_every': 500, 'until_max_ll': False, 'until_max_coherence': True, 'max_extra_iterations': 2000, 'verbose': True, 'alpha': 0.1, 'eta': 0.01, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))



Continuing to train until maximum coherence.



HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…


Best coherence: -3.81928 (Starting: -3.95207)
Docs: 335, Vocab size: 9845, Total Words: 88066
Model training complete. (10.058s)
Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.
Preparing LDA visualisation... Done. (0.237s)


Notice that the top words for each topic remain more or less consistent across different training runs, even though their positions in the visualisation change with each iteration.

In addition to simply slicing by topic, we can also explicitly search the whole corpus for documents that contain certain tokens, in case we want to be absolutely sure we got all the documents that mention certain words or phrases.

In [14]:
# Tokens that are related to games, doing a full-text search through the entire corpus (not just within the current results)
game_slice = tech_model.slice_by_tokens(["game", "games", "gaming"], include_root=True)
game_k = tech_model.resuggest_num_topics(game_slice, verbose=True, start_k=3)
game_model = tech_model.retrain_model(game_slice, model_options={"k": game_k})
game_model.show_visualisation()

Training 8 mini-models to suggest a suitable number of topics between 3 and 10.
(434 documents, 100 iterations each, considering top 30 terms per topic)


HBox(children=(FloatProgress(value=0.0, max=800.0), HTML(value='')))

Suggested topic count: 4	Coherence: -2.22030	
All suggestions: [4] -2.22030, [3] -2.35200, [5] -2.93415, [6] -3.90681, [8] -3.95977, [7] -4.98437, [9] -5.21826, [10] -5.51505

Training model on 434 documents:
{'term_weighting': 'idf', 'k': 4, 'seed': 11399, 'workers': 8, 'parallel_scheme': 'default', 'iterations': 2000, 'update_every': 500, 'until_max_ll': False, 'until_max_coherence': True, 'max_extra_iterations': 2000, 'verbose': True, 'alpha': 0.1, 'eta': 0.01, 'tw': <TermWeight.IDF: 1>, 'parallel': <ParallelScheme.DEFAULT: 0>}



HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))



Continuing to train until maximum coherence.



HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…


Best coherence: -2.37804 (Starting: -2.55674)
Docs: 434, Vocab size: 11855, Total Words: 94136
Model training complete. (11.629s)
Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.
Preparing LDA visualisation... Done. (0.265s)


In [15]:
# Topic 2 seems related to video games
game_model.nb_explore_topics()

interactive(children=(IntSlider(value=1, description='Topic', max=4, min=1), Output()), _dom_classes=('widget-…

<function ignis.aurum.Aurum.nb_explore_topics.<locals>.show_topic(topic_id=1)>