Ignis: LDA
============

Welcome to the Ignis LDA Modelling Template.

Use this Jupyter notebook as a demo and starting point for modelling and exploring your own data.

In [1]:
import ignis

----------

Model training (LDA)
----

Load from an `ignis.Corpus`, add the processed docs to an LDA model, and train it.

Because LDA is a probabilistic algorithm, the model's random seed and parallelisation options can both affect results, so setting the seed and number of workers is necessary for reproducibility.

(Most users don't need to be too concerned about this: to be safe, Ignis sets both parameters to initial default values, but these can be overridden by advanced users as necessary.)

In [2]:
corpus = ignis.load_corpus("data/bbc.corpus")

The LDA implementation also provides a number of other options that can be changed by the user as necessary, but for most general cases, we can stick with the default values.

See the library docs for more information about the available options.

In [3]:
# Setting an empty dictionary here uses the ignis-provided defaults for all options.
model_options = {}

The number of topics for the algorithm to infer, `k`, needs to be set in advance, but a convenience method, `ignis.suggest_num_topics()`, can be used to suggest a suitable initial setting heuristically. (Specifically, it trains a number of mini-models and assesses how well the top `n` words of each topic in each model are related.)

Because this is only a _suggested_ initial setting, users are free to ignore it and experiment with different manual values of `k` instead to see how the results change accordingly.

In [4]:
# Here, we override the `k` option with the output of `ignis.suggest_num_topics()`.
model_options["k"] = ignis.suggest_num_topics(corpus, model_options=model_options)

# To specify your own value for `k`, use something liek the following line instead:
# model_options["k"] = 7

Training 8 mini-models to suggest a suitable number of topics between 3 and 10...
(2120 documents, 150 iterations each, coherence metric: 'c_npmi')


  0%|          | 0/1200 [00:00<?, ?it/s]

Suggested topic count: 5	Coherence: 0.048988079305304924


We can then train the actual final topic model with the configured value of `k`...

In [5]:
results = ignis.train_model(
    corpus,
    model_options=model_options,
)

  0%|          | 0/2000 [00:00<?, ?it/s]

Model training complete. (14.179s)

<Ignis Options>
| Workers: 8
| ParallelScheme: <ParallelScheme.DEFAULT: 0>
|
<Basic Info>
| LDAModel (current version: 0.12.0)
| 2120 docs, 400185 words
| Total Vocabs: 14621, Used Vocabs: 14621
| Entropy of words: -8.15441
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 2000, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -27.96613
|
<Initial Parameters>
| tw: TermWeight.PMI
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 5 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.2 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 11399 (random seed)
| trained in version 0.12.0
|
<Parameters>
| alpha 

... and show the basic graphical visualisation of the topics that were found:

In [6]:
results.show_visualisation()

----------

## Exporting visualisations

If we want, we can export the visualisations to a separate folder for offline display using the `.export_visualisation()` method.

Uncomment and specify a target folder in the cell below, then run the cell and open the exported `visualisation.html` file to view the visualisation.

The entire folder can be copied to a different PC to display the visualisation there.

If the display PC will not have internet access, set `use_cdn` to `False`.

In [7]:
# results.export_visualisation("data/bbc_results", use_cdn=True)

----------

## Saving topic modelling results

The trained model can be saved to a separate file and loaded on another PC or after a Jupyter notebook restart, negating the need to redo a potentially time-consuming training run.

N.B.: The saving/loading of topic modelling results can also take some time, because the full contents of each document in the corpus are also saved to the results file for continued display and iteration.

In [8]:
# results.save("data/bbc.aurum")

Instead of running the modelling steps above on the other PC, the results can then be loaded directly with `ignis.load_results()`:

In [9]:
# import ignis
# results = ignis.load_results("data/bbc.aurum")

----------

## Using the automated labeller

The automated labeller tries to come up with a few key terms that describe each topic. Its results may provide a slightly different perspective from the main model output (which is simply a list of the most probable terms for each topic).

If the automated labeller is initialised using the `.init_labeller()` function below, its suggestions will automatically be shown in the `.nb_explore_topics()` widget in the next section.

In [10]:
results.init_labeller("tomotopy", verbose=True)

Extracting label candidates from model...
Preparing First-order relevance labeller...
Done.


----------

## Exploring documents that "belong" to a given topic

Because topics are distributions over words and documents are *distributions* over topics, documents don't belong to individual topics per se; every topic is represented in every document with some probability.

We therefore have to specify how many of the document's top `n` topics we want to check for the actual topic we're interested in.

In [11]:
results.nb_explore_topics()

HBox(children=(IntSlider(value=1, continuous_update=False, description='Topic', layout=Layout(width='80%'), ma…

Output()

----------

Slicing and iteration
--------
After seeing what the main topics might be, we can slice the initial corpus further and re-run topic modelling to get better resolution.

In [12]:
# Try zooming in on a sub-topic
sub_slice = results.slice_by_topic(1)

We might start by checking exactly which documents a slice of the corpus contains, by exploring it directly:

In [13]:
# By default, the documents in the slice are put in an arbitrary order.
# For advanced users, a custom sort key can be used to change this:
# E.g., this one sorts documents by the probability of their top topic instead.
# (This gives us the same document order as in the `nb_explore_topics()` widget above.)
sub_slice.nb_explore(
    doc_sort_key=lambda doc: results.get_document_top_topic(doc.id)[1], reverse=True
)

HBox(children=(IntSlider(value=1, continuous_update=False, description='Document', layout=Layout(width='80%'),…

Output()

As before, Ignis can suggest a recommended number of topics to use for LDA, based on the coherence scores of a range of lightly-trained models.

In [14]:
# Getting a suggested number of topics and retraining the model on the new slice
best_k = results.resuggest_num_topics(corpus_slice=sub_slice, verbose=True)
sub_model = results.retrain_model(corpus_slice=sub_slice, model_options={"k": best_k})
sub_model.show_visualisation()

Training 8 mini-models to suggest a suitable number of topics between 3 and 10...
(328 documents, 150 iterations each, coherence metric: 'c_npmi')


  0%|          | 0/1200 [00:00<?, ?it/s]

Suggested topic count: 4	Coherence: -0.07911263540965896


  0%|          | 0/2000 [00:00<?, ?it/s]

Model training complete. (4.116s)

<Ignis Options>
| Workers: 8
| ParallelScheme: <ParallelScheme.DEFAULT: 0>
|
<Basic Info>
| LDAModel (current version: 0.12.0)
| 328 docs, 84322 words
| Total Vocabs: 6958, Used Vocabs: 6958
| Entropy of words: -7.57679
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 2000, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -21.22212
|
<Initial Parameters>
| tw: TermWeight.PMI
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 4 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.25 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 11399 (random seed)
| trained in version 0.12.0
|
<Parameters>
| alpha (Dir

The position of each topic cluster on the graph is not intrinsically informative per se (being simply the result of some specified dimensionality-reducing technique), but if we want we can run the modelling algorithm with a different random seed and see if we get a more nicely-separated set of topics.

(We can also override any of the previously set options.)

In [15]:
new_model_options = {"seed": 1234567}
sub_model_2 = sub_model.retrain_model(model_options=new_model_options)
sub_model_2.show_visualisation()

  0%|          | 0/2000 [00:00<?, ?it/s]

Model training complete. (3.993s)

<Ignis Options>
| Workers: 8
| ParallelScheme: <ParallelScheme.DEFAULT: 0>
|
<Basic Info>
| LDAModel (current version: 0.12.0)
| 328 docs, 84322 words
| Total Vocabs: 6958, Used Vocabs: 6958
| Entropy of words: -7.57679
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 2000, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -21.16450
|
<Initial Parameters>
| tw: TermWeight.PMI
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 4 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.25 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 1234567 (random seed)
| trained in version 0.12.0
|
<Parameters>
| alpha (D

The topic words remain more or less consistent across different training runs, even though their positions in the visualisation change when the random seed is changed.

In [17]:
sub_model_2.nb_explore_topics()

HBox(children=(IntSlider(value=1, continuous_update=False, description='Topic', layout=Layout(width='80%'), ma…

Output()

----------

## Further slicing and iteration

In addition to simply slicing by topic, we can also explicitly search the whole corpus for documents that contain certain tokens, in case we want to be absolutely sure we got all the documents that mention certain words or phrases.

References and examples are available on the Ignis documentation site for all the slicing functions.

In [18]:
# Tokens that are related to games, doing a full-text search through the entire corpus (not just within the current results)
game_slice = sub_model_2.slice_by_tokens(["game", "games", "gaming"], include_root=True)
game_k = sub_model_2.resuggest_num_topics(game_slice, verbose=True, start_k=3)
game_model = sub_model_2.retrain_model(game_slice, model_options={"k": game_k})
game_model.show_visualisation()

Training 8 mini-models to suggest a suitable number of topics between 3 and 10...
(435 documents, 150 iterations each, coherence metric: 'c_npmi')


  0%|          | 0/1200 [00:00<?, ?it/s]

Suggested topic count: 3	Coherence: -0.04343148168793708


  0%|          | 0/2000 [00:00<?, ?it/s]

Model training complete. (3.871s)

<Ignis Options>
| Workers: 8
| ParallelScheme: <ParallelScheme.DEFAULT: 0>
|
<Basic Info>
| LDAModel (current version: 0.12.0)
| 435 docs, 91385 words
| Total Vocabs: 9235, Used Vocabs: 9235
| Entropy of words: -7.88890
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 2000, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -24.52070
|
<Initial Parameters>
| tw: TermWeight.PMI
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 3 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.33333 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 1234567 (random seed)
| trained in version 0.12.0
|
<Parameters>
| alpha

In [19]:
game_model.nb_explore_topics()

HBox(children=(IntSlider(value=1, continuous_update=False, description='Topic', layout=Layout(width='80%'), ma…

Output()

----------

## Manipulating the stop word list

If you want to add or remove words from the stop word list at run-time, you can use the `add_stop_word()` and `remove_stop_word()` methods on a slice (or their plural versions, `add_stop_words()` and `remove_stop_words()`) to do so.

In [20]:
# If we decide that certain tokens do not contribute to our `game_slice` model:
game_slice.add_stop_words(["try", "i am", "people"])
game_k = game_model.resuggest_num_topics(game_slice, verbose=True, start_k=3)
game_model = game_model.retrain_model(game_slice, model_options={"k": game_k})
game_model.show_visualisation()

Training 8 mini-models to suggest a suitable number of topics between 3 and 10...
(435 documents, 150 iterations each, coherence metric: 'c_npmi')


  0%|          | 0/1200 [00:00<?, ?it/s]

Suggested topic count: 3	Coherence: -0.021740968647045722


  0%|          | 0/2000 [00:00<?, ?it/s]

Model training complete. (3.899s)

<Ignis Options>
| Workers: 8
| ParallelScheme: <ParallelScheme.DEFAULT: 0>
|
<Basic Info>
| LDAModel (current version: 0.12.0)
| 435 docs, 90892 words
| Total Vocabs: 9233, Used Vocabs: 9233
| Entropy of words: -7.89434
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 2000, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -24.51536
|
<Initial Parameters>
| tw: TermWeight.PMI
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 3 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.33333 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 1234567 (random seed)
| trained in version 0.12.0
|
<Parameters>
| alpha

In these new results above, the tokens we added to the stop word list no longer appear in the topic models.

**N.B.:** These stop words are controlled at the root Corpus level, so any stop words that are added or removed will apply to _all_ slices that originate from the same initial Corpus.