Skip to content

Knowing your way around ARTM EM algorithm

Evgeny-Egorov-Projects edited this page Oct 21, 2019 · 2 revisions

Prerequisites


The algorithm used in ARTM described in many articles such as BigARTM release. From the described algorithm, one can see that two special hyperparameters are used to obtain Theta and Phi matrices. In short, BigARTM contains two hyperparameters that determine when Phi and Theta are converged. Namely num_document_passes gives us control how many times Theta is updated before it is considered converged. The second hyperparameter num_collection_passes determines how many times the Phi matrix is updated with a converged version of the Theta matrix.

The following paragraphs will guide you through EM algorithm realisation in BigARTM helping better understand the inner workings of the library.

Reconstructing PLSA algorithm


Starting off with most basic and yet most important part of this topic: building a PLSA model with EM algorithm. Given that the user constructet ARTM model without any regularizer and initialized it:

model = armt.ARTM(...)
model.initialize(dictionary)

they can access stored inside phi matrix of the model using the following:

pwt = model.model_pwt
nwt = model.model_nwt

For PLSA model every Phi matrix update will happen by implementing following algorithm:

# algorithm 1 - offline without regularization, PLSA
num_collection_passes = 10
for outer_iteration in range(num_collection_passes):
    model.master.process_batches(
        pwt="pwt",
        nwt="nwt",
        num_document_passes=10,
        batches_folder=None,
        batches=None
    )
    model.master.normalize_model(pwt="pwt", nwt="nwt", rwt=None)

This code describes how model.fit_offline(num_collection_passes=10) works in BigARTM.

Reconstructing ARTM algorithm


The main benefit of working with BigARTM comes from the ability to use regularizers. In order to account for regularizers in the model previous code should be adjusted in the following way:

for outer_iteration in range(num_collection_passes):
    model.master.process_batches(
        pwt="pwt",
        nwt="nwt",
        num_document_passes=10,
        batches_folder=None,
        batches=None
    )
    model.master.regularize_model(pwt="pwt", nwt="nwt", rwt="rwt", regularizer_name=..., regularizer_tau=...)
    model.master.normalize_model(pwt="pwt", nwt="nwt", rwt="rwt")

How to manage custom Theta initialization in EM algorithm


Sometimes the Researcher required to perform EM algorithm differently to the default way programmed in BigARTM. This concerns choice of initial Theta matrix used in EM algorithm. Let's say, there is a better way to initialize Theta matrix Theta = Theta_init(*args, **kwargs).

First one would need to make sure that the Theta matrix is:

  • Stored in the model cache_theta=True option
  • Has a name theta_name = ptd option
  • Given by your function and not by the saved in memory Theta reuse_theta=False option Next we make ARTM create Theta matrix in the model