This notebook demonstrates how to change the number of topics in the model "on the fly".

In [1]:
import artm
print artm.version()

0.8.1


Let's start by fitting a simple model with 5 topics using ``fit_offline`` algorithm. Let's  also log perplexity score after each iteration.

In [2]:
batch_vectorizer = artm.BatchVectorizer(data_path=r'C:\bigartm\data', data_format='bow_uci', collection_name='kos')
model = artm.ARTM(topic_names=['topic_{}'.format(i) for i in xrange(5)],
                  scores=[artm.PerplexityScore(name='PerplexityScore')],
                  num_document_passes = 10)
model.initialize(dictionary=batch_vectorizer.dictionary)
model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=5)
print model.score_tracker['PerplexityScore'].value

[6771.564197835558, 2516.6601030082857, 2407.3543922010576, 2187.2682550919926, 1996.181266257517]


Now let's see how to use internal method ``model.master.merge_model`` to add new topics.
Originally, ``merge_model`` is designed to combine several ``nwt`` matrices with some weights.
In addition, it allows you to specify which topics to include in the resulting matrix.
If a topic doesn't exist in any of the source matrices it will be initialized with zeros.
In the following example we "merge" just a single matrix with wegith ``1.0``.

In [3]:
model.master.merge_model({'nwt': 1.0}, 'test', topic_names = ['topic_{}'.format(i) for i in xrange(7)])
model.get_phi(model_name='test')[:5]

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6
parentheses,1.036641,1.1e-05,0.036257,5.276785,1.650306,0.0,0.0
opinion,79.475784,25.148537,3.788871,29.092693,3.494116,0.0,0.0
attitude,8.835311,1.463707,0.004694,6.825315,3.870974,0.0,0.0
held,70.140091,90.400917,15.58677,59.392002,10.480217,0.0,0.0
impeachment,0.186589,11.420464,3.039623,1.935981,1.417344,0.0,0.0


As a side note, it is always helpful to see which matrices exist in the model.
Normally you expect to see ``pwt`` and ``nwt`` matrix, but due to ``merge_model`` that we've execute
there is an additional matrix named ``test``.

In [4]:
for model_description in model.info.model:
    print model_description

name: "nwt"
type: "class artm::core::DensePhiMatrix"
num_topics: 5
num_tokens: 6906

name: "pwt"
type: "class artm::core::DensePhiMatrix"
num_topics: 5
num_tokens: 6906

name: "test"
type: "class artm::core::DensePhiMatrix"
num_topics: 7
num_tokens: 6906



Now, you need to modify the values by *attaching* to the model. From ``model.info`` you can easily see that the model became attached.

In [5]:
(test_model, test_matrix) = model.master.attach_model('test')
for model_description in model.info.model:
    print model_description

name: "nwt"
type: "class artm::core::DensePhiMatrix"
num_topics: 5
num_tokens: 6906

name: "pwt"
type: "class artm::core::DensePhiMatrix"
num_topics: 5
num_tokens: 6906

name: "test"
type: "class artm::core::AttachedPhiMatrix"
num_topics: 7
num_tokens: 6906



In [6]:
import numpy as np
test_matrix[:, [5,6]] = np.random.rand(test_matrix.shape[0], 2)
model.get_phi(model_name='test')[:5]

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6
parentheses,1.036641,1.1e-05,0.036257,5.276785,1.650306,0.081645,0.714468
opinion,79.475784,25.148537,3.788871,29.092693,3.494116,0.336906,0.085047
attitude,8.835311,1.463707,0.004694,6.825315,3.870974,0.381812,0.2678
held,70.140091,90.400917,15.58677,59.392002,10.480217,0.349821,0.079702
impeachment,0.186589,11.420464,3.039623,1.935981,1.417344,0.422647,0.988804


Now, I'm realy not sure what will happen if you modify ``pwt`` or ``nwt``, and then use ``fit_offline``.
That's because the ``fit_offline`` expects matrices with the same number of topics
as described in the configuration of the model.
However it is quite safe to use low-level methods, such as ``model.master.process_batches`` and ``model.master.normalize_model``.
The example below shows how to use these methods to reproduce the results of ``fit_offline``. You need to figure out how to use this methods on the modified matrices (those that have different number of topics).

In [7]:
# Fitting model with our internal API --- process batches and normalize model
model.initialize(dictionary=batch_vectorizer.dictionary)
for i in xrange(5):
    model.master.clear_score_cache()
    model.master.process_batches(model._model_pwt, model._model_nwt,
                                 batches=[x.filename for x in batch_vectorizer.batches_list],
                                 num_document_passes = 10)
    model.master.normalize_model(model._model_pwt, model._model_nwt)
    print model.get_score('PerplexityScore').value

6771.56419784
2516.66014788
2407.35444318
2187.26840835
1996.18147091


As you see, perplexity values precisely reproduce the results of the ``fit_offline``.