# Attaching Phi and Theta matrices 
Author: Rose Aysina.

In this tutorial we will demonstrate how to attach (i.e. replace or initialize) Phi and Theta matrices by custom matrices and how to fix Phi and Theta matrices during model iterations.

In [1]:
import pandas as pd
import numpy as np

import artm

import os
import glob

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
print(artm.version())

0.9.0


## Loading data and topic model initialization

We will use small text collection `kos`, that is available in [UCI](https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/) repo. Download all needed files and place them in `data_path`. Let's then create `batch_vectorizer` and corresponding `dictionary` objects:

In [4]:
batch_vectorizer = None

if len(glob.glob(os.path.join('kos', '*.batch'))) < 1:
    batch_vectorizer = artm.BatchVectorizer(data_path='kos', 
                                            data_format='bow_uci', 
                                            collection_name='kos', 
                                            target_folder='kos')
else:
    batch_vectorizer = artm.BatchVectorizer(data_path='kos', 
                                            data_format='batches')

In [5]:
dictionary = artm.Dictionary()

if not os.path.isfile('kos/dictionary.dict'):
    dictionary.gather(data_path=batch_vectorizer.data_path)
    dictionary.save(dictionary_path='kos/dictionary.dict')

dictionary.load(dictionary_path='kos/dictionary.dict')
dictionary.load(dictionary_path='kos/dictionary.dict')

Number of topics is selected to 10 (for printing convenience) and there are no namespaces but default one. 

Remember that you might not specify default namespace. In this tutorial we will explicitly set this namespace and use it across all tutorial so if your model has several namespaces it will be easier to adjust the code for using several different namespaces.

In [6]:
num_topics = 10
class_ids = {'@default_class': 1.0}

In [7]:
model_init = artm.ARTM(num_topics=num_topics,
                       dictionary=dictionary,
                       cache_theta=True,
                       reuse_theta=True,
                       theta_columns_naming='title',
                       theta_name='theta',
                       class_ids=class_ids)

model_init.initialize(dictionary=dictionary)

## Running model

When you set the model there is no Theta matrix and Phi matrix is set to random values. Let's run the model for several iteration to have Theta matrix:

In [8]:
model = model_init.clone()  # we will need initial model earlier 

In [9]:
model.fit_offline(batch_vectorizer, num_collection_passes=10)

Let's look on Phi and Theta matrices and save them:

In [10]:
model.get_phi().head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
predebate,1.559367e-05,4.771521e-10,1.8e-05,3e-06,0.0,8.889096e-05,0.0,8.481427e-13,5.111458e-15,1.340616e-09
barbour,8.666742e-12,0.0,0.0,7e-06,5.805268e-07,0.0002536439,1.983248e-05,1.508368e-12,1.18092e-12,1.898585e-07
bumblebums,0.0,9.774654e-14,0.000261,0.0,0.000471635,0.0,0.0,0.0,4.017255e-13,0.0
spindizzy,5.515514e-16,6.895758e-14,0.000273,0.0,0.000490344,0.0,0.0,0.0,2.716291e-14,0.0
mcentee,0.0,0.0,0.0,0.0,2.02694e-13,3.915133e-14,1.446393e-16,0.0,0.0005604492,1.04684e-15


In [11]:
model.get_theta().head()

Unnamed: 0,3001,3002,3003,3004,3005,3006,3007,3008,3009,3010,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000
topic_0,0.317986,0.0,0.296328,0.0793248,0.006957792,0.0,0.1203053,0.0,6.219742e-14,2.655634e-14,...,0.405047,0.022242,0.08101612,2.016665e-16,0.0008619558,2.58694e-08,0.0,1.947321e-11,2.785035e-10,0.0
topic_1,0.357562,0.0,0.027244,1.06005e-13,0.0004282176,0.0,0.03307888,0.0,0.0,0.0,...,0.00031,1.1e-05,6.015801e-07,0.0,0.0,0.3074379,4.563028e-16,0.009136069,0.0,0.0
topic_2,0.0,0.032342,0.0,1.318244e-10,0.00949747,1.0,1.66665e-07,0.0,0.0,1.506634e-08,...,0.0,0.0,0.0,0.8237324,3.299619e-14,0.0141815,6.701738e-15,0.008544581,0.0,0.0
topic_3,0.000943,0.0,0.00446,0.2047599,0.1767254,1.004859e-12,0.0009707168,0.083137,0.0286475,0.004849287,...,0.242819,0.014091,0.01657028,8.908565e-09,0.386577,0.1513949,0.136242,0.0001340723,0.2210321,0.057991
topic_4,0.219647,0.000946,0.519475,0.0005952637,4.606398e-13,1.34044e-08,0.03212908,0.0,1.567543e-15,3.891018e-06,...,0.001198,0.173389,3.120982e-07,0.1184843,0.03170368,0.0,0.0,1.324152e-10,0.0,0.0


In [12]:
phi = model.get_phi().copy()
theta = model.get_theta().copy()

## Phi attaching

According to docs, you can attach (i.e. replace) Phi matrix through `ARTM.master.attach_model()` approach. [Here](http://docs.bigartm.org/en/stable/tutorials/python_userguide/attach_model.html) is corresponding doc page. 

So all you need is to get reference to Phi matrix and change its values through this reference. Problem may appear when you have several namespaces and you want not just put random values but to set exact value for each token in each topic. This means that you have to be able to know exact orders of token and topics in stored Phi matrix. Fortunetely, for this purpose you can use Protobuf message.

Let's look what method `ARTM.master.attach_model()` returns:

In [13]:
tm_info, phi_ref = model.master.attach_model(model=model.model_pwt)

In [14]:
phi_ref  # Phi matrix reference

array([[1.55936686e-05, 4.77152096e-10, 1.80611205e-05, ...,
        8.48142682e-13, 5.11145801e-15, 1.34061595e-09],
       [8.66674180e-12, 0.00000000e+00, 0.00000000e+00, ...,
        1.50836797e-12, 1.18092016e-12, 1.89858511e-07],
       [0.00000000e+00, 9.77465382e-14, 2.61063658e-04, ...,
        0.00000000e+00, 4.01725501e-13, 0.00000000e+00],
       ...,
       [4.75688651e-03, 3.84311518e-03, 2.99096544e-04, ...,
        1.49234375e-02, 2.24081334e-04, 5.16588846e-03],
       [1.89932452e-05, 8.17234104e-05, 6.61607436e-11, ...,
        8.50354991e-05, 5.76613711e-05, 7.10015939e-08],
       [5.59726497e-04, 4.98893671e-04, 1.69469113e-05, ...,
        5.49448829e-04, 9.03802575e-04, 6.75734889e-04]], dtype=float32)

In [15]:
type(tm_info)  # Protobuf message

artm.messages_pb2.TopicModel

This message contains all information you need to know how Phi is stored. Let's look at some fields of it:

In [16]:
fields = tm_info.ListFields()

In [17]:
fields[3][1][:5]  # tokens order

['predebate', 'barbour', 'bumblebums', 'spindizzy', 'mcentee']

In [18]:
fields[2][1][:5]  # topics order

['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4']

In [19]:
fields[5][1][:5]  # namespaces order for each token

['@default_class',
 '@default_class',
 '@default_class',
 '@default_class',
 '@default_class']

You can see that using these three fields you can replace exact cells of Phi matrix by pointing to certain token of certain topic in given namespace. Note that Phi matrix consists of stocked namespace Phi matrices (in our case it's just one 'default' matrix). 

Then given that information let's create the function that takes `model_`, replaces Phi matrix that corresponds to namespace `class_id` by new `class_id_phi` matrix and returns updated model:

In [20]:
def init_custom_phi(model_, class_id, class_id_phi):
    model = model_.clone()
    
    tm_info, phi_ref = model.master.attach_model(model=model.model_pwt)

    fields = tm_info.ListFields()
    token_order = np.array(fields[3][1])
    topic_order = np.array(fields[2][1])
    class_id_order = np.array(fields[5][1])

    new_phi = pd.DataFrame(data=phi_ref,
                           index=token_order,
                           columns=topic_order)

    # here loop over all class_ids can be pasted
    mask = class_id_order == class_id
    current_phi = new_phi.iloc[mask].copy()
    current_phi.update(class_id_phi)
    new_phi.iloc[mask] = current_phi

    np.copyto(phi_ref, new_phi.values)
    return model

Remember that you can't just do `phi_ref = new_phi.values`!

Also note how actual replacement is done. Main point is that you don't know how Phi matrix is stored at all, all you have is just dataframe where index is words and columns are topics and the orders are all shuffled. 

That's why first of all you copy existing part of Phi matrix (actually certain block corr. given namespace), then update this block with new values from your new Phi dataframe and finally put it back to existing Phi matrix.

Let's look how it works:

In [21]:
x = np.random.random(phi.shape)
x = np.nan_to_num(x / np.sum(x, axis=0)[None])

phi_new = pd.DataFrame(data=x[:, :-2],  # drop last two topic columns
                       columns=phi.columns[0:-2][::-1], # drop last two topic columns and reverse topic order
                       index=phi.index[::-1])  # reverse token order

In [22]:
phi_new.head()

Unnamed: 0,topic_7,topic_6,topic_5,topic_4,topic_3,topic_2,topic_1,topic_0
close,0.000129,0.000209,6.6e-05,1.4e-05,1.9e-05,7.6e-05,0.000205,0.000159
assets,8.6e-05,0.000263,5e-06,0.00026,2.3e-05,4e-05,0.000244,4e-06
administration,0.000236,2.2e-05,0.000222,0.000231,5e-05,8.7e-05,4.9e-05,2.6e-05
contempt,0.000259,2.8e-05,0.000263,0.000119,6.9e-05,4.5e-05,4e-05,0.000148
deadline,0.000155,3e-05,0.00023,0.000237,0.000128,0.000217,8.3e-05,0.000267


In [23]:
phi_new.tail()

Unnamed: 0,topic_7,topic_6,topic_5,topic_4,topic_3,topic_2,topic_1,topic_0
mcentee,0.000115,0.000105,7.3e-05,0.000105,3.7e-05,0.000279,5.2e-05,0.000126
spindizzy,8.5e-05,0.000144,6.2e-05,1.4e-05,0.000138,0.000101,0.000116,0.000172
bumblebums,0.000235,3.1e-05,0.000186,0.000223,1.2e-05,0.000226,1.1e-05,0.000124
barbour,2e-06,0.000136,0.000215,0.000229,0.000233,0.000117,9.3e-05,0.000271
predebate,0.000147,2.7e-05,0.000273,5.2e-05,6.4e-05,8.2e-05,0.000266,8.1e-05


You can make dataframe not for all topics or/and not for all tokens. In this case values for not specified cells won't be updated.

In [24]:
model2 = init_custom_phi(model, class_id='@default_class', class_id_phi=phi_new)

And as expected for first 6 topics values has been updated and for last two topics they hasn't been changed:

In [25]:
model2.get_phi().head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
predebate,8.1e-05,0.000266,8.2e-05,6.4e-05,5.2e-05,0.000273,2.7e-05,0.000147,5.111458e-15,1.340616e-09
barbour,0.000271,9.3e-05,0.000117,0.000233,0.000229,0.000215,0.000136,2e-06,1.18092e-12,1.898585e-07
bumblebums,0.000124,1.1e-05,0.000226,1.2e-05,0.000223,0.000186,3.1e-05,0.000235,4.017255e-13,0.0
spindizzy,0.000172,0.000116,0.000101,0.000138,1.4e-05,6.2e-05,0.000144,8.5e-05,2.716291e-14,0.0
mcentee,0.000126,5.2e-05,0.000279,3.7e-05,0.000105,7.3e-05,0.000105,0.000115,0.0005604492,1.04684e-15


In [26]:
model2.get_phi().tail()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
deadline,0.000267,8.3e-05,0.000217,0.000128,0.000237,0.00023,3e-05,0.000155,0.000118,0.0001040973
contempt,0.000148,4e-05,4.5e-05,6.9e-05,0.000119,0.000263,2.8e-05,0.000259,7e-05,0.0002967782
administration,2.6e-05,4.9e-05,8.7e-05,5e-05,0.000231,0.000222,2.2e-05,0.000236,0.000224,0.005165888
assets,4e-06,0.000244,4e-05,2.3e-05,0.00026,5e-06,0.000263,8.6e-05,5.8e-05,7.100159e-08
close,0.000159,0.000205,7.6e-05,1.9e-05,1.4e-05,6.6e-05,0.000209,0.000129,0.000904,0.0006757349


In [27]:
model2.fit_offline(batch_vectorizer, num_collection_passes=5)

In [28]:
model2.get_phi().head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
predebate,5.92935e-05,4.436006e-06,1.7e-05,8e-06,0.0,1.160829e-05,0.0,9.17099e-07,1.020178e-05,1.00698e-05
barbour,4.506564e-09,2.968563e-14,0.0,3e-06,0.000461752,7.498069e-05,2.889659e-06,1.160586e-13,1.61849e-09,5.676034e-09
bumblebums,1.331954e-09,1.588872e-09,0.000389,0.0,2.884993e-06,1.915237e-09,0.0,0.0,5.879125e-11,0.0
spindizzy,8.243961e-10,5.129966e-10,0.000406,0.0,2.683245e-06,3.751949e-09,0.0,0.0,7.105313e-11,0.0
mcentee,3.960219e-16,6.763284e-16,0.0,0.0,1.853338e-11,2.047387e-10,3.656444e-12,0.0,0.0005578498,8.742719e-11


## Theta attaching

Almost the same you should do for Theta attching. The only differences are 1. you should specify `theta_name` in model instance initialization and 2. matrix in theta reference is transformed. 

In [29]:
def init_custom_theta(model_, class_id_theta):
    model = model_.clone()
    
    tm_info, theta_ref = model.master.attach_model(model=model.theta_name)

    fields = tm_info.ListFields()
    doc_order = np.array(fields[3][1])
    topic_order = np.array(fields[2][1])
    
    new_theta = pd.DataFrame(data=theta_ref.T,
                             index=topic_order,
                             columns=doc_order)

    current_theta = new_theta.copy()
    current_theta.update(class_id_theta)
    new_theta = current_theta.copy()

    np.copyto(theta_ref, new_theta.values.T)
    return model

In [30]:
x = np.random.random(theta.shape)
x = np.nan_to_num(x / np.sum(x, axis=0)[None])

theta_new = pd.DataFrame(data=x,
                         columns=theta.columns,
                         index=theta.index)

In [31]:
theta_new.head()

Unnamed: 0,3001,3002,3003,3004,3005,3006,3007,3008,3009,3010,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000
topic_0,0.142788,0.082596,0.106861,0.041266,0.169127,0.103364,0.073396,0.034039,0.222823,0.112804,...,0.170808,0.019612,0.063559,0.129988,0.144326,0.048276,0.145229,0.1558,0.095519,0.135563
topic_1,0.044472,0.109912,0.098161,0.121733,0.074178,0.068061,0.044898,0.028275,0.024751,0.128402,...,0.080104,0.027547,0.041547,0.019209,0.009714,0.033005,0.05728,0.134579,0.240483,0.042533
topic_2,0.073878,0.14291,0.046963,0.133567,0.125171,0.159091,0.188441,0.018838,0.142736,0.128207,...,0.009553,0.066851,0.026521,0.100441,0.002826,0.086326,0.074711,0.018839,0.058775,0.095397
topic_3,0.134575,0.126338,0.044359,0.108228,0.01138,0.001977,0.027978,0.056209,0.100357,0.022552,...,0.147635,0.116031,0.019031,0.112683,0.110663,0.033118,0.005276,0.107551,0.025976,0.000112
topic_4,0.100853,0.139144,0.12816,0.096326,0.188389,0.14444,0.079858,0.170195,0.058386,0.322618,...,0.165004,0.055706,0.253409,0.036142,0.108889,0.123851,0.195287,0.058474,0.042512,0.106639


In [32]:
model3 = init_custom_theta(model, class_id_theta=theta_new)

In [33]:
model3.get_theta().head()

Unnamed: 0,3001,3002,3003,3004,3005,3006,3007,3008,3009,3010,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000
topic_0,0.142788,0.082596,0.106861,0.041266,0.169127,0.103364,0.073396,0.034039,0.222823,0.112804,...,0.170808,0.019612,0.063559,0.129988,0.144326,0.048276,0.145229,0.1558,0.095519,0.135563
topic_1,0.044472,0.109912,0.098161,0.121733,0.074178,0.068061,0.044898,0.028275,0.024751,0.128402,...,0.080104,0.027547,0.041547,0.019209,0.009714,0.033005,0.05728,0.134579,0.240483,0.042533
topic_2,0.073878,0.14291,0.046963,0.133567,0.125171,0.159091,0.188441,0.018838,0.142736,0.128207,...,0.009553,0.066851,0.026521,0.100441,0.002826,0.086326,0.074711,0.018839,0.058775,0.095397
topic_3,0.134575,0.126338,0.044359,0.108228,0.01138,0.001977,0.027978,0.056209,0.100357,0.022552,...,0.147635,0.116031,0.019031,0.112683,0.110663,0.033118,0.005276,0.107551,0.025976,0.000112
topic_4,0.100853,0.139144,0.12816,0.096326,0.188389,0.14444,0.079858,0.170195,0.058386,0.322618,...,0.165004,0.055706,0.253409,0.036142,0.108889,0.123851,0.195287,0.058474,0.042512,0.106639


## Few words about fixing Phi during iterations

To fix Phi matrix and to get Theta matrix you need to just use `transform()` method. Wherein effect of all regularizers persists. 