# Tutorial for the CollocationNet prediction

This tutorial shows how to use the predict function in CollocationNet to predict which adjectives (from a given list of adjectives) are most likely to describe the given noun. In the example we will use 'kohv' for the noun and ['kange', 'tugev'] for the list of adjectives to predict which of the two adjectives is more likely to describe 'kohv'.

### The usage of predict in CollocationNet

To predict the order of likely adjectives for the noun, we first need to import CollocationNet and create a CollocationNet object:

In [1]:
from collocation_net import CollocationNet

In [2]:
cn = CollocationNet()

Then we can use the prediction function in the following way:

In [3]:
cn.predict(noun='kohv', adjectives=['kange', 'tugev'])

[('kange', 0.10684924723040845), ('tugev', 7.489607178763322e-05)]

As we can see, the predict function returns an ordered list where each element is a tuple consisting of an adjective and a probability for the adjective to describe the given noun. For this example we see that the collocation 'kange kohv' is more likely than 'tugev kohv'.

### Manual calculation of the results

Let's first import the modules necessary for the calculations.

In [4]:
import scipy
import numpy as np

To obtain the data that CollocationNet is built on, a Latent Dirichlet Allocation (LDA) model was trained using a dataset of noun-adjective collocations from a corpus. The LDA model divides the data into clusters and outputs matrices corresponding to the noun and adjective distributions that describe the distribution between the clusters.

The data obtained from the LDA model doesn't have words linked so let's first find the indices of the noun and adjectives to find the vectors that correspond to them. We can do that using the dataset in CollocationNet.

In [5]:
data = cn.data
kohv_idx = list(data.index).index('kohv')
kange_idx = list(data.columns).index('kange')
tugev_idx = list(data.columns).index('tugev')
print(kohv_idx, kange_idx, tugev_idx)

1505 127 73


The matrix of distribution between topics for nouns includes a vector of probabilities for each noun which showcases the probabilities of a certain noun belonging to different topics. The distribution matrix of adjectives includes pseudocounts which show how much each adjective was assigned to each topic.

In [6]:
noun_dist = cn.noun_dist
adj_dist = cn.adj_dist
print(noun_dist.shape, adj_dist.shape)

(5817, 1000) (1000, 7358)


To calculate the probabilities of each adjective forming a collocation with the given noun, first we need to sample variables from the Dirichlet' distribution. We need a variable for each element of the vectors corresponding to the adjectives and their pseudocounts in topics. Since our model has 1000 topics, we will sample 1000 variables for each adjective using the vector obtained from the model. For practical approximation of the results, each set of 1000 variables will be sampled 1000 times to later obtain an average.

In [7]:
kange_theta = scipy.stats.dirichlet.rvs(adj_dist[:, kange_idx], size=1000)
tugev_theta = scipy.stats.dirichlet.rvs(adj_dist[:, tugev_idx], size=1000)

We will also need the vector of probabilities corresponding to the noun.

In [8]:
kohv_topics = noun_dist[kohv_idx]

To find the probability of the collocation, we need to multiply the vector of probabilities corresponding to the noun with the vector of sampled Dirichlet' variables. This will find the corresponding value for each topic and sum them together. If a topic has a high probability for the noun and the sampled variable for that topic is also higher (meaning the original pseudocount was high), the multiplication of the values will also be higher and thus increase the overall probability of the collocation. Similarly if the most probable topics for the noun and the highest pseudocounts of the adjective don't align, the probability of the collocation will be lower.

Since we have 1000 vectors of Dirichlet' variables per adjective, the probability will be found as an average of those 1000 simulations.

In [9]:
kange_prob = np.mean(np.matmul(kohv_topics, kange_theta.T))
tugev_prob = np.mean(np.matmul(kohv_topics, tugev_theta.T))

In [10]:
print(("kange", kange_prob), ("tugev", tugev_prob))

('kange', 0.10683482609763297) ('tugev', 7.418066120767987e-05)


The probabilities differ slightly from the previously obtained CollocationNet.predict probabilities due to the randomness of the sampled Dirichlet' variables.