<a href="https://colab.research.google.com/github/dmcguire81/metapy/blob/master/tutorials/5-topic-modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
%pip install https://github.com/dmcguire81/metapy/releases/download/v0.2.14/metapy-0.2.14-cp37-cp37m-manylinux_2_24_x86_64.whl

First, let's import the Python bindings, as usual.

In [2]:
import metapy

In [3]:
metapy.__version__ # you will want your version to be >= to this

'0.2.14'

If you would like to, you can inform MeTA to output log data to stderr like so:

In [4]:
metapy.log_to_stderr()

Now, let's download a list of stopwords and a sample dataset to begin exploring MeTA's topic models.

In [5]:
%%capture
!wget -N https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

In [6]:
%%capture
!wget -N https://meta-toolkit.org/data/2016-01-26/ceeaus.tar.gz
!tar xf ceeaus.tar.gz

We will need to index our data to proceed. We eventually want to be able to extract the bag-of-words representation for our individual documents, so we will want a `ForwardIndex` in this case.

In [7]:
%%capture
!wget -N https://raw.githubusercontent.com/dmcguire81/metapy/master/tutorials/ceeaus-config.toml

In [8]:
fidx = metapy.index.make_forward_index('ceeaus-config.toml')

1669441892: [info]     Creating forward index: ceeaus-idx/fwd (/metapy/deps/meta/src/index/forward_index.cpp:239)
1669441893: [info]     Done creating index: ceeaus-idx/fwd (/metapy/deps/meta/src/index/forward_index.cpp:278)


Just like in classification, the feature set used for the topic modeling will be the feature set used at the time of indexing, so if you want to play with a different set of features (like bigram words), you will need to re-index your data.

For now, we've just stuck with the default filter chain for unigram words, so we're operating in the traditional bag-of-words space.

Let's load our documents into memory to run the topic model inference now.

In [9]:
dset = metapy.learn.Dataset(fidx)



Now, let's try to find some topics for this dataset. To do so, we're going to use a generative model called a topic model.

There are many different topic models in the literature, but the most commonly used topic model is Latent Dirichlet Allocation. Here, we propose that there are K topics (represented with a categorical distribution over words) $\phi_k$ from which all of our documents are genereated. These K topics are modeled as being sampled from a Dirichlet distribution with parameter $\vec{\alpha}$. Then, to generate a document $d$, we first sample a distribution over the K topics $\theta_d$ from another Dirichlet distribution with parameter $\vec{\beta}$. Then, for each word in this document, we first sample a topic identifier $z \sim \theta_d$ and then the word by drawing from the topic we selected ($w \sim \phi_z$). Refer to the [Wikipedia article on LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) for more information.

The goal of running inference for an LDA model is to infer the latent variables $\phi_k$ and $\theta_d$ for all of the $K$ topics and $D$ documents, respectively. MeTA provides a number of different inference algorithms for LDA, as each one entails a different set of trade-offs (inference in LDA is intractable, so all inference algorithms are approximations; different algorithms entail different approximation guarantees, running times, and required memroy consumption). For now, let's run a Variational Infernce algorithm called CVB0 to find two topics. (In practice you will likely be finding many more topics than just two, but this is a very small toy dataset.)

In [10]:
lda_inf = metapy.topics.LDACollapsedVB(dset, num_topics=2, alpha=1.0, beta=0.01)
lda_inf.run(num_iters=1000)

Iteration 1 maximum change in gamma: 1.82245                                    
Iteration 2 maximum change in gamma: 0.473407                                   
Iteration 3 maximum change in gamma: 0.327967                                   
Iteration 4 maximum change in gamma: 0.576433                                   
Iteration 5 maximum change in gamma: 0.657188                                   
Iteration 6 maximum change in gamma: 1.00488                                    
Iteration 7 maximum change in gamma: 1.35403                                    
Iteration 8 maximum change in gamma: 1.4486                                     
Iteration 9 maximum change in gamma: 1.51951                                    
Iteration 10 maximum change in gamma: 1.43066                                   
Iteration 11 maximum change in gamma: 1.32694                                   
Iteration 12 maximum change in gamma: 1.20058                                   
Iteration 13 maximum change 

The above ran the CVB0 algorithm for 1000 iterations, or until an algorithm-specific convergence criterion was met. Now let's save the current estimate for our topics and topic proportions.

In [11]:
lda_inf.save('lda-cvb0')

We can interrogate the topic inference results by using the `TopicModel` query class. Let's load our inference results back in.

In [12]:
model = metapy.topics.TopicModel('lda-cvb0')



Now, let's have a look at our topics. A typical way of doing this is to print the top $k$ words in each topic, so let's do that.

In [13]:
model.top_k(tid=0)

[(3341, 0.131104037927063),
 (3045, 0.05434934835077168),
 (2677, 0.03678011673524976),
 (3346, 0.03349265715315878),
 (281, 0.022530685310436133),
 (3729, 0.015620493350789287),
 (1953, 0.01278092492670823),
 (707, 0.01263507617737916),
 (592, 0.011987189464605957),
 (2448, 0.011317746215793184)]

The models operate on term ids instead of raw text strings, so let's convert this to a human readable format by using the vocabulary contained in our `ForwardIndex` to map the term ids to strings.

In [14]:
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=0)]

[('smoke', 0.131104037927063),
 ('restaur', 0.05434934835077168),
 ('peopl', 0.03678011673524976),
 ('smoker', 0.03349265715315878),
 ('ban', 0.022530685310436133),
 ('think', 0.015620493350789287),
 ('japan', 0.01278092492670823),
 ('complet', 0.01263507617737916),
 ('cigarett', 0.011987189464605957),
 ('non', 0.011317746215793184)]

In [15]:
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=1)]

[('time', 0.06705641912649929),
 ('job', 0.05605927233040223),
 ('part', 0.05222303729314569),
 ('student', 0.04642936177478375),
 ('colleg', 0.0348813900900554),
 ('work', 0.029067466795057315),
 ('money', 0.02885020547633855),
 ('think', 0.022331336604501116),
 ('import', 0.020755687395793487),
 ('studi', 0.015483027954764635)]

We can pretty clearly see that this particular dataset was about two major issues: smoking in public and part time jobs for students. This dataset is actually a collection of essays written by students, and there just so happen to be two different topics they can choose from!

The topics are pretty clear in this case, but in some cases it is also useful to score the terms in a topic using some function of the probability of the word in the topic and the probability of the word in the other topics. Intuitively, we might want to select words from each topic that best reflect that topic's content by picking words that both have high probability in that topic **and** have low probability in the other topics. In other words, we want to balance between high probability terms and highly specific terms (this is kind of like a tf-idf weighting). One such scoring function is provided by the toolkit in `BLTermScorer`, which implements a scoring function proposed by Blei and Lafferty.

In [16]:
scorer = metapy.topics.BLTermScorer(model)
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=0, scorer=scorer)]

[('smoke', 0.8741649117009036),
 ('restaur', 0.3174629035659571),
 ('smoker', 0.20060276513583242),
 ('ban', 0.1285305927149838),
 ('cigarett', 0.06557610155520721),
 ('non', 0.06128425189479603),
 ('complet', 0.06105374081168069),
 ('japan', 0.05846309534398397),
 ('health', 0.05054837315437988),
 ('seat', 0.045339919921592475)]

In [17]:
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=1, scorer=scorer)]

[('job', 0.34822035155943354),
 ('part', 0.313110580762872),
 ('student', 0.2832891809207996),
 ('colleg', 0.20808986731897416),
 ('time', 0.17797667989825072),
 ('money', 0.16234674183749842),
 ('work', 0.15585204614058856),
 ('studi', 0.08228285335580843),
 ('learn', 0.06491904130151305),
 ('experi', 0.05494478251960788)]

Here we can see that the uninformative word stem "think" was downweighted from the word list from each topic, since it had relatively high probability in either topic.

We can also see the inferred topic distribution for each document.

In [18]:
model.topic_distribution(0)

<metapy.stats.Multinomial {0: 0.021341, 1: 0.978659}>

It looks like our first document was written by a student who chose the part-time job essay topic...

In [19]:
model.topic_distribution(900)

<metapy.stats.Multinomial {0: 0.978797, 1: 0.021203}>

...whereas this document looks like it was written by a student who chose the public smoking essay topic.

We can also infer topics for a brand new document. First, let's create the document and use the forward index we loaded before to convert it to a feature vector:

In [20]:
doc = metapy.index.Document()
doc.content("I think smoking in public is bad for others' health.")
fvec = fidx.tokenize(doc)

Now, let's load a topic model inferencer that uses the same CVB inference method we used earlier:

In [21]:
inferencer = metapy.topics.CVBInferencer('lda-cvb0.phi.bin', alpha=1.0)



Now, let's infer the topic proportions for the new document:

In [22]:
proportions = inferencer.infer(fvec, max_iters=20, convergence=1e-4)
print(proportions)

<metapy.stats.Multinomial {0: 0.814392, 1: 0.185608}>
