First, let's import the Python bindings, as usual.

In [1]:
import metapy

In [2]:
metapy.__version__ # you will want your version to be >= to this

'0.2.13'

If you would like to, you can inform MeTA to output log data to stderr like so:

In [3]:
metapy.log_to_stderr()

Now, let's download a list of stopwords and a sample dataset to begin exploring MeTA's topic models.

In [4]:
#!wget -N https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

In [10]:
fidx = metapy.index.make_forward_index('review.toml')

RuntimeError: failed failed to open input file ./reviews/reviews.dat

Just like in classification, the feature set used for the topic modeling will be the feature set used at the time of indexing, so if you want to play with a different set of features (like bigram words), you will need to re-index your data.

For now, we've just stuck with the default filter chain for unigram words, so we're operating in the traditional bag-of-words space.

Let's load our documents into memory to run the topic model inference now.

In [6]:
dset = metapy.learn.Dataset(fidx)



Now, let's try to find some topics for this dataset. To do so, we're going to use a generative model called a topic model.

There are many different topic models in the literature, but the most commonly used topic model is Latent Dirichlet Allocation. Here, we propose that there are K topics (represented with a categorical distribution over words) $\phi_k$ from which all of our documents are genereated. These K topics are modeled as being sampled from a Dirichlet distribution with parameter $\vec{\alpha}$. Then, to generate a document $d$, we first sample a distribution over the K topics $\theta_d$ from another Dirichlet distribution with parameter $\vec{\beta}$. Then, for each word in this document, we first sample a topic identifier $z \sim \theta_d$ and then the word by drawing from the topic we selected ($w \sim \phi_z$). Refer to the [Wikipedia article on LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) for more information.

The goal of running inference for an LDA model is to infer the latent variables $\phi_k$ and $\theta_d$ for all of the $K$ topics and $D$ documents, respectively. MeTA provides a number of different inference algorithms for LDA, as each one entails a different set of trade-offs (inference in LDA is intractable, so all inference algorithms are approximations; different algorithms entail different approximation guarantees, running times, and required memroy consumption). For now, let's run a Variational Infernce algorithm called CVB0 to find two topics. (In practice you will likely be finding many more topics than just two, but this is a very small toy dataset.)

In [7]:
lda_inf = metapy.topics.LDACollapsedVB(dset, num_topics=10, alpha=1.0, beta=0.01)
lda_inf.run(num_iters=100)

Iteration 1 maximum change in gamma: 1.48113                                                                     
Iteration 2 maximum change in gamma: 0.456762                                                                    
Iteration 3 maximum change in gamma: 0.676432                                                                   
Iteration 4 maximum change in gamma: 1.16341                                                                     
Iteration 5 maximum change in gamma: 1.28079                                                                     
Iteration 6 maximum change in gamma: 1.21783                                                                     
Iteration 7 maximum change in gamma: 0.982676                                                                    
Iteration 8 maximum change in gamma: 0.91563                                                                     
Iteration 9 maximum change in gamma: 1.0244                                              

The above ran the CVB0 algorithm for 1000 iterations, or until an algorithm-specific convergence criterion was met. Now let's save the current estimate for our topics and topic proportions.

In [8]:
lda_inf.save('lda-cvb0')

We can interrogate the topic inference results by using the `TopicModel` query class. Let's load our inference results back in.

In [9]:
model = metapy.topics.TopicModel('lda-cvb0')



Now, let's have a look at our topics. A typical way of doing this is to print the top $k$ words in each topic, so let's do that.

In [10]:
model.top_k(tid=0)

[(65608, 0.02169470121983371),
 (50925, 0.021593963127798904),
 (74689, 0.021433829452810242),
 (10722, 0.018157006042275256),
 (49957, 0.015830883573488632),
 (88407, 0.013145897147562237),
 (26637, 0.012809181262481813),
 (90395, 0.011990212562659237),
 (42444, 0.01185693362520633),
 (44494, 0.011559300765737197)]

The models operate on term ids instead of raw text strings, so let's convert this to a human readable format by using the vocabulary contained in our `ForwardIndex` to map the term ids to strings.

In [11]:
topic = [[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id)] for id in range(0,10)]

We can pretty clearly see that this particular dataset was about two major issues: smoking in public and part time jobs for students. This dataset is actually a collection of essays written by students, and there just so happen to be two different topics they can choose from!

The topics are pretty clear in this case, but in some cases it is also useful to score the terms in a topic using some function of the probability of the word in the topic and the probability of the word in the other topics. Intuitively, we might want to select words from each topic that best reflect that topic's content by picking words that both have high probability in that topic **and** have low probability in the other topics. In other words, we want to balance between high probability terms and highly specific terms (this is kind of like a tf-idf weighting). One such scoring function is provided by the toolkit in `BLTermScorer`, which implements a scoring function proposed by Blei and Lafferty.

In [12]:
scorer = metapy.topics.BLTermScorer(model)
topic_words = {'Topic ' + str(id) : [(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id, scorer=scorer)] for id in range(0,10)}
topic_words

{'Topic 0': [('sandwich', 0.22453031604072748),
  ('breakfast', 0.21624518246080698),
  ('lunch', 0.15868801891145246),
  ('egg', 0.12804156499564953),
  ('coffe', 0.11319242130427742),
  ('locat', 0.09750457224002426),
  ('place', 0.06950254951025886),
  ('pancak', 0.059233436614966155),
  ('shop', 0.049345013321590164),
  ('love', 0.04757087439206664)],
 'Topic 1': [('great', 0.7683970896050292),
  ('food', 0.4408239393507806),
  ('servic', 0.4287363300627014),
  ('place', 0.21367771865620055),
  ('love', 0.16825087306813202),
  ('friend', 0.1627250227054333),
  ('staff', 0.1612898450398487),
  ('amaz', 0.14160499731568105),
  ('atmospher', 0.12480218282213397),
  ('excel', 0.1122930126276172)],
 'Topic 2': [("don't", 0.061208941108643),
  ('custom', 0.06098670188148066),
  ('review', 0.05670711140935136),
  ('know', 0.0542926770240591),
  ('guy', 0.04472296562954788),
  ('bad', 0.044666589396489),
  ('star', 0.04293761256266961),
  ('manag', 0.03412672933801915),
  ("i'm", 0.0330745

Here we can see that the uninformative word stem "think" was downweighted from the word list from each topic, since it had relatively high probability in either topic.

We can also see the inferred topic distribution for each document.

In [None]:
import json

with open ( 'review_topic.json', 'w') as f:
    json.dump(topic_words, f)

Chinese Resturant Review Positive 

In [None]:
fidx_pos = metapy.index.make_forward_index('chinese_review_pos.toml')
dset_pos = metapy.learn.Dataset(fidx_pos)

In [None]:
lda_inf_pos = metapy.topics.LDACollapsedVB(dset_pos, num_topics=10, alpha=1.0, beta=0.01)
lda_inf_pos.run(num_iters=1000)
lda_inf_pos.save('lda-cvb0-pos')

In [None]:
model = metapy.topics.TopicModel('lda-cvb0-pos')

scorer = metapy.topics.BLTermScorer(model)
topic_words = {'Topic '+str(id): [(fidx_pos.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id, scorer=scorer)] for id in range(0,10)}

In [None]:
with open ( 'chinese_pos_topic.json', 'w') as f:
    json.dump(topic_words, f)

topic_words

Chinese Restaurant Negative Reviews

In [None]:
fidx_neg = metapy.index.make_forward_index('chinese_review_neg.toml')
dset_neg = metapy.learn.Dataset(fidx_neg)
lda_inf_neg = metapy.topics.LDACollapsedVB(dset_neg, num_topics=10, alpha=1.0, beta=0.01)
lda_inf_neg.run(num_iters=1000)
lda_inf_neg.save('lda-cvb0-neg')

In [None]:
model = metapy.topics.TopicModel('lda-cvb0-neg')

scorer = metapy.topics.BLTermScorer(model)
topic_words = {'Topic '+str(id): [(fidx_neg.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id, scorer=scorer)] for id in range(0,10)}

with open ( 'chinese_neg_topic.json', 'w') as f:
    json.dump(topic_words, f)

topic_words

In [None]:
model.topic_distribution(0)

It looks like our first document was written by a student who chose the part-time job essay topic...

In [None]:
model.topic_distribution(900)

...whereas this document looks like it was written by a student who chose the public smoking essay topic.

We can also infer topics for a brand new document. First, let's create the document and use the forward index we loaded before to convert it to a feature vector:

In [None]:
doc = metapy.index.Document()
doc.content("I think smoking in public is bad for others' health.")
fvec = fidx.tokenize(doc)

Now, let's load a topic model inferencer that uses the same CVB inference method we used earlier:

In [None]:
inferencer = metapy.topics.CVBInferencer('lda-cvb0.phi.bin', alpha=1.0)

Now, let's infer the topic proportions for the new document:

In [None]:
proportions = inferencer.infer(fvec, max_iters=20, convergence=1e-4)
print(proportions)