<a href="https://colab.research.google.com/github/dmcguire81/metapy/blob/task%2Fgoogle_colab/tutorials/KDD%202017.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%%capture
# NOTE: this assumes you've uploaded a Python 3.7 build from our fork to Drive
# TODO: replace this with a stock install when it's published somewhere
%pip install /content/drive/MyDrive/metapy-0.2.13-cp37-cp37m-manylinux_2_24_x86_64.whl

# Part 1: Feature Engineering for Text Data with MeTA

In this part of the tutorial, we'll explore how to go from raw text data to feature representations for documents using MeTA. Everything downstream depends on this representation, so it's important that we spend some time talking about the many different ways you can analyze documents into feature representations.

First, we'll import the `metapy` python bindings.

In [3]:
import metapy

For reference, this tutorial was written agains the following metapy version:

In [4]:
metapy.__version__

'0.2.13'

If you'd like, you can tell MeTA to log to stderr so you can get progress output when running long-running function calls.

In [5]:
metapy.log_to_stderr()

Now, let's create a document with some content.

In [6]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")

MeTA provides a stream-based interface for performing document tokenization. Each stream starts off with a Tokenizer object, and in most cases you should use the [Unicode standard aware](http://site.icu-project.org) `ICUTokenizer`.

In [7]:
tok = metapy.analyzers.ICUTokenizer()

Tokenizers operate on raw text and provide an Iterable that spits out the individual text tokens. Let's try running just the `ICUTokenizer` to see what it does.

In [8]:
tok.set_content(doc.content()) # this could be any string
[token for token in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>']

One thing that you likely immediately notice is the insertion of these pseudo-XML looking `<s>` and `</s>` tags. These are called "sentence boundary tags". As a side-effect, a default-construted `ICUTokenizer` discovers the sentences in a document by delimiting them with the sentence boundary tags. Let's try tokenizing a multi-sentence document to see what that looks like.

In [9]:
doc.content("I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.")
tok.set_content(doc.content())
[token for token in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>',
 '<s>',
 'I',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.',
 '</s>']

Most of the information retrieval techniques you have likely been learning about in this class don't need to concern themselves with finding the boundaries between separate sentences in a document, but later today we'll explore a scenario where this might matter more.

Let's pass a flag to the `ICUTokenizer` constructor to disable sentence boundary tags for now.

In [10]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok.set_content(doc.content())
[token for token in tok]

['I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 'I',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.']

I mentioned earlier that MeTA treats tokenization as a *streaming* process, and that it *starts* with a tokenizer. As you've learned, for optimal search performance it's often beneficial to modify the raw underlying tokens of a document, and thus change its representation, before adding it to an inverted index structure for searching.

The "intermediate" steps in the tokenization stream are represented with objects called Filters. Each filter consumes the content of a previous filter (or a tokenizer) and modifies the tokens coming out of the stream in some way.

Let's start by using a simple filter that can help eliminate a lot of noise that we might encounter when tokenizing web documents: a `LengthFilter`.

In [11]:
tok = metapy.analyzers.LengthFilter(tok, min=2, max=30)
tok.set_content(doc.content())
[token for token in tok]

['said',
 'that',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '19.95',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '30',
 'before']

Here, we can see that the `LengthFilter` is consuming our original `ICUTokenizer`. It modifies the token stream by only emitting tokens that are of a minimum length of 2 and a maximum length of 30. This can get rid of a lot of punctuation tokens, but also excessively long tokens such as URLs.

Another common trick is to remove stopwords. (Can anyone tell me what a stopword is?) In MeTA, this is done using a `ListFilter`.

In [12]:
%%capture
!wget -nc https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

In [13]:
tok = metapy.analyzers.ListFilter(tok, "lemur-stopwords.txt", metapy.analyzers.ListFilter.Type.Reject)
tok.set_content(doc.content())
[token for token in tok]

["can't", 'believe', 'costs', '19.95', 'find', '30']

Here we've downloaded a common list of stopwords obtained from the [Lemur project](http://lemurproject.org) and created a `ListFilter` to reject any tokens that occur in that list of words.

You can see how much of a difference removing stopwords can make on the size of a document's token stream! This translates to a lot of space savings in the inverted index as well.

Another common filter that people use is called a stemmer, or lemmatizer. This kind of filter tries to modify individual tokens in such a way that different inflected forms of a word all reduce to the same representation. This lets you, for example, find documents about a "run" when you search "running" or "runs". A common stemmer is the [Porter2 Stemmer](http://snowball.tartarus.org/algorithms/english/stemmer.html), which MeTA has an implementation of. Let's try it!

In [14]:
tok = metapy.analyzers.Porter2Filter(tok)
tok.set_content(doc.content())
[token for token in tok]

["can't", 'believ', 'cost', '19.95', 'find', '30']

Notice how "believe" becomes "believ" and "costs" becomes "cost". Stemming can help search by allowing queries to return more matched documents by relaxing what it means for a document to match a query term. Note that it's important to ensure that queries are tokenized in the *exact same way* as your documents were before indexing them. If you ignore this, your query is unlikely to contain the raw token "believ" and you'll miss a lot of results.

Finally, after you've got the token stream configured the way you'd like, it's time to analyze the document by consuming each token from its token stream and performing some actions based on these tokens. In the simplest case, which often is enough for "good enough" search results, our action can simply be counting how many times these tokens occur.

For clarity, let's switch back to a simpler token stream first. Write me a token stream that tokenizes using the Unicode standard, and then lowercases each token. (Hint: `help(metapy.analyzers)`.)

In [15]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok = metapy.analyzers.LowercaseFilter(tok)
tok.set_content(doc.content())
[token for token in tok]

['i',
 'said',
 'that',
 'i',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 'i',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.']

Now, let's count how often each individual token appears in the stream. You might have called this representation the "bag of words" representation, but it is also often called "unigram word counts". In MeTA, classes that consume a token stream and emit a document representation are called Analyzers.

In [16]:
ana = metapy.analyzers.NGramWordAnalyzer(1, tok)
print(doc.content())
ana.analyze(doc)

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.


{"can't": 1,
 'before': 1,
 'for': 1,
 '$': 2,
 '30': 1,
 'costs': 1,
 '19.95': 1,
 '.': 1,
 '!': 1,
 'could': 1,
 'than': 1,
 'that': 2,
 'said': 1,
 'believe': 1,
 'find': 1,
 'i': 3,
 'only': 2,
 'it': 2,
 'more': 1}

If you noticed the name of the analyzer, you might have realized that you can count not just individual tokens, but groups of them. "Unigram" means "1-gram", and we count individual tokens. "Bigram" means "2-gram", and we count adjacent tokens together as a group. Let's try that now.

In [17]:
ana = metapy.analyzers.NGramWordAnalyzer(2, tok)
ana.analyze(doc)

{('$', '19.95'): 1,
 ('$', '30'): 1,
 ('i', "can't"): 1,
 ('only', 'costs'): 1,
 ('only', 'find'): 1,
 ('30', 'before'): 1,
 ('for', 'more'): 1,
 ('it', 'only'): 1,
 ('said', 'that'): 1,
 ('before', '.'): 1,
 ('i', 'said'): 1,
 ('!', 'i'): 1,
 ("can't", 'believe'): 1,
 ('that', 'i'): 1,
 ('it', 'for'): 1,
 ('more', 'than'): 1,
 ('believe', 'that'): 1,
 ('that', 'it'): 1,
 ('19.95', '!'): 1,
 ('find', 'it'): 1,
 ('i', 'could'): 1,
 ('costs', '$'): 1,
 ('than', '$'): 1,
 ('could', 'only'): 1}

Now the individual "tokens" we're counting are pairs of tokens. You can analyze any n-gram of tokens you would like to in this way (and this is a simple way to attempt to support phrase search). Note, however, that as you increase the size of the n-grams you are counting, you are also increasing (exponentially!) the number of possible n-grams you could observe, so there's no free lunch here.

This analysis pipeline feeds both the creation of the `InvertedIndex`, which is used for search applications, and the `ForwardIndex`, which is used for topic modeling and classification applications. For classification, sometimes looking at n-grams of characters is useful.

In [18]:
tok = metapy.analyzers.CharacterTokenizer()
ana = metapy.analyzers.NGramWordAnalyzer(4, tok)
ana.analyze(doc)

{('d', ' ', 't', 'h'): 1,
 ('a', 't', ' ', 'i'): 1,
 ('o', 'r', 'e', '.'): 1,
 ('f', 'o', 'r', 'e'): 1,
 (' ', 'm', 'o', 'r'): 1,
 (' ', 'c', 'a', 'n'): 1,
 (' ', 'c', 'o', 's'): 1,
 ('t', 'h', 'a', 't'): 2,
 ('o', 'n', 'l', 'y'): 2,
 ('u', 'l', 'd', ' '): 1,
 ('c', 'a', 'n', "'"): 1,
 (' ', '$', '1', '9'): 1,
 (' ', 'c', 'o', 'u'): 1,
 ('.', '9', '5', '!'): 1,
 ('$', '3', '0', ' '): 1,
 ('5', '!', ' ', 'I'): 1,
 ('n', 'd', ' ', 'i'): 1,
 ('t', 'h', 'a', 'n'): 1,
 ('i', 'd', ' ', 't'): 1,
 ('I', ' ', 'c', 'o'): 1,
 ('b', 'e', 'l', 'i'): 1,
 ('I', ' ', 'c', 'a'): 1,
 ('f', 'o', 'r', ' '): 1,
 ('I', ' ', 's', 'a'): 1,
 ('l', 'y', ' ', 'c'): 1,
 ('a', 'i', 'd', ' '): 1,
 ('o', 'r', ' ', 'm'): 1,
 (' ', 'I', ' ', 'c'): 2,
 ('o', 'r', 'e', ' '): 1,
 ('i', 'n', 'd', ' '): 1,
 ('t', 's', ' ', '$'): 1,
 ('e', 'l', 'i', 'e'): 1,
 ('t', ' ', 'o', 'n'): 1,
 ('y', ' ', 'f', 'i'): 1,
 ('l', 'y', ' ', 'f'): 1,
 ('o', 's', 't', 's'): 1,
 ('i', 'e', 'v', 'e'): 1,
 (' ', 's', 'a', 'i'): 1,
 (' ', 'f', 

Different analyzers can be combined together to create document representations that have many unique perspectives. Once things start to get more complicated, we recommend using a configuration file to specify each of the analyzers you wish to combine for your document representation.

Now, let's explore something a little bit different. MeTA also has a natural language processing (NLP) component, which currently supports two major NLP tasks: part-of-speech tagging and syntactic parsing.

(Does anyone know what part-of-speech tagging is?) POS tagging is a task in NLP that involves identifying a type for each word in a sentence. For example, POS tagging can be used to identify all of the nouns in a sentence, or all of the verbs, or adjectives, or... This is useful as first step towards developing an understanding of the meaning of a particular sentence.

MeTA places its POS tagging component in its "sequences" library. Let's play with some sequences first to get an idea of how they work. We'll start of by creating a sequence.

In [19]:
seq = metapy.sequence.Sequence()

Now, we can add individual words to this sequence. Sequences consist of a list of `Observation`s, which are essentially (word, tag) pairs. If we don't yet know the tags for a `Sequence`, we can just add individual words and leave the tags unset. Words are called "symbols" in the library terminology.

In [20]:
for word in ["The", "dog", "ran", "across", "the", "park", "."]:
    seq.add_symbol(word)
print(seq)

(The, ???), (dog, ???), (ran, ???), (across, ???), (the, ???), (park, ???), (., ???)


The printed form of the sequence shows that we do not yet know the tags for each word. Let's fill them in by using a pre-trained POS-tagger model that's distributed with MeTA.

In [21]:
%%capture
!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz
!tar xvf greedy-perceptron-tagger.tar.gz

In [22]:
tagger = metapy.sequence.PerceptronTagger("perceptron-tagger/")



Now let's fill in the missing tags in our sentence based on the best guess this model has.

In [23]:
tagger.tag(seq)
print(seq)

(The, DT), (dog, NN), (ran, VBD), (across, IN), (the, DT), (park, NN), (., .)


Each tag indicates the type of a word, and this particular tagger was trained to output the tags present in the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

But what if we want to POS-tag a document?

In [24]:
print(doc.content())

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.


We need a way of going from a document to a list of `Sequence`s, each representing an individual sentence. I'll get you started.

In [25]:
tok = metapy.analyzers.ICUTokenizer() # keep sentence boundaries!
tok = metapy.analyzers.PennTreebankNormalizer(tok)
tok.set_content(doc.content())
[token for token in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 'ca',
 "n't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>',
 '<s>',
 'I',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.',
 '</s>']

(Notice that the `PennTreebankNormalizer` modifies some tokens to better match the conventions of the Penn Treebank training data. This should help improve performance a little.)

Now, write me a function that can take a token stream that contains sentence boundary tags and returns a list of `Sequence` objects. Don't include the sentence boundary tags in the actual `Sequence` objects.

In [26]:
def extract_sequences(tok):
    sequences = []
    for token in tok:
        if token == '<s>':
            sequences.append(metapy.sequence.Sequence())
        elif token != '</s>':
            sequences[-1].add_symbol(token)            
    return sequences

In [27]:
tok.set_content(doc.content())
for seq in extract_sequences(tok):
    tagger.tag(seq)
    print(seq)

(I, PRP), (said, VBD), (that, IN), (I, PRP), (ca, MD), (n't, RB), (believe, VB), (that, IN), (it, PRP), (only, RB), (costs, VBZ), ($, $), (19.95, CD), (!, .)
(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)


This is still a rather shallow understanding of these sentences. The next major leap is to parse these sequences of POS-tagged words to obtain a tree for each sentence. These trees, in our case, will represent the hierarchical phrase structure of a single sentence by grouping together tokens that belong to one phrase together, and showing how small phrases combine into larger phrases, and eventually a sentence.

Let's try parsing the sentences in our document using a pre-tranned constituency parser that's distributed with MeTA.

In [28]:
%%capture
!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-constituency-parser.tar.gz
!tar xvf greedy-constituency-parser.tar.gz

In [29]:
parser = metapy.parser.Parser("parser/")

In [30]:
print(' '.join([obs.symbol for obs in seq]))
print(seq)
tree = parser.parse(seq)
print(tree.pretty_str())

I could only find it for more than $ 30 before .
(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)
(ROOT
  (S
    (NP (PRP I))
    (VP
      (MD could)
      (ADVP (RB only))
      (VP
        (VB find)
        (NP (PRP it))
        (PP
          (IN for)
          (NP
            (QP
              (JJR more)
              (IN than)
              ($ $)
              (CD 30))))
        (ADVP (IN before))))
    (. .)))



(You can also play with this with a [prettier online demo](https://meta-toolkit.org/nlp-demo.html).)

We can now parse all of the sentences in our document.

In [31]:
tok.set_content(doc.content())
for seq in extract_sequences(tok):
    tagger.tag(seq)
    print(parser.parse(seq).pretty_str())

(ROOT
  (S
    (NP (PRP I))
    (VP
      (VBD said)
      (SBAR
        (IN that)
        (S
          (NP (PRP I))
          (VP
            (MD ca)
            (RB n't)
            (VP
              (VB believe)
              (SBAR
                (IN that)
                (S
                  (NP (PRP it))
                  (ADVP (RB only))
                  (VP
                    (VBZ costs)
                    (NP
                      ($ $)
                      (CD 19.95))))))))))
    (. !)))

(ROOT
  (S
    (NP (PRP I))
    (VP
      (MD could)
      (ADVP (RB only))
      (VP
        (VB find)
        (NP (PRP it))
        (PP
          (IN for)
          (NP
            (QP
              (JJR more)
              (IN than)
              ($ $)
              (CD 30))))
        (ADVP (IN before))))
    (. .)))



Now that we know how POS-tagging and syntactic parsing works in MeTA, let's explore some features that we can add to our document representations using these techniques.

The simplest feature we can imagine that uses the POS-taggged sequences might be n-grams of POS tags. (As a quick detour, we'll need to download and extract a CRF-based POS tagging model.)

In [32]:
%%capture
!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/crf.tar.gz
!tar xf crf.tar.gz

Now, we can use the following analysis pipeline to get n-gram POS tag features by using the `NGRamPOSAnalyzer`:

In [33]:
tok = metapy.analyzers.ICUTokenizer()
tok = metapy.analyzers.PennTreebankNormalizer(tok)
ana = metapy.analyzers.NGramPOSAnalyzer(2, tok, 'crf')
ana.analyze(doc)



{('CD', '.'): 1,
 ('MD', 'RB'): 2,
 ('$', 'CD'): 2,
 ('CD', 'RB'): 1,
 ('VBD', 'IN'): 1,
 ('VB', 'IN'): 1,
 ('PRP', 'MD'): 2,
 ('VBZ', '$'): 1,
 ('PRP', 'VBD'): 1,
 ('RB', 'VB'): 2,
 ('RB', 'VBZ'): 1,
 ('PRP', 'IN'): 1,
 ('JJR', 'IN'): 1,
 ('PRP', 'RB'): 1,
 ('IN', '$'): 1,
 ('IN', 'JJR'): 1,
 ('VB', 'PRP'): 1,
 ('RB', '.'): 1,
 ('IN', 'PRP'): 2}

We can also parse the sentences in the document and extract a number of different structural features from the parse trees using a `TreeAnalyzer`.

In [34]:
ana = metapy.analyzers.TreeAnalyzer(tok, 'perceptron-tagger', 'parser')



The `TreeAnalyzer` has a function `add()` that takes `TreeFeaturizer` subclasses. Conceptually, the extraction of structural features from parse trees looks something like this:

1. The tokenizer is run until a full sentence is read.
2. The greedy perceptron tagger is run to tag the words in the sentence.
3. The shift-reduce constituency parser is run to produce a parse tree.
4. Each `TreeFeaturizer` that is part of the `TreeAnalayzer` is run over the parse tree to produce features.

This process is repeated for each sentence found in the document.

Let's try adding just one `TreeFeaturizer` to the analyzer for now and see what features we get.

In [35]:
ana.add(metapy.analyzers.DepthFeaturizer())
ana.analyze(doc)

{'depth-8': 1, 'depth-12': 1}

The featurizer we used here simply extracts the depth of each subtree and creates a new feature for each depth encountered.

We can also see some features that utilize the structure of the trees if we use some different `TreeFeaturizer`s.

In [36]:
ana = metapy.analyzers.TreeAnalyzer(tok, 'perceptron-tagger', 'parser')
ana.add(metapy.analyzers.SubtreeFeaturizer())
ana.analyze(doc)



{'subtree-(SBAR (IN) (S))': 2,
 'subtree-(S (NP) (ADVP) (VP))': 1,
 'subtree-(VP (VB) (SBAR))': 1,
 'subtree-(QP (JJR) (IN) ($) (CD))': 1,
 'subtree-(VP (VB) (NP) (PP) (ADVP))': 1,
 'subtree-(.)': 2,
 'subtree-(PP (IN) (NP))': 1,
 'subtree-(PRP)': 5,
 'subtree-(S (NP) (VP) (.))': 2,
 'subtree-(VP (VBD) (SBAR))': 1,
 'subtree-(IN)': 5,
 'subtree-(VB)': 2,
 'subtree-(NP (PRP))': 5,
 'subtree-(ADVP (RB))': 2,
 'subtree-(VP (VBZ) (NP))': 1,
 'subtree-($)': 2,
 'subtree-(VP (MD) (RB) (VP))': 1,
 'subtree-(JJR)': 1,
 'subtree-(NP ($) (CD))': 1,
 'subtree-(ROOT (S))': 2,
 'subtree-(VBD)': 1,
 'subtree-(ADVP (IN))': 1,
 'subtree-(VP (MD) (ADVP) (VP))': 1,
 'subtree-(RB)': 3,
 'subtree-(NP (QP))': 1,
 'subtree-(S (NP) (VP))': 1,
 'subtree-(VBZ)': 1,
 'subtree-(CD)': 2,
 'subtree-(MD)': 2}

The `SubtreeFeaturizer` creates a new feature for each unique subtree seen in the data, to a depth of 1. This can create quite a lot of features, but describes how the sentence is decomposed structureally. This kind of feature is also known as a "rewrite rule" feature.

We can also ignore the labels of the subtrees entirely and just extract their structure if we use a `SkeletonFeaturizer`.

In [37]:
ana = metapy.analyzers.TreeAnalyzer(tok, 'perceptron-tagger', 'parser')
ana.add(metapy.analyzers.SkeletonFeaturizer())
ana.analyze(doc)



{'(()(())(()((()()()())))(()))': 1,
 '(()(())(()(())(()((()()()())))(())))': 1,
 '(()(()()))': 1,
 '(()(()((())(())(()(()())))))': 1,
 '(()()(()(()((())(())(()(()()))))))': 1,
 '(()())': 1,
 '((()()()()))': 1,
 '(()((())(()()(()(()((())(())(()(()()))))))))': 1,
 '(())': 8,
 '(()((())(())(()(()()))))': 1,
 '((())(()()(()(()((())(())(()(()())))))))': 1,
 '((())(()(()((())(()()(()(()((())(())(()(()())))))))))())': 1,
 '((())(())(()(()())))': 1,
 '()': 26,
 '(((())(()(()((())(()()(()(()((())(())(()(()())))))))))()))': 1,
 '(()()()())': 1,
 '(()(()((())(()()(()(()((())(())(()(()())))))))))': 1,
 '((())(()(())(()(())(()((()()()())))(())))())': 1,
 '(((())(()(())(()(())(()((()()()())))(())))()))': 1,
 '(()((()()()())))': 1}

Play with the other featurizers to see what they do!

In practice, it is often beneficial to combine multiple feature sets together. We can do this with a `MultiAnalyzer`. Let's combine unigram words, bigram POS tags, and rewrite rules for our document feature representation.

We can certainly do this programmatically, but doing so can become tedious quite quickly. Instead, let's use MeTA's configuration file format to specify our analyzer, which we can then load in one line of code. MeTA uses [TOML](https://en.wikipedia.org/wiki/TOML) configuration files for all of its configuration. If you haven't heard of TOML before, don't panic! It's a very simple, readable format that looks like old school INI files.

Let's create a simple configuration file now.

In [38]:
config = """stop-words = "lemur-stopwords.txt"

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"

[[analyzers]]
method = "ngram-pos"
ngram = 2
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
crf-prefix = "crf"

[[analyzers]]
method = "tree"
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
features = ["subtree"]
tagger = "perceptron-tagger/"
parser = "parser/"
"""
with open('KDD-2017-config.toml', 'w') as f:
    f.write(config)

Each `[[analyzers]]` block defines another analyzer to combine for our feature representation. Since "ngram-word" is such a common analyzer, we have defined some default filter chains that can be used with shortcuts. "default-unigram-chain" is a filter chain suitable for unigram words; "default-chain" is a filter chain suitable for bigram words and above.

We can now load an analyzer from this configuration file like so:

In [39]:
ana = metapy.analyzers.load('KDD-2017-config.toml')



Now let's see what we get!

In [40]:
ana.analyze(doc)

{'subtree-(.)': 2,
 "can't": 1,
 'subtree-(ADVP (IN))': 1,
 'subtree-(S (NP) (ADVP) (VP))': 1,
 'subtree-(IN)': 5,
 'IN_JJR': 1,
 'subtree-(VP (MD) (ADVP) (VP))': 1,
 'PRP_IN': 1,
 'CD_.': 1,
 'subtree-(VBD)': 1,
 'RB_VB': 2,
 'CD_RB': 1,
 '$_CD': 2,
 'subtree-(NP ($) (CD))': 1,
 'subtree-(VP (VBZ) (NP))': 1,
 'subtree-(PP (IN) (NP))': 1,
 'subtree-(S (NP) (VP) (.))': 2,
 'find': 1,
 'subtree-(NP (QP))': 1,
 'subtree-(VBZ)': 1,
 'VBZ_$': 1,
 'VBD_IN': 1,
 'cost': 1,
 'subtree-(VP (VB) (SBAR))': 1,
 'RB_.': 1,
 'subtree-(SBAR (IN) (S))': 2,
 'subtree-(VP (VBD) (SBAR))': 1,
 'subtree-($)': 2,
 'MD_RB': 2,
 'subtree-(CD)': 2,
 'PRP_RB': 1,
 'subtree-(PRP)': 5,
 'IN_$': 1,
 'subtree-(VB)': 2,
 'IN_PRP': 2,
 'RB_VBZ': 1,
 'JJR_IN': 1,
 'believ': 1,
 'subtree-(JJR)': 1,
 'PRP_VBD': 1,
 'subtree-(VP (VB) (NP) (PP) (ADVP))': 1,
 'subtree-(VP (MD) (RB) (VP))': 1,
 'subtree-(QP (JJR) (IN) ($) (CD))': 1,
 'VB_IN': 1,
 'subtree-(ADVP (RB))': 2,
 'subtree-(S (NP) (VP))': 1,
 'subtree-(ROOT (S))': 2

# Part 2: Information Retrieval with MeTA

In this part of the tutorial, we'll play with the first major application of MeTA: search engines. We will be having the first contest in this part! Once we finish going through how to create an inverted index, search it, and evaluate retrieval algorithms, I will give you instructions on how to participate in the competition. There will be a leader board to keep track of the best submissions, and I intend on leaving it running until the end of the conference for people to play around with.

Let's get a publicly available retrieval dataset with relevance judgments first.

In [41]:
%%capture
!wget -N https://meta-toolkit.org/data/2016-11-10/cranfield.tar.gz
!tar xf cranfield.tar.gz

We're going to add a flag to our corpus' configuration file to force it to store full text for later.

In [42]:
with open('cranfield/tutorial.toml', 'w') as f:
    f.write('type = "line-corpus"\n')
    f.write('store-full-text = true\n')

Now, let's set up a MeTA configuration file up to index the `cranfield` dataset we just downloaded using the default unigram words filter chain.

In [43]:
config = """prefix = "." # tells MeTA where to search for datasets

dataset = "cranfield" # a subfolder under the prefix directory
corpus = "tutorial.toml" # a configuration file for the corpus specifying its format & additional args

index = "cranfield-idx" # subfolder of the current working directory to place index files

query-judgements = "cranfield/cranfield-qrels.txt" # file containing the relevance judgments for this dataset

stop-words = "lemur-stopwords.txt"

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"
"""
with open('cranfield-config.toml', 'w') as f:
    f.write(config)

Let's index our data using the `InvertedIndex` format. In a search engine, we want to quickly determine what documents mention a specific query term, so the `InvertedIndex` stores a mapping from term to a list of documents that contain that term (along with how many times they do).

In [44]:
inv_idx = metapy.index.make_inverted_index('cranfield-config.toml')

1669328352: [info]     Loading index from disk: cranfield-idx/inv (/metapy/deps/meta/src/index/inverted_index.cpp:171)


This may take a minute at first, since the index needs to be built. Subsequent calls to `make_inverted_index` with this config file will simply load the index, which will not take any time.

Here's how we can interact with the index object:

In [45]:
inv_idx.num_docs()

1400

In [46]:
inv_idx.unique_terms()

4137

In [47]:
inv_idx.avg_doc_length()

87.17857360839844

In [48]:
inv_idx.total_corpus_terms()

122050

Let's search our index. We'll start by creating a ranker:

In [49]:
ranker = metapy.index.OkapiBM25()

Now we need a query. Let's create an example query.

In [50]:
query = metapy.index.Document()
query.content("flow equilibrium")

Now we can use this to search our index like so:

In [51]:
top_docs = ranker.score(inv_idx, query, num_results=5)
top_docs

[(235, 6.424363136291504),
 (1009, 6.096038818359375),
 (1229, 5.877272129058838),
 (1251, 5.866937160491943),
 (316, 5.859640121459961)]

We are returned a ranked list of *(doc_id, score)* pairs. The scores are from the ranker, which in this case was Okapi BM25. Since the `tutorial.toml` file we created for the cranfield dataset has `store-full-text = true`, we can verify the content of our top documents by inspecting the document metadata field "content".

In [52]:
for num, (d_id, _) in enumerate(top_docs):
    content = inv_idx.metadata(d_id).get('content')
    print("{}. {}...\n".format(num + 1, content[0:250]))

1. criteria for thermodynamic equilibrium in gas flow . when gases flow at high velocity, the rates of internal processes may not be fast enough to maintain thermodynamic equilibrium .  by defining quasi-equilibrium in flow as the condition in which the...

2. free-flight measurements of the static and dynamic . air-flow properties in nozzles were calculated and charted for equilibrium flow and two types of frozen flows .  in one type of frozen flow, air was assumed to be in equilibrium from the nozzle res...

3. hypersonic nozzle expansion of air with atom recombination present . an experimental investigation on the expansion of high- temperature, high-pressure air to hypersonic flow mach numbers in a conical nozzle of a hypersonic shock tunnel has been carr...

4. on the approach to chemical and vibrational equilibrium behind a strong normal shock wave . the concurrent approach to chemical and vibrational equilibrium of a pure diatomic gas passing through a strong normal shock wave i

Since we have the queries file and relevance judgements, we can do an IR evaluation.

In [53]:
ev = metapy.index.IREval('cranfield-config.toml')

We will loop over the queries file and add each result to the `IREval` object `ev`.

In [54]:
num_results = 10
with open('cranfield/cranfield-queries.txt') as query_file:
    for query_num, line in enumerate(query_file):
        query.content(line.strip())
        results = ranker.score(inv_idx, query, num_results)                            
        avg_p = ev.avg_p(results, query_num + 1, num_results)
        print("Query {} average precision: {}".format(query_num + 1, avg_p))

Query 1 average precision: 0.24166666666666664
Query 2 average precision: 0.4196428571428571
Query 3 average precision: 0.6383928571428572
Query 4 average precision: 0.25
Query 5 average precision: 0.3333333333333333
Query 6 average precision: 0.125
Query 7 average precision: 0.11666666666666665
Query 8 average precision: 0.1
Query 9 average precision: 0.6388888888888888
Query 10 average precision: 0.0625
Query 11 average precision: 0.09285714285714286
Query 12 average precision: 0.18
Query 13 average precision: 0.0
Query 14 average precision: 0.5
Query 15 average precision: 1.0
Query 16 average precision: 0.16666666666666666
Query 17 average precision: 0.08333333333333333
Query 18 average precision: 0.3333333333333333
Query 19 average precision: 0.0
Query 20 average precision: 0.4302469135802469
Query 21 average precision: 0.0
Query 22 average precision: 0.0
Query 23 average precision: 0.19952380952380952
Query 24 average precision: 0.3333333333333333
Query 25 average precision: 0.650

Afterwards, we can get the mean average precision of all the queries.

In [55]:
ev.map()

0.25511867318944054

In the competition, you should try experimenting with different rankers, ranker parameters, tokenization, and filters. What combination can give you the best results?

Lastly, it's possible to define your own ranking function in Python.

In [56]:
class SimpleRanker(metapy.index.RankingFunction):                                            
    """                                                                          
    Create a new ranking function in Python that can be used in MeTA.             
    """                                                                          
    def __init__(self, some_param=1.0):                                             
        self.param = some_param
        # You *must* invoke the base class __init__() here!
        super(SimpleRanker, self).__init__()                                        
                                                                                 
    def score_one(self, sd):
        """
        You need to override this function to return a score for a single term.
        For fields available in the score_data sd object,
        @see https://meta-toolkit.org/doxygen/structmeta_1_1index_1_1score__data.html
        """
        return (self.param + sd.doc_term_count) / (self.param * sd.doc_unique_terms + sd.doc_size)

**COMPETITION TIME**

# Part 3: Document Classification with MeTA

In this part of the tutorial, we'll play with the next major application for MeTA: creating classifiers. We will be having the second contest in this part! Once we finish going through how to create a forward index, train classifiers on top of it, and perform classifier evaluation and cross validation, I will give you instructions on how to participate in the competition (it will be similar to the first competition). Again, there will be another leader board to keep track of the best submissions, and I intend on leaving it running until the end of the conference for people to play around with.

Let's switch back to using the `ceeaus` dataset we downloaded before. If you're just joining us, grab it now:

In [57]:
%%capture
!wget -N https://meta-toolkit.org/data/2016-01-26/ceeaus.tar.gz
!tar xf ceeaus.tar.gz

We'll also need our standard stopword list. Grab it now if you don't already have it:

In [58]:
%%capture
!wget -N https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

Let's create our MeTA configuration file for this part of the tutorial. We'll be using standard unigram words for now, but you're strongly encouraged to play with different features for the competition!

In [59]:
config = """prefix = "."
dataset = "ceeaus"
corpus = "line.toml"
index = "ceeaus-idx"
stop-words = "lemur-stopwords.txt"

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"
"""
with open('ceeaus-config.toml', 'w') as f:
    f.write(config)

Now, let's index this dataset. Since we are doing classification experiments, we will most likely be concerning ourselves with a `ForwardIndex`, since we want to map document ids to their feature vector representations.

In [60]:
fidx = metapy.index.make_forward_index('ceeaus-config.toml')

1669328353: [info]     Loading index from disk: ceeaus-idx/fwd (/metapy/deps/meta/src/index/forward_index.cpp:171)


Note that the feature set used for classification depends on your settings in the configuration file _at the time of indexing_. If you want to play with different feature sets, remember to change your `analyzer` pipeline in the configuration file, and also to **reindex** your documents!

Here, we've just chosen simple unigram words. This is actually a surprisingly good baseline feature set for many text classification problems.

Now that we have a `ForwardIndex` on disk, we need to load the documents we want to start playing with into memory. Since this is a small enough dataset, let's load the whole thing into memory at once.

We need to decide what kind of dataset we're using. MeTA has classes for binary classification (`BinaryDataset`) and multi-class classification (`MulticlassDataset`), which you should choose from depending on the kind of classification problem you're dealing with. Let's see how many labels we have in our corpus.

In [61]:
fidx.num_labels()

3

Since this is more than 2, we likely want a `MulticlassDataset` so we can learn a classifier that can predict which of these three labels a document should have. (But we might be interested in only determining one particular class from the rest, in which case we might actually want a `BinaryDataset`.)

For now, let's focus on the multi-class case, as that likely makes the most sense for this kind of data. Let's load or documents.

In [62]:
dset = metapy.classify.MulticlassDataset(fidx)
len(dset)



1008

We have 1008 documents, split across three labels. What are our labels?

In [63]:
set([dset.label(instance) for instance in dset])

{'chinese', 'english', 'japanese'}

This dataset is a small collection of essays written by a bunch of students with different first languages. Our goal will be to try to identify whether an essay was written by a native-Chinese speaker, a native-English speaker, or a native-Japanese speaker.

Now, because these in-memory datasets can potentially be quite large, it's beneficial to not make unnecessary copies of them to, for example, create a new list that's shuffled that contains the same documents. In most cases, you'll be operating with a `DatasetView` (either `MulticlassDatasetView` or `BinaryDatasetView`) so that you can do things like shuffle or rotate the contents of a dataset without having to actually modify it. Doing so is pretty easy: you can use Python's slicing API, or you can just construct one directly.

In [64]:
view = dset[0:len(dset)+1]
# or
view = metapy.classify.MulticlassDatasetView(dset)

Now we can, for example, shuffle this view without changing the underlying datsaet.

In [65]:
view.shuffle()
print("{} vs {}".format(view[0].id, dset[0].id))

544 vs 0


The view has been shuffled and now has documents in random order (useful in many cases to make sure that you don't have clumps of the same-labeled documents together, or to just permute the documents in a stochastic learning algorithm), but the underlying dataset is still sorted by id.

We can also use this slicing API to create a random training and testing set from our shuffled views (views also support slicing). Let's make a 75-25 split of training-testing data. (Note that's really important that we already shuffled the view!)

In [66]:
training = view[0:int(0.75*len(view))]
testing = view[int(0.75*len(view)):]

Now, we're ready to train a classifier! Let's start with very simple one: [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

In MeTA, construction of a classifier implies training of that model. Let's train a Naive Bayes classifier on our training view now.

In [67]:
nb = metapy.classify.NaiveBayes(training)

We can now classify individual documents like so.

In [68]:
nb.classify(testing[0].weights)

'japanese'

We might be more interested in how well we classify the testing set.

In [69]:
mtrx = nb.test(testing)
print(mtrx)


            chinese   english   japanese  
          ------------------------------
  chinese | [1m1[22m         -         -         
  english | 0.0263    [1m0.895[22m     0.0789    
 japanese | 0.0258    -         [1m0.974[22m     




The `test()` method of MeTA's classifiers returns to you a `ConfusionMatrix`, which contains useful information about what kinds of mistakes your classifier is making.

(Note that, due to the random shuffling, you might see different results than we do here.)

For example, we can see that this classifier seems to have some trouble with confusing native-Chinese students' essays with those of native-Japanese students. We can tell that by looking at the rows of the confusion matrix. Each row tells you what fraction of documents with that _true_ label were assigned the label for each column by the classifier. In the case of the native-Chinese label, we can see that 25% of the time they were miscategorized as being native-Japanese.

The `ConfusionMatrix` also computes a lot of metrics that are commonly used in classifier evaluation.

In [70]:
mtrx.print_stats()

------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.87        0.769       1           0.0794      
english     0.944       1           0.895       0.151       
japanese    0.979       0.984       0.974       0.77        
------------------------------------------------------------
[1mTotal[22m       [1m0.967[22m       [1m0.97[22m        [1m0.964[22m       
------------------------------------------------------------
252 predictions attempted, overall accuracy: 0.964



If we want to make sure that the classifier isn't overfitting to our training data, a common approach is to do [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Let's run CV for our Naive Bayes classifier across the whole dataset, using 5-folds, to get an idea of how well we might generalize to new data.

In [71]:
mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.NaiveBayes(fold), view, 5)

1669328353: [info]     Cross-validating fold 1/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 2/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 3/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 4/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 5/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)


`cross_validate()` returns a `ConfusionMatrix` just like `test()` does. We give it a function to use to create the trained classifiers for each fold, and then pass in the dataset view containing all of our documents, and the number of folds we want to use.

Let's see how we did.

In [72]:
print(mtrx)
mtrx.print_stats()


            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.859[22m     0.0326    0.109     
  english | 0.0342    [1m0.904[22m     0.0616    
 japanese | 0.013     0.0104    [1m0.977[22m     


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.849       0.84        0.859       0.0915      
english     0.913       0.923       0.904       0.145       
japanese    0.976       0.975       0.977       0.763       
------------------------------------------------------------
[1mTotal[22m       [1m0.955[22m       [1m0.955[22m       [1m0.955[22m       
------------------------------------------------------------
1005 predictions attempted, overall accuracy: 0.955



Now let's do the same thing, but for an arguably stronger baseline: [SVM](https://en.wikipedia.org/wiki/Support_vector_machine).

MeTA's implementation of SVM is actually an approximation using [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) on the [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss). It's implemented as a `BinaryClassifier`, so we will need to adapt it before it can be used to solve our multi-class clasification problem.

MeTA provides two different adapters for this scenario: [One-vs-All](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) and [One-vs-One](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-one).

In [73]:
ova = metapy.classify.OneVsAll(training, metapy.classify.SGD, loss_id='hinge')

We construct the `OneVsAll` reduction by providing it the training documents, the name of a binary classifier, and then (as keyword arguments) any additional arguments to that chosen classifier. In this case, we use `loss_id` to specify the loss function to use.

We can now use `OneVsAll` just like any other classifier.

In [74]:
mtrx = ova.test(testing)
print(mtrx)
mtrx.print_stats()


            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.8[22m       -         0.2       
  english | -         [1m0.921[22m     0.0789    
 japanese | -         0.00515   [1m0.995[22m     


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.889       1           0.8         0.0794      
english     0.946       0.972       0.921       0.151       
japanese    0.98        0.965       0.995       0.77        
------------------------------------------------------------
[1mTotal[22m       [1m0.969[22m       [1m0.969[22m       [1m0.968[22m       
------------------------------------------------------------
252 predictions attempted, overall accuracy: 0.968



In [75]:
mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.OneVsAll(fold, metapy.classify.SGD, loss_id='hinge'), view, 5)
print(mtrx)
mtrx.print_stats()

1669328353: [info]     Cross-validating fold 1/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 2/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 3/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 4/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)
1669328353: [info]     Cross-validating fold 5/5 (/metapy/deps/meta/include/meta/classify/classifier/classifier.h:103)



            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.783[22m     0.0326    0.185     
  english | 0.0137    [1m0.897[22m     0.089     
 japanese | 0.0013    0.00652   [1m0.992[22m     


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.862       0.96        0.783       0.0915      
english     0.919       0.942       0.897       0.145       
japanese    0.977       0.962       0.992       0.763       
------------------------------------------------------------
[1mTotal[22m       [1m0.959[22m       [1m0.959[22m       [1m0.959[22m       
------------------------------------------------------------
1005 predictions attempted, overall accuracy: 0.959



That should be enough to get you started! Try looking at `help(metapy.classify)` for a list of what's included in the bindings.

**COMPETITION TIME**

# Part 4: Topic Modeling

In this part of the tutorial we will discuss how to run a topic model over data indexed as a `ForwardIndex`.

We will need to index our data to proceed. We eventually want to be able to extract the bag-of-words representation for our individual documents, so we will want a `ForwardIndex` in this case.

In [76]:
fidx = metapy.index.make_forward_index('ceeaus-config.toml')

1669328353: [info]     Loading index from disk: ceeaus-idx/fwd (/metapy/deps/meta/src/index/forward_index.cpp:171)


Just like in classification, the feature set used for the topic modeling will be the feature set used at the time of indexing, so if you want to play with a different set of features (like bigram words), you will need to re-index your data.

For now, we've just stuck with the default filter chain for unigram words, so we're operating in the traditional bag-of-words space.

Let's load our documents into memory to run the topic model inference now.

In [77]:
dset = metapy.learn.Dataset(fidx)



Now, let's try to find some topics for this dataset. To do so, we're going to use a generative model called a topic model.

There are many different topic models in the literature, but the most commonly used topic model is Latent Dirichlet Allocation. Here, we propose that there are K topics (represented with a categorical distribution over words) $\phi_k$ from which all of our documents are genereated. These K topics are modeled as being sampled from a Dirichlet distribution with parameter $\vec{\alpha}$. Then, to generate a document $d$, we first sample a distribution over the K topics $\theta_d$ from another Dirichlet distribution with parameter $\vec{\beta}$. Then, for each word in this document, we first sample a topic identifier $z \sim \theta_d$ and then the word by drawing from the topic we selected ($w \sim \phi_z$). Refer to the [Wikipedia article on LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) for more information.

The goal of running inference for an LDA model is to infer the latent variables $\phi_k$ and $\theta_d$ for all of the $K$ topics and $D$ documents, respectively. MeTA provides a number of different inference algorithms for LDA, as each one entails a different set of trade-offs (inference in LDA is intractable, so all inference algorithms are approximations; different algorithms entail different approximation guarantees, running times, and required memroy consumption). For now, let's run a Variational Infernce algorithm called CVB0 to find two topics. (In practice you will likely be finding many more topics than just two, but this is a very small toy dataset.)

In [78]:
lda_inf = metapy.topics.LDACollapsedVB(dset, num_topics=2, alpha=1.0, beta=0.01)
lda_inf.run(num_iters=1000)

Iteration 1 maximum change in gamma: 1.96874                                    
Iteration 2 maximum change in gamma: 0.424639                                   
Iteration 3 maximum change in gamma: 0.343091                                   
Iteration 4 maximum change in gamma: 0.452894                                   
Iteration 5 maximum change in gamma: 0.685825                                   
Iteration 6 maximum change in gamma: 1.29713                                    
Iteration 7 maximum change in gamma: 1.4322                                     
Iteration 8 maximum change in gamma: 1.36543                                    
Iteration 9 maximum change in gamma: 1.41203                                    
Iteration 10 maximum change in gamma: 1.26582                                   
Iteration 11 maximum change in gamma: 1.14877                                   
Iteration 12 maximum change in gamma: 0.85183                                   
Iteration 13 maximum change 

The above ran the CVB0 algorithm for 1000 iterations, or until an algorithm-specific convergence criterion was met. Now let's save the current estimate for our topics and topic proportions.

In [79]:
lda_inf.save('lda-cvb0')

We can interrogate the topic inference results by using the `TopicModel` query class. Let's load our inference results back in.

In [80]:
model = metapy.topics.TopicModel('lda-cvb0')



Now, let's have a look at our topics. A typical way of doing this is to print the top $k$ words in each topic, so let's do that.

In [81]:
model.top_k(tid=0)

[(3341, 0.13110417416754439),
 (3045, 0.054349404799836486),
 (2677, 0.03678013558661191),
 (3346, 0.0334926919579171),
 (281, 0.022530708720050734),
 (3729, 0.015620501971815122),
 (1953, 0.012780938209343243),
 (707, 0.012635089307785483),
 (592, 0.01198720192147316),
 (2448, 0.011317757976702114)]

The models operate on term ids instead of raw text strings, so let's convert this to a human readable format by using the vocabulary contained in our `ForwardIndex` to map the term ids to strings.

In [82]:
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=0)]

[('smoke', 0.13110417416754439),
 ('restaur', 0.054349404799836486),
 ('peopl', 0.03678013558661191),
 ('smoker', 0.0334926919579171),
 ('ban', 0.022530708720050734),
 ('think', 0.015620501971815122),
 ('japan', 0.012780938209343243),
 ('complet', 0.012635089307785483),
 ('cigarett', 0.01198720192147316),
 ('non', 0.011317757976702114)]

In [83]:
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=1)]

[('time', 0.06705635166588672),
 ('job', 0.05605921726559277),
 ('part', 0.052222985996315356),
 ('student', 0.046429316169084245),
 ('colleg', 0.03488135582751223),
 ('work', 0.02906743830662369),
 ('money', 0.028850177138077207),
 ('think', 0.022331321863868842),
 ('import', 0.02075566781279595),
 ('studi', 0.015483012745978186)]

We can pretty clearly see that this particular dataset was about two major issues: part time jobs for students and smoking in public. This dataset is actually a collection of essays written by students, and there just so happen to be two different topics they can choose from!

The topics are pretty clear in this case, but in some cases it is also useful to score the terms in a topic using some function of the probability of the word in the topic and the probability of the word in the other topics. Intuitively, we might want to select words from each topic that best reflect that topic's content by picking words that both have high probability in that topic **and** have low probability in the other topics. In other words, we want to balance between high probability terms and highly specific terms (this is kind of like a tf-idf weighting). One such scoring function is provided by the toolkit in `BLTermScorer`, which implements a scoring function proposed by Blei and Lafferty.

In [84]:
scorer = metapy.topics.BLTermScorer(model)
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=0, scorer=scorer)]

[('smoke', 0.8741658371694072),
 ('restaur', 0.3174616491011326),
 ('smoker', 0.20060314960939518),
 ('ban', 0.12853057250191835),
 ('cigarett', 0.06557615921387594),
 ('non', 0.061284317453813575),
 ('complet', 0.061053845989036404),
 ('japan', 0.05846318850470755),
 ('health', 0.050548449776892435),
 ('seat', 0.04533996037590082)]

In [85]:
[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=1, scorer=scorer)]

[('job', 0.3482198672101333),
 ('part', 0.313110249389438),
 ('student', 0.2832889303013064),
 ('colleg', 0.2080895841416246),
 ('time', 0.17797615819391943),
 ('money', 0.16234658461450713),
 ('work', 0.15585330663466523),
 ('studi', 0.08228277659868777),
 ('learn', 0.06491894472311999),
 ('experi', 0.054945213312151964)]

Here we can see that the uninformative word stem "think" was downweighted from the word list from each topic, since it had relatively high probability in either topic.

We can also see the inferred topic distribution for each document.

In [86]:
model.topic_distribution(0)

<metapy.stats.Multinomial {0: 0.021341, 1: 0.978659}>

It looks like our first document was written by a student who chose the part-time job essay topic...

In [87]:
model.topic_distribution(900)

<metapy.stats.Multinomial {0: 0.978797, 1: 0.021203}>

...whereas this document looks like it was written by a student who chose the public smoking essay topic.