<a href="https://colab.research.google.com/github/illinois/metapy/blob/master/tutorials/3-deeper-text-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
%pip install https://github.com/illinois/metapy/releases/download/v0.2.14/metapy-0.2.14-cp37-cp37m-manylinux_2_24_x86_64.whl

First, we'll import the `metapy` python bindings.

In [2]:
import metapy

Now, let's create a document with some content.

In [3]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")

MeTA provides a stream-based interface for performing document tokenization. Each stream starts off with a Tokenizer object, and in most cases you should use the [Unicode standard aware](http://site.icu-project.org) `ICUTokenizer`.

In [4]:
tok = metapy.analyzers.ICUTokenizer()

Tokenizers operate on raw text and provide an Iterable that spits out the individual text tokens. Let's try running just the `ICUTokenizer` to see what it does.

In [5]:
tok.set_content(doc.content()) # this could be any string
[token for token in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>']

One thing that you likely immediately notice is the insertion of these pseudo-XML looking `<s>` and `</s>` tags. These are called "sentence boundary tags". As a side-effect, a default-construted `ICUTokenizer` discovers the sentences in a document by delimiting them with the sentence boundary tags. Let's try tokenizing a multi-sentence document to see what that looks like.

In [6]:
doc.content("I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.")
tok.set_content(doc.content())
[token for token in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>',
 '<s>',
 'I',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.',
 '</s>']

Most of the information retrieval techniques you have likely been learning about in this class don't need to concern themselves with finding the boundaries between separate sentences in a document, but later today we'll explore a scenario where this might matter more.

Let's pass a flag to the `ICUTokenizer` constructor to disable sentence boundary tags for now.

In [7]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok.set_content(doc.content())
[token for token in tok]

['I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 'I',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.']

I mentioned earlier that MeTA treats tokenization as a *streaming* process, and that it *starts* with a tokenizer. As you've learned, for optimal search performance it's often beneficial to modify the raw underlying tokens of a document, and thus change its representation, before adding it to an inverted index structure for searching.

The "intermediate" steps in the tokenization stream are represented with objects called Filters. Each filter consumes the content of a previous filter (or a tokenizer) and modifies the tokens coming out of the stream in some way.

Let's start by using a simple filter that can help eliminate a lot of noise that we might encounter when tokenizing web documents: a `LengthFilter`.

In [8]:
tok = metapy.analyzers.LengthFilter(tok, min=2, max=30)
tok.set_content(doc.content())
[token for token in tok]

['said',
 'that',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '19.95',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '30',
 'before']

Here, we can see that the `LengthFilter` is consuming our original `ICUTokenizer`. It modifies the token stream by only emitting tokens that are of a minimum length of 2 and a maximum length of 30. This can get rid of a lot of punctuation tokens, but also excessively long tokens such as URLs.

Another common trick is to remove stopwords. (Can anyone tell me what a stopword is?) In MeTA, this is done using a `ListFilter`.

In [9]:
%%capture
!wget -nc https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

In [10]:
tok = metapy.analyzers.ListFilter(tok, "lemur-stopwords.txt", metapy.analyzers.ListFilter.Type.Reject)
tok.set_content(doc.content())
[token for token in tok]

["can't", 'believe', 'costs', '19.95', 'find', '30']

Here we've downloaded a common list of stopwords obtained from the [Lemur project](http://lemurproject.org) and created a `ListFilter` to reject any tokens that occur in that list of words.

You can see how much of a difference removing stopwords can make on the size of a document's token stream! This translates to a lot of space savings in the inverted index as well.

Another common filter that people use is called a stemmer, or lemmatizer. This kind of filter tries to modify individual tokens in such a way that different inflected forms of a word all reduce to the same representation. This lets you, for example, find documents about a "run" when you search "running" or "runs". A common stemmer is the [Porter2 Stemmer](http://snowball.tartarus.org/algorithms/english/stemmer.html), which MeTA has an implementation of. Let's try it!

In [11]:
tok = metapy.analyzers.Porter2Filter(tok)
tok.set_content(doc.content())
[token for token in tok]

["can't", 'believ', 'cost', '19.95', 'find', '30']

Notice how "believe" becomes "believ" and "costs" becomes "cost". Stemming can help search by allowing queries to return more matched documents by relaxing what it means for a document to match a query term. Note that it's important to ensure that queries are tokenized in the *exact same way* as your documents were before indexing them. If you ignore this, your query is unlikely to contain the raw token "believ" and you'll miss a lot of results.

Finally, after you've got the token stream configured the way you'd like, it's time to analyze the document by consuming each token from its token stream and performing some actions based on these tokens. In the simplest case, which often is enough for "good enough" search results, our action can simply be counting how many times these tokens occur.

For clarity, let's switch back to a simpler token stream first. Write me a token stream that tokenizes using the Unicode standard, and then lowercases each token. (Hint: `help(metapy.analyzers)`.)

In [12]:
help(metapy.analyzers)

Help on module metapy.metapy.analyzers in metapy.metapy:

NAME
    metapy.metapy.analyzers

CLASSES
    pybind11_builtins.pybind11_object(builtins.object)
        Analyzer
            MultiAnalyzer
            NGramPOSAnalyzer
            NGramWordAnalyzer
            TreeAnalyzer
        TokenStream
            AlphaFilter
            CharacterTokenizer
            EmptySentenceFilter
            EnglishNormalizer
            ICUFilter
            ICUTokenizer
            LengthFilter
            ListFilter
            LowercaseFilter
            PennTreebankNormalizer
            Porter2Filter
            SentenceBoundaryAdder
        TreeFeaturizer
            BranchFeaturizer
            DepthFeaturizer
            SemiSkeletonFeaturizer
            SkeletonFeaturizer
            SubtreeFeaturizer
            TagFeaturizer
    
    class AlphaFilter(TokenStream)
     |  Method resolution order:
     |      AlphaFilter
     |      TokenStream
     |      pybind11_builtins.pybind11_o

In [13]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok = metapy.analyzers.LowercaseFilter(tok)
tok.set_content(doc.content())
[token for token in tok]

['i',
 'said',
 'that',
 'i',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 'i',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.']

Now, let's count how often each individual token appears in the stream. You might have called this representation the "bag of words" representation, but it is also often called "unigram word counts". In MeTA, classes that consume a token stream and emit a document representation are called Analyzers.

In [14]:
ana = metapy.analyzers.NGramWordAnalyzer(1, tok)
print(doc.content())
ana.analyze(doc)

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.


{'.': 1,
 'more': 1,
 '19.95': 1,
 'i': 3,
 '30': 1,
 'it': 2,
 'believe': 1,
 'costs': 1,
 "can't": 1,
 'said': 1,
 '$': 2,
 'find': 1,
 'that': 2,
 'than': 1,
 'could': 1,
 'for': 1,
 'before': 1,
 'only': 2,
 '!': 1}

If you noticed the name of the analyzer, you might have realized that you can count not just individual tokens, but groups of them. "Unigram" means "1-gram", and we count individual tokens. "Bigram" means "2-gram", and we count adjacent tokens together as a group. Let's try that now.

In [15]:
ana = metapy.analyzers.NGramWordAnalyzer(2, tok)
ana.analyze(doc)

{('only', 'find'): 1,
 ('it', 'for'): 1,
 ('30', 'before'): 1,
 ('believe', 'that'): 1,
 ('that', 'i'): 1,
 ('i', 'could'): 1,
 ('$', '30'): 1,
 ('19.95', '!'): 1,
 ('!', 'i'): 1,
 ('i', 'said'): 1,
 ('that', 'it'): 1,
 ('find', 'it'): 1,
 ('for', 'more'): 1,
 ('before', '.'): 1,
 ('than', '$'): 1,
 ('only', 'costs'): 1,
 ('more', 'than'): 1,
 ('said', 'that'): 1,
 ('could', 'only'): 1,
 ('it', 'only'): 1,
 ('$', '19.95'): 1,
 ('costs', '$'): 1,
 ("can't", 'believe'): 1,
 ('i', "can't"): 1}

Now the individual "tokens" we're counting are pairs of tokens. You can analyze any n-gram of tokens you would like to in this way (and this is a simple way to attempt to support phrase search). Note, however, that as you increase the size of the n-grams you are counting, you are also increasing (exponentially!) the number of possible n-grams you could observe, so there's no free lunch here.

This analysis pipeline feeds both the creation of the `InvertedIndex`, which is used for search applications, and the `ForwardIndex`, which is used for topic modeling and classification applications. For classification, sometimes looking at n-grams of characters is useful.

In [16]:
tok = metapy.analyzers.CharacterTokenizer()
ana = metapy.analyzers.NGramWordAnalyzer(4, tok)
ana.analyze(doc)

{('a', 't', ' ', 'i'): 1,
 (' ', 'I', ' ', 'c'): 2,
 ('f', 'o', 'r', ' '): 1,
 ('c', 'o', 'u', 'l'): 1,
 ('$', '1', '9', '.'): 1,
 ('t', 'h', 'a', 'n'): 1,
 ('h', 'a', 't', ' '): 2,
 (' ', 'b', 'e', 'f'): 1,
 ('f', 'i', 'n', 'd'): 1,
 ('I', ' ', 's', 'a'): 1,
 ('a', 'n', ' ', '$'): 1,
 ('t', 's', ' ', '$'): 1,
 ('d', ' ', 't', 'h'): 1,
 (' ', 'f', 'i', 'n'): 1,
 (' ', 'm', 'o', 'r'): 1,
 ('o', 'r', 'e', '.'): 1,
 ('u', 'l', 'd', ' '): 1,
 ('o', 'r', 'e', ' '): 1,
 ('l', 'd', ' ', 'o'): 1,
 ('9', '5', '!', ' '): 1,
 ('i', 't', ' ', 'f'): 1,
 (' ', '$', '1', '9'): 1,
 (' ', 'b', 'e', 'l'): 1,
 ('t', ' ', 'i', 't'): 1,
 ('1', '9', '.', '9'): 1,
 ('3', '0', ' ', 'b'): 1,
 ('n', 'l', 'y', ' '): 2,
 (' ', 'c', 'o', 's'): 1,
 ('y', ' ', 'f', 'i'): 1,
 ('t', ' ', 'b', 'e'): 1,
 ('n', ' ', '$', '3'): 1,
 ("'", 't', ' ', 'b'): 1,
 (' ', 'i', 't', ' '): 2,
 ('b', 'e', 'l', 'i'): 1,
 ('r', 'e', ' ', 't'): 1,
 ('b', 'e', 'f', 'o'): 1,
 ('m', 'o', 'r', 'e'): 1,
 (' ', 'c', 'a', 'n'): 1,
 (' ', 'o', 

Different analyzers can be combined together to create document representations that have many unique perspectives. Once things start to get more complicated, we recommend using a configuration file to specify each of the analyzers you wish to combine for your document representation.

Now, let's explore something a little bit different. MeTA also has a natural language processing (NLP) component, which currently supports two major NLP tasks: part-of-speech tagging and syntactic parsing.

(Does anyone know what part-of-speech tagging is?) POS tagging is a task in NLP that involves identifying a type for each word in a sentence. For example, POS tagging can be used to identify all of the nouns in a sentence, or all of the verbs, or adjectives, or... This is useful as first step towards developing an understanding of the meaning of a particular sentence.

MeTA places its POS tagging component in its "sequences" library. Let's play with some sequences first to get an idea of how they work. We'll start of by creating a sequence.

In [17]:
seq = metapy.sequence.Sequence()

Now, we can add individual words to this sequence. Sequences consist of a list of `Observation`s, which are essentially (word, tag) pairs. If we don't yet know the tags for a `Sequence`, we can just add individual words and leave the tags unset. Words are called "symbols" in the library terminology.

In [18]:
for word in ["The", "dog", "ran", "across", "the", "park", "."]:
    seq.add_symbol(word)
print(seq)

(The, ???), (dog, ???), (ran, ???), (across, ???), (the, ???), (park, ???), (., ???)


The printed form of the sequence shows that we do not yet know the tags for each word. Let's fill them in by using a pre-trained POS-tagger model that's distributed with MeTA.

In [19]:
%%capture
!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz
!tar xvf greedy-perceptron-tagger.tar.gz

In [20]:
tagger = metapy.sequence.PerceptronTagger("perceptron-tagger/")

Now let's fill in the missing tags in our sentence based on the best guess this model has.

In [21]:
tagger.tag(seq)
print(seq)

(The, DT), (dog, NN), (ran, VBD), (across, IN), (the, DT), (park, NN), (., .)


Each tag indicates the type of a word, and this particular tagger was trained to output the tags present in the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

But what if we want to POS-tag a document?

In [22]:
print(doc.content())

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.


We need a way of going from a document to a list of `Sequence`s, each representing an individual sentence. I'll get you started.

In [23]:
tok = metapy.analyzers.ICUTokenizer() # keep sentence boundaries!
tok = metapy.analyzers.PennTreebankNormalizer(tok)
tok.set_content(doc.content())
[token for token in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 'ca',
 "n't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>',
 '<s>',
 'I',
 'could',
 'only',
 'find',
 'it',
 'for',
 'more',
 'than',
 '$',
 '30',
 'before',
 '.',
 '</s>']

(Notice that the `PennTreebankNormalizer` modifies some tokens to better match the conventions of the Penn Treebank training data. This should help improve performance a little.)

Now, write me a function that can take a token stream that contains sentence boundary tags and returns a list of `Sequence` objects. Don't include the sentence boundary tags in the actual `Sequence` objects.

In [24]:
def extract_sequences(tok):
    sequences = []
    for token in tok:
        if token == '<s>':
            sequences.append(metapy.sequence.Sequence())
        elif token != '</s>':
            sequences[-1].add_symbol(token)            
    return sequences

In [25]:
tok.set_content(doc.content())
for seq in extract_sequences(tok):
    tagger.tag(seq)
    print(seq)

(I, PRP), (said, VBD), (that, IN), (I, PRP), (ca, MD), (n't, RB), (believe, VB), (that, IN), (it, PRP), (only, RB), (costs, VBZ), ($, $), (19.95, CD), (!, .)
(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)


This is still a rather shallow understanding of these sentences. The next major leap is to parse these sequences of POS-tagged words to obtain a tree for each sentence. These trees, in our case, will represent the hierarchical phrase structure of a single sentence by grouping together tokens that belong to one phrase together, and showing how small phrases combine into larger phrases, and eventually a sentence.

Let's try parsing the sentences in our document using a pre-tranned constituency parser that's distributed with MeTA.

In [26]:
%%capture
!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-constituency-parser.tar.gz
!tar xvf greedy-constituency-parser.tar.gz

In [27]:
parser = metapy.parser.Parser("parser/")

In [28]:
print(' '.join([obs.symbol for obs in seq]))
print(seq)
tree = parser.parse(seq)
print(tree.pretty_str())

I could only find it for more than $ 30 before .
(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)
(ROOT
  (S
    (NP (PRP I))
    (VP
      (MD could)
      (ADVP (RB only))
      (VP
        (VB find)
        (NP (PRP it))
        (PP
          (IN for)
          (NP
            (QP
              (JJR more)
              (IN than)
              ($ $)
              (CD 30))))
        (ADVP (IN before))))
    (. .)))



(You can also play with this with a [prettier online demo](https://meta-toolkit.org/nlp-demo.html).)

We can now parse all of the sentences in our document.

In [29]:
tok.set_content(doc.content())
for seq in extract_sequences(tok):
    tagger.tag(seq)
    print(parser.parse(seq).pretty_str())

(ROOT
  (S
    (NP (PRP I))
    (VP
      (VBD said)
      (SBAR
        (IN that)
        (S
          (NP (PRP I))
          (VP
            (MD ca)
            (RB n't)
            (VP
              (VB believe)
              (SBAR
                (IN that)
                (S
                  (NP (PRP it))
                  (ADVP (RB only))
                  (VP
                    (VBZ costs)
                    (NP
                      ($ $)
                      (CD 19.95))))))))))
    (. !)))

(ROOT
  (S
    (NP (PRP I))
    (VP
      (MD could)
      (ADVP (RB only))
      (VP
        (VB find)
        (NP (PRP it))
        (PP
          (IN for)
          (NP
            (QP
              (JJR more)
              (IN than)
              ($ $)
              (CD 30))))
        (ADVP (IN before))))
    (. .)))



Now that we know how to build these phrase structure trees from POS-tagged sentences extracted from raw text, let's explore a simple way we might be able to exploit this knowledge to help a downstream task.

Our goal is going to be to extract the Subject-Verb-Object triples from some simple sentences. This will allow us to understand who is doing what to whom, which is knowledge that might be useful for lots of downstream tasks as diverse as question answering to stock market prediction. We should be able to extract these from our constituency parses. (This, of course, isn't the only way, and this method is quite naive. However, the implementation is simple enough that I think you should be able to grasp it in a single lecture.)

First, let's grab our sample data. This is a collection of BBC news headlines that will serve as our "simple" sentences.

In [30]:
%%capture
!wget -nc https://meta-toolkit.org/data/2017-03-27/headlines.tar.gz # please be nice!
!tar xvf headlines.tar.gz

In [31]:
!echo "" && echo "README:"
!cat headlines/README.md


README:
http://mlg.ucd.ie/datasets/bbc.html

Exactracted first sentence of each doc from this dataset.


Let's look at the first headline of the business category.

In [32]:
with open("headlines/business.txt") as f:
    business = f.readlines()
business[0].strip()

'Brazil approves bankruptcy reform'

This looks simple enough. Let's see how it gets tagged.

In [33]:
tok.set_content(business[0].strip())
sequence = extract_sequences(tok)[0]
tagger.tag(sequence)
print(sequence)

(Brazil, NNP), (approves, VBZ), (bankruptcy, NN), (reform, NN)


Let's also parse it.

In [34]:
tree = parser.parse(sequence)
print(tree.pretty_str())

(ROOT
  (S
    (NP (NNP Brazil))
    (VP
      (VBZ approves)
      (NP
        (NN bankruptcy)
        (NN reform)))))



Great. We can now start to develop our technique. We can see that the subject here is the first noun phrase (NP), the verb is the first verb-like token in the VP, and the object is the NP within that VP.

We're going to need to traverse this tree to extract what we want. MeTA supports this by exploiting the [Visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern), so the easiest way for us to get at what we're looking for is to write some classes that encapsulate the traversal we want to perform and keep track of things within this tree that we are interested in.

Let's write our first simple visitor that traverses the tree to find the first NP node, at which point it will stop and store the root of that subtree.

In [35]:
help(metapy.parser.Visitor)

Help on class Visitor in module metapy.metapy.parser:

class Visitor(pybind11_builtins.pybind11_object)
 |  Method resolution order:
 |      Visitor
 |      pybind11_builtins.pybind11_object
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(...)
 |      __init__(self: metapy.metapy.parser.Visitor) -> None
 |  
 |  visit_internal(...)
 |      visit_internal(self: metapy.metapy.parser.Visitor, arg0: metapy.metapy.parser.InternalNode) -> object
 |  
 |  visit_leaf(...)
 |      visit_leaf(self: metapy.metapy.parser.Visitor, arg0: metapy.metapy.parser.LeafNode) -> object
 |  
 |  ----------------------------------------------------------------------
 |  Static methods inherited from pybind11_builtins.pybind11_object:
 |  
 |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type
 |      Create and return a new object.  See help(type) for accurate signature.



In [36]:
class NounPhraseFinder(metapy.parser.Visitor):
    def __init__(self):
        self.node = None
        super(NounPhraseFinder, self).__init__() # required; invoke base class __init__
        
    def visit_leaf(self, node):
        pass # we don't care about leaf nodes
    
    def visit_internal(self, node):
        if self.node:
            return

        # we do care about internal nodes; check if it is an NP
        if node.category() == 'NP':
            # store this node and stop the traversal
            self.node = node
        else:
            # continue traversing by visiting all of the child nodes
            node.each_child(lambda child: child.accept(self))

In [37]:
npf = NounPhraseFinder()
tree.visit(npf)
print("{} with {} child(ren)".format(npf.node.category(), npf.node.num_children()))

NP with 1 child(ren)


Now that we have that working, we should be able to make a more generic PhraseFinder that finds the first internal node that matches a specific node category. We'll need one for finding the first NP and one for finding the first VP anyway, so this will be helpful.

In [38]:
class PhraseFinder(metapy.parser.Visitor):
    def __init__(self, category):
        super(PhraseFinder, self).__init__()
        self.node = None
        self.category = category
        
    def visit_leaf(self, node):
        pass # we don't care about leaf nodes
    
    def visit_internal(self, node):
        if self.node:
            return
        
        if node.category() == self.category:
            self.node = node
        else:
            node.each_child(lambda child: child.accept(self))

In [39]:
npf = PhraseFinder('NP')
vpf = PhraseFinder('VP')
tree.visit(npf)
tree.visit(vpf)
for node in [npf.node, vpf.node]:
    print("{} with {} child(ren)".format(node.category(), node.num_children()))

NP with 1 child(ren)
VP with 2 child(ren)


Now that we can find the first internal node matching a category label, we need to set about extracting the actual leaf nodes we care about. Fortunately there is already a visitor that can extract all leaf nodes from a subtree, so we can use that to get started.

From the first noun phrase, we want to extract all leaf nodes that are noun-like tags and join them together to make up our subject.

In [40]:
noun_tags = set(['NN', 'NNS', 'NNP', 'NNPS'])
lnf = metapy.parser.LeafNodeFinder()
npf.node.accept(lnf)
subject = ' '.join([leaf.word() for leaf in lnf.leaves() if leaf.category() in noun_tags])
print(subject)

Brazil


And from the first verb phrase, we want to extract (1) the first verb-like leaf node to be the verb and (2) the noun-like tags in the first NP that occurs within that VP. We should be able to re-use some existing code we've already written.

In [41]:
verb_tags = set(['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
lnf = metapy.parser.LeafNodeFinder()
vpf.node.accept(lnf)
verb = next(leaf.word() for leaf in lnf.leaves() if leaf.category() in verb_tags)
print(verb)

approves


In [42]:
np_finder = PhraseFinder('NP')
vpf.node.accept(np_finder)
lnf = metapy.parser.LeafNodeFinder()
np_finder.node.accept(lnf)
obj = ' '.join([leaf.word() for leaf in lnf.leaves() if leaf.category() in noun_tags])
print(obj)

bankruptcy reform


In [43]:
print("SUBJ: {} VERB: {} OBJ: {}".format(subject, verb, obj))

SUBJ: Brazil VERB: approves OBJ: bankruptcy reform


Putting this all together, we can write a visitor to extract (SUBJ, VERB, OBJ) triples.

In [44]:
class SVOExtractor(metapy.parser.Visitor):
    noun_tags = set(['NN', 'NNS', 'NNP', 'NNPS'])
    verb_tags = set(['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])    
    
    def __init__(self):
        super(SVOExtractor, self).__init__()
        self.subject = self.verb = self.object = None
        
    def extract_noun_tagged_words(self, node):
        lnf = metapy.parser.LeafNodeFinder()
        node.accept(lnf)
        return ' '.join([leaf.word() for leaf in lnf.leaves() if leaf.category() in noun_tags])
        
    def visit_leaf(self, node):
        pass # don't care about leaf nodes
    
    def visit_internal(self, node):
        # find and handle the first NP
        first_np = PhraseFinder('NP')        
        node.accept(first_np)
        if first_np.node:
            self.subject = self.extract_noun_tagged_words(first_np.node)
        
        # find and handle the first VP
        first_vp = PhraseFinder('VP')
        node.accept(first_vp)
        
        if first_vp.node:
            # find the first NP within the first VP
            vp_first_np = PhraseFinder('NP')
            first_vp.node.accept(vp_first_np)
            
            if vp_first_np.node:
                self.object = self.extract_noun_tagged_words(vp_first_np.node)
            
            lnf = metapy.parser.LeafNodeFinder()
            first_vp.node.accept(lnf)
            for leaf in lnf.leaves():
                if leaf.category() in verb_tags:
                    self.verb = leaf.word()
                    break
        

In [45]:
for line in business:
    tok.set_content(line.strip())
    seq = extract_sequences(tok)[0]
    
    tagger.tag(seq)
    tree = parser.parse(seq)
    
    extractor = SVOExtractor()
    tree.visit(extractor)
    print(line.strip())
    print("SUBJ: {} VERB: {} OBJ: {}".format(extractor.subject, extractor.verb, extractor.object))

Brazil approves bankruptcy reform
SUBJ: Brazil VERB: approves OBJ: bankruptcy reform
German business confidence slides
SUBJ: business confidence slides VERB: None OBJ: None
Dollar slides ahead of New Year
SUBJ: Dollar slides New Year VERB: None OBJ: None
Aviation firms eye booming India
SUBJ: Aviation firms VERB: eye OBJ: India
Metlife buys up Citigroup insurer
SUBJ: Metlife VERB: buys OBJ: Citigroup insurer
US economy still growing says Fed
SUBJ:  VERB: says OBJ: None
Russia WTO talks 'make progress'
SUBJ: Russia WTO talks make progress VERB: None OBJ: None
Deadline nears for Fiat-GM deal
SUBJ: Deadline VERB: nears OBJ: Fiat GM deal
Five million Germans out of work
SUBJ: Germans work VERB: None OBJ: None
Jobs go at Oracle after takeover
SUBJ: Jobs VERB: go OBJ: Oracle
Asian banks halt dollar's slide
SUBJ: banks VERB: halt OBJ: dollar slide
Markets signal Brazilian recovery
SUBJ: Markets VERB: signal OBJ: recovery
GE sees 'excellent' world economy
SUBJ: GE VERB: sees OBJ: world economy