# 7. Extracting Information from Text

For any given question, it's likely that someone has written the answer down somewhere. The amount of natural language text that is available in electronic form is truly staggering, and is increasing every day. However, the complexity of natural language can make it very difficult to access the information in that text. The state of the art in NLP is still a long way from being able to build general-purpose representations of meaning from unrestricted text. If we instead focus our efforts on a limited set of questions or "entity relations," such as "where are different facilities located," or "who is employed by what company," we can make significant progress. The goal of this chapter is to answer the following questions:

    How can we build a system that extracts structured data, such as tables, from unstructured text?
    
    What are some robust methods for identifying the entities and relationships described in a text?
    
    Which corpora are appropriate for this work, and how do we use them for training and evaluating our  
    models?

Along the way, we'll apply techniques from the last two chapters to the problems of chunking and named-entity recognition.

# 1   Information Extraction

In [1]:
locs = [('Omnicom', 'IN', 'New York'),
        ('DDB Needham', 'IN', 'New York'),
        ('Kaplan Thaler Group', 'IN', 'New York'),
        ('BBDO South', 'IN', 'Atlanta'),
        ('Georgia-Pacific', 'IN', 'Atlanta')]
query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']
print(query)

['BBDO South', 'Georgia-Pacific']


Things are more tricky if we try to get similar information out of text. For example, consider the following snippet (from nltk.corpus.ieer, for fileid NYT19980315.0085).

(1)		The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.

If you read through (1), you will glean the information required to answer the example question. But how do we get a machine to understand enough about (1) to return the answers in 1.2? This is obviously a much harder task. Unlike 1.1, (1) contains no structure that links organization names with location names.

One approach to this problem involves building a very general representation of meaning (10.). In this chapter we take a different approach, deciding in advance that we will only look for very specific kinds of information in text, such as the relation between organizations and locations. Rather than trying to use text like (1) to answer the question directly, we first convert the unstructured data of natural language sentences into the structured data of 1.1. Then we reap the benefits of powerful query tools such as SQL. This method of getting meaning from text is called Information Extraction.

Information Extraction has many applications, including business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning. A particularly important area of current research involves the attempt to extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine.

## 1.1   Information Extraction Architecture

1.1 shows the architecture for a simple information extraction system. It begins by processing a document using several of the procedures discussed in 3 and 5.: first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity detection. In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation detection to search for likely relations between different entities in the text.

Figure 1.1: Simple Pipeline Architecture for an Information Extraction System. This system takes the raw text of a document as its input, and generates a list of (entity, relation, entity) tuples as its output. For example, given a document that indicates that the company Georgia-Pacific is located in Atlanta, it might generate the tuple ([ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']).

To perform the first three tasks, we can define a simple function that simply connects together NLTK's default sentence segmenter [1], word tokenizer [2], and part-of-speech tagger [3]:

In [3]:
# Important: From this chapter onwards, our program samples will assume you begin your interactive session or your program with the 
# following import statements:

from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize
from __future__ import print_function

def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

# This might be all I need for now. I am not sure about including the NER tags?

In [4]:
# Me playing with this code

def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences

sample_string = ''' 1.1 shows the architecture for a simple information extraction system. It begins by 
processing a document using several of the procedures discussed in 3 and 5.: first, the raw text of the document is 
split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a 
tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next 
step, named entity detection. In this step, we search for mentions of potentially interesting entities in 
each sentence. Finally, we use relation detection to search for likely relations between different entities 
in the text. Figure 1.1: Simple Pipeline Architecture for an Information Extraction System. This system takes
the raw text of a document as its input, and generates a list of (entity, relation, entity) tuples as its 
output. For example, given a document that indicates that the company Georgia-Pacific is located in Atlanta, 
it might generate the tuple ([ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']). To perform the first three 
tasks, we can define a simple function that simply connects together NLTK's default sentence segmenter [1], 
word tokenizer [2], and part-of-speech tagger [3]:
'''

sentences = ie_preprocess(sample_string)


print (sentences[8])

#len(sentences)
#sentences[8]

[('To', 'TO'), ('perform', 'VB'), ('the', 'DT'), ('first', 'JJ'), ('three', 'CD'), ('tasks', 'NNS'), (',', ','), ('we', 'PRP'), ('can', 'MD'), ('define', 'VB'), ('a', 'DT'), ('simple', 'JJ'), ('function', 'NN'), ('that', 'WDT'), ('simply', 'RB'), ('connects', 'VBZ'), ('together', 'RB'), ('NLTK', 'NNP'), ("'s", 'POS'), ('default', 'NN'), ('sentence', 'NN'), ('segmenter', 'NN'), ('[', 'VBD'), ('1', 'CD'), (']', 'NN'), (',', ','), ('word', 'NN'), ('tokenizer', 'NN'), ('[', '$'), ('2', 'CD'), (']', 'NNP'), (',', ','), ('and', 'CC'), ('part-of-speech', 'JJ'), ('tagger', 'NN'), ('[', 'VBD'), ('3', 'CD'), (']', 'NNS'), (':', ':')]


In [3]:
nltk.ChunkParserI

NameError: name 'nltk' is not defined

Next, in named entity detection, we segment and label the entities that might participate in interesting relations with one another. Typically, these will be definite noun phrases such as the knights who say "ni", or proper names such as Monty Python. In some tasks it is useful to also consider indefinite nouns or noun chunks, such as every student or cats, and these do not necessarily refer to entities in the same way as definite NPs and proper names.

Finally, in relation extraction, we search for specific patterns between pairs of entities that occur near one another in the text, and use those patterns to build tuples recording the relationships between the entities.

# 2   Chunking

## 2.1   Noun Phrase Chunking

We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets:

(2)		[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

As we can see, NP-chunks are often smaller pieces than complete noun phrases. For example, the market for system-management software for Digital's hardware is a single noun phrase (containing two nested noun phrases), but it is captured in NP-chunks by the simpler chunk the market. One of the motivations for this difference is that NP-chunks are defined so as not to contain other NP-chunks. Consequently, any prepositional phrases or subordinate clauses that modify a nominal will not be included in the corresponding NP-chunk, since they almost certainly contain further noun phrases.

One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the motivations for performing part-of-speech tagging in our information extraction system. We demonstrate this approach using an example sentence that has been part-of-speech tagged in 2.2. In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule [2]. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser [3], and test it on our example sentence [4]. The result is a tree, which we can either print [5], or display graphically [6].

Note to self: when starting out, just use default noun chunkers and taggers.

In [5]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), 
("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

grammar = "NP: {<DT>?<JJ>*<NN>}" # [2]
# ?: 0 or 1
# *: 0 or any number
# <NN>: just one noun. There are no quantifiers here

cp = nltk.RegexpParser(grammar)# [3]
result = cp.parse(sentence) # [4]
print(result) # [5]
result.draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


## 2.2   Tag Patterns

In [6]:
# More practice with NP-chunking...

phrase1 = [("another", "DT"), ("sharp", "JJ") , ("dive", "NN")]
phrase2 = [("trade", "NN"), ("figures", "NNS")]
phrase3 = [("any", "DT"), ("new", "JJ"), ("policy", "NN"), ("measures", "NNS")]
phrase4 = [("earlier", "JJR"), ("stages", "NNS")]
phrase5 = [("Panamanian", "JJ"), ("dictator", "NN"), ("Manuel", "NNP"), ("Noriega", "NNP")]
phrase6 = [("his", "PRP$"), ("Mansion", "NNP"), ("House", "NNP"), ("speech", "NN")]
phrase7 = [("the", "DT"), ("price", "NN"), ("cutting", "VBG")]
phrase8 = [("3", "CD"), ("%", "NN"), ("to", "TO"), ("4", "CD"), ("%", "NN")]
phrase9 = [("more", "JJR"), ("than", "IN"), ("10", "CD"), ("%", "NN")]
phrase10 = [("the", "DT"), ("fastest", "JJS"), ("developing", "VBG"), ("trends", "NNS")]
phrase11 = [("'s", "POS"), ("skill", "NN")]

grammar1 = "NP: {<DT>?<JJ>*<NN>}"
grammar2 = "NP: {<DT>?<JJ>*<NN.*>+}"
grammar3 = "NP: {<DT>?<JJ.*>*<NN.*>+}"
grammar4 = "NP: {<DT|PRP.*>?<JJ.*>*<NN.*>+}"
grammar5 = "NP: {<DT|PRP.*>?<JJ.*>*<NN.*>+<VBG>?}"
grammar6 = """
NP: {<CD>?<NN.*>+<TO>*<CD>?<NN.*>+}
NP: {<DT|PRP.*>?<JJ.*>*<NN.*>+<VBG>?}
"""
grammar7 = """
NP: {<JJ.*>?<IN>?<CD>?<NN.*>+<TO>?<CD>?<NN.*>*}
NP: {<DT|PRP.*>?<JJ.*>*<NN.*>+<VBG>?}
"""
grammar8 = """
NP: {<DT|PRP.*>?<JJ.*>*<VBG>?<NN.*>+}
NP: {<JJ.*>?<IN>?<CD>?<NN.*>+<TO>?<CD>?<NN.*>*}
NP: {<DT|PRP.*>?<JJ.*>*<NN.*>+<VBG>?}
"""
grammar9 = """
NP: {<POS>?<NN.*>+}
NP: {<DT|PRP.*>?<JJ.*>*<VBG>?<NN.*>+}
NP: {<JJ.*>?<IN>?<CD>?<NN.*>+<TO>?<CD>?<NN.*>*}
NP: {<DT|PRP.*>?<JJ.*>*<NN.*>+<VBG>?}
"""

# note that you should do the most irregular noun phrases in the beginning. it will match them first.
# Also, you do not have to define one regular expression for every possible noun phrase. You can have a number of noun phrases defined.

cp = nltk.RegexpParser(grammar9)
result = cp.parse(phrase1)
print(result)
result = cp.parse(phrase2)
print(result)
result = cp.parse(phrase3)
print(result)
result = cp.parse(phrase4)
print(result)
result = cp.parse(phrase5)
print(result)
result = cp.parse(phrase6)
print(result)
result = cp.parse(phrase7)
print(result)
result = cp.parse(phrase8)
print(result)
result = cp.parse(phrase9)
print(result)
result = cp.parse(phrase10)
print(result)
result = cp.parse(phrase11)
print(result)

# result.draw()

# try to do a loop later.

(S another/DT sharp/JJ (NP dive/NN))
(S (NP trade/NN figures/NNS))
(S any/DT new/JJ (NP policy/NN measures/NNS))
(S earlier/JJR (NP stages/NNS))
(S Panamanian/JJ (NP dictator/NN Manuel/NNP Noriega/NNP))
(S his/PRP$ (NP Mansion/NNP House/NNP speech/NN))
(S the/DT (NP price/NN) cutting/VBG)
(S 3/CD (NP %/NN) to/TO 4/CD (NP %/NN))
(S more/JJR than/IN 10/CD (NP %/NN))
(S the/DT fastest/JJS developing/VBG (NP trends/NNS))
(S (NP 's/POS skill/NN))


## 2.3   Chunking with Regular Expressions

To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

2.3 shows a simple chunk grammar consisting of two rules. The first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked [1], and run the chunker on this input [2].

In [7]:
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


In [8]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "NP: {<NN><NN>}  # Chunk two consecutive nouns"
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

(S (NP money/NN market/NN) fund/NN)


## Exploring Text Corpora

In 2 we saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows:

In [9]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
(CHUNK expected/VBN to/TO approve/VB)
(CHUNK expected/VBN to/TO make/VB)
(CHUNK intends/VBZ to/TO make/VB)
(CHUNK seek/VB to/TO set/VB)
(CHUNK like/VB to/TO see/VB)
(CHUNK designed/VBN to/TO provide/VB)
(CHUNK get/VB to/TO hear/VB)
(CHUNK expects/VBZ to/TO tell/VB)
(CHUNK expected/VBN to/TO give/VB)
(CHUNK prefer/VB to/TO pay/VB)
(CHUNK required/VBN to/TO obtain/VB)
(CHUNK permitted/VBN to/TO teach/VB)
(CHUNK designed/VBN to/TO reduce/VB)
(CHUNK Asked/VBN to/TO elaborate/VB)
(CHUNK got/VBN to/TO go/VB)
(CHUNK raised/VBN to/TO pay/VB)
(CHUNK scheduled/VBN to/TO go/VB)
(CHUNK cut/VBN to/TO meet/VB)
(CHUNK needed/VBN to/TO meet/VB)
(CHUNK hastened/VBD to/TO add/VB)
(CHUNK found/VBN to/TO prevent/VB)
(CHUNK continue/VB to/TO insist/VB)
(CHUNK compelled/VBN to/TO make/VB)
(CHUNK mad

In [10]:
def bob_chunker (chunk_string):
    cp = nltk.RegexpParser(chunk_string)
    brown = nltk.corpus.brown
    for sent in brown.tagged_sents():
        tree = cp.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'CHUNK': print(subtree)

chunk_phrase = 'CHUNK: {<V.*> <TO> <V.*>}'
chunk_phrase2 =  'CHUNK: {<N.*><N.*><N.*><N.*>+}' # alternative: 3 nouns + 1 or more nouns
bob_chunker(chunk_phrase2)

(CHUNK Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)
(CHUNK Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)
(CHUNK Georgia's/NP$ automobile/NN title/NN law/NN)
(CHUNK State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)
(CHUNK Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)
(CHUNK Mayor/NN-TL William/NP B./NP Hartsfield/NP)
(CHUNK Mrs./NP J./NP M./NP Cheshire/NP)
(CHUNK E./NP Pelham/NP Rd./NN-TL Aj/NN)
(CHUNK
  State/NN-TL
  Party/NN-TL
  Chairman/NN-TL
  James/NP
  W./NP
  Dorsey/NP)
(CHUNK Texas/NP Sen./NN-TL John/NP Tower/NP)
(CHUNK Lt./NN-TL Gov./NN-TL Garland/NP Byrd's/NP$ campaign/NN)
(CHUNK Schley/NP County/NN-TL Rep./NN-TL B./NP D./NP Pelham/NP)
(CHUNK Colquitt/NP-TL Policeman/NN-TL Tom/NP Williams/NP)
(CHUNK Rep./NN-TL Charles/NP E./NP Hughes/NP)
(CHUNK State/NN-TL Health/NN-TL Department's/NN$-TL authority/NN)
(CHUNK Lamar/NP-TL county/NN-TL Hospital/NN-TL District/NN-TL)
(CHUNK Sen./NN-TL A./NP R./NP Schwartz/NP)
(CHUNK Sen./NN-TL A./NP M./NP Aikin/NP Jr./NP)
(CH

## Chinking

Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink:

Chinking is the process of removing a sequence of tokens from a chunk. If the matching sequence of tokens spans an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before. If the sequence is at the periphery of the chunk, these tokens are removed, and a smaller chunk remains. These three possibilities are illustrated in 7.3.

In 7.5, we put the entire sentence into a single chunk, then excise the chinks.

In [None]:
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print (cp.parse(sentence))

### Representing Chunks: Tags vs Trees

As befits their intermediate status between tagging and parsing (8), chunk structures can be represented using either tags or trees. The most widespread file representation uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O. An example of this scheme is shown in 7.6.

# My Own Code!

In [None]:
# Building Bob's special noun chunker...

# Here, we have a simple text string
sample_string = """
For any given question, it's likely that someone has written the answer down somewhere. The amount of natural language text that is available in electronic form is truly staggering, and is increasing every day. However, the complexity of natural language can make it very difficult to access the information in that text. The state of the art in NLP is still a long way from being able to build general-purpose representations of meaning from unrestricted text. If we instead focus our efforts on a limited set of questions or "entity relations," such as "where are different facilities located," or "who is employed by what company," we can make significant progress.
"""

# Now let's convert the text string into a format NLTK can understand
sentences = ie_preprocess(sample_string)

sentences[0]

In [None]:
# Inspired by http://web.media.mit.edu/~havasi/MAS.S60/chunk.pdf

grammar = r"""
NP:
{<DT>?<JJ>*<N.*>+} # chunk determiners, adjectives and nouns
{<NNP>+}           # chunk sequences of proper nouns
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences[0])
print(result)

In [None]:
# Now... how can we parse result?

print (result[0])
print (result[1])
print (result[2])
print (result[3])
print (result[4])
print (result[5])
print (result[6])
print (result[7])
print (result[8])
print (result[9])
print (result[10])
print (result[11])
print (result[12])
print (result[13])
print (result[14])
print (result[15])

In [None]:
type(result)
# result is an NLTK tree

In [None]:
# this is a way to access the leaves...

ROOT = 'ROOT'
tree = result
def getNodes(parent):
    for node in parent:
        if type(node) is nltk.Tree:
            if node.label() == ROOT:
                print ("======== Sentence =========")
                print ("Sentence:", " ".join(node.leaves()))
            else:
                print ("Label:", node.label())
                print ("Leaves:", node.leaves())
            getNodes(node)
        else:
            print ("Word:", node)

getNodes(tree)

# 7.3 Developing and Evaluating Chunkers

Now you have a taste of what chunking does, but we haven't explained how to evaluate chunkers. As usual, this requires a suitably annotated corpus. We begin by looking at the mechanics of converting IOB format into an NLTK tree, then at how this is done on a larger scale using a chunked corpus. We will see how to score the accuracy of a chunker relative to a corpus, then look some more data-driven ways to search for NP chunks. Our focus throughout will be on expanding the coverage of a chunker.

## Reading IOB Format and the CoNLL 2000 Corpus

Using the corpora module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP and PP. As we have seen, each sentence is represented using multiple lines, as shown below:

In [None]:
text = '''
... he PRP B-NP
... accepted VBD B-VP
... the DT B-NP
... position NN I-NP
... of IN B-PP
... vice NN B-NP
... chairman NN I-NP
... of IN B-PP
... Carlyle NNP B-NP
... Group NNP I-NP
... , , O
... a DT B-NP
... merchant NN I-NP
... banking NN I-NP
... concern NN I-NP
... . . O
... '''
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

We can use the NLTK corpus module to access a larger amount of chunked text. The CoNLL 2000 corpus contains 270k words of Wall Street Journal text, divided into "train" and "test" portions, annotated with part-of-speech tags and chunk tags in the IOB format. We can access the data using nltk.corpus.conll2000. Here is an example that reads the 100th sentence of the "train" portion of the corpus:

In [2]:
from nltk.corpus import conll2000
print (conll2000.chunked_sents('train.txt')[99])

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


As you can see, the CoNLL 2000 corpus contains three chunk types: NP chunks, which we have already seen; VP chunks such as has already delivered; and PP chunks such as because of. Since we are only interested in the NP chunks right now, we can use the chunk_types argument to select them:

In [None]:
print (conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

## Simple Evaluation and Baselines

Now that we can access a chunked corpus, we can evaluate chunkers. We start off by establishing a baseline for the trivial chunk parser cp that creates no chunks:

In [None]:
from nltk.corpus import conll2000 # import chunked corpus
cp = nltk.RegexpParser("") # parser chunks nothing
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) # test dummy parser against true NP chunks
print (cp.evaluate(test_sents))

The IOB tag accuracy indicates that more than a third of the words are tagged with O, i.e. not in an NP chunk. However, since our tagger did not find any chunks, its precision, recall, and f-measure are all zero. Now let's try a naive regular expression chunker that looks for tags beginning with letters that are characteristic of noun phrase tags (e.g. CD, DT, and JJ).

In [None]:
grammar = r"NP: {<[CDJNP].*>+}" # don't understand this regex
cp = nltk.RegexpParser(grammar)
print (cp.evaluate(test_sents))

As you can see, this approach achieves decent results. However, we can improve on it by adopting a more data-driven approach, where we use the training corpus to find the chunk tag (I, O, or B) that is most likely for each part-of-speech tag. In other words, we can build a chunker using a unigram tagger (5.4). But rather than trying to determine the correct part-of-speech tag for each word, we are trying to determine the correct chunk tag, given each word's part-of-speech tag.

In 7.8, we define the UnigramChunker class, which uses a unigram tagger to label sentences with chunk tags. Most of the code in this class is simply used to convert back and forth between the chunk tree representation used by NLTK's ChunkParserI interface, and the IOB representation used by the embedded tagger. The class defines two methods: a constructor [1] which is called when we build a new UnigramChunker; and the parse method [3] which is used to chunk new sentences.

In [None]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): # [1]
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data) # [2]

    def parse(self, sentence): # [3]
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

The constructor [1] expects a list of training sentences, which will be in the form of chunk trees. It first converts training data to a form that suitable for training the tagger, using tree2conlltags to map each chunk tree to a list of word,tag,chunk triples. It then uses that converted training data to train a unigram tagger, and stores it in self.tagger for later use.

The parse method [3] takes a tagged sentence as its input, and begins by extracting the part-of-speech tags from that sentence. It then tags the part-of-speech tags with IOB chunk tags, using the tagger self.tagger that was trained in the constructor. Next, it extracts the chunk tags, and combines them with the original sentence, to yield conlltags. Finally, it uses conlltags2tree to convert the result back into a chunk tree.

Now that we have UnigramChunker, we can train it using the CoNLL 2000 corpus, and test its resulting performance:

In [None]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print (unigram_chunker.evaluate(test_sents))

This chunker does reasonably well, achieving an overall f-measure score of 83%. Let's take a look at what it's learned, by using its unigram tagger to assign a tag to each of the part-of-speech tags that appear in the corpus:

In [None]:
postags = sorted(set(pos for sent in train_sents
...                      for (word,pos) in sent.leaves()))
print (unigram_chunker.tagger.tag(postags))

It has discovered that most punctuation marks occur outside of NP chunks, with the exception of # and $, both of which are used as currency markers. It has also found that determiners (DT) and possessives (PRP$ and WP$) occur at the beginnings of NP chunks, while noun types (NN, NNP, NNPS, NNS) mostly occur inside of NP chunks.

Having built a unigram chunker, it is quite easy to build a bigram chunker: we simply change the class name to BigramChunker, and modify line [2] in 7.8 to construct a BigramTagger rather than a UnigramTagger. The resulting chunker has slightly higher performance than the unigram chunker:

In [None]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): #[1]
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data) #[2]

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

bigram_chunker = BigramChunker(train_sents)
print (bigram_chunker.evaluate(test_sents))

### Training Classifier-Based Chunkers

Both the regular-expression based chunkers and the n-gram chunkers decide what chunks to create entirely based on part-of-speech tags. However, sometimes part-of-speech tags are insufficient to determine how a sentence should be chunked. For example, consider the following two statements:

(3)		

a.		Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.

b.		Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

These two sentences have the same part-of-speech tags, yet they are chunked differently. In the first sentence, the farmer and rice are separate chunks, while the corresponding material in the second sentence, the computer monitor, is a single chunk. Clearly, we need to make use of information about the content of the words, in addition to just their part-of-speech tags, if we wish to maximize chunking performance.

One way that we can incorporate information about the content of words is to use a classifier-based tagger to chunk the sentence. Like the n-gram chunker considered in the previous section, this classifier-based chunker will work by assigning IOB tags to the words in a sentence, and then converting those tags to chunks. For the classifier-based tagger itself, we will use the same approach that we used in 6.1 to build a part-of-speech tagger.

The basic code for the classifier-based NP chunker is shown in 7.9. It consists of two classes. The first class [1] is almost identical to the ConsecutivePosTagger class from 6.5. The only two differences are that it calls a different feature extractor [2] and that it uses a MaxentClassifier rather than a NaiveBayesClassifier [3]. The second class [4] is basically a wrapper around the tagger class that turns it into a chunker. During training, this second class maps the chunk trees in the training corpus into tag sequences; in the parse() method, it converts the tag sequence provided by the tagger back into a chunk tree.

In [None]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set, algorithm='megam', trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

The only piece left to fill in is the feature extractor. We begin by defining a simple feature extractor which just provides the part-of-speech tag of the current token. Using this feature extractor, our classifier-based chunker is very similar to the unigram chunker, as is reflected in its performance:

In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}
chunker = ConsecutiveNPChunker(train_sents)
print (chunker.evaluate(test_sents))

Skipped the rest of Section 7.3, as I am not clear the value added in debugging megam file to work with this. To get started for my prototype, just use off the shelf chunkers and work on the relation extraction.

# 7.4   Recursion in Linguistic Structure

## Building Nested Structure with Cascaded Chunkers

So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules. 7.10 has patterns for noun phrases, prepositional phrases, verb phrases, and sentences. This is a four-stage chunk grammar, and can be used to create structures having a depth of at most four.

In [None]:
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """
cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

print (cp.parse(sentence))

Unfortunately this result misses the VP headed by saw. It has other shortcomings too. Let's see what happens when we apply this chunker to a sentence having deeper nesting. Notice that it fails to identify the VP chunk starting at [1].

In [None]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
     ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
     ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print (cp.parse(sentence))

The solution to these problems is to get the chunker to loop over its patterns: after trying all of them, it repeats the process. We add an optional second argument loop to specify the number of times the set of patterns should be run:

In [None]:
cp = nltk.RegexpParser(grammar, loop=2)
print (cp.parse(sentence))



Note

This cascading process enables us to create deep structures. However, creating and debugging a cascade is difficult, and there comes a point where it is more effective to do full parsing (see 8). Also, the cascading process can only produce trees of fixed depth (no deeper than the number of stages in the cascade), and this is insufficient for complete syntactic analysis.


### Trees

A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node. Here's an example of a tree (note that they are standardly drawn upside-down)

We use a 'family' metaphor to talk about the relationships of nodes in a tree: for example, S is the parent of VP; conversely VP is a child of S. Also, since NP and VP are both children of S, they are also siblings. For convenience, there is also a text format for specifying trees:

Although we will focus on syntactic trees, trees can be used to encode any homogeneous hierarchical structure that spans a sequence of linguistic forms (e.g. morphological structure, discourse structure). In the general case, leaves and node values do not have to be strings.

In NLTK, we create a tree by giving a node label and a list of children:

In [None]:
tree1 = nltk.Tree('NP', ['Alice'])
print (tree1)

In [None]:
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
print (tree2)

We can incorporate these into successively larger trees as follows:

In [None]:
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print (tree4)

Here are some of the methods available for tree objects:

In [None]:
print (tree4[1])

In [None]:
print (tree4[0])

In [None]:
tree4[1].label() # use label() instead of .node

In [None]:
tree4.leaves()

In [None]:
tree4[1][1][1] # makes sense if you look at tree in NLTK book

The bracketed representation for complex trees can be difficult to read. In these cases, the draw method can be very useful. It opens a new window, containing a graphical representation of the tree. The tree display window allows you to zoom in and out, to collapse and expand subtrees, and to print the graphical representation to a postscript file (for inclusion in a document).

In [None]:
tree3.draw()

### Tree Traversal

It is standard to use a recursive function to traverse a tree. The listing in 7.11 demonstrates this.

In [None]:
def traverse(t):
    try:
        t.label()
    except AttributeError:
        print (t,)
    else:
        # Now we know that t.node is defined
        print ('(', t.label(),)
        for child in t:
            traverse(child)
        print (')',)
        
# note: I replaced t.node with t.label() for the function to work...

In [None]:
# t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')
traverse(tree4)
# not sure what happened with t. So I used tree4.


Note

We have used a technique called duck typing to detect that t is a tree (i.e. t.node is defined).

# 7.5   Named Entity Recognition

At the start of this chapter, we briefly introduced named entities (NEs). Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. 7.4 lists some of the more commonly used types of NEs. These should be self-explanatory, except for "Facility": human-made artifacts in the domains of architecture and civil engineering; and "GPE": geo-political entities such as city, state/province, and country.

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type. While named entity recognition is frequently a prelude to identifying relations in Information Extraction, it can also contribute to other tasks. For example, in Question Answering (QA), we try to improve the precision of Information Retrieval by recovering not whole pages, but just those parts which contain an answer to the user's question. Most QA systems take the documents returned by standard Information Retrieval, and then attempt to isolate the minimal text snippet in the document containing the answer. Now suppose the question was Who was the first President of the US?, and one of the documents that was retrieved contained the following passage:

(5)		The Washington Monument is the most prominent structure in Washington, D.C. and one of the city's early attractions. It was built in honor of George Washington, who led the country to independence and then became its first President.

Analysis of the question leads us to expect that an answer should be of the form X was the first President of the US, where X is not only a noun phrase, but also refers to a named entity of type PER. This should allow us to ignore the first sentence in the passage. While it contains two occurrences of Washington, named entity recognition should tell us that neither of them has the correct type.

How do we go about identifying named entities? One option would be to look up each word in an appropriate list of names. For example, in the case of locations, we could use a gazetteer, or geographical dictionary, such as the Alexandria Gazetteer or the Getty Gazetteer. However, doing this blindly runs into problems, as shown in 7.12.

Observe that the gazetteer has good coverage of locations in many countries, and incorrectly finds locations like Sanchez in the Dominican Republic and On in Vietnam. Of course we could omit such locations from the gazetteer, but then we won't be able to identify them when they do appear in a document.

It gets even harder in the case of names for people or organizations. Any list of such names will probably have poor coverage. New organizations come into existence every day, so if we are trying to deal with contemporary newswire or blog entries, it is unlikely that we will be able to recognize many of the entities using gazetteer lookup.

Another major source of difficulty is caused by the fact that many named entity terms are ambiguous. Thus May and North are likely to be parts of named entities for DATE and LOCATION, respectively, but could both be part of a PERSON; conversely Christian Dior looks like a PERSON but is more likely to be of type ORGANIZATION. A term like Yankee will be ordinary modifier in some contexts, but will be marked as an entity of type ORGANIZATION in the phrase Yankee infielders.

Further challenges are posed by multi-word names like Stanford University, and by names that contain other names such as Cecil H. Green Library and Escondido Village Conference Service Center. In named entity recognition, therefore, we need to be able to identify the beginning and end of multi-token sequences.

Named entity recognition is a task that is well-suited to the type of classifier-based approach that we saw for noun phrase chunking. In particular, we can build a tagger that labels each word in a sentence using the IOB format, where chunks are labeled by their appropriate type. Here is part of the CONLL 2002 (conll2002) Dutch training data:


In this representation, there is one token per line, each with its part-of-speech tag and its named entity tag. Based on this training corpus, we can construct a tagger that can be used to label new sentences; and use the nltk.chunk.conlltags2tree() function to convert the tag sequences into a chunk tree.

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True [1], then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [None]:
sent = nltk.corpus.treebank.tagged_sents()[22]
print (nltk.ne_chunk(sent, binary=True))

# 7.6   Relation Extraction

Once named entities have been identified in a text, we then want to extract the relations that exist between them. As indicated earlier, we will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. The following example searches for strings that contain the word in. The special regular expression (?!\b.+ing\b) is a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerund.

In [None]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
        print (nltk.sem.rtuple(rel))
        
# Not sure I completely understand this regular expression. Come back...
# However, this is pretty damn impressive.

Searching for the keyword in works reasonably well, though it will also retrieve false positives such as [ORG: House Transportation Committee] , secured the most money in the [LOC: New York]; there is unlikely to be simple string-based method of excluding filler strings such as this.

As shown above, the conll2002 Dutch corpus contains not just named entity annotation but also part-of-speech tags. This allows us to devise patterns that are sensitive to these tags, as shown in the next example. The method show_clause() prints out the relations in a clausal form, where the binary relation symbol is specified as the value of parameter relsym [1].

In [None]:
from nltk.corpus import conll2002
vnv = """
    (
    is/V|    # 3rd sing present and
    was/V|   # past forms of the verb zijn ('be')
    werd/V|  # and also present
    wordt/V  # past of worden ('become)
    )
    .*       # followed by anything
    van/Prep # followed by van ('of')
    """
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
        print  (nltk.sem.clause(r, relsym="VAN"))



Note

Your Turn: Replace the last line [1], by print show_raw_rtuple(rel, lcon=True, rcon=True). This will show you the actual words that intervene between the two NEs and also their left and right context, within a default 10-word window. With the help of a Dutch dictionary, you might be able to figure out why the result VAN('annie_lennox', 'eurythmics') is a false hit.


In [None]:
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
        print  (nltk.sem.rtuple(rel, lcon=True, rcon=True))

Dr. Bell - you completed Chapter 7...