The amount of natural language text that is available in electronic form is truly staggering, and is increasing every day. However, the complexity of natural language can make it very difficult to access the information in that text. The state of the art in NLP is still a long way from being able to build general-purpose representations of meaning from unrestricted text. 

If we instead focus our efforts on a limited set of questions or "entity relations," such as "where are different facilities located," or "who is employed by what company," we can make significant progress. The goal of this chapter is to answer the following questions:

- How can we build a system that extracts structured data, such as tables, from unstructured text?
- What are some robust methods for identifying the entities and relationships described in a text?
- Which corpora are appropriate for this work, and how do we use them for training and evaluating our models?

Along the way, we'll apply techniques from the last two chapters to the problems of chunking and named-entity recognition.

# 1   Information Extraction

Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships. For example, we might be interested in the relation between companies and locations. Given a particular company, we would like to be able to identify the locations where it does business; conversely, given a location, we would like to discover which companies do business in that location. If our data is in tabular form, such as the example in 1.1, then answering these queries is straightforward.
 
Table 1-1: Locations Data

$$
\begin{array}{|ll|}
\hline {\text { OrgName }} &  {\text { LocationName }} \\
\text { Omnicom } & \text { New York } \\
\text { DDB Needham } & \text { New York } \\
\text { Kaplan Thaler Group } & \text { New York } \\
\text { BBDO South } & \text { Atlanta } \\
\text { Georgia-Pacific } & \text { Atlanta } \\
\hline
\end{array}
$$

If this location data was stored in Python as a list of tuples (entity, relation, entity), then the question "Which organizations operate in Atlanta?" could be translated as follows:

In [1]:
locs = [('Omnicom', 'IN', 'New York'),
        ('DDB Needham', 'IN', 'New York'),
        ('Kaplan Thaler Group', 'IN', 'New York'),
        ('BBDO South', 'IN', 'Atlanta'),
        ('Georgia-Pacific', 'IN', 'Atlanta')]
query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta'] #Gets companies that operate in Atlanta 
print(query)

['BBDO South', 'Georgia-Pacific']


Things are more tricky if we try to get similar information out of text. For example, consider the following snippet (from nltk.corpus.ieer, for fileid NYT19980315.0085).

    (1)    The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.
    
If you read through (1), you will glean the information required to answer the example question. But how do we get a machine to understand enough about (1) to return the answers in 1.2? This is obviously a much harder task. Unlike 1.1, (1) contains no structure that links organization names with location names.

One approach to this problem involves building a very general representation of meaning. In this chapter we take a different approach, deciding in advance that we will only look for very specific kinds of information in text, such as the relation between organizations and locations. Rather than trying to use text like (1) to answer the question directly, we first convert the _unstructured data_ of natural language sentences into the structured data of 1.1. Then we reap the benefits of powerful query tools such as SQL. This method of getting meaning from text is called _Information Extraction_ .

Information Extraction has many applications, including business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning. A particularly important area of current research involves the attempt to extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine.

# 1.1   Information Extraction Architecture

1.1 shows the architecture for a simple information extraction system. It begins by processing a document using several of the procedures discussed in chapters 3 and 5.: first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity detection. In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation detection to search for likely relations between different entities in the text.

Figure 1.1
![Figure 1.1](https://www.nltk.org/images/ie-architecture.png)

In [2]:
import nltk, re, pprint
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

Next, in named entity detection, we segment and label the entities that might participate in interesting relations with one another. Typically, these will be definite noun phrases such as the knights who say "ni", or proper names such as Monty Python. In some tasks it is useful to also consider indefinite nouns or noun chunks, such as every student or cats, and these do not necessarily refer to entities in the same way as definite NPs and proper names.

Finally, in relation extraction, we search for specific patterns between pairs of entities that occur near one another in the text, and use those patterns to build tuples recording the relationships between the entities.

# 2   Chunking

The basic technique we will use for entity detection is _chunking_ , which segments and labels multi-token sequences as illustrated in 2.1. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

Figure 2.1
![Figure 2.1](https://www.nltk.org/images/chunk-segmentation.png)

In this section, we will explore chunking in some depth, beginning with the definition and representation of chunks. We will see regular expression and n-gram approaches to chunking, and will develop and evaluate chunkers using the CoNLL-2000 chunking corpus. We will then return in sections 5 and 6 to the tasks of named entity recognition and relation extraction.


# 2.1   Noun Phrase Chunking

We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets:

    (2)		[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.
    
    
As we can see, NP-chunks are often smaller pieces than complete noun phrases. For example, _the market for system-management software for Digital's hardware_ is a single noun phrase (containing two nested noun phrases), but it is captured in NP-chunks by the simpler chunk the market. One of the motivations for this difference is that NP-chunks are defined so as not to contain other NP-chunks. Consequently, any prepositional phrases or subordinate clauses that modify a nominal will not be included in the corresponding NP-chunk, since they almost certainly contain further noun phrases.


One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the motivations for performing part-of-speech tagging in our information extraction system. We demonstrate this approach using an example sentence that has been part-of-speech tagged in section 2.2. In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser, and test it on our example sentence. The result is a tree, which we can either print, or display graphically.



In [3]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), 
            ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}" #A single regular-expression rule 
#an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

cp = nltk.RegexpParser(grammar) # Using this grammar, we create a chunk parser

result = cp.parse(sentence)# test it on our example sentence

print(result) # A tree which we can either print 

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [4]:
result.draw()# or display graphically 

# 2.2   Tag Patterns

The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT\>?<JJ\> * <NN\>. Tag patterns are similar to regular expression patterns. Now, consider the following noun phrases from the Wall Street Journal:

    another/DT sharp/JJ dive/NN
    trade/NN figures/NNS
    any/DT new/JJ policy/NN measures/NNS
    earlier/JJR stages/NNS
    Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP
    
    
We can match these noun phrases using a slight refinement of the first tag pattern above, i.e. <DT\>?<JJ.\*>\*<NN.*\>+. This will chunk any sequence of tokens beginning with an optional determiner, followed by zero or more adjectives of any type (including relative adjectives like earlier/JJR), followed by one or more nouns of any type. However, it is easy to find many more complicated examples which this rule will not cover:


    his/PRP$ Mansion/NNP House/NNP speech/NN
    the/DT price/NN cutting/VBG
    3/CD %/NN to/TO 4/CD %/NN
    more/JJR than/IN 10/CD %/NN
    the/DT fastest/JJS developing/VBG trends/NNS
    's/POS skill/NN
    


In [5]:
nltk.app.chunkparser() # this is difficult 

# 2.3   Chunking with Regular Expressions


To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

2.3 shows a simple chunk grammar consisting of two rules. The first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked, and run the chunker on this input 

In [6]:
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
#The $ symbol is a special character in regular expressions, 
#and must be backslash escaped in order to match the tag PP$.

cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [7]:
print(cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked

In [8]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "NP: {<NN><NN>}  # Chunk two consecutive nouns"
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

(S (NP money/NN market/NN) fund/NN)


In [9]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "NP: {<NN>+}  # Chunk two consecutive nouns"
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

(S (NP money/NN market/NN fund/NN))


# 2.4   Exploring Text Corpora

In 2 we saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows:



In [10]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
(CHUNK expected/VBN to/TO approve/VB)
(CHUNK expected/VBN to/TO make/VB)
(CHUNK intends/VBZ to/TO make/VB)
(CHUNK seek/VB to/TO set/VB)
(CHUNK like/VB to/TO see/VB)
(CHUNK designed/VBN to/TO provide/VB)
(CHUNK get/VB to/TO hear/VB)
(CHUNK expects/VBZ to/TO tell/VB)
(CHUNK expected/VBN to/TO give/VB)
(CHUNK prefer/VB to/TO pay/VB)
(CHUNK required/VBN to/TO obtain/VB)
(CHUNK permitted/VBN to/TO teach/VB)
(CHUNK designed/VBN to/TO reduce/VB)
(CHUNK Asked/VBN to/TO elaborate/VB)
(CHUNK got/VBN to/TO go/VB)
(CHUNK raised/VBN to/TO pay/VB)
(CHUNK scheduled/VBN to/TO go/VB)
(CHUNK cut/VBN to/TO meet/VB)
(CHUNK needed/VBN to/TO meet/VB)
(CHUNK hastened/VBD to/TO add/VB)
(CHUNK found/VBN to/TO prevent/VB)
(CHUNK continue/VB to/TO insist/VB)
(CHUNK compelled/VBN to/TO make/VB)
(CHUNK mad

(CHUNK voting/VBG to/TO cut/VB)
(CHUNK prepared/VBD to/TO choke/VB)
(CHUNK used/VBD to/TO approach/VB)
(CHUNK trying/VBG to/TO hit/VB)
(CHUNK refused/VBD to/TO let/VB)
(CHUNK began/VBD to/TO acquire/VB)
(CHUNK proceeded/VBD to/TO sink/VB)
(CHUNK proceeded/VBD to/TO follow/VB)
(CHUNK hoping/VBG to/TO slice/VB)
(CHUNK chose/VBD to/TO hit/VB)
(CHUNK like/VB to/TO hit/VB)
(CHUNK tries/VBZ to/TO answer/VB)
(CHUNK want/VB to/TO talk/VB)
(CHUNK got/VBN to/TO get/VB)
(CHUNK used/VBD to/TO follow/VB)
(CHUNK try/VB to/TO play/VB)
(CHUNK conspired/VBN to/TO lose/VB)
(CHUNK needed/VBN to/TO revive/VB)
(CHUNK chosen/VBN to/TO run/VB)
(CHUNK hopes/VBZ to/TO visit/VB)
(CHUNK got/VBD to/TO see/VB)
(CHUNK arranged/VBD to/TO sell/VB)
(CHUNK delighted/VBN to/TO get/VB)
(CHUNK want/VB to/TO enjoy/VB)
(CHUNK tried/VBN to/TO get/VB)
(CHUNK try/VB to/TO close/VB)
(CHUNK required/VBN to/TO furnish/VB)
(CHUNK obliged/VBN to/TO dole/VB)
(CHUNK wished/VBD to/TO wait/VB)
(CHUNK decided/VBD to/TO act/VB)
(CHUNK ho

(CHUNK forbidden/VBN to/TO sit/VB)
(CHUNK plans/VBZ to/TO import/VB)
(CHUNK likes/VBZ to/TO imagine/VB)
(CHUNK used/VBD to/TO get/VB)
(CHUNK trying/VBG to/TO make/VB)
(CHUNK ceased/VBD to/TO suggest/VB)
(CHUNK going/VBG to/TO work/VB)
(CHUNK wanting/VBG to/TO cut/VB)
(CHUNK choose/VB to/TO persuade/VB)
(CHUNK trying/VBG to/TO keep/VB)
(CHUNK like/VB to/TO embark/VB)
(CHUNK suited/VBN to/TO defeat/VB)
(CHUNK hastened/VBD to/TO put/VB)
(CHUNK like/VB to/TO add/VB)
(CHUNK want/VB to/TO preserve/VB)
(CHUNK required/VBN to/TO participate/VB)
(CHUNK happened/VBN to/TO save/VB)
(CHUNK doing/VBG to/TO promote/VB)
(CHUNK tempted/VBN to/TO quote/VB)
(CHUNK continuing/VBG to/TO capture/VB)
(CHUNK need/VB to/TO communicate/VB)
(CHUNK like/VB to/TO see/VB)
(CHUNK interested/VBN to/TO know/VB)
(CHUNK allowed/VBN to/TO rust/VB)
(CHUNK chose/VBD to/TO devote/VB)
(CHUNK left/VBN to/TO choose/VB)
(CHUNK want/VB to/TO own/VB)
(CHUNK plan/VB to/TO become/VB)
(CHUNK persuaded/VBN to/TO restock/VB)
(CHUNK s

(CHUNK starting/VBG to/TO go/VB)
(CHUNK expected/VBN to/TO race/VB)
(CHUNK designed/VBN to/TO push/VB)
(CHUNK looks/VBZ to/TO run/VB)
(CHUNK began/VBD to/TO motor/VB)
(CHUNK trained/VBN to/TO drag/VB)
(CHUNK fled/VBD to/TO make/VB)
(CHUNK seemed/VBD to/TO know/VB)
(CHUNK used/VBD to/TO say/VB)
(CHUNK preferred/VBD to/TO get/VB)
(CHUNK hope/VB to/TO cover/VB)
(CHUNK want/VB to/TO miss/VB)
(CHUNK scheduled/VBN to/TO vanish/VB)
(CHUNK continued/VBD to/TO live/VB)
(CHUNK seem/VB to/TO cascade/VB)
(CHUNK forget/VB to/TO buy/VB)
(CHUNK fail/VB to/TO shorten/VB)
(CHUNK intend/VB to/TO cook/VB)
(CHUNK sized/VBN to/TO fit/VB)
(CHUNK continue/VB to/TO release/VB)
(CHUNK wish/VB to/TO create/VB)
(CHUNK trim/VB to/TO fit/VB)
(CHUNK cut/VBN to/TO fit/VB)
(CHUNK help/VB to/TO prevent/VB)
(CHUNK designed/VBN to/TO take/VB)
(CHUNK used/VBN to/TO transport/VB)
(CHUNK want/VB to/TO buy/VB)
(CHUNK used/VBN to/TO fasten/VB)
(CHUNK help/VB to/TO keep/VB)
(CHUNK needed/VBN to/TO build/VB)
(CHUNK designed/VB

(CHUNK decides/VBZ to/TO proceed/VB)
(CHUNK interested/VBN to/TO hear/VB)
(CHUNK seems/VBZ to/TO probe/VB)
(CHUNK preferring/VBG to/TO consider/VB)
(CHUNK known/VBN to/TO go/VB)
(CHUNK amazed/VBN to/TO realize/VB)
(CHUNK seeking/VBG to/TO help/VB)
(CHUNK interpreted/VBN to/TO conform/VB)
(CHUNK explored/VBN to/TO find/VB)
(CHUNK trying/VBG to/TO throw/VB)
(CHUNK designed/VBN to/TO find/VB)
(CHUNK required/VBN to/TO mark/VB)
(CHUNK asked/VBN to/TO consider/VB)
(CHUNK seem/VB to/TO involve/VB)
(CHUNK seems/VBZ to/TO smell/VB)
(CHUNK seems/VBZ to/TO see/VB)
(CHUNK learned/VBN to/TO develop/VB)
(CHUNK arranged/VBN to/TO meet/VB)
(CHUNK need/VB to/TO work/VB)
(CHUNK want/VB to/TO raise/VB)
(CHUNK need/VB to/TO bring/VB)
(CHUNK expect/VB to/TO grow/VB)
(CHUNK expected/VBN to/TO work/VB)
(CHUNK want/VB to/TO include/VB)
(CHUNK going/VBG to/TO produce/VB)
(CHUNK want/VB to/TO hire/VB)
(CHUNK want/VB to/TO buy/VB)
(CHUNK want/VB to/TO hire/VB)
(CHUNK going/VBG to/TO farm/VB)
(CHUNK need/VB to/T

(CHUNK Ask/VB to/TO see/VB)
(CHUNK interested/VBN to/TO learn/VB)
(CHUNK pass/VB to/TO add/VB)
(CHUNK demanding/VBG to/TO know/VB)
(CHUNK seems/VBZ to/TO threaten/VB)
(CHUNK tried/VBD to/TO describe/VB)
(CHUNK tried/VBD to/TO explain/VB)
(CHUNK seem/VB to/TO feel/VB)
(CHUNK got/VBN to/TO play/VB)
(CHUNK ceased/VBN to/TO need/VB)
(CHUNK like/VB to/TO live/VB)
(CHUNK came/VBD to/TO see/VB)
(CHUNK threatens/VBZ to/TO break/VB)
(CHUNK begins/VBZ to/TO fade/VB)
(CHUNK begins/VBZ to/TO appear/VB)
(CHUNK equipped/VBN to/TO tell/VB)
(CHUNK used/VBN to/TO increase/VB)
(CHUNK hired/VBN to/TO repeat/VB)
(CHUNK wished/VBD to/TO make/VB)
(CHUNK seems/VBZ to/TO exist/VB)
(CHUNK means/VBZ to/TO choose/VB)
(CHUNK struggle/VB to/TO insulate/VB)
(CHUNK serves/VBZ to/TO crystallize/VB)
(CHUNK doomed/VBN to/TO become/VB)
(CHUNK serves/VBZ to/TO illuminate/VB)
(CHUNK failed/VBD to/TO furnish/VB)
(CHUNK encouraged/VBN to/TO trade/VB)
(CHUNK continued/VBD to/TO come/VB)
(CHUNK promised/VBD to/TO send/VB)
(CH

(CHUNK decided/VBD to/TO migrate/VB)
(CHUNK continued/VBD to/TO trouble/VB)
(CHUNK labored/VBD to/TO finish/VB)
(CHUNK decided/VBD to/TO return/VB)
(CHUNK waiting/VBG to/TO go/VB)
(CHUNK chosen/VBN to/TO serve/VB)
(CHUNK came/VBD to/TO know/VB)
(CHUNK helped/VBN to/TO escape/VB)
(CHUNK opened/VBN to/TO admit/VB)
(CHUNK happened/VBD to/TO see/VB)
(CHUNK brought/VBN to/TO bear/VB)
(CHUNK inclined/VBN to/TO argue/VB)
(CHUNK seeming/VBG to/TO say/VB)
(CHUNK prompted/VBN to/TO write/VB)
(CHUNK come/VBN to/TO dominate/VB)
(CHUNK used/VBN to/TO illustrate/VB)
(CHUNK prepared/VBN to/TO find/VB)
(CHUNK wish/VB to/TO argue/VB)
(CHUNK begin/VB to/TO read/VB)
(CHUNK plan/VB to/TO discuss/VB)
(CHUNK come/VBN to/TO call/VB)
(CHUNK expect/VB to/TO find/VB)
(CHUNK come/VBN to/TO believe/VB)
(CHUNK continue/VB to/TO pay/VB)
(CHUNK tend/VB to/TO thump/VB)
(CHUNK determined/VBN to/TO prove/VB)
(CHUNK learn/VB to/TO control/VB)
(CHUNK used/VBN to/TO frustrate/VB)
(CHUNK trying/VBG to/TO assert/VB)
(CHUNK 

(CHUNK come/VBN to/TO say/VB)
(CHUNK began/VBD to/TO move/VB)
(CHUNK went/VBD to/TO visit/VB)
(CHUNK got/VBD to/TO drink/VB)
(CHUNK seem/VB to/TO know/VB)
(CHUNK wanted/VBD to/TO help/VB)
(CHUNK seem/VB to/TO fall/VB)
(CHUNK tends/VBZ to/TO obscure/VB)
(CHUNK beginning/VBG to/TO point/VB)
(CHUNK trying/VBG to/TO prove/VB)
(CHUNK trying/VBG to/TO sort/VB)
(CHUNK Start/VB to/TO prepare/VB)
(CHUNK obliged/VBN to/TO go/VB)
(CHUNK declined/VBD to/TO introduce/VB)
(CHUNK enter/VB to/TO ask/VB)
(CHUNK seems/VBZ to/TO lie/VB)
(CHUNK continued/VBD to/TO shape/VB)
(CHUNK seem/VB to/TO pass/VB)
(CHUNK prepared/VBN to/TO accept/VB)
(CHUNK done/VBN to/TO obtaine/VB)
(CHUNK expected/VBN to/TO reach/VB)
(CHUNK seems/VBZ to/TO refer/VB)
(CHUNK tried/VBD to/TO consult/VB)
(CHUNK came/VBD to/TO put/VB)
(CHUNK seemed/VBN to/TO promise/VB)
(CHUNK needed/VBD to/TO possess/VB)
(CHUNK seem/VB to/TO indicate/VB)
(CHUNK purports/VBZ to/TO examine/VB)
(CHUNK attempts/VBZ to/TO understand/VB)
(CHUNK tend/VB to/T

(CHUNK began/VBD to/TO talk/VB)
(CHUNK begins/VBZ to/TO take/VB)
(CHUNK decided/VBN to/TO make/VB)
(CHUNK asked/VBN to/TO give/VB)
(CHUNK formed/VBN to/TO give/VB)
(CHUNK taken/VBN to/TO link/VB)
(CHUNK put/VBN to/TO use/VB)
(CHUNK going/VBG to/TO work/VB)
(CHUNK combine/VB to/TO provide/VB)
(CHUNK wish/VB to/TO serve/VB)
(CHUNK expected/VBN to/TO increase/VB)
(CHUNK needed/VBN to/TO maintain/VB)
(CHUNK needed/VBN to/TO obtain/VB)
(CHUNK planned/VBN to/TO maintain/VB)
(CHUNK needed/VBN to/TO meet/VB)
(CHUNK proposed/VBN to/TO authorize/VB)
(CHUNK decided/VBN to/TO stop/VB)
(CHUNK scheduled/VBN to/TO become/VB)
(CHUNK seek/VB to/TO assure/VB)
(CHUNK agrees/VBZ to/TO furnish/VB)
(CHUNK prepared/VBN to/TO consider/VB)
(CHUNK prepared/VBN to/TO act/VB)
(CHUNK prepared/VBN to/TO act/VB)
(CHUNK required/VBN to/TO cease/VB)
(CHUNK required/VBN to/TO operate/VB)
(CHUNK designed/VBN to/TO operate/VB)
(CHUNK permitted/VBN to/TO operate/VB)
(CHUNK taken/VBN to/TO minimize/VB)
(CHUNK permitted/VBN

(CHUNK mentioned/VBN to/TO make/VB)
(CHUNK trying/VBG to/TO develop/VB)
(CHUNK compelled/VBN to/TO omit/VB)
(CHUNK continue/VB to/TO show/VB)
(CHUNK planning/VBG to/TO use/VB)
(CHUNK expecting/VBG to/TO recover/VB)
(CHUNK meant/VBD to/TO move/VB)
(CHUNK preferred/VBD to/TO continue/VB)
(CHUNK trying/VBG to/TO find/VB)
(CHUNK planned/VBN to/TO exterminate/VB)
(CHUNK trying/VBG to/TO marry/VB)
(CHUNK pledged/VBN to/TO hold/VB)
(CHUNK determined/VBN to/TO create/VB)
(CHUNK seemed/VBD to/TO assure/VB)
(CHUNK attempted/VBD to/TO marry/VB)
(CHUNK obliged/VBN to/TO concede/VB)
(CHUNK expected/VBD to/TO democratize/VB)
(CHUNK Failing/VBG to/TO heed/VB)
(CHUNK determined/VBN to/TO keep/VB)
(CHUNK tend/VB to/TO procrastinate/VB)
(CHUNK even/VB to/TO repudiate/VB)
(CHUNK served/VBD to/TO minimize/VB)
(CHUNK encouraged/VBN to/TO state/VB)
(CHUNK trying/VBG to/TO unearth/VB)
(CHUNK decided/VBD to/TO remove/VB)
(CHUNK decide/VB to/TO encourage/VB)
(CHUNK prefer/VB to/TO hire/VB)
(CHUNK go/VB to/TO w

(CHUNK began/VBD to/TO take/VB)
(CHUNK wanted/VBD to/TO tell/VB)
(CHUNK wanted/VBD to/TO substitute/VB)
(CHUNK want/VB to/TO make/VB)
(CHUNK come/VBN to/TO determine/VB)
(CHUNK begun/VBN to/TO ebb/VB)
(CHUNK intended/VBN to/TO incorporate/VB)
(CHUNK led/VBN to/TO postulate/VB)
(CHUNK hope/VB to/TO discover/VB)
(CHUNK tended/VBN to/TO emphasize/VB)
(CHUNK fails/VBZ to/TO explore/VB)
(CHUNK seeks/VBZ to/TO make/VB)
(CHUNK helping/VBG to/TO define/VB)
(CHUNK trying/VBG to/TO avoid/VB)
(CHUNK trying/VBG to/TO get/VB)
(CHUNK made/VBN to/TO symbolize/VB)
(CHUNK kneels/VBZ to/TO kiss/VB)
(CHUNK serve/VB to/TO travesty/VB)
(CHUNK used/VBN to/TO equate/VB)
(CHUNK altered/VBN to/TO show/VB)
(CHUNK altered/VBN to/TO show/VB)
(CHUNK taken/VBN to/TO branch/VB)
(CHUNK attempt/VB to/TO execute/VB)
(CHUNK used/VBN to/TO name/VB)
(CHUNK used/VBN to/TO name/VB)
(CHUNK used/VBN to/TO generate/VB)
(CHUNK used/VBN to/TO select/VB)
(CHUNK used/VBN to/TO select/VB)
(CHUNK used/VBN to/TO specify/VB)
(CHUNK us

(CHUNK consented/VBD to/TO meet/VB)
(CHUNK rose/VBD to/TO go/VB)
(CHUNK chosen/VBN to/TO read/VB)
(CHUNK started/VBN to/TO cross/VB)
(CHUNK seemed/VBD to/TO think/VB)
(CHUNK started/VBD to/TO undo/VB)
(CHUNK longed/VBD to/TO tell/VB)
(CHUNK chose/VBD to/TO read/VB)
(CHUNK served/VBN to/TO increase/VB)
(CHUNK refused/VBD to/TO bring/VB)
(CHUNK got/VBN to/TO stop/VB)
(CHUNK want/VB to/TO take/VB)
(CHUNK tried/VBD to/TO order/VB)
(CHUNK seeking/VBG to/TO create/VB)
(CHUNK hope/VB to/TO accomplish/VB)
(CHUNK attempt/VB to/TO rise/VB)
(CHUNK tried/VBD to/TO rise/VB)
(CHUNK began/VBD to/TO crawl/VB)
(CHUNK failed/VBD to/TO reach/VB)
(CHUNK began/VBD to/TO creep/VB)
(CHUNK began/VBD to/TO crawl/VB)
(CHUNK promised/VBD to/TO take/VB)
(CHUNK meant/VBN to/TO shout/VB)
(CHUNK longed/VBN to/TO increase/VB)
(CHUNK want/VB to/TO begin/VB)
(CHUNK seemed/VBD to/TO imply/VB)
(CHUNK stopped/VBD to/TO admire/VB)
(CHUNK stayed/VBD to/TO visit/VB)
(CHUNK want/VB to/TO get/VB)
(CHUNK tried/VBD to/TO remembe

(CHUNK continued/VBD to/TO discharge/VB)
(CHUNK seem/VB to/TO belong/VB)
(CHUNK began/VBD to/TO flicker/VB)
(CHUNK trying/VBG to/TO wreck/VB)
(CHUNK fit/VBN to/TO touch/VB)
(CHUNK going/VBG to/TO take/VB)
(CHUNK trying/VBG to/TO clear/VB)
(CHUNK want/VB to/TO spend/VB)
(CHUNK paused/VBD to/TO look/VB)
(CHUNK going/VBG to/TO allow/VB)
(CHUNK like/VB to/TO talk/VB)
(CHUNK planning/VBG to/TO set/VB)
(CHUNK bent/VBD to/TO examine/VB)
(CHUNK turned/VBD to/TO jump/VB)
(CHUNK started/VBD to/TO retch/VB)
(CHUNK going/VBG to/TO get/VB)
(CHUNK come/VBN to/TO recognize/VB)
(CHUNK expected/VBN to/TO report/VB)
(CHUNK failed/VBD to/TO see/VB)
(CHUNK failed/VBD to/TO notify/VB)
(CHUNK failed/VBD to/TO co-operate/VB)
(CHUNK stopping/VBG to/TO hear/VB)
(CHUNK want/VB to/TO talk/VB)
(CHUNK going/VBG to/TO cost/VB)
(CHUNK wanted/VBD to/TO ask/VB)
(CHUNK going/VBG to/TO get/VB)
(CHUNK going/VBG to/TO swear/VB)
(CHUNK tried/VBD to/TO keep/VB)
(CHUNK think/VB to/TO look/VB)
(CHUNK tried/VBD to/TO find/VB)


(CHUNK wanted/VBD to/TO get/VB)
(CHUNK continued/VBD to/TO snort/VB)
(CHUNK started/VBD to/TO reach/VB)
(CHUNK beginning/VBG to/TO recover/VB)
(CHUNK going/VBG to/TO talk/VB)
(CHUNK like/VB to/TO kill/VB)
(CHUNK going/VBG to/TO hear/VB)
(CHUNK wanted/VBD to/TO hear/VB)
(CHUNK came/VBD to/TO investigate/VB)
(CHUNK managed/VBD to/TO duck/VB)
(CHUNK began/VBD to/TO focus/VB)
(CHUNK trying/VBG to/TO yank/VB)
(CHUNK began/VBD to/TO snort/VB)
(CHUNK wanted/VBD to/TO show/VB)
(CHUNK tried/VBD to/TO start/VB)
(CHUNK going/VBG to/TO let/VB)
(CHUNK going/VBG to/TO fight/VB)
(CHUNK attempting/VBG to/TO speak/VB)
(CHUNK longing/VBG to/TO catch/VB)
(CHUNK rejoicing/VBG to/TO think/VB)
(CHUNK like/VB to/TO get/VB)
(CHUNK tried/VBD to/TO go/VB)
(CHUNK seemed/VBD to/TO make/VB)
(CHUNK aim/VB to/TO give/VB)
(CHUNK want/VB to/TO trade/VB)
(CHUNK beginning/VBG to/TO turn/VB)
(CHUNK forced/VBN to/TO maintain/VB)
(CHUNK hope/VB to/TO locate/VB)
(CHUNK paused/VBD to/TO gather/VB)
(CHUNK bothering/VBG to/TO 

(CHUNK managed/VBD to/TO look/VB)
(CHUNK needed/VBD to/TO get/VB)
(CHUNK answered/VBD to/TO find/VB)
(CHUNK afford/VB to/TO get/VB)
(CHUNK started/VBD to/TO look/VB)
(CHUNK takes/VBZ to/TO get/VB)
(CHUNK going/VBG to/TO get/VB)
(CHUNK tried/VBD to/TO quiet/VB)
(CHUNK trying/VBG to/TO sound/VB)
(CHUNK came/VBD to/TO meet/VB)
(CHUNK seemed/VBD to/TO focus/VB)
(CHUNK want/VB to/TO talk/VB)
(CHUNK want/VB to/TO see/VB)
(CHUNK wants/VBZ to/TO get/VB)
(CHUNK went/VBD to/TO turn/VB)
(CHUNK surprised/VBN to/TO find/VB)
(CHUNK want/VB to/TO stay/VB)
(CHUNK going/VBG to/TO make/VB)
(CHUNK hoped/VBD to/TO dig/VB)
(CHUNK trying/VBG to/TO make/VB)
(CHUNK going/VBG to/TO lug/VB)
(CHUNK surprised/VBN to/TO see/VB)
(CHUNK stop/VB to/TO read/VB)
(CHUNK intended/VBD to/TO move/VB)
(CHUNK rising/VBG to/TO sting/VB)
(CHUNK arranged/VBN to/TO live/VB)
(CHUNK managed/VBN to/TO find/VB)
(CHUNK inclined/VBN to/TO wobble/VB)
(CHUNK supposed/VBN to/TO care/VB)
(CHUNK shuddered/VBD to/TO think/VB)
(CHUNK seemed/

In [11]:
cp = nltk.RegexpParser("NOUNS: {<N.*>{4,}}") # Nouns if there is 4 or more 
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'NOUNS': print(subtree)

(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)
(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)
(NOUNS Georgia's/NP$ automobile/NN title/NN law/NN)
(NOUNS State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)
(NOUNS Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)
(NOUNS Mayor/NN-TL William/NP B./NP Hartsfield/NP)
(NOUNS Mrs./NP J./NP M./NP Cheshire/NP)
(NOUNS E./NP Pelham/NP Rd./NN-TL Aj/NN)
(NOUNS
  State/NN-TL
  Party/NN-TL
  Chairman/NN-TL
  James/NP
  W./NP
  Dorsey/NP)
(NOUNS Texas/NP Sen./NN-TL John/NP Tower/NP)
(NOUNS Lt./NN-TL Gov./NN-TL Garland/NP Byrd's/NP$ campaign/NN)
(NOUNS Schley/NP County/NN-TL Rep./NN-TL B./NP D./NP Pelham/NP)
(NOUNS Colquitt/NP-TL Policeman/NN-TL Tom/NP Williams/NP)
(NOUNS Rep./NN-TL Charles/NP E./NP Hughes/NP)
(NOUNS State/NN-TL Health/NN-TL Department's/NN$-TL authority/NN)
(NOUNS Lamar/NP-TL county/NN-TL Hospital/NN-TL District/NN-TL)
(NOUNS Sen./NN-TL A./NP R./NP Schwartz/NP)
(NOUNS Sen./NN-TL A./NP M./NP Aikin/NP Jr./NP)
(NO

(NOUNS snow/NN emergency/NN route/NN plan/NN)
(NOUNS Spring/NN-TL Grove/NN-TL State/NN-TL Hospital/NN-TL)
(NOUNS Anne/NP-TL Arundel/NP-TL County/NN-TL Jail/NN-TL)
(NOUNS Trooper/NN-TL J./NP A./NP Grzesiak/NP)
(NOUNS Bow/NP-TL St./NN-TL police/NN court/NN)
(NOUNS Vic/NP theater/NN Saturday/NR afternoon/NN)
(NOUNS Judge/NN-TL Joseph/NP Sam/NP Perry/NP)
(NOUNS city/NN police/NN narcotics/NNS unit/NN)
(NOUNS Patrolman/NN James/NP F./NP Simms/NP)
(NOUNS Assistant/NN Prosecutor/NN-TL Fred/NP Lewis/NP)
(NOUNS Circuit/NN Judge/NN-TL Paul/NP R./NP Cash/NP)
(NOUNS gas/NN explosion/NN Saturday/NR night/NN)
(NOUNS Lyle/NP-TL Elliott/NP-TL Funeral/NN-TL Home/NN-TL)
(NOUNS Assistant/NN Fire/NN-TL Chief/NN-TL Chester/NP Cornell/NP)
(NOUNS mother's/NN$ death/NN yesterday/NR afternoon/NN)
(NOUNS Mrs./NP Frank/NP C./NP Smith/NP)
(NOUNS
  Tareytown/NP-TL
  Acres/NNS-TL
  Homeowners/NNS-TL
  Association/NN-TL)
(NOUNS Principal/NN Clayton/NP W./NP Pohly/NP)
(NOUNS Traffic/NN-TL Judge/NN-TL George/NP T./NP 

(NOUNS college/NN age/NN limit/NN law/NN)
(NOUNS road/NN maintenance/NN bond/NN issue/NN)
(NOUNS Georgia/NP Tech/NP research/NN staff/NN)
(NOUNS Chairman/NN-TL Charles/NP O./NP Emmerich/NP)
(NOUNS
  Georgia/NP-TL
  Power/NN-TL
  Company's/NN$-TL
  record/NN
  construction/NN
  budget/NN)
(NOUNS Dictator/NN-TL Marcos/NP Perez/NP Jimenez/NP)
(NOUNS Mayor/NN-TL Robert/NP F./NP Wagner/NP)
(NOUNS State/NN-TL Controller/NN-TL Arthur/NP Levitt/NP)
(NOUNS press/NN conference/NN Mr./NP Kennedy/NP)
(NOUNS Viet/NP Minh/NP guerrilla/NN fighters/NNS)
(NOUNS Dade/NP-TL County/NN-TL Port/NN-TL Authority/NN-TL)
(NOUNS Chicago/NP-TL Tribune/NN-TL News/NN-TL Service/NN-TL)
(NOUNS world/NN trade/NN trip/NN people/NNS)
(NOUNS Stone/NN-TL Harbor/NN-TL bird/NN sanctuary/NN)
(NOUNS Stone/NN-TL Harbor/NN-TL bird/NN sanctuary's/NN$ allies/NNS)
(NOUNS
  glamor/NN
  President/NN-TL
  Kennedy's/NP$
  Peace/NN-TL
  Corps/NN-TL)
(NOUNS Dean/NN-TL John/NP W./NP Schwada/NP)
(NOUNS Herrin-Murphysboro-West/NP Frankfort

(NOUNS Wall/NN-TL Street/NN-TL Journal/NN-TL survey/NN)
(NOUNS Advance/NN-TL Neon/NN-TL Sign/NN-TL Co./NN-TL)
(NOUNS Du/NP Pont's/NP$ Polychemicals/NNS-TL Dept./NN-TL)
(NOUNS Advance/NN-TL Neon/NN-TL Sign/NN-TL Co./NN-TL)
(NOUNS Lumber/NN-TL Dealers'/NNS$-TL Research/NN-TL Council/NN-TL)
(NOUNS Lumber/NN-TL Dealers'/NNS$-TL Research/NN-TL Council/NN-TL)
(NOUNS Type/NN-TL 6-B/NP floating-load/NN method/NN)
(NOUNS Clay/NN-TL Products/NNS-TL Research/NN-TL Foundation/NN-TL)
(NOUNS Alfa/NP Romeo/NP Giulietta/NP models/NNS)
(NOUNS Charles/NP MacArthur-Helen/NP Hayes/NP saga/NN)
(NOUNS Dr./NN-TL Henry/NP Lee/NP Smith/NP)
(NOUNS President/NN-TL Franklin/NP Delano/NP Roosevelt/NP)
(NOUNS Lt./NN-TL Howard/NP D./NP Beckstrom/NP)
(NOUNS Lt./NN-TL Thomas/NP H./NP Richardson/NP)
(NOUNS Capt./NN-TL A./NP B./NP Jenks/NP)
(NOUNS Ensign/NN-TL Kay/NP K./NP Vesole/NP)
(NOUNS Seaman/NN-TL 2/c/NP-TL Donald/NP L./NP Norton/NP)
(NOUNS Seaman/NN-TL 1/c/NP-TL William/NP A./NP Rochford/NP)
(NOUNS Seaman/NN-TL 1

(NOUNS E./NP T./NP Leeds'/NP$ Archaeology/NN-TL)
(NOUNS Staff/NN-TL Hugh/NP L./NP Scott/NP)
(NOUNS Provost/NN-TL Marshal/NN-TL Enoch/NP Crowder/NP)
(NOUNS Democrat/NP Stanley/NP H./NP Dent/NP)
(NOUNS Floor/NN-TL Leader/NN-TL Claude/NP Kitchin/NP)
(NOUNS Senator/NN-TL James/NP A./NP Reed/NP)
(NOUNS minister/NN Harry/NP Emerson/NP Fosdick/NP)
(NOUNS cousin/NN Sir/NP Fulke/NP Greville/NP)
(NOUNS countriman/NN Mr./NP Wm./NP Shak./NP)
(NOUNS Sr./NP Edw/NP Grevyles/NP$ minaces/NNS)
(NOUNS tyme/NN Mr./NP Ryc'/NP Quyney/NP)
(NOUNS Mark/NP Antony/NP De/NP Wolfe/NP Howe/NP)
(NOUNS W./NP E./NP Burghardt/NP Du/NP Bois/NP)
(NOUNS Bill/NP Brown's/NP$ Training/NN-TL Camp/NN-TL)
(NOUNS Arthur/NP Clarke's/NP$ Childhood's/NN$-TL End/NN-TL)
(NOUNS Kurt/NP Vonnegut's/NP$ Player/NN-TL Piano/NN-TL)
(NOUNS San/NP Francisco/NP poetry/NN group/NN)
(NOUNS York/NP-TL Times/NNS-TL Book/NN-TL Review/NN-TL)
(NOUNS what-will-T./NIL S./NIL Eliot-or-Martin/NIL Buber-think/NIL)
(NOUNS William/NP Lyon/NP Phelps/NP atmos

(NOUNS George/NP Washington's/NP$ cherry/NN tree/NN)
(NOUNS lb/NN BOD/day/1,000/NN cu/NN ft/NN aeration/NN capacity/NN)
(NOUNS first-order/NN stress/NN relaxation/NN function/NN)
(NOUNS C./NP A./NP J./NP Hoeve/NP)
(NOUNS C./NP A./NP J./NP Hoeve/NP)
(NOUNS C./NP A./NP J./NP Hoeve/NP)
(NOUNS James/NP Dwight/NP Dana's/NP$ System/NN-TL)
(NOUNS reactor/NN irradiation/NN loop/NN system/NN)
(NOUNS Van/NP De/NP Graff/NP machines/NNS)
(NOUNS Army/NN-TL Quartermaster/NN-TL Corps/NN-TL program/NN)
(NOUNS
  material/NN
  (/NIL
  B.t.u./sq./NIL
  ft./NIL
  of/NIL
  material/hr./*0F./in./NIL
  of/NIL
  thickness/NIL
  )/NIL)
(NOUNS urethane/NN foam/NN sandwich/NN panels/NNS)
(NOUNS 90-degrees-F/NNS (/NIL 29/NIL -/NIL 32-degrees-C/NIL )/NIL)
(NOUNS 90-degrees-F/NNS (/NIL 32*0C./NIL )/NIL)
(NOUNS 140-degrees-F/NNS (/NIL 38*0/NIL to/NIL 60*0C./NIL )/NIL)
(NOUNS 275-degrees-F/NNS (/NIL 135*0C./NIL )/NIL)
(NOUNS 160-degrees-F/NNS (/NIL 49*0/NIL -/NIL 71*&0C./NIL )/NIL)
(NOUNS 105-degrees-F/NNS (/NIL 38/N

# 2.5   Chinking
Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink:

    [ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]
    
Chinking is the process of removing a sequence of tokens from a chunk. If the matching sequence of tokens spans an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before. If the sequence is at the periphery of the chunk, these tokens are removed, and a smaller chunk remains. These three possibilities are illustrated in

$$
\begin{array}{|llll|}
\hline &  {\text { Entire chunk }} &  {\text { Middle of a chunk }} &  {\text { End of a chunk }} \\
\text { Input } & \text { [a/DT little/JJ dog/NN] } & {[\text { a/DT little/JJ dog/NN] }} & \text { [a/DT little/JJ dog/NN] } \\
\text { Operation } & \text { Chink "DT JJ NN" } & \text { Chink "JJ" } & \text { Chink "NN" } \\
\text { Pattern } & \text { \}DT JJ NN }\{ & \text { \}JJ\{ } & \text {\}NN\{ } \\
\text { Output } & \text { a/DT little/JJ dog/NN } & \text { [a/DT] little/JJ [dog/NN] } & \text { [a/DT little/JJ] dog/NN } \\
\hline
\end{array}
$$

In [12]:
# we put the entire sentence into a single chunk, then excise the chinks.
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


As befits their intermediate status between tagging and parsing, chunk structures can be represented using either tags or trees. The most widespread file representation uses _IOB_ tags. In this scheme, each token is tagged with one of three special chunk tags, _I (inside)_ , _O (outside)_ , or _B (begin)_ . A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I . All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O. An example of this scheme is shown in 2.5 below.

![2.5](https://www.nltk.org/images/chunk-tagrep.png)

IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format. Here is how the information in 2.5 would appear in a file:

    We PRP B-NP
    saw VBD O
    the DT B-NP
    yellow JJ I-NP
    dog NN I-NP
    
In this representation there is one token per line, each with its part-of-speech tag and chunk tag. This format permits us to represent more than one chunk type, so long as the chunks do not overlap. As we saw earlier, chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent that can be manipulated directly. An example is shown below

![2.6](https://www.nltk.org/images/chunk-treerep.png)

# 3   Developing and Evaluating Chunkers

Now you have a taste of what chunking does, but we haven't explained how to evaluate chunkers. As usual, this requires a suitably annotated corpus. We begin by looking at the mechanics of converting IOB format into an NLTK tree, then at how this is done on a larger scale using a chunked corpus. We will see how to score the accuracy of a chunker relative to a corpus, then look at some more data-driven ways to search for NP chunks. Our focus throughout will be on expanding the coverage of a chunker.

# 3.1   Reading IOB Format and the CoNLL 2000 Corpus

Using the corpus module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP and PP. As we have seen, each sentence is represented using multiple lines, as shown below:

    he PRP B-NP
    accepted VBD B-VP
    the DT B-NP
    position NN I-NP
 
A conversion function chunk.conllstr2tree() builds a tree representation from one of these multi-line strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks:

In [13]:
>>> text = '''
... he PRP B-NP
... accepted VBD B-VP
... the DT B-NP
... position NN I-NP
... of IN B-PP
... vice NN B-NP
... chairman NN I-NP
... of IN B-PP
... Carlyle NNP B-NP
... Group NNP I-NP
... , , O
... a DT B-NP
... merchant NN I-NP
... banking NN I-NP
... concern NN I-NP
... . . O
... '''
>>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

We can use the NLTK corpus module to access a larger amount of chunked text. The CoNLL 2000 corpus contains 270k words of Wall Street Journal text, divided into "train" and "test" portions, annotated with part-of-speech tags and chunk tags in the IOB format. We can access the data using nltk.corpus.conll2000. Here is an example that reads the 100th sentence of the "train" portion of the corpus:

In [14]:
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


As you can see, the CoNLL 2000 corpus contains three chunk types: NP chunks, which we have already seen; VP chunks such as has already delivered; and PP chunks such as because of. Since we are only interested in the NP chunks right now, we can use the chunk_types argument to select them:



In [15]:
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)


# 3.2   Simple Evaluation and Baselines

Now that we can access a chunked corpus, we can evaluate chunkers. We start off by establishing a baseline for the trivial chunk parser cp that creates no chunks:

In [16]:
from nltk.corpus import conll2000
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


The IOB tag accuracy indicates that more than a third of the words are tagged with O, i.e. not in an NP chunk. However, since our tagger did not find any chunks, its precision, recall, and f-measure are all zero. Now let's try a naive regular expression chunker that looks for tags beginning with letters that are characteristic of noun phrase tags (e.g. CD, DT, and JJ).

In [17]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


As you can see, this approach achieves decent results. However, we can improve on it by adopting a more data-driven approach, where we use the training corpus to find the chunk tag (I, O, or B) that is most likely for each part-of-speech tag. In other words, we can build a chunker using a unigram tagger. But rather than trying to determine the correct part-of-speech tag for each word, we are trying to determine the correct chunk tag, given each word's part-of-speech tag.

In 3.1, we define the UnigramChunker class, which uses a unigram tagger to label sentences with chunk tags. Most of the code in this class is simply used to convert back and forth between the chunk tree representation used by NLTK's ChunkParserI interface, and the IOB representation used by the embedded tagger. The class defines two methods: a constructor which is called when we build a new UnigramChunker; and the parse method which is used to chunk new sentences.

In [18]:
#3.1
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): #: a constructor
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data) 

    def parse(self, sentence): # the parse method
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

The constructor expects a list of training sentences, which will be in the form of chunk trees. It first converts training data to a form that is suitable for training the tagger, using tree2conlltags to map each chunk tree to a list of word,tag,chunk triples. It then uses that converted training data to train a unigram tagger, and stores it in self.tagger for later use.

The parse method takes a tagged sentence as its input, and begins by extracting the part-of-speech tags from that sentence. It then tags the part-of-speech tags with IOB chunk tags, using the tagger self.tagger that was trained in the constructor. Next, it extracts the chunk tags, and combines them with the original sentence, to yield conlltags. Finally, it uses conlltags2tree to convert the result back into a chunk tree.

Now that we have UnigramChunker, we can train it using the CoNLL 2000 corpus, and test its resulting performance:

In [19]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


This chunker does reasonably well, achieving an overall f-measure score of 83%. Let's take a look at what it's learned, by using its unigram tagger to assign a tag to each of the part-of-speech tags that appear in the corpus:

In [20]:
postags = sorted(set(pos for sent in train_sents
                     for (word,pos) in sent.leaves()))
print(unigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]


It has discovered that most punctuation marks occur outside of NP chunks, with the exception of # and "$\$$" , both of which are used as currency markers. It has also found that determiners (DT) and possessives (PRP$\$$ and WP$\$$) occur at the beginnings of NP chunks, while noun types (NN, NNP, NNPS, NNS) mostly occur inside of NP chunks.

Having built a unigram chunker, it is quite easy to build a bigram chunker: we simply change the class name to BigramChunker, and modify a line in to construct a BigramTagger rather than a UnigramTagger. The resulting chunker has slightly higher performance than the unigram chunker:

In [21]:
#3.1
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): #: a constructor
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data) 

    def parse(self, sentence): # the parse method
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

KeyboardInterrupt: 

# 3.3   Training Classifier-Based Chunkers

Both the regular-expression based chunkers and the n-gram chunkers decide what chunks to create entirely based on part-of-speech tags. However, sometimes part-of-speech tags are insufficient to determine how a sentence should be chunked. For example, consider the following two statements:

    a.		Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.

    b.		Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

These two sentences have the same part-of-speech tags, yet they are chunked differently. In the first sentence, the farmer and rice are separate chunks, while the corresponding material in the second sentence, the computer monitor, is a single chunk. Clearly, we need to make use of information about the content of the words, in addition to just their part-of-speech tags, if we wish to maximize chunking performance.

One way that we can incorporate information about the content of words is to use a classifier-based tagger to chunk the sentence. Like the n-gram chunker considered in the previous section, this classifier-based chunker will work by assigning IOB tags to the words in a sentence, and then converting those tags to chunks. For the classifier-based tagger itself, we will use the same approach that we used in 1 to build a part-of-speech tagger.

The basic code for the classifier-based NP chunker is shown in 3.2 below. It consists of two classes. The first class is almost identical to the ConsecutivePosTagger class from 1.5. The only two differences are that it calls a different feature extractor and that it uses a MaxentClassifier rather than a NaiveBayesClassifier. The second class is basically a wrapper around the tagger class that turns it into a chunker. During training, this second class maps the chunk trees in the training corpus into tag sequences; in the parse() method, it converts the tag sequence provided by the tagger back into a chunk tree.

In [None]:
# c:\users\junglebook\anaconda3\lib\site-packages\nltk\classify\megam.py
megam.config_megam(r'c:\users\junglebook\anaconda3\lib\site-packages\nltk\classify\megam.py')
#Still didn't work, got rid of the usage of megam instead 

In [None]:
from  nltk.classify import megam
class ConsecutiveNPChunkTagger(nltk.TaggerI): # sequence classifier

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents: 
            untagged_sent = nltk.tag.untag(tagged_sent) #Untag sentences 
            history = [] # provides a list of the tags that we've predicted for the sentence so far
            # tag in history corresponds with a word in sentence
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) #  feature extractor 
                train_set.append( (featureset, tag) ) # Features of the setence with the tag 
                history.append(tag) # The tags, likely for lableing? 
        self.classifier = nltk.MaxentClassifier.train( # it uses a MaxentClassifier
            train_set, 
            #algorithm='megam',
            trace=0) 

    def tag(self, sentence): # Tags a sentence after training 
        history = [] 
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history) # Grabs features of the sentence
            tag = self.classifier.classify(featureset) # Arrives at a tag based on a featureset 
            history.append(tag) #The predicted labels 
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI): # a wrapper around the tagger class that turns it into a chunker
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in  
                         nltk.chunk.tree2conlltags(sent)] #The word, pos tag and chunker tag
                        for sent in train_sents]  # The sentences we wish to observe 
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents) #Guesses IOB tags

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence) # Tags a sentence based from the training 
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents] # Puts it in cononical form 
        return nltk.chunk.conlltags2tree(conlltags)

The only piece left to fill in is the feature extractor. We begin by defining a simple feature extractor which just provides the part-of-speech tag of the current token. Using this feature extractor, our classifier-based chunker is very similar to the unigram chunker, as is reflected in its performance:

In [None]:
#Create your own feature extractor for the above algorithm
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

We can also add a feature for the previous part-of-speech tag. Adding this feature allows the classifier to model interactions between adjacent tags, and results in a chunker that is closely related to the bigram chunker.



In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "prevpos": prevpos}
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

Next, we'll try adding a feature for the current word, since we hypothesized that word content should be useful for chunking. We find that this feature does indeed improve the chunker's performance, by about 1.5 percentage points (which corresponds to about a 10% reduction in the error rate).

In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "word": word, "prevpos": prevpos}
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

Finally, we can try extending the feature extractor with a variety of additional features, such as lookahead features, paired features, and complex contextual features. This last feature, called tags-since-dt, creates a string describing the set of all part-of-speech tags that have been encountered since the most recent determiner, or since the beginning of the sentence if there is no determiner before index i. .

In [None]:
>>> def npchunk_features(sentence, i, history):
...     word, pos = sentence[i]
...     if i == 0:
...         prevword, prevpos = "<START>", "<START>"
...     else:
...         prevword, prevpos = sentence[i-1]
...     if i == len(sentence)-1:
...         nextword, nextpos = "<END>", "<END>"
...     else:
...         nextword, nextpos = sentence[i+1]
...     return {"pos": pos,
...             "word": word,
...             "prevpos": prevpos,
...             "nextpos": nextpos, # such as lookahead features
...             "prevpos+pos": "%s+%s" % (prevpos, pos), # paired features 
...             "pos+nextpos": "%s+%s" % (pos, nextpos),
...             "tags-since-dt": tags_since_dt(sentence, i)}# complex contextual features

In [None]:
def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT': # reset the tags since dt if there was a dt seperating chunks 
            tags = set()
        else:
            tags.add(pos) # Add tags to the set 
    return '+'.join(sorted(tags))

In [None]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

# 4   Recursion in Linguistic Structure


## 4.1   Building Nested Structure with Cascaded Chunkers

So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules. The code below has patterns for noun phrases, prepositional phrases, verb phrases, and sentences. This is a four-stage chunk grammar, and can be used to create structures having a depth of at most four.

In [22]:
import nltk
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """
cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))

(S
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


Unfortunately this result misses the VP headed by saw. It has other shortcomings too. Let's see what happens when we apply this chunker to a sentence having deeper nesting. Notice that it fails to identify the VP chunk starting at

In [23]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
            ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
            ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))

(S
  (NP John/NNP)
  thinks/VBZ
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


The solution to these problems is to get the chunker to loop over its patterns: after trying all of them, it repeats the process. We add an optional second argument loop to specify the number of times the set of patterns should be run:

In [24]:
cp = nltk.RegexpParser(grammar, loop=2)
print(cp.parse(sentence))

(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))


 - **Note**: This cascading process enables us to create deep structures. However, creating and debugging a cascade is difficult, and there comes a point where it is more effective to do full parsing. Also, the cascading process can only produce trees of fixed depth (no deeper than the number of stages in the cascade), and this is insufficient for complete syntactic analysis.
 
 # 4.2   Trees
 
A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node. Here's an example of a tree (note that they are standardly drawn upside-down):

![](https://www.nltk.org/book/tree_images/ch07-tree-3.png)

We use a 'family' metaphor to talk about the relationships of nodes in a tree: for example, S is the parent of VP; conversely VP is a child of S. Also, since NP and VP are both children of S, they are also siblings. For convenience, there is also a text format for specifying trees:

     	
    (S
       (NP Alice)
       (VP
          (V chased)
          (NP
             (Det the)
             (N rabbit))))
             
Although we will focus on syntactic trees, trees can be used to encode any homogeneous hierarchical structure that spans a sequence of linguistic forms (e.g. morphological structure, discourse structure). In the general case, leaves and node values do not have to be strings.

In NLTK, we create a tree by giving a node label and a list of children:



In [25]:
tree1 = nltk.Tree('NP', ['Alice'])
print(tree1)

(NP Alice)


In [26]:
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
print(tree2)

(NP the rabbit)


In [27]:
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)

(S (NP Alice) (VP chased (NP the rabbit)))


In [28]:
# methods available for tree objects
print(tree4[1])

(VP chased (NP the rabbit))


In [29]:
tree4[1].label()

'VP'

In [30]:
tree4[1][1][1]

'rabbit'

The bracketed representation for complex trees can be difficult to read. In these cases, the draw method can be very useful. It opens a new window, containing a graphical representation of the tree. The tree display window allows you to zoom in and out, to collapse and expand subtrees, and to print the graphical representation to a postscript file (for inclusion in a document).



In [31]:
tree3.draw()

# 4.3   Tree Traversal

It is standard to use a recursive function to traverse a tree. The code below demonstrates this.

In [33]:
# Code not working 
def traverse(t):
    try:
        t.label()
    except AttributeError:
        print(t, end=" ")
    else:
        # Now we know that t.node is defined
        print('(', t.label(), end=" ")
        for child in t:
            traverse(child)
        print(')', end=" ")

#t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')
#traverse(t)



# 5   Named Entity Recognition

At the start of this chapter, we briefly introduced named entities (NEs). Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. These should be self-explanatory, except for "Facility": human-made artifacts in the domains of architecture and civil engineering; and "GPE": geo-political entities such as city, state/province, and country.

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type. While named entity recognition is frequently a prelude to identifying relations in Information Extraction, it can also contribute to other tasks. For example, in Question Answering (QA), we try to improve the precision of Information Retrieval by recovering not whole pages, but just those parts which contain an answer to the user's question. Most QA systems take the documents returned by standard Information Retrieval, and then attempt to isolate the minimal text snippet in the document containing the answer. Now suppose the question was Who was the first President of the US?, and one of the documents that was retrieved contained the following passage:

    The Washington Monument is the most prominent structure in Washington, D.C. and one of the city's early attractions. It was built in honor of George Washington, who led the country to independence and then became its first President.
    
Analysis of the question leads us to expect that an answer should be of the form X was the first President of the US, where X is not only a noun phrase, but also refers to a named entity of type PERSON. This should allow us to ignore the first sentence in the passage. While it contains two occurrences of Washington, named entity recognition should tell us that neither of them has the correct type.

How do we go about identifying named entities? One option would be to look up each word in an appropriate list of names. For example, in the case of locations, we could use a gazetteer, or geographical dictionary, such as the Alexandria Gazetteer or the Getty Gazetteer. However, doing this blindly runs into problems, as shown in below 

![](https://www.nltk.org/images/locations.png)

Observe that the gazetteer has good coverage of locations in many countries, and incorrectly finds locations like Sanchez in the Dominican Republic and On in Vietnam. Of course we could omit such locations from the gazetteer, but then we won't be able to identify them when they do appear in a document.

It gets even harder in the case of names for people or organizations. Any list of such names will probably have poor coverage. New organizations come into existence every day, so if we are trying to deal with contemporary newswire or blog entries, it is unlikely that we will be able to recognize many of the entities using gazetteer lookup.

Another major source of difficulty is caused by the fact that many named entity terms are ambiguous. Thus May and North are likely to be parts of named entities for DATE and LOCATION, respectively, but could both be part of a PERSON; conversely Christian Dior looks like a PERSON but is more likely to be of type ORGANIZATION. A term like Yankee will be ordinary modifier in some contexts, but will be marked as an entity of type ORGANIZATION in the phrase Yankee infielders.

Further challenges are posed by multi-word names like Stanford University, and by names that contain other names such as Cecil H. Green Library and Escondido Village Conference Service Center. In named entity recognition, therefore, we need to be able to identify the beginning and end of multi-token sequences.

Named entity recognition is a task that is well-suited to the type of classifier-based approach that we saw for noun phrase chunking. In particular, we can build a tagger that labels each word in a sentence using the IOB format, where chunks are labeled by their appropriate type. Here is part of the CONLL 2002 (conll2002) Dutch training data:

    Eddy N B-PER
    Bonte N I-PER
    is V O
    woordvoerder N O
    van Prep O
    diezelfde Pron O
    Hogeschool N B-ORG
    . Punc O

In this representation, there is one token per line, each with its part-of-speech tag and its named entity tag. Based on this training corpus, we can construct a tagger that can be used to label new sentences; and use the nltk.chunk.conlltags2tree() function to convert the tag sequences into a chunk tree.

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.



In [34]:
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent, binary=True))

(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (NE Brooke/NNP)
  T./NNP
  Mossman/NNP
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE Vermont/NNP College/NNP)
  of/IN
  (NE Medicine/NNP)
  ./.)


In [35]:
print(nltk.ne_chunk(sent)) 

(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (PERSON Vermont/NNP College/NNP)
  of/IN
  (GPE Medicine/NNP)
  ./.)


# 6   Relation Extraction

Once named entities have been identified in a text, we then want to extract the relations that exist between them. As indicated earlier, we will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. The following example searches for strings that contain the word in. The special regular expression (?!\b.+ing\b) is a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerund.

In [36]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
                                     corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


Searching for the keyword in works reasonably well, though it will also retrieve false positives such as \[ORG: House
Transportation Committee\] , secured the most money in the \[LOC: New
York\]; there is unlikely to be simple string-based method of excluding filler strings such as this.

As shown above, the conll2002 Dutch corpus contains not just named entity annotation but also part-of-speech tags. This allows us to devise patterns that are sensitive to these tags, as shown in the next example. The method clause() prints out the relations in a clausal form, where the binary relation symbol is specified as the value of parameter relsym

In [37]:
from nltk.corpus import conll2002
>>> vnv = """
... (
... is/V|    # 3rd sing present and
... was/V|   # past forms of the verb zijn ('be')
... werd/V|  # and also present
... wordt/V  # past of worden ('become)
... )
... .*       # followed by anything
... van/Prep # followed by van ('of')
... """
>>> VAN = re.compile(vnv, re.VERBOSE)
>>> for doc in conll2002.chunked_sents('ned.train'):
...     for rel in nltk.sem.extract_rels('PER', 'ORG', doc,
...                                    corpus='conll2002', pattern=VAN):
...         print(nltk.sem.clause(rel, relsym="VAN"))

VAN("cornet_d'elzius", 'buitenlandse_handel')
VAN('johan_rottiers', 'kardinaal_van_roey_instituut')
VAN('annie_lennox', 'eurythmics')


In [39]:
>>> for doc in conll2002.chunked_sents('ned.train'):
...     for rel in nltk.sem.extract_rels('PER', 'ORG', doc,
...                                    corpus='conll2002', pattern=VAN):
...         print(nltk.rtuple(rel, lcon=True, rcon=True))

...'')[PER: "Cornet/V d'Elzius/N"] 'is/V op/Prep dit/Pron ogenblik/N kabinetsadviseur/N van/Prep staatssecretaris/N voor/Prep' [ORG: 'Buitenlandse/N Handel/N'](''...
...'')[PER: 'Johan/N Rottiers/N'] 'is/V informaticacoördinator/N van/Prep het/Art' [ORG: 'Kardinaal/N Van/N Roey/N Instituut/N']('in/Prep'...
...'Door/Prep rugproblemen/N van/Prep zangeres/N')[PER: 'Annie/N Lennox/N'] 'wordt/V het/Art concert/N van/Prep' [ORG: 'Eurythmics/N']('vandaag/Adv in/Prep'...
