# Natural Language Processing with Python
## Chapter 7 Extracting Information from Text
### 1. Information Extraction
#### 1.1 Information Extraction Architecture
### 2 Chunking
#### 2.1 Noun Phrase Chunking

In [1]:
import re,nltk  

In [10]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
grammar = r"NP: {<DT>?<JJ>*<NN>}"
chunker = nltk.RegexpParser(grammar)
result = chunker.parse(sentence)
print(result)
result.draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


#### 2.2 Tag Patterns
#### 2.3 Chunking with Regular Expressions
#### 2.4 Exploring Text Corpora 

In [17]:
from nltk.corpus import brown
def FindChunk(corpus=brown,rule='Chunk:{<V.*><TO><V.*>}'):
    chunker = nltk.RegexpParser(rule)
    ChunkTag = rule.split(':')[0]
    for sent in brown.tagged_sents()[:200]:
        tree = chunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == ChunkTag:
                print(subtree)
FindChunk(brown)

(Chunk combined/VBN to/TO achieve/VB)
(Chunk continue/VB to/TO place/VB)
(Chunk serve/VB to/TO protect/VB)
(Chunk wanted/VBD to/TO wait/VB)
(Chunk allowed/VBN to/TO place/VB)
(Chunk expected/VBN to/TO become/VB)
(Chunk expected/VBN to/TO approve/VB)
(Chunk expected/VBN to/TO make/VB)
(Chunk intends/VBZ to/TO make/VB)
(Chunk seek/VB to/TO set/VB)
(Chunk like/VB to/TO see/VB)
(Chunk designed/VBN to/TO provide/VB)
(Chunk get/VB to/TO hear/VB)
(Chunk expects/VBZ to/TO tell/VB)
(Chunk expected/VBN to/TO give/VB)
(Chunk prefer/VB to/TO pay/VB)
(Chunk required/VBN to/TO obtain/VB)
(Chunk permitted/VBN to/TO teach/VB)
(Chunk designed/VBN to/TO reduce/VB)


#### 2.5 Chinking

In [20]:
ChinkRule = r'''
NP:
    {<.*>}
    }<VBD|IN>+{
'''
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
chinker = nltk.RegexpParser(ChinkRule)
print(chinker.parse(sentence))

(S
  (NP the/DT)
  (NP little/JJ)
  (NP yellow/JJ)
  (NP dog/NN)
  barked/VBD
  at/IN
  (NP the/DT)
  (NP cat/NN))


#### 2.6 Representing Chunks: Tags vs Trees
### 3 Developing and Evaluating Chunkers
#### 3.1 Reading IOB Format and the CoNLL 2000 Corpus
#### 3.2 Simple Evaluation and Baselines

In [21]:
from nltk.corpus import conll2000
TestSet = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
grammar = r'NP:{<[CDJNP].*>+}'
chunker = nltk.RegexpParser(grammar)
print(chunker.evaluate(TestSet))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


In [4]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self,TrainSet):
        TrainData = [[(tag,ChunkTag) for word,tag,ChunkTag in nltk.chunk.tree2conlltags(sent)] for sent in TrainSet]
        self.tagger = nltk.UnigramTagger(TrainData)
    def parse(self,sentence):
        tag = [tag for word,tag in sentence]
        ChunkTag = [ChunkTag for tag,ChunkTag in self.tagger.tag(tag)]
        ConllTag = [(word,tag,ChunkTag) for ((word,tag),ChunkTag) in zip(sentence,ChunkTag)]
        return nltk.chunk.conlltags2tree(ConllTag)
TrainSet = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
UnigramChunker = UnigramChunker(TrainSet)
print(UnigramChunker.evaluate(TestSet))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


#### 3.3 Training Classifier-Based Chunkers

In [7]:
def ExChunkFea(sentence,i,history):
    word,pos = sentence[i]
    if i == 0:
        PrevWord,PrevPos = "<START>", "<START>"
    else:
        PrevWord,PrevPos = sentence[i-1]
    if i == len(sentence)-1:
        NextWord,NextPos = "<END>", "<END>"
    else:
        NextWord,NextPos = sentence[i+1]
    TagsSinceDt = set()
    for word,tag in sentence[:i]:
        if tag == 'DT':
            TagsSinceDt = set()
        else:
            TagsSinceDt.add(tag)
    TagsSinceDt = '+'.join(sorted(TagsSinceDt))
    feature = {
        'word':word,
        'pos':pos,
        'PrevPos':PrevPos,
        'NextPos':NextPos,
        'PrevAndPos':'{}+{}'.format(PrevPos,pos),
        'NextAndPos':'{}+{}'.format(pos,NextPos),
        'PosSinceDT':TagsSinceDt
    }
    return feature
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self,TrainSent):
        TrainSet = []
        for sent in TrainSent:
            history = []
            untag = nltk.tag.untag(sent)
            for i,(word,tag) in enumerate(sent):
                feature = ExChunkFea(untag,i,history)
                TrainSet.append((feature,tag))
                history.append(tag)
        self.classifer = nltk.MaxentClassifier.train(TrainSet)
    def tag(self,sentence):
        history = []
        for i,word in enumerate(sentence):
            feature = ExChunkFea(sentence,i,history)
            tag = self.classifer.classify(feature)
            history.append(tag)
        return zip(sentence,history)
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self,TrainSet):
        TrainData = [[((word,tag),ChunkTag) for word,tag,ChunkTag in nltk.chunk.tree2conlltags(sent)] for sent in TrainSet]
        self.tagger = ConsecutiveNPChunkTagger(TrainData)
    def parse(self,sentence):
        taggedSent = self.tagger.tag(sentence)
        ConllTag = [(word,tag,ChunkTag) for ((word,tag),ChunkTag) in taggedSent]
        return nltk.chunk.conlltags2tree(ConllTag)
chunker = ConsecutiveNPChunker(TrainSet)
print(chunker.evaluate(TestSet))

==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.441
             2          -0.23049        0.942
             3          -0.14128        0.954
             4          -0.11146        0.959
             5          -0.09671        0.961
             6          -0.08773        0.964
             7          -0.08154        0.966
             8          -0.07691        0.968
             9          -0.07324        0.970
            10          -0.07021        0.971
            11          -0.06765        0.972
            12          -0.06542        0.973
            13          -0.06347        0.974
            14          -0.06173        0.975
            15          -0.06016        0.976
            16          -0.05874        0.977
            17          -0.05744        0.977
            18          -0.05625        0.978
            19          -0.05515        0.979
   

### 4. Recursion in Linguistic Structure
#### 4.1 Building Nested Structure with Cascaded Chunkers
#### 4.2 Trees
#### 4.3 Tree Traversal

In [4]:
def traverse(tree):
    try:
        tree.label()
    except AttributeError:
        print(tree,end=' ')
    else:
        print('(',end='')
        for child in tree:
            traverse(child)
        print(')',end='')
t = nltk.Tree.fromstring('(S (NP Alice) (VP chased (NP the rabbit)))')
traverse(t)

((Alice )(chased (the rabbit )))

### 5. Named Entity Recognition

In [6]:
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent))

(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (PERSON Vermont/NNP College/NNP)
  of/IN
  (GPE Medicine/NNP)
  ./.)


### 6. Relation Extraction

In [7]:
INrule = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=INrule):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


### 7. Summary
### 8. Further Reading
### 9. Exercises
#### 1. The IOB format categorizes tagged tokens as I, O and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively?
#### 2. Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases.

In [5]:
PluralRule = 'NPP:{<DT>?<CD>?<JJ>*<NNS>}'
sentence = [("these", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dogs", "NNS"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
chunker = nltk.RegexpParser(PluralRule)
print(chunker.parse(sentence))

(S
  (NPP these/DT little/JJ yellow/JJ dogs/NNS)
  barked/VBD
  at/IN
  the/DT
  cat/NN)


#### 3. Pick one of the three chunk types in the CoNLL corpus. Inspect the CoNLL corpus and try to observe any patterns in the POS tag sequences that make up this kind of chunk. Develop a simple chunker using the regular expression chunker nltk.RegexpParser. Discuss any tag sequences that are difficult to chunk reliably.
#### 4. An early definition of chunk was the material that occurs between chinks. Develop a chunker that starts by putting the whole sentence in a single chunk, and then does the rest of its work solely by chinking. Determine which tags (or tag sequences) are most likely to make up chinks with the help of your own utility program. Compare the performance and simplicity of this approach relative to a chunker based entirely on chunk rules.
#### 5. Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.

In [12]:
GerundRule = 'NPG:{<DT>?<NN>?<VBG>+<NN>}'
sentence = 'He is an assistant managing editor'
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tagged = [('He', 'PRP'),('is', 'VBZ'),('an', 'DT'),('assistant', 'NN'),('managing', 'VBG'),('editor', 'NN')]
chunker = nltk.RegexpParser(GerundRule)
print(chunker.parse(tagged))

(S He/PRP is/VBZ (NPG an/DT assistant/NN managing/VBG editor/NN))


#### 6. Write one or more tag patterns to handle coordinated noun phrases, e.g. "July/NNP and/CC August/NNP", "all/DT your/PRP$ managers/NNS and/CC supervisors/NNS", "company/NN courts/NNS and/CC adjudicators/NNS".

In [18]:
CordRule = 'NPC:{<DT>?<N.*><CC><DT>?<N.*>}'
sentence = "July/NNP and/CC August/NNP"
tagged = [nltk.str2tuple(i) for i in sentence.split()]
chunker = nltk.RegexpParser(CordRule)
print(chunker.parse(tagged))

(S (NPC July/NNP and/CC August/NNP))


#### 7. Carry out the following evaluation tasks for any of the chunkers you have developed earlier. (Note that most chunking corpora contain some internal inconsistencies, such that any reasonable rule-based approach will produce errors.)
##### a. Evaluate your chunker on 100 sentences from a chunked corpus, and report the precision, recall and F-measure.
##### b. Use the `chunkscore.missed()` and `chunkscore.incorrect()` methods to identify the errors made by your chunker. Discuss.
##### c. Compare the performance of your chunker to the baseline chunker discussed in the evaluation section of this chapter.
#### 8. Develop a chunker for one of the chunk types in the CoNLL corpus using a regular-expression based chunk grammar RegexpChunk. Use any combination of rules for chunking, chinking, merging or splitting.
#### 9. Sometimes a word is incorrectly tagged, e.g. the head noun in "12/CD or/CC so/RB cases/VBZ". Instead of requiring manual correction of tagger output, good chunkers are able to work with the erroneous output of taggers. Look for other examples of correctly chunked noun phrases with incorrect tags.
#### 10. The bigram chunker scores about 90% accuracy. Study its errors and try to work out why it doesn't get 100% accuracy. Experiment with trigram chunking. Are you able to improve the performance any more?

In [26]:
from nltk.corpus import conll2000
class TrigramChunker(nltk.ChunkParserI):
    def __init__(self,TrainSet):
        TrainData = [[(tag,ChunkTag) for word,tag,ChunkTag in nltk.chunk.tree2conlltags(sent)] for sent in TrainSet]
        self.tagger = nltk.TrigramTagger(TrainData)
    def parse(self,sentence):
        tag = [tag for word,tag in sentence]
        ChunkTag = [ChunkTag for tag,ChunkTag in self.tagger.tag(tag)]
        ConllTag = [(word,tag,ChunkTag) for ((word,tag),ChunkTag) in zip(sentence,ChunkTag)]
        return nltk.chunk.conlltags2tree(ConllTag)
TrainSet = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
TrigramChunker = TrigramChunker(TrainSet)
print(UnigramChunker.evaluate(TestSet))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.5%%
    Recall:        86.8%%
    F-Measure:     84.6%%


#### 11. Apply the n-gram and Brill tagging methods to IOB chunk tagging. Instead of assigning POS tags to words, here we will assign IOB tags to the POS tags. E.g. if the tag DT (determiner) often occurs at the start of a chunk, it will be tagged B (begin). Evaluate the performance of these chunking methods relative to the regular expression chunking methods covered in this chapter.
#### 12. We saw in 5. that it is possible to establish an upper limit to tagging performance by looking for ambiguous n-grams, n-grams that are tagged in more than one possible way in the training data. Apply the same method to determine an upper bound on the performance of an n-gram chunker.
#### 13. Pick one of the three chunk types in the CoNLL corpus. Write functions to do the following tasks for your chosen type:
##### a. List all the tag sequences that occur with each instance of this chunk type.
##### b. Count the frequency of each tag sequence, and produce a ranked list in order of decreasing frequency; each line should consist of an integer (the frequency) and the tag sequence.
##### c. Inspect the high-frequency tag sequences. Use these as the basis for developing a better chunker.
#### 14. The baseline chunker presented in the evaluation section tends to create larger chunks than it should. For example, the phrase: `[every/DT time/NN] [she/PRP] sees/VBZ [a/DT newspaper/NN]` contains two consecutive chunks, and our baseline chunker will incorrectly combine the first two: `[every/DT time/NN she/PRP]`. Write a program that finds which of these chunk-internal tags typically occur at the start of a chunk, then devise one or more rules that will split up these chunks. Combine these with the existing baseline chunker and re-evaluate it, to see if you have discovered an improved baseline.
#### 15. Develop an NP chunker that converts POS-tagged text into a list of tuples, where each tuple consists of a verb followed by a sequence of noun phrases and prepositions, e.g. the little cat sat on the mat becomes `('sat', 'on', 'NP')`...
#### 16. The Penn Treebank contains a section of tagged Wall Street Journal text that has been chunked into noun phrases. The format uses square brackets, and we have encountered it several times during this chapter. The Treebank corpus can be accessed using: for sent in `nltk.corpus.treebank_chunk.chunked_sents(fileid)`. These are flat trees, just as we got using `nltk.corpus.conll2000.chunked_sents()`.
##### a. The functions `nltk.tree.pprint()` and `nltk.chunk.tree2conllstr()` can be used to create Treebank and IOB strings from a tree. Write functions `chunk2brackets()` and `chunk2iob()` that take a single chunk tree as their sole argument, and return the required multi-line string representation.
##### b. Write command-line conversion utilities `bracket2iob.py` and `iob2bracket.py` that take a file in Treebank or CoNLL format (resp) and convert it to the other format. (Obtain some raw Treebank or CoNLL data from the NLTK Corpora, save it to a file, and then use `for line in open(filename)` to access it from Python.)
#### 17. An n-gram chunker can use information other than the current part-of-speech tag and the n-1 previous chunk tags. Investigate other models of the context, such as the n-1 previous part-of-speech tags, or some combination of previous chunk tags along with previous and following part-of-speech tags.
#### 18. Consider the way an n-gram tagger uses recent tags to inform its tagging choice. Now observe how a chunker may re-use this sequence information. For example, both tasks will make use of the information that nouns tend to follow adjectives (in English). It would appear that the same information is being maintained in two places. Is this likely to become a problem as the size of the rule sets grows? If so, speculate about any ways that this problem might be addressed.