# 7. Extracting Information from Text

## 1) Information Extraction

#### See Slides

#### If this location data was stored in Python as a list of tuples (entity, relation, entity), then the question "Which organizations operate in Atlanta?" could be translated as follows:


In [None]:
import nltk, re

In [None]:
locs = [('Omnicom', 'IN', 'New York'),
        ('DDB Needham', 'IN', 'New York'),
        ('Kaplan Thaler Group', 'IN', 'New York'),
        ('BBDO South', 'IN', 'Atlanta'),
        ('Georgia-Pacific', 'IN', 'Atlanta')]

In [None]:
query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']

In [None]:
print(query)

#### See Slides

### 1.1 Information Extraction Architecture

#### See Slides

#### To perform the first three tasks, we can define a simple function that simply connects together NLTK's default sentence segmenter [1], word tokenizer [2], and part-of-speech tagger [3]:

In [None]:
 def ie_preprocess(document):
        sentences = nltk.sent_tokenize(document) #1
        sentences = [nltk.word_tokenize(sent) for sent in sentences] #2
        sentences = [nltk.pos_tag(sent) for sent in sentences] #3
        print (sentences)

In [None]:
 def ie_fullprocess(document):
        sentences = nltk.sent_tokenize(document) 
        sentences = [nltk.word_tokenize(sent) for sent in sentences] 
        sentences = [nltk.pos_tag(sent) for sent in sentences]
        sentences = [nltk.ne_chunk(sent) for sent in sentences]
        print (sentences)

In [None]:
document_sample="I am a student at Baruch College in New York and Georgia-Pacific is in Atlanta. University of Missouri is in Columbia."

In [None]:
ie_preprocess(document_sample)

In [None]:
ie_fullprocess(document_sample)

In [None]:
sentences = nltk.sent_tokenize(document_sample) 
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences] 
tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]

In [None]:
tagged_sentences

In [None]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')

In [None]:
for i, sent in enumerate(tagged_sentences):
    sent=nltk.chunk.ne_chunk(sent)
    for rel in nltk.sem.extract_rels('ORGANIZATION', 'GPE', sent, corpus='ace', pattern = IN):
        print(nltk.sem.rtuple(rel))

In [None]:
of = re.compile(r'.*\bof\b(?!\b.+ing)')

In [None]:
for i, sent in enumerate(tagged_sentences):
    sent=nltk.chunk.ne_chunk(sent)
    for rel in nltk.sem.extract_rels('ORGANIZATION', 'GPE', sent, corpus='ace', pattern = of):
        print(nltk.sem.rtuple(rel))

#### See Slides

## 2)   Chunking

### 2.1 Noun Phrase Chunking

#### In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule [2]. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser [3], and test it on our example sentence [4]. The result is a tree, which we can either print [5], or display graphically [6].

In [None]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), 
("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}" # 2

#### review regular expression

In [None]:
cp = nltk.RegexpParser(grammar) #3

In [None]:
result = cp.parse(sentence) #4

In [None]:
print(result) #5

In [None]:
result.draw() #6

### 2.2 Tag Patterns
#### See Slides

### 2.3  Chunking with Regular Expressions

#### The following code shows a simple chunk grammar consisting of two rules. The first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked [1], and run the chunker on this input 

In [None]:
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   
      {<NNP>+} """

In [None]:
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [None]:
print(cp.parse(sentence))

#### If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked:


In [None]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]

In [None]:
grammar = "NP: {<NN><NN>}  # Chunk two consecutive nouns"

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
print(cp.parse(nouns))

#### Once we have created the chunk for money market, we have removed the context that would have permitted fund to be included in a chunk. This issue would have been avoided with a more permissive chunk rule, e.g. NP: {<NN>+}.


In [None]:
grammar1 = "NP: {<NN>+}"

In [None]:
cp = nltk.RegexpParser(grammar1)

In [None]:
print(cp.parse(nouns))

### 2.4   Exploring Text Corpora

#### In Chapter 5, we saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows:

In [None]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')

In [None]:
brown = nltk.corpus.brown

In [None]:
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

In [None]:
#### Exercise 1.Encapsulate the above example inside a function find_chunks() that takes a chunk string like "CHUNK: {<V.*> <TO><V.*>}"as an argument. Use it to search the brown corpus for several other patterns, such as four or more nouns in a row, e.g. "NOUNS: {<N.*>{4,}}"

## 2.5 Chinking

#### See Slides

#### In 2.4, we put the entire sentence into a single chunk, then excise the chinks.

In [None]:
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """

In [None]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
print(cp.parse(sentence))

### 2.6 Representing Chunks: Tags vs Trees

#### See Slides

## 3 Developing and Evaluating Chunkers (Slides)

### 3.1   Reading IOB Format and the CoNLL 2000 Corpus

#### A conversion function chunk.conllstr2tree() builds a tree representation from one of these multi-line strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks:


In [None]:
text = '''
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
'''

In [None]:
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

#### We can use the NLTK corpus module to access a larger amount of chunked text. The CoNLL 2000 corpus contains 270k words of Wall Street Journal text, divided into "train" and "test" portions, annotated with part-of-speech tags and chunk tags in the IOB format. We can access the data using nltk.corpus.conll2000. Here is an example that reads the 100th sentence of the "train" portion of the corpus

In [None]:
from nltk.corpus import conll2000

In [None]:
print(conll2000.chunked_sents('train.txt')[99])

#### As you can see, the CoNLL 2000 corpus contains three chunk types: NP chunks, which we have already seen; VP chunks such as has already delivered; and PP chunks such as because of. Since we are only interested in the NP chunks right now, we can use the chunk_types argument to select them:

In [None]:
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

### 3.2   Simple Evaluation and Baselines


#### Now that we can access a chunked corpus, we can evaluate chunkers. We start off by establishing a baseline for the trivial chunk parser cp that creates no chunks:



In [None]:
from nltk.corpus import conll2000

In [None]:
cp = nltk.RegexpParser("")

In [None]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])

In [None]:
print(cp.evaluate(test_sents))

#### The IOB tag accuracy indicates that more than a third of the words are tagged with O, i.e. not in an NP chunk. However, since our tagger did not find any chunks, its precision, recall, and f-measure are all zero. 
#### Now let's try a naive regular expression chunker that looks for tags beginning with letters that are characteristic of noun phrase tags (e.g. CD, DT, and JJ).


In [None]:
grammar = r"NP: {<[CDJNP].*>+}"

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
print(cp.evaluate(test_sents))

### Unigram Chunker

#### As you can see, this approach achieves decent results.However, we can improve on it by adopting a more data-driven approach, where we use the training corpus to find the chunk tag (I, O, or B) that is most likely for each part-of-speech tag. 

#### In other words, we can build a chunker using a unigram tagger (4). But rather than trying to determine the correct part-of-speech tag for each word, we are trying to determine the correct chunk tag, given each word's part-of-speech tag.

#### We define the UnigramChunker class, which uses a unigram tagger to label sentences with chunk tags. 

#### Most of the code in this class is simply used to convert back and forth between the chunk tree representation used by NLTK's ChunkParserI interface, and the IOB representation used by the embedded tagger. The class defines two methods: a constructor [1] which is called when we build a new UnigramChunker; and the parse method [3] which is used to chunk new sentences.

In [None]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): #1
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data) #2
    def parse(self, sentence): #3
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

#### The constructor [1] expects a list of training sentences, which will be in the form of chunk trees. It first converts training data to a form that is suitable for training the tagger, using tree2conlltags to map each chunk tree to a list of word,tag,chunk triples. It then uses that converted training data to train a unigram tagger, and stores it in self.tagger for later use.

#### The parse method [3] takes a tagged sentence as its input, and begins by extracting the part-of-speech tags from that sentence. It then tags the part-of-speech tags with IOB chunk tags, using the tagger self.tagger that was trained in the constructor. Next, it extracts the chunk tags, and combines them with the original sentence, to yield conlltags. Finally, it uses conlltags2tree to convert the result back into a chunk tree.

#### Now that we have UnigramChunker, we can train it using the CoNLL 2000 corpus, and test its resulting performance:

In [None]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])

In [None]:
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

In [None]:
unigram_chunker = UnigramChunker(train_sents)

In [None]:
print(unigram_chunker.evaluate(test_sents))

#### This chunker does reasonably well, achieving an overall f-measure score (slides) of 83%. Let's take a look at what it's learned, by using its unigram tagger to assign a tag to each of the part-of-speech tags that appear in the corpus:


In [None]:
postags = sorted(set(pos for sent in train_sents
                    for (word,pos) in sent.leaves()))

In [None]:
print(unigram_chunker.tagger.tag(postags))

In [None]:
#### It has discovered that most punctuation marks occur outside of NP chunks, with the exception of # and $, both of which are used as currency markers. 
#### It has also found that determiners (DT) and possessives (PRP$ and WP$) occur at the beginnings of NP chunks, while noun types (NN, NNP, NNPS, NNS) mostly occur inside of NP chunks.


### Bigram Chunker

#### Having built a unigram chunker, it is quite easy to build a bigram chunker: we simply change the class name to BigramChunker, and modify line [2] in 3.1 to construct a BigramTagger rather than a UnigramTagger. The resulting chunker has slightly higher performance than the unigram chunker:



In [None]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
    def parse(self, sentence): 
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [None]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

### 4   Recursion in Linguistic Structure

#### 4.1   Building Nested Structure with Cascaded Chunkers

#### So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules. 4.1 has patterns for noun phrases, prepositional phrases, verb phrases, and sentences. This is a four-stage chunk grammar, and can be used to create structures having a depth of at most four.

In [None]:
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

In [None]:
cp = nltk.RegexpParser(grammar)

In [None]:
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

In [None]:
print(cp.parse(sentence))

#### Unfortunately this result misses the VP headed by saw. It has other shortcomings too. Let's see what happens when we apply this chunker to a sentence having deeper nesting. Notice that it fails to identify the VP chunk starting at [1].

In [None]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
...     ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
...     ("on", "IN"), ("the", "DT"), ("mat", "NN")]

In [None]:
print(cp.parse(sentence))

#### The solution to these problems is to get the chunker to loop over its patterns: after trying all of them, it repeats the process. We add an optional second argument loop to specify the number of times the set of patterns should be run:


In [None]:
cp = nltk.RegexpParser(grammar, loop=2)

In [None]:
print(cp.parse(sentence))

### 4.2   Trees


#### See Slides

#### In NLTK, we create a tree by giving a node label and a list of children:



In [None]:
tree1 = nltk.Tree('NP', ['Alice'])

In [None]:
print(tree1)

In [None]:
tree1.draw()

In [None]:
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
print(tree2)

In [None]:
tree2.draw()

#### We can incorporate these into successively larger trees as follows:



In [None]:
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)

In [None]:
tree4.draw()

#### Here are some of the methods available for tree objects:



In [None]:
print(tree4[1])

In [None]:
tree4[1].label()

In [None]:
tree4.leaves()

In [None]:
tree4[1][1][1]

### 5   Named Entity Recognition


#### See Slides

#### NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True [1], then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [None]:
sent = nltk.corpus.treebank.tagged_sents()[22]

In [None]:
sent

In [None]:
print(nltk.ne_chunk(sent, binary=True))

In [None]:
print(nltk.ne_chunk(sent)) 

#### Exercise 2: Use function nltk.ne_chunk() to tag named entities as PERSON,ORGANIZATION, GPE etc on the 100th sentence or any sentence of brown corpus.

### 6   Relation Extraction

#### See Slides

In [None]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')

In [None]:
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    print (doc)

In [None]:
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
                                corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))