for a simple information extraction system. It begins by processing a document using several of the procedures discussed in Chapters 3 and 5: first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity recognition. In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.


In [1]:
#lets make function of a preprocessing
import nltk
import re
import pprint
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

lets first talk about chunking a group of tokens that refere to multipe nouns and dont neccessarily refer to entities the same way as definite NPs and proper names

In [2]:
#chunks in IOB formate
from nltk.corpus import conll2000
#corpus of chunked sentence in IOB formate
nltk.download('conll2000')
cp = nltk.RegexpParser("")
#NP since we are only interested in nouns
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


The IOB tag accuracy indicates that more than a third of the words are tagged with O, i.e., not in an NP chunk. However, since our tagger did not find any chunks, its precision, recall, and F-measure are all zero. 

In [3]:
#lets try a naive rergular expression that looks for tags beginning with letters that are characteristic of noun phrase tags
#CD, DT and JJ
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


decent results however we can improve by adopting a more data-driven approach

In [0]:
class UnigramChunker(nltk.ChunkParserI):
  def __init__(self,train_sents):
    train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
    #we could make it bigramchunker which increases performanc slightly more than unigram
    self.tagger = nltk.UnigramTagger(train_data)
    print(train_data[0])
  def parse(self,sentence):
    pos_tags = [pos for (word,pos) in sentence]
    tagged_pos_tags =  self.tagger.tag(pos_tags)
    chunktags = [chunktag for (pos,chunktag) in tagged_pos_tags]
    conlltags = [(word,pos,chunktag) for ((word,pos),chunktag) in zip(sentence,chunktags)]
    return nltk.chunk.conlltags2tree(conlltags)


In [5]:
test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt',chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

[('NN', 'B-NP'), ('IN', 'O'), ('DT', 'B-NP'), ('NN', 'I-NP'), ('VBZ', 'O'), ('RB', 'O'), ('VBN', 'O'), ('TO', 'O'), ('VB', 'O'), ('DT', 'B-NP'), ('JJ', 'I-NP'), ('NN', 'I-NP'), ('IN', 'O'), ('NN', 'B-NP'), ('NNS', 'I-NP'), ('IN', 'O'), ('NNP', 'B-NP'), (',', 'O'), ('JJ', 'O'), ('IN', 'O'), ('NN', 'B-NP'), ('NN', 'B-NP'), (',', 'O'), ('VB', 'O'), ('TO', 'O'), ('VB', 'O'), ('DT', 'B-NP'), ('JJ', 'I-NP'), ('NN', 'I-NP'), ('IN', 'O'), ('NNP', 'B-NP'), ('CC', 'I-NP'), ('NNP', 'I-NP'), ('POS', 'B-NP'), ('JJ', 'I-NP'), ('NNS', 'I-NP'), ('.', 'O')]
ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


if u wanna read more about how to auto chunk, the book made an example with 96% IOB accuracy

In [25]:
#back to named entity recognition
#there is a pretrained classifier provided by nltk trained to recognize names entities accessed with the
#function nltk.ne_chunk()
#If we set the parameter binary=True , then named entities are just tagged as NE; 
#otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.
nltk.download('maxent_ne_chunker') 
nltk.download('treebank')
nltk.download('words')
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent,binary=True))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (NE Brooke/NNP)
  T./NNP
  Mossman/NNP
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE Vermont/NNP College/NNP)
  of/IN
 