 - [IE](http://www.nltk.org/howto/relextract.html)
 - [IE](https://www.nltk.org/book/ch07.html)
 - [Oreilly](https://learning.oreilly.com/library/view/natural-language-processing/9781787285101/ch04s07.html)
 - [rel_extracts](https://www.nltk.org/_modules/nltk/sem/relextract.html)

# Named-entity recognition (NER)
We learnt about taggers and parsers that we can use to build a basic information extraction engine. Let's jump directly to a very basic IE engine and how a typical IE engine can be developed using NLTK.

Any sort of meaningful information can be drawn only if the given input stream goes to each of the following NLP steps. We already have enough understanding of sentence tokenization, word tokenization, and POS tagging. Let's discuss NER and relation extraction as well.

We already briefly discussed NER generally in the last chapter. Essentially, NER is a way of extracting some of the most common entities, such as names, organizations, and locations. However, some of the modified NER can be used to extract entities such as product names, biomedical entities, author names, brand names, and so on.

Let's start with a very generic example where we are given a text file of the content and we need to extract some of the most insightful named entities from it:

In the following code, we just followed the same pipeline provided in the preceding figure. We took all the preprocessing steps, such as sentence tokenization, tokenization, POS tagging, and NLTK. NER (pre-trained models) can be used to extract all NERs.

In [1]:
import nltk

In [27]:
text = "In October 1988, Bill Denbrough gives his six-year-old brother, Georgie, a paper sailboat."\
        " Georgie sails the boat along the rainy streets of small town Derry, and is disappointed when it falls down a storm drain."\
        " As he attempts to retrieve it, Georgie sees a clown in the sewer, who introduces himself as Pennywise."\
        " The clown entices Georgie to come closer, then severs his arm and drags him into the sewer."\
        " The following summer, Bill and his friends - Richie Tozier, Eddie Kaspbrak, and Stan Uris - run afoul of older bully Henry Bowers and his gang."\
        " Bill, still haunted by Georgie's disappearance and the resulting neglect from his grief-stricken parents, discovers that his brother's body may have washed up in a marshy wasteland called the Barrens."\
        " He recruits his friends to investigate, believing his brother may still be alive."\
        " Ben Hanscom learns that the town has been plagued by unexplained tragedies and child disappearances for centuries."\
        " He is targeted by Bowers' gang, after which he flees into the Barrens and meets Bill's group."\
        " They find the sneaker of a missing girl, while a member of the pursuing Bowers Gang, Patrick Hockstetter, is killed by Pennywise while searching the sewers for Ben."\
        " The film was produced by Universal Studios Inc and supported by the BBC in London."

In [28]:
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

for sent in tagged_sentences[-2:]:
    s = nltk.ne_chunk(sent)
    print(s)

(S
  They/PRP
  find/VBP
  the/DT
  sneaker/NN
  of/IN
  a/DT
  missing/VBG
  girl/NN
  ,/,
  while/IN
  a/DT
  member/NN
  of/IN
  the/DT
  pursuing/VBG
  (ORGANIZATION Bowers/NNP Gang/NNP)
  ,/,
  (PERSON Patrick/NNP Hockstetter/NNP)
  ,/,
  is/VBZ
  killed/VBN
  by/IN
  (PERSON Pennywise/NNP)
  while/IN
  searching/VBG
  the/DT
  sewers/NNS
  for/IN
  (PERSON Ben/NNP)
  ./.)
(S
  The/DT
  film/NN
  was/VBD
  produced/VBN
  by/IN
  (ORGANIZATION Universal/NNP Studios/NNP Inc/NNP)
  and/CC
  supported/VBN
  by/IN
  the/DT
  (ORGANIZATION BBC/NNP)
  in/IN
  (GPE London/NNP)
  ./.)


# Relation extraction

Relation extraction is another commonly used information extraction operation. Relation extraction as it sound is the process of extracting the different relationships between different entities. There are variety of the relationship that exist between the entities. We have seen relationship like inheritance/synonymous/analogous. The definition of the relation can be dependent on the Information need. For example in the case where we want to look from unstructured text data who is the writer of which book then authorship could be a relation between the author name and book name. With NLTK the idea is to use the same IE pipeline that we used till NER and extend it with a relation pattern based on the NER tags.

So, in the following code, we used an inbuilt corpus of ieer, where the sentences are tagged till NER and the only thing we need to specify is the relation pattern we want and the kind of NER we want the relation to define. In the following code, a relationship between an organization and a location has been defined and we want to extract all the combinations of these patterns. This can be applied in various ways, for example, in a large corpus of unstructured text, we will be able to identify some of the organizations of our interest with their corresponding location:

In [8]:
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
