# T-725 Natural Language Processing: Lab 8
In today's lab, we will be working with named entity recognition and information extraction.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [2]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Named entity recognition
NLTK includes a classifier for tagging named entities, which is described in [Chapter 7.5](https://www.nltk.org/book/ch07.html#sec-ner) of the NLTK book.

In [3]:
sent = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

sent_tokens = nltk.word_tokenize(sent)
sent_tagged = nltk.pos_tag(sent_tokens)
sent_ner = nltk.ne_chunk(sent_tagged)

print(sent_ner)

(S
  The/DT
  2020/CD
  Nobel/NNP
  Prize/NNP
  in/IN
  (GPE Physics/NNP)
  is/VBZ
  awarded/VBN
  to/TO
  (PERSON Roger/NNP Penrose/NNP)
  ,/,
  (PERSON Reinhard/NNP Genzel/NNP)
  and/CC
  (PERSON Andrea/NNP Ghez/NNP)
  for/IN
  their/PRP$
  work/NN
  on/IN
  black/JJ
  holes/NNS
  ./.)


The NLTK book shows a list of commonly used named entity categories along with examples:

NE Type | Examples
--- | ---
ORGANIZATION | Georgia-Pacific Corp., WHO
PERSON | Eddy Bonte, President Obama
LOCATION | Murray River, Mount Everest
DATE | June, 2008-06-29
TIME | two fifty a m, 1:30 p.m.
MONEY | 175 million Canadian Dollars, GBP 10.40
PERCENT | twenty pct, 18.75 %
FACILITY | Washington Monument, Stonehenge
GPE | South East Asia, Midlothian

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 AM, monday morning, October 23rd. Remember to save your file before uploading it.

## Question 1
Use `nltk.ne_chunk(tagged_sentence)` to identify the named entities in the sentences below. Note that you have to tokenize and tag the sentences first. Print out and review the trees. Find at least one error and leave a description of it as a comment or in a text cell below.

In [4]:
# On this day, October 16th (from https://en.wikipedia.org/wiki/October_16):
sentences = [
    "1813 – The Sixth Coalition attacks Napoleon in the three-day Battle of Leipzig.",
    "1923 – The Walt Disney Company is founded.",
    "1968 – Yasunari Kawabata becomes the first Japanese person to be awarded the Nobel Prize in Literature.",
    "1975 – Three-year-old Rahima Banu, from Bangladesh, is the last known case of naturally occurring smallpox.",
    "2002 – The Bibliotheca Alexandrina opens in Egypt, commemorating the ancient library of Alexandria."
]


In [5]:
# Your solution here

sent_ner_2 = []

for sentence in sentences:
  sent_tokens_2 = nltk.word_tokenize(sentence)
  sent_tagged_2 = nltk.pos_tag(sent_tokens_2)
  ne_chunk = nltk.ne_chunk(sent_tagged_2)
  sent_ner_2.append(ne_chunk)
  print(ne_chunk)

'''
There are some words not classified as expected. For example:
  - "three-day" should be TIME (not classified)
  - "Napoleon" shoulb be PERSON (not classified)
while some word are correctly classified. Here:
  - "Walt Disney Company" is ORGANIZATION (not classified)
'''

(S
  1813/CD
  –/VBZ
  The/DT
  (ORGANIZATION Sixth/JJ Coalition/NNP)
  attacks/NNS
  Napoleon/NNP
  in/IN
  the/DT
  three-day/JJ
  Battle/NNP
  of/IN
  (GPE Leipzig/NNP)
  ./.)
(S
  1923/CD
  –/VBZ
  The/DT
  (ORGANIZATION Walt/NNP Disney/NNP Company/NNP)
  is/VBZ
  founded/VBN
  ./.)
(S
  1968/CD
  –/NNP
  Yasunari/NNP
  Kawabata/NNP
  becomes/VBZ
  the/DT
  first/JJ
  (GPE Japanese/JJ)
  person/NN
  to/TO
  be/VB
  awarded/VBN
  the/DT
  (ORGANIZATION Nobel/NNP Prize/NNP)
  in/IN
  (GPE Literature/NNP)
  ./.)
(S
  1975/CD
  –/JJ
  Three-year-old/NNP
  (PERSON Rahima/NNP Banu/NNP)
  ,/,
  from/IN
  (GPE Bangladesh/NNP)
  ,/,
  is/VBZ
  the/DT
  last/JJ
  known/JJ
  case/NN
  of/IN
  naturally/RB
  occurring/VBG
  smallpox/NN
  ./.)
(S
  2002/CD
  –/VBZ
  The/DT
  (ORGANIZATION Bibliotheca/NNP Alexandrina/NNP)
  opens/VBZ
  in/IN
  (GPE Egypt/NNP)
  ,/,
  commemorating/VBG
  the/DT
  ancient/JJ
  library/NN
  of/IN
  (GPE Alexandria/NNP)
  ./.)


'\nThere are some words not classified as expected. For example:\n  - "three-day" should be TIME (not classified)\n  - "Napoleon" shoulb be PERSON (not classified)\nwhile some word are correctly classified. Here:\n  - "Walt Disney Company" is ORGANIZATION (not classified)\n'

## Question 2
[SpaCy](https://spacy.io/) is another NLP library for Python. Try out its named entity recognition system on the sentences in Question 1. Does it repeat any of the mistakes that NLTK makes? Does it make any errors that NLTK doesn't? Leave your answer as a comment or in a text cell below.

In [6]:
import spacy
from spacy import displacy
import en_core_web_sm

nlp = en_core_web_sm.load()

# Example
text = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [7]:
# Your solution here

for sentence in sentences:
  doc = nlp(sentence)
  displacy.render(doc, style="ent", jupyter=True)

'''
For the same words taken as examples before, these are the results using SpaCy:
  - "three-day" should be TIME but it is classified DATE (better classification than NLTK)
  - "Napoleon" should be PERSON but it is ORG (better classification than NLTK)
  - "Walt Disney Company" is correctly classified as ORGANIZATION

Comparing SpaCy to NLTK, we can say that it recognize more named entities, but
still some of them are not correctly recognized
'''

'\nFor the same words taken as examples before, these are the results using SpaCy:\n  - "three-day" should be TIME but it is classified DATE (better classification than NLTK)\n  - "Napoleon" should be PERSON but it is ORG (better classification than NLTK)\n  - "Walt Disney Company" is correctly classified as ORGANIZATION\n\nComparing SpaCy to NLTK, we can say that it recognize more named entities, but\nstill some of them are not correctly recognized\n'

## Question 3
Use regular expressions to try to find instances of the following relationships in the `reuters` corpus:
1. Organizations or companies and their subsidiaries, divisions or parts, e.g.:
  * *Moss Rosenberg Verft, a subsidiary of Kvaerner Industrier A/S*
  * *Merrill Lynch Capital Partners, a unit of Merrill Lynch*
2. Executives and the companies they work for, e.g.:
  * *Isao Nakamura, president of Higashi Nippon*
  *  *Henry Rosenberg, chairman of Crown Central Petroleum*

Your results don't have to be perfect! Getting a few relevant matches is enough, but try to keep irrelevant results to a minimum.

In [8]:
import re
from nltk.corpus import reuters
nltk.download('reuters')

# Create a copy of the text where there's only a single space between each word
text = " ".join(reuters.raw().split())

# Example
for m in re.findall(r'(?: [A-Z][a-z]+)+ said it acquired (?:[A-Z][a-z]+ )+', text):
  print(m)

# Note how normal groups and non-capturing groups work with re.findall():
# a_string = "a a b"
# re.findall(r'(a )+b', a_string): ['a '] (normal group)
# re.findall(r'(?:a )+b', a_string): ['a a b'] (non-capturing group)

[nltk_data] Downloading package reuters to /root/nltk_data...


 Douglas Corp said it acquired Frampton Computer Services 
 Corp said it acquired Private Formulations Inc 
 Forstmann Little said it acquired Sybron 
 Southmark Corp said it acquired Berg Ventures 
 Sico said it acquired Sterling 
 First Financial Management Corp said it acquired Confidata 
 Philadelphia Suburban Corp said it acquired Mentor Systems 
 Medar Inc said it acquired Automatic Inspection Devices 
 Stryker Corp said it acquired Hexcel Medical 
 Inspeech Inc said it acquired Norma Bork Associates Inc 
 Olin Hunt Specialty Products Inc said it acquired Image Technology Corp 
 Enro Holding Corp said it acquired Enro Shirt Co 
 Seal Inc said it acquired Ademco 


In [9]:
org_regex = r'\b(?:[A-Z][\w&\'\s]+)(?:,\s+a\s+)?(?:subsidiary|division|part)\s+of\s+(?:[A-Z][\w&\'\s]+)\b'
exec_regex = r'\b(?:[A-Z][\w&\'\s]+),\s+(?:president|chairman|CEO|executive)\s+of\s+(?:[A-Z][\w&\'\s]+)\b'

print("\n1. Subsidiaries")
for m in re.findall(org_regex, text)[:10]:
  print(" - "+m)

print("\n2. Executives")
for m in re.findall(exec_regex, text)[:10]:
  print(" - "+m)



1. Subsidiaries
 - Spie Batignolles, a subsidiary of Schneider SA &lt
 - Company is a subsidiary of Switzerland's BBC AG Brown Boveri und Cie &lt
 - CTS Magma Copper Co, a subsidiary of Newmont Mining Corp
 - UNIT TO RAISE HEAVY FUEL PRICES Scallop Petroleum Corp, a subsidiary of Royal Dutch
 - Industry sources told Reuters yesterday that Fundamental was close to acquiring the government securities brokerage division of MKI
 - The company added that Isis will operate as part of McDonnell Douglas Information Systems International
 - Warner will become a wholly owned subsidiary of AV Holdings
 - Allied Stores Corp, a subsidiary of Campeau Corp
 - Canadian subsidiary of Rothmans International Plc &lt
 - ARGOSystems will operate as a wholly owned subsidiary of Boeing Co

2. Executives
 - Edward Brennan, chairman of Sears Roebuck and Co &lt
 - Lichtblau, president of Petroleum Industry Research Associates
 - James Burke, president of Merrill Lynch Capital Partners
 - Robert Campeau, chairm

## Question 4
It's much easier to extract relationships from text that is tagged with named entities. This can be accomplished using the `nltk.sem.extract_rels()` function, as described in [Chapter 7.6](https://www.nltk.org/book/ch07.html#relation-extraction) of the NLTK book. The function takes two named entity categories and a regular expression as arguments and returns all instances where the pattern occurs between the two categories (allowing for up to 10 tokens between them, by default).

The `ieer` (Information Extraction and Entity Recognition) corpus contains named entity annotations, such as `PER`, `ORG` and `LOC`. Find some instances of the following relationships using `nltk.sem.extract_rels()`:
1. Professors and the organizations they work for, e.g.:
  * *Roger Goldman, a law professor at St. Louis University*
2. Family members e.g.,:
  * *Louis XIV and his brother, Philippe*
  * *Mildred Rosenbaum and her husband Stanley*
3. People and where are from, e.g.:
  * *Anna Rechnio of Poland*

In [10]:
from nltk.corpus import ieer
nltk.download('ieer')

# Example
pattern = re.compile(r'.*\bacquired?\b')

for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('ORG', 'ORG', doc, 'ieer', pattern):
    print(nltk.sem.rtuple(rel))

[nltk_data] Downloading package ieer to /root/nltk_data...
[nltk_data]   Unzipping corpora/ieer.zip.


[ORG: 'Omnicom'] 'moved to acquire' [ORG: 'GGT']
[ORG: 'BDDP'] 'was acquired last year by' [ORG: 'GGT']
[ORG: 'Safeway Stores'] 'acquired' [ORG: 'Mutual']


In [28]:
# Your solution here
print("\nProfessors and their organizations:\n")
pattern_1 = re.compile(r'.*\bprofessor at?\b')
for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('PER', 'ORG', doc, 'ieer', pattern_1):
    print(nltk.sem.rtuple(rel))
print("\nFamily members:\n")
pattern_2 = re.compile(r'.*\bhis?\b')
for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('PER', 'PER', doc, 'ieer', pattern_2):
    print(nltk.sem.rtuple(rel))
print("\nPeople and where thei are from:\n")
pattern_2 = re.compile(r'.*\bfrom?\b')
for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('PER', 'LOC', doc, 'ieer', pattern_2):
    print(nltk.sem.rtuple(rel))


Professors and their organizations:

[PER: 'Pepper Schwartz'] ', a sociology professor at the' [ORG: 'University of Washington']
[PER: 'Roger Goldman'] ', a law professor at' [ORG: 'St. Louis University']
[PER: 'Joseph Jacobson'] ', an assistant professor at' [ORG: 'MIT']

Family members:

[PER: 'Yeltsin'] 'fired his Cabinet and named' [PER: 'Kiriyenko']
[PER: 'Jack N. Berkman'] ', an alumnus, and his wife,' [PER: 'Lillian R. Berkman']
[PER: 'Louis XIV'] 'and his brother,' [PER: 'Philippe']
[PER: 'Wilson'] 'brought his costume designer,' [PER: 'Frida Parmeggiani']
[PER: 'Moss'] 'and his longtime partner,' [PER: 'Stan Dragoti']
[PER: 'Clinton'] 'and his wife,' [PER: 'Hillary Rodham Clinton']
[PER: 'Ismoil'] 'did not testify in the trial, but his lawyer,' [PER: 'Louis R. Aidala']
[PER: 'Louis R. Aidala'] ', often sought to distance his client from' [PER: 'Yousef']
[PER: 'Brosius'] 'tapped his chest, acknowledging his mistake. Later,' [PER: 'Strawberry']
[PER: 'Johnson'] 'with his usual 