<a href="https://colab.research.google.com/github/giorgiosld/Natural-Language-Processing/blob/main/labs/lab8/T_725_Lab08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T-725 Natural Language Processing: Lab 8
In today's lab, we will be working with named entity recognition and information extraction.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [1]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Named entity recognition
NLTK includes a classifier for tagging named entities, which is described in [Chapter 7.5](https://www.nltk.org/book/ch07.html#sec-ner) of the NLTK book.

In [2]:
sent = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

sent_tokens = nltk.word_tokenize(sent)
sent_tagged = nltk.pos_tag(sent_tokens)
sent_ner = nltk.ne_chunk(sent_tagged)

print(sent_ner)

(S
  The/DT
  2020/CD
  Nobel/NNP
  Prize/NNP
  in/IN
  (GPE Physics/NNP)
  is/VBZ
  awarded/VBN
  to/TO
  (PERSON Roger/NNP Penrose/NNP)
  ,/,
  (PERSON Reinhard/NNP Genzel/NNP)
  and/CC
  (PERSON Andrea/NNP Ghez/NNP)
  for/IN
  their/PRP$
  work/NN
  on/IN
  black/JJ
  holes/NNS
  ./.)


The NLTK book shows a list of commonly used named entity categories along with examples:

NE Type | Examples
--- | ---
ORGANIZATION | Georgia-Pacific Corp., WHO
PERSON | Eddy Bonte, President Obama
LOCATION | Murray River, Mount Everest
DATE | June, 2008-06-29
TIME | two fifty a m, 1:30 p.m.
MONEY | 175 million Canadian Dollars, GBP 10.40
PERCENT | twenty pct, 18.75 %
FACILITY | Washington Monument, Stonehenge
GPE | South East Asia, Midlothian

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59, October 18th. Remember to save your file before uploading it.

## Question 1
Use `nltk.ne_chunk(tagged_sentence)` to identify the named entities in the sentences below. Note that you have to tokenize and tag the sentences first.

(a) Print out and review the trees.

(b) Find at least one error and leave a description of it as a comment or in a text cell below.

In [3]:
# On this day, October 16th (from https://en.wikipedia.org/wiki/October_16):
sentences = [
    "1813 – The Sixth Coalition attacks Napoleon in the three-day Battle of Leipzig.",
    "1923 – The Walt Disney Company is founded.",
    "1968 – Yasunari Kawabata becomes the first Japanese person to be awarded the Nobel Prize in Literature.",
    "1975 – Three-year-old Rahima Banu, from Bangladesh, is the last known case of naturally occurring smallpox.",
    "2002 – The Bibliotheca Alexandrina opens in Egypt, commemorating the ancient library of Alexandria."
]


In [15]:
# Your solution here

sentence_token = [nltk.word_tokenize(sentence) for sentence in sentences]
sentence_tagged = [nltk.pos_tag(tokenized_sentence) for tokenized_sentence in sentence_token]

for sentence in sentence_tagged:
    sentence_ner = nltk.ne_chunk(sentence)
    print(sentence_ner)


(S
  1813/CD
  –/VBZ
  The/DT
  (ORGANIZATION Sixth/JJ Coalition/NNP)
  attacks/NNS
  Napoleon/NNP
  in/IN
  the/DT
  three-day/JJ
  Battle/NNP
  of/IN
  (GPE Leipzig/NNP)
  ./.)
(S
  1923/CD
  –/VBZ
  The/DT
  (ORGANIZATION Walt/NNP Disney/NNP Company/NNP)
  is/VBZ
  founded/VBN
  ./.)
(S
  1968/CD
  –/NNP
  Yasunari/NNP
  Kawabata/NNP
  becomes/VBZ
  the/DT
  first/JJ
  (GPE Japanese/JJ)
  person/NN
  to/TO
  be/VB
  awarded/VBN
  the/DT
  (ORGANIZATION Nobel/NNP Prize/NNP)
  in/IN
  (GPE Literature/NNP)
  ./.)
(S
  1975/CD
  –/JJ
  Three-year-old/NNP
  (PERSON Rahima/NNP Banu/NNP)
  ,/,
  from/IN
  (GPE Bangladesh/NNP)
  ,/,
  is/VBZ
  the/DT
  last/JJ
  known/JJ
  case/NN
  of/IN
  naturally/RB
  occurring/VBG
  smallpox/NN
  ./.)
(S
  2002/CD
  –/VBZ
  The/DT
  (ORGANIZATION Bibliotheca/NNP Alexandrina/NNP)
  opens/VBZ
  in/IN
  (GPE Egypt/NNP)
  ,/,
  commemorating/VBG
  the/DT
  ancient/JJ
  library/NN
  of/IN
  (GPE Alexandria/NNP)
  ./.)


**Literature** in the third sentence is incorrectly tagged as a GPE. It should be left untagged or considered as a field of study.

## Question 2
[SpaCy](https://spacy.io/) is another NLP library for Python. Try out its named entity recognition system on the sentences in Question 1.

Answer the following questions in a text cell below:

(a) Does it repeat any of the mistakes that NLTK makes?

(b) Does it make any errors that NLTK doesn't?

In [5]:
import spacy
from spacy import displacy
import en_core_web_sm

nlp = en_core_web_sm.load()

# Example
text = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [16]:
# Your solution here
docs = [nlp(text) for text in sentences]

for doc in docs:
    displacy.render(doc, style="ent", jupyter=True)

a)
Yes, spaCy repeats some of the mistakes that NLTK makes:


*   In the third sentence "the Nobel Prize in Literature" nltk classify It as Organization and GPE, instead spaCy tag It as "Work of Art" misclassifing It since a Nobel Prize is an award;

b)
Yes, spaCy makes some errors that NLTK doesn't:


*   In the first senctence spaCy tags "Battle of Leipzig" as GPE, instead only Leipzig should be tagged as GPE;



## Question 3
Use regular expressions to try to find instances of the following relationships in the `reuters` corpus:
1. Organizations or companies and their subsidiaries, divisions or parts, e.g.:
  * *Moss Rosenberg Verft, a subsidiary of Kvaerner Industrier A/S*
  * *Merrill Lynch Capital Partners, a unit of Merrill Lynch*
2. Executives and the companies they work for, e.g.:
  * *Isao Nakamura, president of Higashi Nippon*
  *  *Henry Rosenberg, chairman of Crown Central Petroleum*

Your results don't have to be perfect! Getting a few relevant matches is enough, but try to keep irrelevant results to a minimum.

In [7]:
import re
from nltk.corpus import reuters
nltk.download('reuters')

# Create a copy of the text where there's only a single space between each word
text = " ".join(reuters.raw().split())

# Example
for m in re.findall(r'(?: [A-Z][a-z]+)+ said it acquired (?:[A-Z][a-z]+ )+', text):
  print(m)

# Note how normal groups and non-capturing groups work with re.findall():
# a_string = "a a b"
# re.findall(r'(a )+b', a_string): ['a '] (normal group)
# re.findall(r'(?:a )+b', a_string): ['a a b'] (non-capturing group)

[nltk_data] Downloading package reuters to /root/nltk_data...


 Douglas Corp said it acquired Frampton Computer Services 
 Corp said it acquired Private Formulations Inc 
 Forstmann Little said it acquired Sybron 
 Southmark Corp said it acquired Berg Ventures 
 Sico said it acquired Sterling 
 First Financial Management Corp said it acquired Confidata 
 Philadelphia Suburban Corp said it acquired Mentor Systems 
 Medar Inc said it acquired Automatic Inspection Devices 
 Stryker Corp said it acquired Hexcel Medical 
 Inspeech Inc said it acquired Norma Bork Associates Inc 
 Olin Hunt Specialty Products Inc said it acquired Image Technology Corp 
 Enro Holding Corp said it acquired Enro Shirt Co 
 Seal Inc said it acquired Ademco 


In [35]:
print("\n1. Subsidiaries")
for m in re.findall(r'([A-Z][a-z]+(?: [A-Z][a-z]+)*), (a (?:subsidiary|unit) of [A-Z][a-z]+(?: [A-Z][a-z]+)*)', text):
    print(f"{m[0]}, {m[1]}.")

print("\n2. Executives")
for m in re.findall(r'([A-Z][a-z]+(?: [A-Z][a-z]+)*), (president|chairman) of ([A-Z][a-z]+(?: [A-Z][a-z]+)*)', text):
    print(f"{m[0]}, {m[1]} of {m[2]}.")



1. Subsidiaries
Spie Batignolles, a subsidiary of Schneider.
James Beam Distilling Co, a unit of American Brands Inc.
Magma Copper Co, a subsidiary of Newmont Mining Corp.
Scallop Petroleum Corp, a subsidiary of Royal Dutch.
Texas Pacific Oil Co Inc, a unit of Canada.
Allied Stores Corp, a subsidiary of Campeau Corp.
Merrill Lynch Capital Partners, a unit of Merrill Lynch.
Reliance Financial Serivces Corp, a subsidiary of Reliance Group Holdings Inc.
Acquisition Corp, a subsidiary of Merrill Lynch Capital Partners Inc.
Monsanto Chemical Company, a unit of Monsanto Co.
Poulenc Chimie, a unit of Rhone.
Gallaher Ltd, a subsidiary of American Brands Inc.
Algonquin Gas Transmission Co, a unit of Texas Eastern Corp.
Kennecott Corp, a unit of British Petroleum Co.
Chase Home Mortgage Corp, a subsidiary of Chase Manhattan Corp.
Permian Corp, a subsidiary of National Intergroup.
Inspiration Consolidated Copper Co, a subsidiary of Inspiration Resources Corp.
Belcher Oil Co, a unit of Coastal Co

## Question 4
It's much easier to extract relationships from text that is tagged with named entities. This can be accomplished using the `nltk.sem.extract_rels()` function, as described in [Chapter 7.6](https://www.nltk.org/book/ch07.html#relation-extraction) of the NLTK book. The function takes two named entity categories and a regular expression as arguments and returns all instances where the pattern occurs between the two categories (allowing for up to 10 tokens between them, by default).

The `ieer` (Information Extraction and Entity Recognition) corpus contains named entity annotations, such as `PER`, `ORG` and `LOC`. Find some instances of the following relationships using `nltk.sem.extract_rels()`:
1. Professors and the organizations they work for, e.g.:
  * *Roger Goldman, a law professor at St. Louis University*
2. Family members e.g.,:
  * *Louis XIV and his brother, Philippe*
  * *Mildred Rosenbaum and her husband Stanley*
3. People and where are from, e.g.:
  * *Anna Rechnio of Poland*

In [9]:
from nltk.corpus import ieer
nltk.download('ieer')

# Example
pattern = re.compile(r'.*\bacquired?\b')

for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('ORG', 'ORG', doc, 'ieer', pattern):
    print(nltk.sem.rtuple(rel))

[nltk_data] Downloading package ieer to /root/nltk_data...
[nltk_data]   Unzipping corpora/ieer.zip.


[ORG: 'Omnicom'] 'moved to acquire' [ORG: 'GGT']
[ORG: 'BDDP'] 'was acquired last year by' [ORG: 'GGT']
[ORG: 'Safeway Stores'] 'acquired' [ORG: 'Mutual']


In [39]:
# Your solution here

print("\nProfessors and the organizations they work for")
prof_pattern = re.compile(r'.*\bprofessor\b.*')

for doc in ieer.parsed_docs():
    for rel in nltk.sem.extract_rels('PER', 'ORG', doc, 'ieer', prof_pattern):
        print(nltk.sem.rtuple(rel))

print("\nFamily members")
family_pattern = re.compile(r'.*\b(husband|wife|brother|sister)\b.*')

for doc in ieer.parsed_docs():
    for rel in nltk.sem.extract_rels('PER', 'PER', doc, 'ieer', family_pattern):
        print(nltk.sem.rtuple(rel))

print("\nPeople and where are from")
people_pattern = re.compile(r'.*\b(from|born|of)\b.*')

for doc in ieer.parsed_docs():
    for rel in nltk.sem.extract_rels('PER', 'LOC', doc, 'ieer', people_pattern):
        print(nltk.sem.rtuple(rel))



Professors and the organizations they work for
[PER: 'Raymond Rosen'] ', a sex researcher and professor of psychiatry at the' [ORG: 'Robert Wood Johnson Medical School']
[PER: 'Leonore Tiefer'] ', a sex researcher and clinical professor of psychiatry at' [ORG: 'New York Medical Center']
[PER: 'Pepper Schwartz'] ', a sociology professor at the' [ORG: 'University of Washington']
[PER: 'Irwin Goldstein'] ', a professor of urology at the' [ORG: 'Boston University School of Medicine']
[PER: 'Roger Goldman'] ', a law professor at' [ORG: 'St. Louis University']
[PER: 'Joseph Jacobson'] ', an assistant professor at' [ORG: 'MIT']

Family members
[PER: 'Jack N. Berkman'] ', an alumnus, and his wife,' [PER: 'Lillian R. Berkman']
[PER: 'Louis XIV'] 'and his brother,' [PER: 'Philippe']
[PER: 'McCarthy'] "'s wife," [PER: 'Margaret Grundy McCarthy']
[PER: 'Clinton'] 'and his wife,' [PER: 'Hillary Rodham Clinton']
[PER: 'Mildred Rosenbaum'] 'and her husband' [PER: 'Stanley']

People and where are fro