<a href="https://colab.research.google.com/github/giorgiosld/Natural-Language-Processing/blob/main/labs/lab8/T_725_Lab08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T-725 Natural Language Processing: Lab 8
In today's lab, we will be working with named entity recognition and information extraction.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [None]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

## Named entity recognition
NLTK includes a classifier for tagging named entities, which is described in [Chapter 7.5](https://www.nltk.org/book/ch07.html#sec-ner) of the NLTK book.

In [None]:
sent = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

sent_tokens = nltk.word_tokenize(sent)
sent_tagged = nltk.pos_tag(sent_tokens)
sent_ner = nltk.ne_chunk(sent_tagged)

print(sent_ner)

The NLTK book shows a list of commonly used named entity categories along with examples:

NE Type | Examples
--- | ---
ORGANIZATION | Georgia-Pacific Corp., WHO
PERSON | Eddy Bonte, President Obama
LOCATION | Murray River, Mount Everest
DATE | June, 2008-06-29
TIME | two fifty a m, 1:30 p.m.
MONEY | 175 million Canadian Dollars, GBP 10.40
PERCENT | twenty pct, 18.75 %
FACILITY | Washington Monument, Stonehenge
GPE | South East Asia, Midlothian

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59, October 18th. Remember to save your file before uploading it.

## Question 1
Use `nltk.ne_chunk(tagged_sentence)` to identify the named entities in the sentences below. Note that you have to tokenize and tag the sentences first.

(a) Print out and review the trees.

(b) Find at least one error and leave a description of it as a comment or in a text cell below.

In [None]:
# On this day, October 16th (from https://en.wikipedia.org/wiki/October_16):
sentences = [
    "1813 – The Sixth Coalition attacks Napoleon in the three-day Battle of Leipzig.",
    "1923 – The Walt Disney Company is founded.",
    "1968 – Yasunari Kawabata becomes the first Japanese person to be awarded the Nobel Prize in Literature.",
    "1975 – Three-year-old Rahima Banu, from Bangladesh, is the last known case of naturally occurring smallpox.",
    "2002 – The Bibliotheca Alexandrina opens in Egypt, commemorating the ancient library of Alexandria."
]


In [None]:
# Your solution here


## Question 2
[SpaCy](https://spacy.io/) is another NLP library for Python. Try out its named entity recognition system on the sentences in Question 1.

Answer the following questions in a text cell below:

(a) Does it repeat any of the mistakes that NLTK makes?

(b) Does it make any errors that NLTK doesn't?

In [None]:
import spacy
from spacy import displacy
import en_core_web_sm

nlp = en_core_web_sm.load()

# Example
text = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [None]:
# Your solution here


## Question 3
Use regular expressions to try to find instances of the following relationships in the `reuters` corpus:
1. Organizations or companies and their subsidiaries, divisions or parts, e.g.:
  * *Moss Rosenberg Verft, a subsidiary of Kvaerner Industrier A/S*
  * *Merrill Lynch Capital Partners, a unit of Merrill Lynch*
2. Executives and the companies they work for, e.g.:
  * *Isao Nakamura, president of Higashi Nippon*
  *  *Henry Rosenberg, chairman of Crown Central Petroleum*

Your results don't have to be perfect! Getting a few relevant matches is enough, but try to keep irrelevant results to a minimum.

In [None]:
import re
from nltk.corpus import reuters
nltk.download('reuters')

# Create a copy of the text where there's only a single space between each word
text = " ".join(reuters.raw().split())

# Example
for m in re.findall(r'(?: [A-Z][a-z]+)+ said it acquired (?:[A-Z][a-z]+ )+', text):
  print(m)

# Note how normal groups and non-capturing groups work with re.findall():
# a_string = "a a b"
# re.findall(r'(a )+b', a_string): ['a '] (normal group)
# re.findall(r'(?:a )+b', a_string): ['a a b'] (non-capturing group)

In [None]:
print("\n1. Subsidiaries")


print("\n2. Executives")


## Question 4
It's much easier to extract relationships from text that is tagged with named entities. This can be accomplished using the `nltk.sem.extract_rels()` function, as described in [Chapter 7.6](https://www.nltk.org/book/ch07.html#relation-extraction) of the NLTK book. The function takes two named entity categories and a regular expression as arguments and returns all instances where the pattern occurs between the two categories (allowing for up to 10 tokens between them, by default).

The `ieer` (Information Extraction and Entity Recognition) corpus contains named entity annotations, such as `PER`, `ORG` and `LOC`. Find some instances of the following relationships using `nltk.sem.extract_rels()`:
1. Professors and the organizations they work for, e.g.:
  * *Roger Goldman, a law professor at St. Louis University*
2. Family members e.g.,:
  * *Louis XIV and his brother, Philippe*
  * *Mildred Rosenbaum and her husband Stanley*
3. People and where are from, e.g.:
  * *Anna Rechnio of Poland*

In [None]:
from nltk.corpus import ieer
nltk.download('ieer')

# Example
pattern = re.compile(r'.*\bacquired?\b')

for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('ORG', 'ORG', doc, 'ieer', pattern):
    print(nltk.sem.rtuple(rel))

In [None]:
# Your solution here
