---

## Information Extraction and Relation Extraction

In the following tasks you will write code to perform **_information extraction_** and **_relation extraction_** across a collection of documents in `movies.zip`.

The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.


In [None]:
!if test -f "movies.zip"; then rm "movies.zip"; fi
!if test -d "movies/"; then rm -rf "movies/"; fi
!wget "https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB" -O "movies.zip"
!unzip "movies.zip"

---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [None]:
######### DO NOT EDIT THIS CELL #########

import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(50):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt')) as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json')) as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 50
assert len(labels) == 50

---

In [None]:
# Load the libraries which might be useful

import re
import nltk
nltk.download('all', quiet=True)

True

---

## Task 1: Document Pre-processing 
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

The expected output is a list of tagged sentences where each tagged sentence is a list containing `(token, tag)` pairs.


In [None]:
def ie_preprocess(document):
  '''Return a list of sentences tagged with part-of-speech tags for the given document.'''

  tagged_sentences = []

  # your code goes here
  # ...
  sentences = nltk.sent_tokenize(document)

  tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

  tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]

  return tagged_sentences

Run the cell below to check if the output is formatted correctly.

Expected output: `[('It', 'PRP'), ('received', 'VBD'), ('ten', 'JJ'), ('Oscar', 'NNP'), ('nominations', 'NNS'), ('(', '('), ('including', 'VBG'), ('Best', 'NNP'), ('Picture', 'NN'), (')', ')'), (',', ','), ('winning', 'VBG'), ('seven', 'CD'), ('.', '.')]`

In [None]:
# check output for Task 1
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

## Task 2: Named Entity Recognition

Write a function that returns a list of all the named entities in a given document. The document here is structured as a list of sentences and tagged with part-of-speech tags.

Hint: Set `binary = True` while calling the `ne_chunk` function.

In [None]:
def find_named_entities(tagged_document):
  '''Return a list of all the named entities in the given tagged document.'''
  


  # your code goes here
  # ...
  named_entities = []
  
  tree = nltk.ne_chunk_sents(tagged_document, binary=True)
  for tree in tree:
    for subtree in tree.subtrees():
      if subtree.label() == "NE":
        entity = ""
        for leaf in subtree.leaves():
          entity = entity + leaf[0] + " "
        named_entities.append(entity.strip())

  return named_entities



Run the cell below to check if the output is formatted correctly.

The output values might not match exactly, but should look similar to: `['Star Wars', 'Star Wars', 'New Hope', 'American', 'George Lucas', 'Lucasfilm', ...]`

In [None]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document
find_named_entities(tagged_document)[:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

## Task 3: Information / Relation Extraction (I) 

Choose any **three** relations out of the following and write functions to extract them from a given document.

* **Title**
* **Language**
* **Starring**
* **Release date**
* **Cinematography**
* **Dialogue by**
* **Directed by**
* **Edited by**
* **Music by**
* **Narrated by**
* **Produced by**
* **Screenplay by**
* **Story by**
* **Written by**
* **Production companies**
* **Distribution companies**
* **Budget**
* **Box office**


The functions you define here must take as input a string called `document` and return the information/relation extracted as a list. You can explain your approach with comments along with your code.


In [None]:
# relation 1 - your code goes here
def Written_by(document):

 #the class of subject named entity
 subject_class = 'NE'   

 #pattern for extracting written by relation 
 pattern = re.compile(r'.*\bwritten\b.*',re.IGNORECASE) 

 #The class of object named entity 
 object_class = 'NE'  

 written = [] #stores the ouput list

 tagged_sentences = ie_preprocess(document)

 for sent in tagged_sentences:
  
  # chunk each sent of tagged documents
  chunked_sent = nltk.ne_chunk(sent, binary = True) 

  #Group a chunk structure into a list of 'semi-relations'
  pairs = nltk.sem.relextract.tree2semi_rel(chunked_sent)  

  #Converts the pairs generated into a 'reldict': a dictionary which
  #stores information about the subject and object NEs plus the filler between them.
  reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])   


  relfilter = lambda x: (x['subjclass'] == subject_class and
                           pattern.match(x['filler']) and
                           x['objclass'] == object_class)

  rels = list(filter(relfilter, reldicts))

  for rel in rels:
    written.append(rel['objsym'].replace("_", " ").title())
    
 return written

In [None]:
def directed_by(document):

 #the class of subject named entity
 subject_class = 'NE'

 #pattern for extracting written by relation
 pattern = re.compile(r'.*\bdirected\b.*',re.IGNORECASE)

 #The class of object named entity 
 object_class = 'NE'

 directed = []  #stores the ouput list

 tagged_sentences = ie_preprocess(document)

 for sent in tagged_sentences:

  # chunk each sent of tagged documents
  chunked_sent = nltk.ne_chunk(sent, binary = True) 

  #Group a chunk structure into a list of 'semi-relations'
  pairs = nltk.sem.relextract.tree2semi_rel(chunked_sent)

  #Converts the pairs generated into a 'reldict': a dictionary which
  #stores information about the subject and object NEs plus the filler between them.
  reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

  relfilter = lambda x: (x['subjclass'] == subject_class and
                           pattern.match(x['filler']) and
                           x['objclass'] == object_class)

  rels = list(filter(relfilter, reldicts))


  for rel in rels:
    directed.append(rel['objsym'].replace("_", " ").title())

 return directed

In [None]:
def Distribution(document):

 #the class of subject named entity
 subject_class = 'NE'

 #pattern for extracting written by relation
 pattern = re.compile(r'.*\bdistributed\b.*',re.IGNORECASE)

 #The class of object named entity
 object_class = 'NE'

 distributed = [] #stores the ouput list

 tagged_sentences = ie_preprocess(document)

 for sent in tagged_sentences:

  # chunk each sent of tagged documents
  chunked_sent = nltk.ne_chunk(sent, binary = True) 

  #Group a chunk structure into a list of 'semi-relations'
  pairs = nltk.sem.relextract.tree2semi_rel(chunked_sent)

  #Converts the pairs generated into a 'reldict': a dictionary which
  #stores information about the subject and object NEs plus the filler between them.
  reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

  relfilter = lambda x: (x['subjclass'] == subject_class and
                           pattern.match(x['filler']) and
                           x['objclass'] == object_class)

  rels = list(filter(relfilter, reldicts))

  for rel in rels:
    distributed.append(rel['objsym'].replace("_", " ").title())

 return distributed

---

## Task 4: Information / Relation Extraction (II)  

Identify one other relation of your choice, besides the ones mentioned in the previous task, and write a function to extract it. 

The function you define here must take as input a string called `document` and return the information/relations extracted as a list.

In [None]:
def Awards_Won(document):

 subject_class = 'NE'
 pattern = re.compile(r'.*\bwon\b.*',re.IGNORECASE)
 object_class = 'NE'

 award = []

 tagged_sentences = ie_preprocess(document)

 for sent in tagged_sentences:
  chunked_sent = nltk.ne_chunk(sent, binary = True) 
  pairs = nltk.sem.relextract.tree2semi_rel(chunked_sent)

  reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])
  relfilter = lambda x: (x['subjclass'] == subject_class and
                           pattern.match(x['filler']) and
                           x['objclass'] == object_class)

  rels = list(filter(relfilter, reldicts))

  for rel in rels:
    award.append(rel['objsym'].replace("_", " ").title())



 return award

---

## Task 5: Combining information in the output 

Edit the function below to return a Python dictionary with the outputs from the functions defined in tasks $3 - 4$.

In [None]:
def extract_info(document):
  '''Extract information and relations from a given document.'''

  # Edit the output dict below and assign the values to keys by 
  # calling the appropriate functions from Tasks 3 and 4.
  
  # You can delete the keys for which you do not perform extraction in Task 3.

  output = {
    ##### EDIT BELOW THIS LINE #####
    
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
    
    "Directed by": directed_by(document),
    "Distributed companies": Distribution(document),
    "Written by": Written_by(document),

    # save the output from Task 4 here
    "Task 4": Awards_Won(document),

    ##### EDIT ABOVE THIS LINE #####
  }

  return output


# check output for the first document

extract_info(documents[7])

{'Directed by': ['Steven Spielberg'],
 'Distributed companies': ['North America'],
 'Task 4': ['Spielberg'],
 'Written by': ['Robert Rodat']}

The output from the cell above should look something like the dictionary shown below. Overall values might be different, based on what four items you choose to extract in Tasks 3 and 4, but the structure should be similar.

For example, if you choose to extract **Starring**, **Release Date**, **Box office**, and **Directed by**, then the output should look something like this for the first document:

```javascript
{
  'Box office': ['$775 million'],
  'Directed by': ['George Lucas'],
  'Release date': ['May 25, 1977'],
  'Starring': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 
               'Peter Cushing', 'David Prowse', 'James Earl Jones', ],
}
```

---

## Task 6: Evaluation (I)

Write a function to evaluate the performance of Task $3$ using **Precision**, **Recall** and **F1** scores. Use the gold-standard labels provided in the JSON files to calculate these values.

Please note that not all the information / relations mentioned in Task $3$ have associated labels for each and every movie in the JSON documents, i.e., some JSON documents will have certain keys-value pairs missing. For example, we have labels for *Budget* in 46 out of the 50 movies and in the remaining 4 documents, you will find that the key `Budget` is omitted from the JSON.
 
Also keep in mind that we will further run this evaluation on a hidden test set containing similar movie descriptions.

In [None]:
from itertools import zip_longest
def evaluate(labels, predictions):
  '''
  Evaluate the performance of relation extraction 
  using Precision, Recall, and F1 scores.

  Args:
    labels: A list containing gold-standard labels
    predictions: A list containing information extracted from documents
  Returns:
    scores: A dictionary containing Precision, Recall and F1 scores 
            for the information/relations extracted in Task 3.
  '''

  assert len(predictions) == len(labels)

  scores = {
      'precision': 0.0, 'recall': 0.0, 'f1': 0.0
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.

  # your code goes here
  # ...


  true_positive = prec = reca = 0   # initialize to zero


  for predictions, label in zip_longest(predictions, labels):
    for key, value in predictions.items():
    
      predictions_set = set(list(predictions[key]))
      if key in label:
        label_set = set(list(label[key]))
        true_positive = true_positive + len(predictions_set & label_set)
        prec = prec + len(predictions_set)
        reca = reca + len(label_set)

  scores['precision'] = round((true_positive / prec),2)
  scores['recall'] = round((true_positive / reca),2)
  scores['f1'] = 2 * round(((scores['precision'] * scores['recall']) / (scores['precision'] + scores['recall'])),3) 

  return scores

---
Run the cell below to calculate and display the evaluation scores for the 50 documents in `movies.zip`.

You can consider the following as a baseline score. Your aim should be to score higher or atleast get as close as possible to these values.

| Precision | Recall | F1    |
| :---:     | :---:  | :---: |
| 0.5       | 0.25   | 0.333 |

In [None]:
# !pip install pandas
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

Unnamed: 0,precision,recall,f1
0,0.79,0.45,0.574


---

## Task 7: Evaluation (II) (10 Marks)

Describe **two** challenges you encountered above or might encounter in the evaluation of *information extraction* or *relation extraction* tasks.



---

> 
1. There are just a few files among the specified JSON files that have the key-value pair written by. When we extract written by relation from the corpus, we get roughly 30 names. The return relation is not counted during evaluation because there is no key-value pair for written, which reduces precision.

2. Using the common pattern "produced," the relations "produced by" and "producing companies" are retrieved. They both produce the same-named entity E.g., document 1 returns "Lucasfilm" for both the relationships. resulting in less precision.