<a href="https://colab.research.google.com/github/dgromann/MCMLR/blob/main/MCMLR_BonusExercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Second Bonus Exercise: Information Extraction

This second bonus exercise will focus on methods for information extraction. We will first load an article from a website and then extract information from the article step by step. 

Each task you have to complete for this bonus exercise is marked with a "TASK" heading. 


*Step-wise instructions for this exercise:*


1.   Locate and complete all subtasks of this bonus exercise marked with "TASK"
2.   Complete each "TASK" directly in this notebook and finally submit this notebook for the second bonus exercise

---
To make it easier to spot, each TASK is separated by lines just like this sentence.


---




##Getting and parsing a news article

In order to perform information extraction, we require and need to clean some news article from the web. To this end, a very convenient Python library "newspaper3k" has been provided. 

In [None]:
!pip3 install newspaper3k

In [None]:
import newspaper
from newspaper import Article

url = 'https://www.aljazeera.com/sports/2022/12/13/messi-leads-argentina-to-world-cup-final-in-3-0-win-over-croatiahttps://www.aljazeera.com/sports/2022/12/13/messi-leads-argentina-to-world-cup-final-in-3-0-win-over-croatia'

article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the entire text of the article
print("Text of article: \n", article.text)

Since the article is very long, let's only use the first 10 sentences. To do this, we first need to segment the simple string of ```article.text``` by sentences and then compile these 10 sentences back to a single string (text).  

 

In [None]:
short_article = " ".join(article.text.split("\n\n")[:5])
print("First 5 sentences of the article :", short_article)

##Getting started with NER

The first step of traditional information extraction is to detect all named entities in a document. 

In [None]:
!pip install --upgrade spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

In [66]:
import spacy
from spacy import displacy

nlp_sm = spacy.load("en_core_web_sm")

In [None]:
doc = nlp_sm(short_article)
displacy.render(doc, style="ent", jupyter=True)



---


# TASK #1: 
Can you spot any mistakes in the NER output above? Which ones? 


*Tip: to obtain information on the NER tags, use spacy's explain function, e.g. for "GPE" use ``` # spacy.explain("GPE") ```.*







In [None]:
#To facilitate this task, here the code to only output named entities from the text
for ent in doc.ents:
    print(ent.text, ent.label_)

#The above code can also be written in one line
ents = [(e.text, e.label_) for e in doc.ents]

## Your answer for TASK #1:

*Type your answer here*



---





---


# TASK #2: 

Does the NER result change if you use the large or TRF (English Transformer 
pipeline based on RoBERTa)?

*Reference: If you would like to know details about transformers, take a look at [this](https://jalammar.github.io/illustrated-transformer/) or [this juxtaposition of different types of utilizations of transfomers](https://youtu.be/xI0HHN5XKDo) .*


In [None]:
#Try NER with the large or TRF model of spaCy - how can this be done? 

##Your answer for TASK #2: 

*How does the result change with a different model? Which one did you choose?*



---



## Dependency parsing

Named entity recognition is very useful to identify named entities in isolation or as they relate to specific entities, e.g. geographical units, persons, etc. 

With dependency parsing first types of relations between entities can be identified. While this is not information extraction, it can provide access to first relations between entities (not just named entites) in a sentence and can be and has been frequently utilized as a predecessor of information extraction.

In [None]:
doc = nlp_sm(short_article)

#Let's get the second sentence of our article (to get the first the index number would have to be [0] instead of [1])
sent = list(doc.sents)[1]
print("This first sentence is:")
print(sent)

#And see its dependency parsing output visualized with spaCy
displacy.render(sent, style="dep", jupyter=True)

#Entity linking 

We will now combine named entity recognition and dependency parsing to perform entity linking. In other words, first the named entities of a sentence need to be uncovered and then we need to find out how they depend on each other and with which verb. 

The following code cell provides all noun phrases in the sentence. 

In [None]:
print("Input sentence: ", sent, "\n")

print("All the noun phrases detected in the sentence and their depdendency relations:")
for chunk in sent.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)



---

## TASK #3: 

Utilize the named entities extracted above and the dependency parsing output to extract entities and their links (verbs usually) provided by dependency parsing. 

Pay attention to use the **most** correct version of the spaCy model, e.g. the small model might produce errors in detecting all words belonging to a named entity ("Messi" instead of "Lionel Messi" in our example sentence). 

##Your answer for TASK #3: 




In [None]:
#Provide your code to combine NER and dependency parsing here


*Please describe here how easy this task was and how well it worked*



---

##Question-Answering

Question-Answering can be utilized effectively for information extraction, especially if there is no training data available for a language, e.g. zero-shot transfer. 

In the code cell below, you find an example of how [DistilBERT base cased distilled SQuAD](https://huggingface.co/distilbert-base-cased-distilled-squad?context=cars&question=What+type+of+animal+is+a+cow%3F), a question answering model based on transformers can be utilized to extract information from the first sentence. 

The information we obtain is the triple 

```
(semifinal, location, Lusail Stadium)
```



In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

#Since the tokenized article in spacy format is of type <class spacy.tokens.span.Span> and the QA model requires a string, 
#we need to type convert the sentence to a string by using str()
context = str(list(doc.sents)[0])
question="What was the location of the seminfal?"

#To see which datatype a Python variable is, you can use the type() command
print(type(list(doc.sents)[0]))

#Here we need to formulate a question that can be answered with the provided context 
result = question_answerer(question=question, context=context)

print(question)
print(context)
print("Answer: ", result['answer'], "score: ", round(result['score'], 4), "start: ", result['start'], "end: ", result['end'])



---


#TASK #4: 

Formulate **three** questions to extract information from any of the other sentences in this article. 

##Your answer for TASK #4: 

Use the code cell below to provide your answer as code. 

In [None]:
#Provide the code for your QA TO DO#4 here  

How easy was this task and did you always obtain the correct answer from the model?

*Use this text cell to provide an answer for the above question regarding your experiments.*

---




---


#TASK #5: 

Formulate a single question and context in another language and run the question-answering model again. No need to search for a question and context online, you can freely make up both on your own. 

##Your answer for TASK #5:


In [None]:
#Provide the code for your QA TO DO#5 in another language here 

How well does it work on your own example in another language?


*Use this text cell to provide an answer for the above question regarding your experiments.*

---

When you are done with all of the above TASKS please upload a link to your notebook on Moodle. 