# Introduction to Python and Natural Language Technologies

__Laboratory 10- NLP applications, Dependency parsing__

__April 22, 2021__

During this laboratory you will have to implement various evaluation methods and use them to measure the performance of pretrained models.

In [None]:
import stanza
import spacy
from gensim.summarization import summarizer as gensim_summarizer
from transformers import pipeline
import nltk
import conllu
import os
import numpy as np
import requests

In [None]:
stanza.download('en')
stanza_nlp = stanza.Pipeline('en')
spacy_nlp = spacy.load("en_core_web_sm")

Let's download the UD treebanks if you do not have them already. We are going to use them for evaluations.

In [None]:
url = "https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3424/ud-treebanks-v2.7.tgz"
tgz = 'ud-treebanks-v2.7.tgz'
directory = 'ud_treebanks'
if not os.path.exists(directory):
    import tarfile
    response = requests.get(url, stream=True)
    with open(tgz, 'wb') as ud:
        ud.write(response.content)
    os.mkdir(directory)
    with tarfile.open(tgz, 'r:gz') as _tar:
        for member in _tar:
            if member.isdir():
                continue
            fname = member.name.rsplit('/',1)[1]
            _tar.makefile(member, os.path.join(directory, fname))

In [None]:
data = "ud_treebanks/en_ewt-ud-train.conllu"
with open(data) as conll_data:
    trees = conllu.parse(conll_data.read())

In [None]:
print(trees[0].serialize())

## Evaluation Methods

### 1. F-score

Probably the most relevant measure we can use when we are evaluating classifiers.

Implement the function below. The function takes two iterables and returns a detailed dictionary that contains the True Positive, True Negative, False Positive, Precision, Recall, F-score values for each unique class in the gold list. Additionally, the dictionary should contain the micro and macro precision, recall and F-score values as well.

You can read about the F-measure [here](https://en.wikipedia.org/wiki/F-score).

Help for the micro-macro averages: https://tomaxent.com/2018/04/27/Micro-and-Macro-average-of-Precision-Recall-and-F-Score/.

Example:

In [None]:
f_dict = {
    0: {'tp': 4, 'fp': 0, 'fn': 0, 'precision': 1.0, 'recall': 1.0, 'f': 1.0}, 
    1: {'tp': 4, 'fp': 0, 'fn': 0, 'precision': 1.0, 'recall': 1.0, 'f': 1.0}, 
    2: {'tp': 4, 'fp': 0, 'fn': 0, 'precision': 1.0, 'recall': 1.0, 'f': 1.0}, 
    'MICRO AVG': {'precision': 1.0, 'recall': 1.0, 'f': 1.0}, 
    'MACRO AVG': {'precision': 1.0, 'recall': 1.0, 'f': 1.0}
}

f_dict2 = {
    0: {'tp': 3, 'fp': 1, 'fn': 1, 'precision': 0.75, 'recall': 0.75, 'f': 0.75},
    1: {'tp': 3, 'fp': 1, 'fn': 1, 'precision': 0.75, 'recall': 0.75, 'f': 0.75},
    2: {'tp': 2, 'fp': 2, 'fn': 2, 'precision': 0.5, 'recall': 0.5, 'f': 0.5},
    'MICRO AVG': {'precision': 0.6666666666666666, 'recall': 0.6666666666666666, 'f': 0.6666666666666666},
    'MACRO AVG': {'precision': 0.6666666666666666, 'recall': 0.6666666666666666, 'f': 0.6666666666666666}

}

In [None]:
def f_score(gold, predicted):
    raise NotImplementedError()

In [None]:
gold = [0, 0, 1, 1, 2, 2, 0, 1, 2, 0, 1, 2]
pred = [0, 2, 1, 1, 2, 0, 0, 2, 1, 0, 1, 2]

assert f_dict == f_score(gold, gold)
assert f_dict2 == f_score(gold, pred)

### 1.1 Evaluate a pretrained POS tagger using the example

Choose an existing POS tagger (eg. stanza, spacy, nltk) and predict the POS tags of the sentence given below. Compare the results to the refference below using the f_score function above. Keep in mind, that there are different POS formats, and you should compare them accordingly.

In [None]:
sentence = trees[0].metadata["text"]
upos = [token['upos'] for token in trees[0]]
xpos = [token['xpos'] for token in trees[0]]

print(f'{sentence}\n{upos}\n{xpos}')

In [None]:
# Your solution here

### 2. ROUGE-N score

We usually use the ROUGE score to evaluate summaries, comparing the reference summaries and the generated summaries. Write a function that gets a reference summary, a generated summary and a number N. The number represents the length of n-grams to compare. The function should return a dictionary containing the precision, recall and f-score of the ROUGE-N score. (I practice, the most important part of the ROUGE score is its recall.)

\begin{equation*}
Recall = \frac{overlapping\ ngrams}{all\ ngrams\ in\ the\ reference\ summary}
\end{equation*}

\begin{equation*}
Precision = \frac{overlapping\ ngrams}{all\ ngrams\ in\ the\ generated\ summary}
\end{equation*}

\begin{equation*}
F1 = 2 * \frac{Precision * Recall}{Precision + Recall}
\end{equation*}

You can read further about the ROUGE-N scoring method [here](https://www.aclweb.org/anthology/W04-1013.pdf).

You are encouraged to implement and use the helper functions outlined below. You can use any tokenizer you'd like for this exercise.

Example results of the rouge_n function:

In [None]:
n2 = {'precision': 0.75, 'recall': 0.6, 'f': 0.6666666666666665}

In [None]:
def get_ngram(text, n):
    raise NotImplementedError()

def rouge_n(reference, generated, n):
    raise NotImplementedError()


In [None]:
reference = 'this cat is absoultely adorable today'
generated = 'this cat is adorable today'
assert n2 == rouge_n(reference, generated, 2)

### 2.1 Evaluate a pretraied summarizer using the example

Choose a summarizer (eg. gensim, huggingface) and summarize the following text (taken from the [CNN-Daily Mail dataset](https://cs.nyu.edu/~kcho/DMQA/)) and calculate the ROUGE-2 score of the summary.

In [None]:
article = """Manchester City starlet Devante Cole, son of Andy Cole, has joined Barnsley on loan until January.
City have also confirmed that £3m midfielder Bruno Zuculini has joined Valencia on loan for the rest of the season. 
Meanwhile Juventus and Roma remain keen on signing Matija Nastasic.
On the move: Manchester City striker Devante Cole, son of Andy, has joined Barnsley on loan"""

reference = """Devante Cole has joined Barnsley on loan until January.
Son of Andy Cole has impressed in the City youth ranks.
City have also confirmed that Bruno Zuculini has joined Valencia."""

In [None]:
# Your solution here

### 3. Dependency parse evaluation

We've discussed the two methods used to evaluate dependency parsers.

Reminder:

 - Labeled attachment score (LAS): the percentage of words that are assigned both the correct syntactic head and the correct dependency label
 - Unlabeled attachment score (UAS): the percentage of words that are assigned both the correct syntactic head

### 3.1 UAS method

Implement the UAS method for evaluating graphs!
The input of the function should be two graphs, both in formatted in a simplified conll-dict format, where the keys are the indices of the tokens and the values are tuples consisting of the head and the dependency relation.

In [None]:
def convert_conllu(tree):
    return {token['id']: (token['head'], token['deprel']) for token in tree}

In [None]:
reference_graph = convert_conllu(trees[0])
reference_graph

In [None]:
pred = {1: (0, 'root'), 2: (1, 'punct'), 3: (1, 'flat'), 4: (1, 'punct'), 5: (6, 'amod'),
        6: (7, 'obj'), 7: (1, 'parataxis'), 8: (7, 'obj'), 9: (8, 'flat'), 10: (8, 'flat'),
        11: (8, 'punct'), 12: (8, 'flat'), 13: (8, 'punct'), 14: (15, 'det'), 15: (8, 'appos'),
        16: (18, 'case'), 17: (10, 'det'), 18: (7, 'obl'), 19: (8, 'case'), 20: (21, 'det'),
        21: (18, 'obl'), 22: (23, 'case'), 23: (21, 'nmod'), 24: (21, 'punct'), 25: (28, 'case'),
        26: (28, 'det'), 27: (28, 'amod'), 28: (8, 'obl'), 29: (1, 'punct')}

In [None]:
def uas(gold, predicted):
    raise NotImplementedError()

### 3.2 LAS method
Implement the LAS method as well, similarly to the previous evaluation script.

In [None]:
def las(gold, predicted):
    raise NotImplementedError()

In [None]:
assert 26/29 == uas(reference_graph, pred)
assert 24/29 == las(reference_graph, pred)

# ================ PASSING LEVEL ====================

### 3.3 Try out the evaluation methods with stanza

Evaluate the predictions of stanza on the given example! To do so, you will have to convert the output of stanza to be in the same format as the expected input of the uas and las methods. We recomend the stanza [documentation](https://stanfordnlp.github.io/stanza/tutorials.html) to be able to do this.

In [None]:
def stanza_converter(stanza_doc):
    raise NotImplementedError()

In [None]:
# Your solution here

### 3.4 Compare the accuracy of stanza and spacy

Run the spacy dependency parser on the same input as before and evaluate the performace. To do so you will have to implement a function, that converts the output of spacy (see the [documentation](https://spacy.io/usage/linguistic-features#dependency-parse)) to the appropriate format and check the output of the las and uas methods.

In [None]:
def spacy_converter(spacy_doc):
    raise NotImplementedError()

In [None]:
# Your solution here

# ================ EXTRA LEVEL ====================