<a href="https://colab.research.google.com/github/fpgmina/DeepNLP/blob/main/L3_Part_1_NER_and_Intent_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Giuseppe Gallipoli

**Credits:** Moreno La Quatra

**Practice 3:** Named Entity Recognition (part 1) & Intent Detection (part 2)

## Named Entity Recognition (NER)

The Named Entity Recognition task aims at identifying and classifying named entities in a text. Named entities are real-world objects such as persons, locations, organizations, etc. The task takes as input a sentence and determines the boundaries of the named entities and their type.

For example, given the sentence:

```
I went to Paris last week.
```

the task is to identify the named entity `Paris` as a location.

Hereafter an illustration of the NER task:

![https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg](https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg)   

In the first part of this practice, you will:
- explore the NER task using pre-trained models available on spaCy and HuggingFace
- evaluate the performance of a spaCy NER model on a custom dataset
- evaluate the performance of a HuggingFace NER model on a custom dataset

NB: The library used to evaluate the performance of the models is `seqeval`, which is a library for evaluating sequence labeling tasks, or `eval4ner` for the NER task.

### **Question 1: Data preparation**

The first step is to prepare the data. In this practice, you will use the WikiGold dataset [1][2], which is a collection of annotated sentences from Wikipedia. The dataset is available in [CONLL](https://simpletransformers.ai/docs/ner-data-formats/#text-file-in-conll-format) format. The dataset is available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt).

**Please, read carefully the following instructions before starting to work on the practice.**

You need to extract clean sentences (no annotation) and, for each sentence, the corresponding annotations. The dataset has the following format:

- `sentences`: list of sentences
- `annotations`: list of list of entities (both string and class information). E.g., `[[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')], [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC'), ('Come', 'MISC')], ...]`. You can remove I- prefix because the data collection does not actually contain valuable prefixes.

---


[1] Balasuriya, Dominic, et al. "Named entity recognition in wikipedia."
    Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources. Association for Computational Linguistics, 2009.

[2] Nothman, Joel, et al. "Learning multilingual named entity recognition
    from Wikipedia." Artificial Intelligence 194 (2013): 151-175

---

The following cell downloads the dataset on Google Colab.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt

To avoid spending too much time on data processing, the following cells prepare the dataset for you.
After running the cell, you will have the following variables:
- `sentences_with_labels`: a list of tokens with their corresponding labels
- `sentences`: a list of the sentences in the dataset
- `labels`: a list of lists of labels. Each element in the outer list corresponds to a list of labels for a sentence in the dataset.

Hereafter an example of the provided data:

```python
sentences_with_labels[0] = [
    ['010', 'I-MISC'],
    ['is', 'O'],
    ['the', 'O'],
    ['tenth', 'O'],
    ['album', 'O'],
    ['from', 'O'],
    ['Japanese', 'I-MISC'],
    ['Punk', 'O'],
    ['Techno', 'O'],
    ['band', 'O'],
    ['The', 'I-ORG'],
    ['Mad', 'I-ORG'],
    ['Capsule', 'I-ORG'],
    ['Markets', 'I-ORG'],
    ['.', 'O']
]

sentences[0] = [
    '010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets .'
]

labels[0] = [
    ('010', 'MISC'),
    ('Japanese', 'MISC'),
    ('The Mad Capsule Markets', 'ORG')
]
```

Please, note that the labels are not in IOB format. You can ignore the I- prefix because the data collection does not actually contain valuable prefixes.
Get familar with the data by printing the first 10 sentences and their corresponding labels. Which are the labels in the dataset?

In [None]:
%%capture
! pip install datasets
! pip install transformers
! pip install spacy
! python -m spacy download en_core_web_sm

In [None]:
def split_text_label(filename):
    f = open(filename)
    split_labeled_text = []
    sentence = []
    for line in f:
        if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
             if len(sentence) > 0:
                 split_labeled_text.append(sentence)
                 sentence = []
             continue
        splits = line.split(' ')
        sentence.append([splits[0],splits[-1].rstrip("\n")])
    if len(sentence) > 0:
        split_labeled_text.append(sentence)
        sentence = []
    return split_labeled_text
sentences_with_labels = split_text_label("wikigold.conll.txt")

In [None]:
print(sentences_with_labels[0])

In [None]:
sentences = []

for sent_list in sentences_with_labels:
    sentence = [s[0] for s in sent_list]
    sentence = " ".join(sentence)
    sentences.append(sentence)

In [None]:
print (sentences[0])

In [None]:
labels = []
overall_labels = []
for sent_list in sentences_with_labels:
    current_labels = []
    prev = "O"
    current_entity = ""
    for w, l in sent_list:
        overall_labels.append(l)
        if l != "O" and prev != "O":
            if l == prev:
                # continue entity
                current_entity += w + " "
            else:
                # end prev and start a new one
                current_labels.append((current_entity.strip(), prev.split("-")[1]))
                current_entity = w + " "
        elif l == "O" and  prev != "O":
            # end prev
            current_labels.append((current_entity.strip(), prev.split("-")[1]))
            current_entity = ""
        elif l != "O" and prev == "O":
            # start new
            current_entity = w + " "

        prev = l
    labels.append(current_labels)

print (labels)
overall_labels = list(set(overall_labels))
overall_labels = [o for o in overall_labels if o != "O"]
overall_labels = [o.split("-")[1] for o in overall_labels]
print (overall_labels)

In [None]:
print (labels[0])

### **Question 2: Inference with spaCy for entity recognition**

spaCy is a free, open-source library for advanced Natural Language Processing in Python. It features NER models for different languages including English.
The models are available [here](https://spacy.io/models).

For this question you are asked to instantiate a spaCy model for English and perform inference on the sentences in the dataset. The English model contains a superset of the labels in the dataset. For this reason, you need to map the labels that are not in the dataset to the `MISC` label.

You are expected to generate an output similar to the following:
```python
[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')]
```

Please pay attention to the token attributes (you can find more information [here](https://spacy.io/api/token#attributes)) and the entity attributes (you can find more information [here](https://spacy.io/api/entityrecognizer)).

The following cell instantiates a spaCy model for English.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
from tqdm import tqdm

pred_ner = []
for i, s in enumerate(tqdm(sentences)):
    doc = nlp(s)
    current_ner = []
    for entity in doc.ents:
        if entity.label_ not in overall_labels:
            label = "MISC"
        else:
            label = entity.label_
        current_ner.append((entity.text, label))

    pred_ner.append(current_ner)
print (pred_ner)

### **Question 3: Compute metrics for evaluating NER**

The output of NER models consists of a set of named entities. To evaluate the performance of a model, we need to compare the predicted named entities with the ground truth.

For this question, you need to use [`eval4ner`](https://github.com/cyk1337/eval4ner) package to evaluate the performance of the model.

In [None]:
%%capture
! pip install eval4ner

In [None]:
# eval4ner accepts the following format:
# [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')]
# need to switch the position of NER_type and the string

labels_new = [[(label, text) for text, label in label] for label in labels]
pred_ner_new = [[(label, text) for text, label in pred] for pred in pred_ner]

In [None]:
import eval4ner.muc as muc

evaluations = muc.evaluate_all(pred_ner_new, labels_new, sentences, verbose=False)
print()

eval_types = ["exact", "partial", "strict", "type"]
for eval in eval_types:
    print (eval)
    print ("Precision: ", evaluations[eval]["precision"])
    print ("Recall   : ", evaluations[eval]["recall"])
    print ("F1_score : ", evaluations[eval]["f1_score"])
    print()

### **Question 4: Inference with transformers pipeline**

Transformer-based models can be fine-tuned for token-level classification. The task is to classify each token in a sentence and assign it to a class.
The NER task is a token-level classification task and the models can be used for performing inference on the sentences in the dataset.

You can use the pipeline available on the HuggingFace [transformers library](https://huggingface.co/docs/transformers/main_classes/pipelines). The pipeline allows to perform inference on a list of sentences.

Evaluate the **standard** model using the pipeline (`pipe = pipeline("ner")`). Check the documentation here: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline


A few notes about the question (**read carefully**):
1. The output of the pipeline differs with respect to spaCy. Please be sure to process data correctly before running evaluation.
2. `ignore_labels` parameter could be used to exclude labels from the prediction.
3. `##` symbol is used when a token is a continuation of a previous one (Poli + ##TO). You may need to check this specific case to merge the tokens correctly.
4. Use `eval4ner` to evaluate the performance of the model.

In [None]:
%%capture
! pip install transformers
! pip install datasets

In [None]:
import datasets
from transformers import pipeline
from transformers.pipelines.base import KeyDataset
import torch, tqdm

device = torch.device('cuda') if torch.cuda.is_available() else torch.devivce('cpu')

pipe = pipeline("ner", device=device)
pipe.ignore_labels = []

In [None]:
from tqdm import tqdm
pred_transformers_ner = []
overall_labels_transformers = []
for i, s in enumerate(tqdm(sentences)):
    out = pipe(s)
    current_labels = []
    prev = "O"
    current_entity = ""
    for o in out:
        overall_labels_transformers.append(o['entity'])
        l = o['entity']
        w = o['word']
        if l != "O" and prev != "O":
            if l == prev:
                # continue entity
                current_entity += w + " "
            else:
                # end prev and start a new one
                current_entity = current_entity.strip()
                current_entity = current_entity.replace(" ##", "")
                current_labels.append((current_entity, prev.split("-")[1]))
                current_entity = w + " "
        elif l == "O" and  prev != "O":
            # end prev
            current_entity = current_entity.strip()
            current_entity = current_entity.replace(" ##", "")
            current_labels.append((current_entity, prev.split("-")[1]))
            current_entity = ""
        elif l != "O" and prev == "O":
            # start new
            current_entity = w + " "

        prev = l
    pred_transformers_ner.append(current_labels)

In [None]:
print (len(labels))
print (len(pred_transformers_ner))
overall_labels_transformers = list(set(overall_labels_transformers))
overall_labels_transformers = [o for o in overall_labels_transformers if o != "O"]
overall_labels_transformers = [o.split("-")[1] for o in overall_labels_transformers]
print (overall_labels_transformers)

In [None]:
# eval4ner accepts the following format:
# [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')]
# need to switch the position of NER_type and the string

labels_new = [[(label, text) for text, label in label] for label in labels]
pred_transformers_ner_new = [[(label, text) for text, label in label] for label in pred_transformers_ner]

In [None]:
import eval4ner.muc as muc

evaluations = muc.evaluate_all(pred_transformers_ner_new, labels_new, sentences, verbose=False)
print()

eval_types = ["exact", "partial", "strict", "type"]
for eval in eval_types:
    print (eval)
    print ("Precision: ", evaluations[eval]["precision"])
    print ("Recall   : ", evaluations[eval]["recall"])
    print ("F1_score : ", evaluations[eval]["f1_score"])
    print()