<a href="https://colab.research.google.com/github/francescodisalvo05/polito-deep-nlp/blob/main/Labs/Lab_04_NER_and_Intent_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Named Entities Recognition & Intent Detection

## Named Entities Recognition

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

![https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg](https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg)   

Text domain is **crucial** while recognizing entities (political, medical, food...)

In this practice you will:
- Experiment with pre-trained models to extract entities from text
- 

### **Question 1: data preparation**

The data collection is available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt). 
This dataset was presented in [1][2] and consists of a set of manually annotated Wikipedia text. The data already in [CONLL](https://simpletransformers.ai/docs/ner-data-formats/#text-file-in-conll-format) format. Please read carefully before proceeding with data parsing.

You need to extract clean sentences (no annotation) and, for each sentence, text associated to each entity:     
- `sentences`: list of sentences
- `annotations`: list of list of entities (both string and class information). E.g., `[[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')], [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC'), ('Come', 'MISC')], ...]`. You can remove I- prefix because the data collection does not actually cotains valuable prefixes.

---


[1] Balasuriya, Dominic, et al. "Named entity recognition in wikipedia."
    Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources. Association for Computational Linguistics, 2009.

[2] Nothman, Joel, et al. "Learning multilingual named entity recognition
    from Wikipedia." Artificial Intelligence 194 (2013): 151-175 

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt

--2021-11-23 17:37:26--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 318530 (311K) [text/plain]
Saving to: ‘wikigold.conll.txt’


2021-11-23 17:37:26 (13.9 MB/s) - ‘wikigold.conll.txt’ saved [318530/318530]



In [5]:
def split_text_label(filename):
    f = open(filename)
    split_labeled_text = []
    sentence = []
    for line in f:
        if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
             if len(sentence) > 0:
                 split_labeled_text.append(sentence)
                 sentence = []
             continue
        splits = line.split(' ')
        sentence.append([splits[0],splits[-1].rstrip("\n")])
    if len(sentence) > 0:
        split_labeled_text.append(sentence)
        sentence = []
    return split_labeled_text
sentences_with_labels = split_text_label("wikigold.conll.txt")

In [12]:
# Your code here

cleaned_sentences = [' '.join([t[0] for t in sentence]) for sentence in sentences_with_labels]
cleaned_sentences[0]

'010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets .'

In [79]:
# we have to take care about the ones that are splitted into
# multiple tokens. Therefore we will take care about the "preceeding"
# one, that by default is the escape "0"

labels, complete_labels = [], set()

for sentence in sentences_with_labels:

  current_labels, previous_label = [], "O"

  # since the given "entity" can be composed from 
  # different words, it must be "constructed"
  constructed_entity = ""

  for word, current_label in sentence:

    complete_labels.add(current_label)

    # we can append the previous one
    if  current_label == "O" and previous_label != "O":
      current_labels.append((constructed_entity.strip(), previous_label.split("-")[1])) # remove I-
      constructed_entity = "" # initialize again

    # start a new one
    if current_label != "O" and previous_label == "O":
      constructed_entity = word + " "


    # add element to the same label
    if current_label != "O" and previous_label == current_label:
      constructed_entity = constructed_entity + word + " "

    # new entity
    if current_label != "O" and previous_label != "O" and previous_label != current_label:
      current_labels.append((constructed_entity.strip(), previous_label.split("-")[1])) # remove I-
      constructed_entity = word + " " # initialize again with the new word

    previous_label = current_label

  labels.append(current_labels)

# remove "O" and "I-"
complete_labels = [l.split("-")[1] for l in list(complete_labels) if l != "O"]

In [80]:
labels[:5]

[[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')],
 [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC'), ('Come', 'MISC')],
 [('Kojima Minoru', 'PER'),
  ('Good Day', 'MISC'),
  ('Wardanceis', 'MISC'),
  ('UK', 'LOC'),
  ('Killing Joke', 'ORG')],
 [('XXX can of This', 'MISC')],
 [('Cannabis', 'MISC'),
  ('Cannabis', 'MISC'),
  ('P.O.P', 'MISC'),
  ('HUMANITY', 'MISC')]]

In [61]:
complete_labels

['ORG', 'MISC', 'LOC', 'PER']

### **Question 2: inference with spacy for entity recognition**

Spacy models comes with built-in NER models. Instantiate a [spacy model](https://spacy.io/usage/models) for the english language and get, for each sentence in the data collection, its named entities extracted from the model.

Given that, the provided data collection only contains a subset of spacy labels map all the classes not available in the data collection to the `MISC` class. 

In [None]:
# Your code here

In [76]:
import spacy

nlp = spacy.load("en_core_web_sm")

predictions = []

for sentence in cleaned_sentences:

  out = nlp(sentence)

  entities = []

  # https://github.com/explosion/spaCy/issues/1131
  # out.ents!
  for e in out.ents:
    if e.label_ not in complete_labels:
      entities.append((e.text, 'MISC'))
    else:
      entities.append((e.text, e.label_))

  predictions.append(entities)

In [78]:
predictions[:5]

[[('010', 'MISC'),
  ('tenth', 'MISC'),
  ('Japanese', 'MISC'),
  ('The Mad Capsule Markets', 'ORG')],
 [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC')],
 [('Kojima Minoru', 'MISC'),
  ('Good Day', 'MISC'),
  ('Wardanceis', 'MISC'),
  ('UK', 'MISC'),
  ('Killing Joke', 'MISC')],
 [('XXX', 'ORG')],
 [('Cannabis', 'ORG'),
  ('Cannabis', 'ORG'),
  ('P.O.P', 'ORG'),
  ('HUMANITY', 'ORG')]]

### **Question 3: compute metrics for evaluating NER**

Use [eval4ner](https://github.com/cyk1337/eval4ner) to evaluate the spacy model for NER on the parsed dataset.

**Note**: please use `pip install git+https://github.com/MorenoLaQuatra/eval4ner` to use a fixed version of the library. Before passing the parameter to the evaluation function, create a deepcopy of each variable:

The issue has been already reported to the original author.

In [None]:
! pip install git+https://github.com/MorenoLaQuatra/eval4ner

In [None]:
# Your code here

### **Question 4: inference with transformers pipeline**

Transformer-based models can be fine-tuned for token-level classification. The most relevant task in this class is NER. Use [transformers pipelines](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.TokenClassificationPipeline) to recognize entities in the previous data collection. 

Evaluate the model using the same procedure of Q3.

**Note:** the output of the pipeline differs with respect to spacy. Please be sure to process data correctly before running evaluation.

**Note 2:** `ignore_labels` parameter could be useful to correctly parse entities.

**Note 3:** `##` symbol is used when a token is a continuation of a previous one (Poli + ##TO)

In [None]:
%%capture
! pip install datasets transformers

In [None]:
# Your code here

## Intent Detection

In data mining, intention mining or intent mining is the problem of determining a user's intention from logs of his/her behavior in interaction with a computer system, such as in search engines. Intent Detection is the identification and categorization of what a user online intended or wanted to find when they type or speak with a conversational agent (or a search engine).

![https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png](https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png)

Data source (ATIS dataset): https://github.com/yvchen/JointSLU ; https://www.kaggle.com/siddhadev/atis-dataset-clean/home

Use provided train/dev/test split accordingly.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.test.csv

### **Question 5: two-step classification model**

Train a classification model to identify the intent from sentence text. The model should leverage on pretrained BERT model to generate features for each sentence (No-finetuning).

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true)


Assess the performance of the generated model by using the **classification accuracy**.

In [None]:
# Your code here

### **Question 6: finetuning end-to-end classification model**

Train a new BERT model for the task of [sequence classification](https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification) (include BERT fine-tuning).  

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true)

Assess the performance of the generated model by using the **classification accuracy**.

Which model has better performance?

In [None]:
# Your code here