<a href="https://colab.research.google.com/github/francescodisalvo05/polito-deep-nlp/blob/main/Labs/Lab_04_NER_and_Intent_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Named Entities Recognition & Intent Detection

## Named Entities Recognition

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

![https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg](https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg)   

Text domain is **crucial** while recognizing entities (political, medical, food...)

In this practice you will:
- Experiment with pre-trained models to extract entities from text
- 

### **Question 1: data preparation**

The data collection is available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt). 
This dataset was presented in [1][2] and consists of a set of manually annotated Wikipedia text. The data already in [CONLL](https://simpletransformers.ai/docs/ner-data-formats/#text-file-in-conll-format) format. Please read carefully before proceeding with data parsing.

You need to extract clean sentences (no annotation) and, for each sentence, text associated to each entity:     
- `sentences`: list of sentences
- `annotations`: list of list of entities (both string and class information). E.g., `[[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')], [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC'), ('Come', 'MISC')], ...]`. You can remove I- prefix because the data collection does not actually cotains valuable prefixes.

---


[1] Balasuriya, Dominic, et al. "Named entity recognition in wikipedia."
    Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources. Association for Computational Linguistics, 2009.

[2] Nothman, Joel, et al. "Learning multilingual named entity recognition
    from Wikipedia." Artificial Intelligence 194 (2013): 151-175 

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt

--2021-11-24 17:11:37--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 318530 (311K) [text/plain]
Saving to: ‘wikigold.conll.txt’


2021-11-24 17:11:37 (9.06 MB/s) - ‘wikigold.conll.txt’ saved [318530/318530]



In [None]:
def split_text_label(filename):
    f = open(filename)
    split_labeled_text = []
    sentence = []
    for line in f:
        if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
             if len(sentence) > 0:
                 split_labeled_text.append(sentence)
                 sentence = []
             continue
        splits = line.split(' ')
        sentence.append([splits[0],splits[-1].rstrip("\n")])
    if len(sentence) > 0:
        split_labeled_text.append(sentence)
        sentence = []
    return split_labeled_text
sentences_with_labels = split_text_label("wikigold.conll.txt")

In [None]:
# Your code here

cleaned_sentences = [' '.join([t[0] for t in sentence]) for sentence in sentences_with_labels]
cleaned_sentences[0]

'010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets .'

In [None]:
# we have to take care about the ones that are splitted into
# multiple tokens. Therefore we will take care about the "preceeding"
# one, that by default is the escape "0"

labels, complete_labels = [], set()

for sentence in sentences_with_labels:

  current_labels, previous_label = [], "O"

  # since the given "entity" can be composed from 
  # different words, it must be "constructed"
  constructed_entity = ""

  for word, current_label in sentence:

    complete_labels.add(current_label)

    # we can append the previous one
    if  current_label == "O" and previous_label != "O":
      current_labels.append((constructed_entity.strip(), previous_label.split("-")[1])) # remove I-
      constructed_entity = "" # initialize again

    # start a new one
    if current_label != "O" and previous_label == "O":
      constructed_entity = word + " "


    # add element to the same label
    if current_label != "O" and previous_label == current_label:
      constructed_entity = constructed_entity + word + " "

    # new entity
    if current_label != "O" and previous_label != "O" and previous_label != current_label:
      current_labels.append((constructed_entity.strip(), previous_label.split("-")[1])) # remove I-
      constructed_entity = word + " " # initialize again with the new word

    previous_label = current_label

  labels.append(current_labels)

# remove "O" and "I-"
complete_labels = [l.split("-")[1] for l in list(complete_labels) if l != "O"]

In [None]:
labels[:5]

[[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')],
 [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC'), ('Come', 'MISC')],
 [('Kojima Minoru', 'PER'),
  ('Good Day', 'MISC'),
  ('Wardanceis', 'MISC'),
  ('UK', 'LOC'),
  ('Killing Joke', 'ORG')],
 [('XXX can of This', 'MISC')],
 [('Cannabis', 'MISC'),
  ('Cannabis', 'MISC'),
  ('P.O.P', 'MISC'),
  ('HUMANITY', 'MISC')]]

In [None]:
complete_labels

['LOC', 'ORG', 'PER', 'MISC']

### **Question 2: inference with spacy for entity recognition**

Spacy models comes with built-in NER models. Instantiate a [spacy model](https://spacy.io/usage/models) for the english language and get, for each sentence in the data collection, its named entities extracted from the model.

Given that, the provided data collection only contains a subset of spacy labels map all the classes not available in the data collection to the `MISC` class. 

In [None]:
# Your code here

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

predictions = []

for sentence in cleaned_sentences:

  out = nlp(sentence)

  entities = []

  # https://github.com/explosion/spaCy/issues/1131
  # out.ents!
  for e in out.ents:
    if e.label_ not in complete_labels:
      entities.append((e.text, 'MISC'))
    else:
      entities.append((e.text, e.label_))

  predictions.append(entities)

In [None]:
predictions[:5]

[[('010', 'MISC'),
  ('tenth', 'MISC'),
  ('Japanese', 'MISC'),
  ('The Mad Capsule Markets', 'ORG')],
 [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC')],
 [('Kojima Minoru', 'MISC'),
  ('Good Day', 'MISC'),
  ('Wardanceis', 'MISC'),
  ('UK', 'MISC'),
  ('Killing Joke', 'MISC')],
 [('XXX', 'ORG')],
 [('Cannabis', 'ORG'),
  ('Cannabis', 'ORG'),
  ('P.O.P', 'ORG'),
  ('HUMANITY', 'ORG')]]

### **Question 3: compute metrics for evaluating NER**

Use [eval4ner](https://github.com/cyk1337/eval4ner) to evaluate the spacy model for NER on the parsed dataset.

**Note**: please use `pip install git+https://github.com/MorenoLaQuatra/eval4ner` to use a fixed version of the library. Before passing the parameter to the evaluation function, create a deepcopy of each variable:

The issue has been already reported to the original author.

In [None]:
! pip install git+https://github.com/MorenoLaQuatra/eval4ner

Collecting git+https://github.com/MorenoLaQuatra/eval4ner
  Cloning https://github.com/MorenoLaQuatra/eval4ner to /tmp/pip-req-build-moro925z
  Running command git clone -q https://github.com/MorenoLaQuatra/eval4ner /tmp/pip-req-build-moro925z
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: eval4ner
  Building wheel for eval4ner (PEP 517) ... [?25l[?25hdone
  Created wheel for eval4ner: filename=eval4ner-0.0.4-py3-none-any.whl size=6306 sha256=fd71e44184590052eeb726776e2b9ab541e1bf498704b62568e5752fd50f302c
  Stored in directory: /tmp/pip-ephem-wheel-cache-9il7ocet/wheels/58/d2/e2/4b3613c62c5ceb2f9e5f021bd6d0a6f2490c01a927b07f154c
Successfully built eval4ner
Installing collected packages: eval4ner
Successfully installed eval4ner-0.0.4


In [None]:
# Your code here

In [None]:
import eval4ner.muc as muc

evaluations = muc.evaluate_all(predictions, labels, cleaned_sentences, verbose=False)


 NER evaluation scores:
  strict mode, Precision=0.1939, Recall=0.1637, F1:0.1732
   exact mode, Precision=0.3299, Recall=0.2680, F1:0.2868
 partial mode, Precision=0.3299, Recall=0.2680, F1:0.2868
    type mode, Precision=0.1939, Recall=0.1637, F1:0.1732


### **Question 4: inference with transformers pipeline**

Transformer-based models can be fine-tuned for token-level classification. The most relevant task in this class is NER. Use [transformers pipelines](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.TokenClassificationPipeline) to recognize entities in the previous data collection. 

Evaluate the model using the same procedure of Q3.

**Note:** the output of the pipeline differs with respect to spacy. Please be sure to process data correctly before running evaluation.

**Note 2:** `ignore_labels` parameter could be useful to correctly parse entities.

**Note 3:** `##` symbol is used when a token is a continuation of a previous one (Poli + ##TO)

In [None]:
%%capture
! pip install datasets transformers

In [None]:
# Your code here

In [None]:
import datasets
from transformers import pipeline
from transformers.pipelines.base import KeyDataset

pipe = pipeline("ner")
pipe.ignore_labels = []

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

In [None]:
test_output = pipe("I love Central Park")
test_output # Central and Park are two different tokens, therefore we need to manage it

[{'end': 1,
  'entity': 'O',
  'index': 1,
  'score': 0.9969644,
  'start': 0,
  'word': 'I'},
 {'end': 6,
  'entity': 'O',
  'index': 2,
  'score': 0.9990342,
  'start': 2,
  'word': 'love'},
 {'end': 14,
  'entity': 'I-LOC',
  'index': 3,
  'score': 0.98957646,
  'start': 7,
  'word': 'Central'},
 {'end': 19,
  'entity': 'I-LOC',
  'index': 4,
  'score': 0.98647004,
  'start': 15,
  'word': 'Park'}]

In [None]:
predictions_ner = []
all_labels_ner = []

for sentence in cleaned_sentences:

  current_output = pipe(sentence)

  previous_entity = "O"

  curr_labels, curr_entity = [], ""

  for token in current_output:

    entity = token['entity']
    text = token['word']

    all_labels_ner.append(entity)

    # case 1
    if entity == "O" and previous_entity == "O":
      continue

    # case 2 - append
    elif entity == "O" and previous_entity != "O":
  
      # current_entity = current_entity.replace(" ##", "")
      current_labels.append((curr_entity.strip(), previous_entity.split("-")[1]))
      curr_entity = ""

    # case 3 - shift
    elif entity != "O" and previous_entity == previous_entity :
      curr_entity += text + " "

    # case 4 - append
    elif entity != "O" and previous_entity != "O" and previous_entity != previous_entity :
    
      # current_entity = current_entity.replace(" ##", "")
      current_labels.append((curr_entity.strip(), previous_entity.split("-")[1]))
      curr_entity = ""


# it takes too much time..

In [None]:
import eval4ner.muc as muc

evaluations = muc.evaluate_all(predictions_ner, labels, cleaned_sentences, verbose=False)

## Intent Detection

In data mining, intention mining or intent mining is the problem of determining a user's intention from logs of his/her behavior in interaction with a computer system, such as in search engines. Intent Detection is the identification and categorization of what a user online intended or wanted to find when they type or speak with a conversational agent (or a search engine).

![https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png](https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png)

Data source (ATIS dataset): https://github.com/yvchen/JointSLU ; https://www.kaggle.com/siddhadev/atis-dataset-clean/home

Use provided train/dev/test split accordingly.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.test.csv

--2021-11-24 21:35:35--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 838864 (819K) [text/plain]
Saving to: ‘atis.train.csv’


2021-11-24 21:35:35 (18.3 MB/s) - ‘atis.train.csv’ saved [838864/838864]

--2021-11-24 21:35:35--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112033 (109K) [text/plain]
Saving to: ‘atis.dev.c

### **Question 5: two-step classification model**

Train a classification model to identify the intent from sentence text. The model should leverage on pretrained BERT model to generate features for each sentence (No-finetuning).

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true)


Assess the performance of the generated model by using the **classification accuracy**.

In [None]:
# Your code here

In [None]:
# !pip install sklearn 
!pip install sentence-transformers

In [None]:
from sklearn.metrics import classification_report
from sentence_transformers import SentenceTransformer
import pandas as pd

In [None]:
df_train = pd.read_csv('atis.train.csv')
df_val = pd.read_csv('atis.dev.csv')
df_test = pd.read_csv('atis.test.csv')

In [None]:
df_test.head()

Unnamed: 0,id,tokens,slots,intent
0,test-00001,BOS what are the coach flights between dallas ...,O O O O B-class_type O O B-fromloc.city_name O...,atis_flight
1,test-00002,BOS i want a flight from nashville to seattle ...,O O O O O O B-fromloc.city_name O B-toloc.city...,atis_flight
2,test-00003,BOS i need a flight leaving kansas city to chi...,O O O O O O B-fromloc.city_name I-fromloc.city...,atis_flight
3,test-00004,BOS explain meal codes sd d EOS,O O B-meal O B-meal_code I-meal_code O,atis_abbreviation
4,test-00005,BOS show me all flights from atlanta to san fr...,O O O O O O B-fromloc.city_name O B-toloc.city...,atis_flight


In [None]:
def split_clean_df(df):

  tokens = df.tokens.values
  labels = df.intent.values

  # remove BOS and EOS
  tokens = [token.replace("BOS ", "") for token in tokens]
  tokens = [token.replace(" EOS", "") for token in tokens]

  return tokens, labels

train_tokens, y_train = split_clean_df(df_train)
val_tokens, y_val = split_clean_df(df_val)
test_tokens, y_test = split_clean_df(df_test)

In [None]:
train_tokens[:5], y_train[:5]

(['what is the cost of a round trip flight from pittsburgh to atlanta beginning on april twenty fifth and returning on may sixth',
  'now i need a flight leaving fort worth and arriving in denver no later than 2 pm next monday',
  'i need to fly from kansas city to chicago leaving next wednesday and returning the following day',
  'what is the meaning of meal code s',
  'show me all flights from denver to pittsburgh which serve a meal for the day after tomorrow'],
 array(['atis_airfare', 'atis_flight', 'atis_flight', 'atis_abbreviation',
        'atis_flight'], dtype=object))

In [None]:
bert = SentenceTransformer("stsb-mpnet-base-v2")

# encode them through bert
X_train = bert.encode(train_tokens, show_progress_bar=True)
X_dev = bert.encode(val_tokens, show_progress_bar=True)
X_test = bert.encode(test_tokens, show_progress_bar=True)

Downloading:   0%|          | 0.00/868 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/588 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/134 [00:00<?, ?it/s]

Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/19 [00:00<?, ?it/s]

In [None]:
# conver the labels into numerical values
from sklearn.preprocessing import LabelEncoder
import numpy as np


train_labels_set = list(set(y_train))

label_encoder = LabelEncoder()
label_encoder.fit(np.array(train_labels_set).reshape(-1,1))

# encode all of them
encoded_y_train = label_encoder.transform(y_train)
encoded_y_val = label_encoder.transform(y_val)
encoded_y_test = label_encoder.transform(y_test)

  y = column_or_1d(y, warn=True)


In [None]:
len(train_labels_set)

17

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.naive_bayes import GaussianNB

# very naive baseline

gnb = GaussianNB().fit(X_train, encoded_y_train)
y_pred_dev = gnb.predict(X_dev)

print("Validation set : \n")
print(classification_report(encoded_y_val, y_pred_dev))

Validation set : 

              precision    recall  f1-score   support

           0       0.88      0.82      0.85        17
           1       0.78      0.88      0.82         8
           2       0.60      0.87      0.71        46
           3       0.35      0.69      0.47        16
           4       1.00      0.67      0.80         3
           5       1.00      1.00      1.00         4
           6       0.20      0.50      0.29         2
           7       1.00      1.00      1.00         3
           8       0.96      0.83      0.89       423
           9       0.29      0.67      0.40         3
          10       0.00      0.00      0.00         2
          11       0.23      0.71      0.34         7
          12       1.00      1.00      1.00         2
          13       1.00      0.89      0.94        28
          14       0.00      0.00      0.00         1
          15       0.56      0.83      0.67         6
          16       0.00      0.00      0.00         1

    acc

In [None]:
y_pred_test = gnb.predict(X_test)

print("Validation set : \n")
print(classification_report(encoded_y_test, y_pred_test))

Validation set : 

              precision    recall  f1-score   support

           0       0.94      0.94      0.94        16
           1       0.73      1.00      0.84         8
           2       0.74      0.83      0.78        54
           3       0.36      0.72      0.48        18
           4       1.00      0.75      0.86         4
           5       1.00      1.00      1.00         4
           6       1.00      0.33      0.50         3
           7       0.75      1.00      0.86         3
           8       0.96      0.81      0.88       424
           9       0.30      1.00      0.46         3
          10       0.00      0.00      0.00         2
          11       0.19      0.86      0.31         7
          12       1.00      0.75      0.86         4
          13       1.00      0.93      0.96        29
          14       0.50      1.00      0.67         1
          15       0.33      0.80      0.47         5
          16       0.00      0.00      0.00         1

    acc

### **Question 6: finetuning end-to-end classification model**

Train a new BERT model for the task of [sequence classification](https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification) (include BERT fine-tuning).  

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true)

Assess the performance of the generated model by using the **classification accuracy**.

Which model has better performance?

In [None]:
# Your code here

In [42]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=17)

loading file https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bert-base-uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-uncased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc7043

In [43]:
bert_train = tokenizer(train_tokens, padding="max_length", truncation=True, max_length=64)
bert_eval = tokenizer(val_tokens, padding="max_length", truncation=True, max_length=64)
bert_test = tokenizer(test_tokens, padding="max_length", truncation=True, max_length=64)

In [44]:
# taken from the lab solution
# notice : go deepen on Bert pipelines

In [45]:
import torch

# bert requires tensors

class AtisDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_ds = AtisDataset(bert_train, encoded_y_train)
eval_ds = AtisDataset(bert_eval, encoded_y_val)
test_ds = AtisDataset(bert_test, encoded_y_test)

In [46]:
from sklearn.metrics import accuracy_score, f1_score, classification_report
def compute_metrics(pred):
  predictions = np.argmax(pred.predictions, axis=-1)
  labels = pred.label_ids
  return {
      "acc": accuracy_score(labels, predictions),
      "f1_macro": f1_score(labels, predictions, average="macro"),
      "f1_weight": f1_score(labels, predictions, average="weighted")
  }

In [47]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=10,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    eval_steps=100,
    save_steps=100,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_ds,         # training dataset
    eval_dataset=eval_ds,             # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 4274
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 268


Epoch,Training Loss,Validation Loss,Acc,F1 Macro,F1 Weight
1,0.2692,0.188373,0.958042,0.491213,0.947132


***** Running Evaluation *****
  Num examples = 572
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-268
Configuration saved in ./results/checkpoint-268/config.json
Model weights saved in ./results/checkpoint-268/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-268 (score: 0.18837293982505798).


TrainOutput(global_step=268, training_loss=0.5597691658272672, metrics={'train_runtime': 114.3443, 'train_samples_per_second': 37.378, 'train_steps_per_second': 2.344, 'total_flos': 140586012752640.0, 'train_loss': 0.5597691658272672, 'epoch': 1.0})

In [48]:
preds = trainer.predict(test_ds)

y_pred = [np.argmax(array, axis=0) for array in preds.predictions] 
print(classification_report(encoded_y_test, y_pred))

***** Running Prediction *****
  Num examples = 586
  Batch size = 32


              precision    recall  f1-score   support

           0       0.88      0.94      0.91        16
           1       1.00      0.88      0.93         8
           2       0.96      1.00      0.98        54
           3       0.86      1.00      0.92        18
           4       0.00      0.00      0.00         4
           5       1.00      0.50      0.67         4
           6       0.00      0.00      0.00         3
           7       1.00      0.67      0.80         3
           8       0.99      0.99      0.99       424
           9       0.00      0.00      0.00         3
          10       0.00      0.00      0.00         2
          11       1.00      0.57      0.73         7
          12       0.00      0.00      0.00         4
          13       0.78      0.97      0.86        29
          14       1.00      1.00      1.00         1
          15       0.31      0.80      0.44         5
          16       0.00      0.00      0.00         1

    accuracy              