## Transfer learning & fine-tuning

So far we know about using pre-trained huggingface language models (e.g BERT, T5), in this part of the workshop we are going to talk about fine-tuning these pre-trained models for specific downstream NLP tasks (e.g. document classification (sentiment), named-entity-recognition (NER) and Summarisation). 

This is generally know as transfer learning. Transfer learning is a machine learning technique for adapting pretrained models to solve specialized problems. Sequential transfer learning is learning on one task, or one dataset, and then transferring this learning to another task or dataset.

## Install dependencies

In [3]:
!pip install transformers datasets torch 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.0 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 16.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 63.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 74.5 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 70.8

### Dataset: The Yelp Review Full dataset for text classification.
Before we can fine-tune a pretrained model, we need to download a dataset and prepare it for training. We are going to use the Yelp dataset for fine-tuning. 

This dataset is a subset of businesses, reviews and user data.

The dataset contains text and the corresponding label (1-5 stars).



In [2]:
from datasets import load_dataset
dataset = load_dataset('yelp_review_full') 
dataset 


Downloading builder script:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/979 [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

Let's take a look at an example

In [3]:
dataset["train"][42]

{'label': 4,
 'text': 'What a find! I stopped in here for breakfast while in town for business. The service is so friendly I thought I was down south. The service was quick, frankly and felt like I was with family. \\nFantastic poached eggs, Cajun homefries and crispy bacon. Gab and Eat is definitely a place I world recommend to locals. I was stuffed and the bill was only $8.00.'}

Remember we need to process the text using a tokenizer, we will use padding and truncation to handle any variations in the sequence lengths. Why do we need to do this?

The primary purpose of [`map()`](https://huggingface.co/docs/datasets/process#map) is to speed up processing functions. It allows you to apply a processing function to each example in a dataset, independently or in batches. In this case we will use it to apply the `tokenize_function()` to the data. 

This will apply the function on all the elements of all the splits in dataset, so the train and test data will be preprocessed in one single command.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast=True)


Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [5]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/650 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

To reduce the time it takes for training we can create smaller subsets of the full dataset for fine-tuning

In [7]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(20000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(2000))

In [8]:
print(small_train_dataset[0])

{'label': 4, 'text': "I stalk this truck.  I've been to industrial parks where I pretend to be a tech worker standing in line, strip mall parking lots, and of course the farmer's market.  The bowls are so so absolutely divine.  The owner is super friendly and he makes each bowl by hand with an incredible amount of pride.  You gotta eat here guys!!!", 'input_ids': [101, 146, 27438, 1142, 4202, 119, 146, 112, 1396, 1151, 1106, 3924, 8412, 1187, 146, 9981, 1106, 1129, 170, 13395, 7589, 2288, 1107, 1413, 117, 6322, 8796, 5030, 7424, 117, 1105, 1104, 1736, 1103, 9230, 112, 188, 2319, 119, 1109, 20400, 1132, 1177, 1177, 7284, 10455, 119, 1109, 3172, 1110, 7688, 4931, 1105, 1119, 2228, 1296, 7329, 1118, 1289, 1114, 1126, 10965, 2971, 1104, 8188, 119, 1192, 13224, 3940, 1303, 3713, 106, 106, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [9]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 20000
})

In [10]:
small_eval_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2000
})

## Train
We will be using 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for training. The API supports a wide range of training options & features.

First we need to load the model we are going to fine-tune for a classifcation task. We are using a BERT base model trained for sequence classification. Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), we know there are five labels:

The authors of BERT have released several versions of BERT pretrained on massive amounts of data, including a multilingual version which supports 104 languages in a single model. You can try these out in your spare time!

In [11]:
from transformers import AutoModelForSequenceClassification
model_name = "bert-base-cased" 
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)


Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Think about what this warning is telling us ...

In [12]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)
model.to(device)

cuda


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

To evaluate our models performance we need to pass the `Trainer` a function for computing and reporting the metrics, you can load different metrics with `load_metric()` (e.g. `accuracy`, `precision`, `recall`, `f1`). You can read all about the [accuracy metric](https://huggingface.co/spaces/evaluate-metric/accuracy).

In [13]:

from datasets import load_metric

metric = load_metric("accuracy")

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Next is a call to the `compute` method on `metric` to calculate the prediction accuracies. Predictions must first be converted to `logits`, which are the raw predictions of the last layer of the neural network.

We can use the Argmax and SoftMax functions to make the output values from the neural network be between 0 and 1.
The `argmax()` function gets the largest number from the `logits`, which corresponds to the most likely class as predicted by the model.
The `softmax()` function gives us the probabilities for the predicted class.

In [14]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

To monitor the evaluation metrics during fine-tuning we need to specify an `evaluation_strategy` parameter in the training arguments, in this case at the end of an epoch.



We need to specify where to save the training checkpoints using the `TrainingArguments` class, this class contains all the hyperparameters

In [15]:
from transformers import TrainingArguments, Trainer
import numpy as np


training_args = TrainingArguments(output_dir="Bert_Classifier",
                                  num_train_epochs=3,
                                  evaluation_strategy="epoch",
                                  overwrite_output_dir=True,
                                  push_to_hub=True
                                  )

In [16]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [17]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


Create a Trainer object specifying the model, training arguments, datasets and the evaluation function we defined above.

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/jenniferjane/Bert_Classifier into local empty directory.


We are now ready to start fine-tuning the model for the text classification task, by calling the `train()` method.

In [19]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 20000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7500


Epoch,Training Loss,Validation Loss,Accuracy
1,0.9528,0.909739,0.5985


Saving model checkpoint to Bert_Classifier/checkpoint-500
Configuration saved in Bert_Classifier/checkpoint-500/config.json
Model weights saved in Bert_Classifier/checkpoint-500/pytorch_model.bin
tokenizer config file saved in Bert_Classifier/checkpoint-500/tokenizer_config.json
Special tokens file saved in Bert_Classifier/checkpoint-500/special_tokens_map.json
tokenizer config file saved in Bert_Classifier/tokenizer_config.json
Special tokens file saved in Bert_Classifier/special_tokens_map.json
Saving model checkpoint to Bert_Classifier/checkpoint-1000
Configuration saved in Bert_Classifier/checkpoint-1000/config.json
Model weights saved in Bert_Classifier/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in Bert_Classifier/checkpoint-1000/tokenizer_config.json
Special tokens file saved in Bert_Classifier/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to Bert_Classifier/checkpoint-1500
Configuration saved in Bert_Classifier/checkpoint-1500/config.json
Mod

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9528,0.909739,0.5985
2,0.7607,0.896941,0.627
3,0.5039,1.054579,0.634


Saving model checkpoint to Bert_Classifier/checkpoint-3000
Configuration saved in Bert_Classifier/checkpoint-3000/config.json
Model weights saved in Bert_Classifier/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in Bert_Classifier/checkpoint-3000/tokenizer_config.json
Special tokens file saved in Bert_Classifier/checkpoint-3000/special_tokens_map.json
Saving model checkpoint to Bert_Classifier/checkpoint-3500
Configuration saved in Bert_Classifier/checkpoint-3500/config.json
Model weights saved in Bert_Classifier/checkpoint-3500/pytorch_model.bin
tokenizer config file saved in Bert_Classifier/checkpoint-3500/tokenizer_config.json
Special tokens file saved in Bert_Classifier/checkpoint-3500/special_tokens_map.json
Saving model checkpoint to Bert_Classifier/checkpoint-4000
Configuration saved in Bert_Classifier/checkpoint-4000/config.json
Model weights saved in Bert_Classifier/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in Bert_Classifier/checkpoint-4000/

TrainOutput(global_step=7500, training_loss=0.7998364725748698, metrics={'train_runtime': 3619.9901, 'train_samples_per_second': 16.575, 'train_steps_per_second': 2.072, 'total_flos': 1.578708854784e+16, 'train_loss': 0.7998364725748698, 'epoch': 3.0})

In [36]:
trainer.push_to_hub()


Saving model checkpoint to Bert_Classifier
Configuration saved in Bert_Classifier/config.json
Model weights saved in Bert_Classifier/pytorch_model.bin
tokenizer config file saved in Bert_Classifier/tokenizer_config.json
Special tokens file saved in Bert_Classifier/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/413M [00:00<?, ?B/s]

Upload file runs/Sep07_18-38-37_ab3b3e29f37b/events.out.tfevents.1662575960.ab3b3e29f37b.71.0:  45%|####5     …

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/jenniferjane/Bert_Classifier
   e266888..c040067  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/jenniferjane/Bert_Classifier
   e266888..c040067  main -> main

To https://huggingface.co/jenniferjane/Bert_Classifier
   c040067..ffc35d0  main -> main

   c040067..ffc35d0  main -> main



'https://huggingface.co/jenniferjane/Bert_Classifier/commit/c040067a9460a7629c0549662803f0dacfa761b1'

In [20]:
model.save_pretrained("classify_model")

Configuration saved in classify_model/config.json
Model weights saved in classify_model/pytorch_model.bin


In [34]:
model = AutoModelForSequenceClassification.from_pretrained("classify_model")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)
#input_text = "Last night we went to Hoppers in Marylebone and had a wonderful dinner, the food and service was fantastic."
#input_text = "Last night we went to Hoppers in Marylebone and had an awful dinner, the food and service were dreadful, we had to wait a very long time. Very disappointing."
input_text = "The food is awesome, the staff are awfully brilliant... I can't go back to eat there as my waist line will expand... "
tokenized_text = tokenizer(input_text,
                            truncation=True,
                            is_split_into_words=False,
                            return_tensors='pt')
outputs = model(tokenized_text["input_ids"])
predicted_label = outputs.logits.argmax(-1)


loading configuration file classify_model/config.json
Model config BertConfig {
  "_name_or_path": "classify_model",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.21.3",
  "type_vocab_size": 2,
  "use_cache"

In [35]:
predicted_label

tensor([3])

# Fine-tuning for Named Entity Recognition (NER)

In this part of the workshop we are going to learn how to fine-tune one of the 🤗 Transformers models to a token classification task. This is the task of predicting a label for each token. We will train a token classifier for the task of named entity recognition (NER).

An NER classifier classifies named entities mentioned in text as (person, organization, location or names of miscellaneous entities).

Setup some variables, the task in our case is `ner`, other options could be `pos` (part-of-speech-tagging) or `chunk` (chunking).

You can read more about the pre-trained model [distilbert](https://huggingface.co/distilbert-base-uncased) we are using.


In [44]:
task = "ner" # "ner", "pos" or "chunk"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16 # adjust where necessary

Next we load the `conll2003` dataset. You can read more about this training dataset [here](https://huggingface.co/datasets/conll2003)

In [45]:
from datasets import load_dataset
datasets = load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]


The datasets object is a `DatasetDict`, which contains one key for the `train`, `validation` and `test` sets.

In [46]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

Each of the datasets have a column for the `tokens` (the input texts split into words) and a column of `labels` for each of the tasks (pos, chunk ner).

Let's have a look at an element...

In [49]:
datasets['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

The labels are already coded as integer ids to be easily usable by our model, but the correspondence with the actual categories is stored in the `features` of the dataset:

In [50]:
datasets["train"].features[f"ner_tags"]

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

# NER tags: 

0 corresponds to 'O'

1 to 'B-PER' 

2 to 'I-PER' etc... 

In addition to the 'O' (which is means **no** special entity), there are **four** labels for NER, each prefixed with 'B-' (for beginning) or 'I-' (for intermediate), that indicate if the token is the first one for the current group with the label or not:

*   'PER' for person
*   'ORG' for organization
*   'LOC' for location
*   'MISC' for miscellaneous

Since the labels are lists of `ClassLabel`, the actual names of the labels are nested in the feature attribute of the object above:

In [51]:
label_list = datasets['train'].features[f"{task}_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing)

In [52]:
#@title
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))


In [53]:
#@title
show_random_elements(datasets["train"])


Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,11399,"[Come, on, ,, you, know, better, than, that, ., ""]","[VB, IN, ,, PRP, VB, JJR, IN, DT, ., ""]","[B-VP, B-PRT, O, B-NP, B-VP, B-ADJP, B-PP, B-NP, O, O]","[O, O, O, O, O, O, O, O, O, O]"
1,12270,"[Olympic, sprint, championship, (, three-man, teams, ), :]","[JJ, JJ, NN, (, JJ, NNS, ), :]","[B-NP, I-NP, I-NP, O, B-NP, I-NP, O, O]","[B-MISC, O, O, O, O, O, O, O]"
2,6883,"[Pusan, 0, 2, 0, 3, 3, 2]","[NN, CD, CD, CD, CD, CD, CD]","[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP, I-NP]","[B-ORG, O, O, O, O, O, O]"
3,8278,"[Preston, 2, Crewe, 1]","[NNP, CD, NNP, CD]","[B-NP, I-NP, I-NP, I-NP]","[B-ORG, O, B-ORG, O]"
4,2519,"[SOCCER, -, VOGTS, KEEPS, FAITH, WITH, EURO, ', 96, CHAMPIONS, .]","[NN, :, NNS, NNS, IN, IN, NNP, POS, CD, NNS, .]","[B-NP, O, B-NP, I-NP, B-PP, I-PP, B-NP, B-NP, I-NP, I-NP, O]","[O, O, B-PER, O, O, O, B-MISC, I-MISC, I-MISC, O, O]"


# Preprocessing the data

As previous, we now tokenize the inputs using a 🤗 Transformers Tokenizer. 

This means converting the `tokens` to their corresponding IDs in the pretrained **vocabulary**, and gets the data into a format the model expects, as well as generating the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer`.`from_pretrained` method, which will ensure:

* we get a tokenizer that corresponds to the model architecture we want to use,
* we download the vocabulary used when pretraining this specific checkpoint.
* That the vocabulary will be cached, so it's not downloaded again the next time we run the cell.



In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

The next assertion makes sure that our tokenizer is a fast tokenize (backed by Rust) from the 🤗 Tokenizers library.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

Next try calling this tokenizer on an example sentence.

In [None]:
tokenizer("This is an example boring sentence!")

{'input_ids': [101, 2023, 2003, 2019, 2742, 11771, 6251, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

If, as is the case here, your inputs have already been split into words, you should pass the list of words to your tokenzier with the argument `is_split_into_words=True`:

In [None]:
tokenizer(["This", "is", "an", "example", "boring", "sentence", "split", "into", "words", "."], is_split_into_words=True)


{'input_ids': [101, 2023, 2003, 2019, 2742, 11771, 6251, 3975, 2046, 2616, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer. Let's look at an example of that:

In [None]:
example = datasets["train"][4]
print(example["tokens"])

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']


In [None]:
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', 'germany', "'", 's', 'representative', 'to', 'the', 'european', 'union', "'", 's', 'veterinary', 'committee', 'werner', 'z', '##wing', '##mann', 'said', 'on', 'wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']


Here the words "Zwingmann" and "sheepmeat" have been split in three subtokens.

This means that we need to do some processing on our labels as the input ids returned by the tokenizer are longer than the lists of labels our dataset contain. This is because some special tokens might be added (a [CLS] and a [SEP] above) and because of possible splits of words in multiple tokens:

In [None]:
len(example[f"{task}_tags"]), len(tokenized_input["input_ids"])


(31, 39)

The tokenizer returns outputs that have a `word_ids` method which can help us.

In [None]:
print(tokenized_input.word_ids())

[None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]


As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to None and all other tokens to their respective word. This way, we can align the labels with the processed input ids.

In [None]:
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]
print(len(aligned_labels), len(tokenized_input["input_ids"]))

39 39


Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. We propose the two strategies here, just change the value of the following flag:

In [None]:
label_all_tokens = True


The following is a function that will preprocess the samples. Feed them to the tokenizer with the argument `truncation=True` (to truncate texts that are larger than the maximum size allowed by the model) and `is_split_into_words=True` (as above). Then align the labels with the token ids using the strategy picked:

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
tokenize_and_align_labels(datasets['train'][:5])


{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], [101, 2848, 13934, 102], [101, 9371, 2727, 1011, 5511, 1011, 2570, 102], [101, 1996, 2647, 3222, 2056, 2006, 9432, 2009, 18335, 2007, 2446, 6040, 2000, 10390, 2000, 18454, 2078, 2329, 12559, 2127, 6529, 5646, 3251, 5506, 11190, 4295, 2064, 2022, 11860, 2000, 8351, 1012, 102], [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100], [-100, 1, 2, -100], [-100, 5, 0, 

Again we can apply this function on all the sentences (or pairs of sentences) in the dataset, by using the `map` method of our dataset object created earlier. 

This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

Pass `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

In [None]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)


  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

Downloading pytorch_model.bin:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN t

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the batch_size defined earlier and customize the number of epochs for training, as well as the weight decay.

The last argument to setup everything so we can push the model to the Hub regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. 

In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

Next is setting up a data collator that will batch the processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels:

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.5 MB/s 
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=ade3563592e8ba518c5ea253ed86092749208ac3b2753c7f8aa424c7967c4ee2
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Next is defining how to compute the metrics from the predictions in this NER example. We will load the seqeval metric via the Datasets library.

In [None]:
from datasets import load_metric

metric = load_metric("seqeval")


This metric takes list of labels for the predictions and references:



In [None]:
labels = [label_list[i] for i in example[f"{task}_tags"]]
metric.compute(predictions=[labels], references=[labels])

{'LOC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

Next is some post-processing on the predictions:

select the predicted index (with the maximum logit) for each token
convert it to its string label
ignore everywhere we set a label of -100
The following function does all this post-processing on the result of `Trainer.evaluate` (which is a namedtuple containing predictions and labels) before applying the metric:

In [None]:
import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Drop the precision/recall/f1 computed for each category and only focus on the overall precision/recall/f1/accuracy.

Then pass all of this along with the datasets to the Trainer:

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="./distilbert_ner", 
                                  num_train_epochs=1,
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",
                                  overwrite_output_dir=True,
                                  push_to_hub=True)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

Cloning https://huggingface.co/jenniferjane/ner_trainer into local empty directory.


Now finetune the model by calling the train method:

In [None]:
trainer.train()


The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 14041
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1756


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0029,0.10691,0.923145,0.932543,0.92782,0.983033


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
Saving model checkpoint to ./ner_trainer/checkpoint-1756
Configuration saved in ./ner_trainer/checkpoint-1756/config.json
Model weights saved in ./ner_trainer/checkpoint-1756/pytorch_model.bin
tokenizer config file saved in ./ner_trainer/checkpoint-1756/tokenizer_config.json
Special tokens file saved in ./ner_trainer/checkpoint-1756/special_tokens_map.json
tokenizer config file saved in ./ner_trainer/tokenizer_config.json
Special tokens file saved in ./ner_trainer/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1756, training_loss=0.0038222267307290183, metrics={'train_runtime': 116.3161, 'train_samples_per_second': 120.714, 'train_steps_per_second': 15.097, 'total_flos': 149063429673648.0, 'train_loss': 0.0038222267307290183, 'epoch': 1.0})

In [None]:
model.save_pretrained("ner_distilbert_model")

Configuration saved in ner_model/config.json
Model weights saved in ner_model/pytorch_model.bin


The evaluate method evaluates again on the evaluation dataset or on another dataset:

In [None]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8


{'eval_loss': 0.09985809773206711,
 'eval_precision': 0.934257602862254,
 'eval_recall': 0.9347801767535519,
 'eval_f1': 0.9345188167533413,
 'eval_accuracy': 0.9840024147298521,
 'eval_runtime': 6.681,
 'eval_samples_per_second': 486.452,
 'eval_steps_per_second': 60.919,
 'epoch': 3.0}

To get the precision/recall/f1 computed for each category after having finished training, apply the same function as before on the result of the predict method:

In [None]:
predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results


The following columns in the test set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3250
  Batch size = 8


{'LOC': {'precision': 0.95717017208413,
  'recall': 0.9560733384262796,
  'f1': 0.9566214408561056,
  'number': 2618},
 'MISC': {'precision': 0.8356054530874097,
  'recall': 0.8464662875710804,
  'f1': 0.841000807102502,
  'number': 1231},
 'ORG': {'precision': 0.8981835564053537,
  'recall': 0.9139105058365758,
  'f1': 0.9059787849566057,
  'number': 2056},
 'PER': {'precision': 0.9757217847769029,
  'recall': 0.98022412656559,
  'f1': 0.9779677737586321,
  'number': 3034},
 'overall_precision': 0.9329037991557432,
 'overall_recall': 0.9394786888913749,
 'overall_f1': 0.9361797001281981,
 'overall_accuracy': 0.9843836878643939}

In [None]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


If you want to share your model you can upload the result of the training to the Hub, by executing: this instruction:

In [None]:
trainer.push_to_hub()


Saving model checkpoint to ./ner_trainer
Configuration saved in ./ner_trainer/config.json
Model weights saved in ./ner_trainer/pytorch_model.bin
tokenizer config file saved in ./ner_trainer/tokenizer_config.json
Special tokens file saved in ./ner_trainer/special_tokens_map.json
To https://huggingface.co/jenniferjane/ner_trainer
   425f1eb..0249b6c  main -> main

   425f1eb..0249b6c  main -> main



Now you can share the model using the model name and test it out using the `ner` pipeline discussed earlier in this workshop.

In [None]:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("jenniferjane/ner_trainer")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer)
sequence = 'Jennifer loves dining at Hoppers in Marylebone, London.'

In [None]:
for entity in ner_pipe(sequence):
  print(entity)

{'entity': 'LABEL_1', 'score': 0.9999734, 'index': 1, 'word': 'jennifer', 'start': 0, 'end': 8}
{'entity': 'LABEL_0', 'score': 0.99998367, 'index': 2, 'word': 'loves', 'start': 9, 'end': 14}
{'entity': 'LABEL_0', 'score': 0.9999815, 'index': 3, 'word': 'dining', 'start': 15, 'end': 21}
{'entity': 'LABEL_0', 'score': 0.9999722, 'index': 4, 'word': 'at', 'start': 22, 'end': 24}
{'entity': 'LABEL_3', 'score': 0.9146261, 'index': 5, 'word': 'hopper', 'start': 25, 'end': 31}
{'entity': 'LABEL_3', 'score': 0.92764086, 'index': 6, 'word': '##s', 'start': 31, 'end': 32}
{'entity': 'LABEL_0', 'score': 0.99736255, 'index': 7, 'word': 'in', 'start': 33, 'end': 35}
{'entity': 'LABEL_5', 'score': 0.99973494, 'index': 8, 'word': 'marylebone', 'start': 36, 'end': 46}
{'entity': 'LABEL_0', 'score': 0.99998033, 'index': 9, 'word': ',', 'start': 46, 'end': 47}
{'entity': 'LABEL_5', 'score': 0.999948, 'index': 10, 'word': 'london', 'start': 48, 'end': 54}
{'entity': 'LABEL_0', 'score': 0.9999869, 'index'