## Transfer learning & fine tuning

So far we know about using pre-trained huggingface language models (e.g BERT, T5), in this part of the workshop we are going to talk about fine-tuning these pre-trained models for specific downstream NLP tasks (e.g. document classification (sentiment), or summarisation). 

This is generally know as transfer learning. Transfer learning is a machine learning technique for adapting pretrained models to solve specialized problems. Sequential transfer learning is learning on one task or one dataset and then transferring this learning to another task or dataset.

## Install dependencies

In [1]:
!%pip install transformers datasets torch

zsh:fg:1: no job control in this shell.


### Dataset: The Yelp Review Full dataset for text classification.
Before we can fine-tune a pretrained model, we need to download a dataset and prepare it for training. We are going to use the Yelp dataset for fine-tuning. 

This dataset is a subset of businesses, reviews and user data.

The dataset contains text and the corresponding label (1-5 stars).



In [13]:
from datasets import load_dataset
dataset = load_dataset("/Users/JENSAM/GIT/edc22-nlp/data/yelp_review_full.py", cache_dir='/Users/JENSAM/GIT/edc22-nlp/data') 
dataset 

Reusing dataset yelp_review_full (/Users/JENSAM/GIT/edc22-nlp/data/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|██████████| 2/2 [00:00<00:00, 310.94it/s]


DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

Let's take a look at an example

In [21]:
dataset["train"][99]

{'label': 0,
 'text': "Eat at your own risk. The service is terrible, the staff seem to be generally clueless, the management is inclined to blame the staff for their own mistakes, and there's no sense of FAST in their fast food. When we came, half of the menu board was still on breakfast, and it was 4:30p. The only thing they have going for them is that the food is hot and tastes just like McDonald's should. \\n\\nThen again, the franchise is owned by Rice, and I've come to take terrible service is their MO."}

Remember we need to process the text using a tokenizer, we will use padding and truncation to handle any variations in the sequence lengths. 

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)



100%|██████████| 650/650 [01:25<00:00,  7.60ba/s]
100%|██████████| 50/50 [00:06<00:00,  7.66ba/s]


To reduce the time it takes for training we can create smaller subsets of the full dataset for fine-tuning

In [30]:
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(100))

Loading cached shuffled indices for dataset at /Users/JENSAM/GIT/edc22-nlp/data/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-a0e621c27d9b360e.arrow
Loading cached shuffled indices for dataset at /Users/JENSAM/GIT/edc22-nlp/data/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-61e0da4d9cd46a2c.arrow


In [31]:
small_train_dataset[99]

{'label': 3,
 'text': "Went here with my roommate last week for a late breakfast and was pleasantly surprised! I got the Eggs Benedict and a chocolate croissant while my roommate went for the French toast with fresh fruit. Everything was very tasty, although my roommate could have easily handled a larger portion. \\nAll the employees were super sweet and the French owner's eclectic playlist really added to how much I enjoyed the experience. Will have to go again for lunch or dinner!"}

In [25]:
small_eval_dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 100
})

## Train
We will be using Hugging Face Transformers Trainer class for training. The API supports a wide range of training options & features.

First we need to load the model we are going to fine-tune for a classifcation task).

In [35]:
from transformers import AutoModelForSequenceClassification
model_name = "bert-base-cased" 
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Think about what this warning is telling us ...

We need to specify where to save the training checkpoints using the TrainingArguments class, this class contains all the hyperparameters

In [36]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

To evaluate our models performance we need to pass the Trainer a function for computing and reporting the metrics, you can load different metrics with the load_metric function (e.g. accuracy, precision, recall, f1).

In [38]:

import numpy as np

from datasets import load_metric

metric = load_metric("accuracy")

Next is a call to the compute method on metric to calculate the prediction accuracies. Predictions must first be converted to logits, which are the raw predictions of the last layer of the neural network.

We use the Argmax and SoftMax functions to make the output values from the neural network be between 0 and 1.
The Argmax function interprets the largest positive output value as 1 and all other values as 0. This gives us the predicted class.
The SoftMax function gives us the probabilities for the predicted class.

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

To monitor the evaluation metrics during fine-tuning we need to specify an evaluation_strategy parameter in the training arguments, in this case at the end of an epoch.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", num_train_epochs=3, evaluation_strategy="epoch")

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    
)

In [None]:
trainer.train()

## Summarisation

In [63]:
!%pip install rouge-score nltk sentencepiece
import nltk
nltk.download("punkt")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting absl-py
  Downloading absl_py-1.2.0-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.4/123.4 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Using lega

[nltk_data] Downloading package punkt to /Users/JENSAM/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [65]:
!apt install git-lfs

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Running `brew update --auto-update`...
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 3 taps (homebrew/core, homebrew/cask and homebrew/cask-fonts).
[34m==>[0m [1mNew Formulae[0m
cargo-depgraph             dura                       prql-compiler
chain-bench                page                       tlsx
[34m==>[0m [1mNew Casks[0m
polypad             qwerty-fr           shop-different      workman

You have [1m6[0m outdated formulae installed.
You can upgrade them with [1mbrew upgrade[0m
or list them with [1mbrew outdated[0m.

[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/git-lfs/manifests/3.2.0[0m
######################################################################## 10

## Parameters

In [66]:
MODEL_NAME = "t5-small"

## Prepare the dataset

In [71]:
from datasets import load_dataset, load_metric

raw_datsets = load_dataset("xsum")
metric = load_metric("rouge")

Using custom data configuration default


Downloading and preparing dataset xsum/default (download: 245.38 MiB, generated: 507.60 MiB, post-processed: Unknown size, total: 752.98 MiB) to /Users/JENSAM/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934...




[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Dataset xsum downloaded and prepared to /Users/JENSAM/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 149.83it/s]
Downloading builder script: 5.60kB [00:00, 5.32MB/s]                   


In [72]:
raw_datsets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [75]:
raw_datsets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

The function show_random_elements can be used to show some randomly picked examples from the dataset.

In [117]:
import pickletools
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
        print(dataset[picks])
   
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    #display(HTML(df.to_html()))


In [118]:
show_random_elements(raw_datsets["train"], 2)

{'document': ['Officers turned to Twitter in a bid to find a thief who stole £600 worth of cosmetics from a local Boots store.\n"We are looking for a 40-year-old man who looks 20, glowing skin, long eyelashes, raised eyebrows & pronounced lips," they added.\nIn response, one pun-loving joker replied: "Is there any foundation to these allegations?"\nEnd of Twitter post  by @MonklandsPol\nThe post by Monklands police sparked a series of witty responses from their followers on the social media platform.\nReferring to a popular brand of make-up, one asked: "If you put him in an identity parade, will he be No 7 in the line up?"\nAnother said: "When questioned as to why he had allegedly stolen £600 of cosmetics the suspect simply answered \'Because I\'m worth it.\'"\nThe theft happened at Boots in Main Street, Coatbridge, at about 12:30 on Thursday.\nAnyone with information is asked to call Police Scotland on 101 or Crimestoppers.'], 'summary': ['A light-hearted appeal to trace a shoplifter 

In [114]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [120]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data 

Use the transformers `Tokenizer` to tokenize the imputs, this converts the tokens to the IDs in the pretrained vocabularly, formats the inputs for the model, and generate other inputs that the model needs. 

By instantiating the ´AutoTokenizer.from_pretrained´ method, we get a tokenizer specific to the model architecture & the vocabulary 

In [121]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/JENSAM/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file https://huggingface.co/bert-base-cased/resolve/

In [123]:
tokenizer("Hello my name is Boris Johnson, I live and party at 10 Downing Street.")

{'input_ids': [101, 8667, 1139, 1271, 1110, 11265, 2921, 117, 146, 1686, 1105, 1710, 1120, 1275, 5245, 1158, 1715, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [124]:
tokenizer("Hello my name is Boris Johnson, I live and party at 10 Downing Street.", "This is a fabulous sentence.")

{'input_ids': [101, 8667, 1139, 1271, 1110, 11265, 2921, 117, 146, 1686, 1105, 1710, 1120, 1275, 5245, 1158, 1715, 119, 102, 1188, 1110, 170, 175, 20356, 5650, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Prefix the inputs with "summarize" when using the T5 model checkpoint as it can also do translation and it needs to know which task to perform.

In [125]:
if model_name in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize"
else:
    prefix = ""

Write a function to preprocess the samples, give them to the ´tokenizer´ using the argument ´truncation=true´. This will truncate input that are longer than what the model can handle will be truncated to the maximum length accpeted by the model. 

In [126]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

In [127]:
preprocess_function(raw_datsets['train'][:2])

{'input_ids': [[101, 1109, 1554, 2616, 1104, 3290, 1107, 8102, 5272, 117, 1141, 1104, 1103, 1877, 4997, 4634, 117, 1110, 1253, 1217, 14758, 119, 20777, 8341, 1250, 1110, 7173, 1107, 11679, 6196, 1105, 1242, 4744, 1107, 153, 3051, 8350, 6662, 3118, 6118, 4634, 1118, 2288, 1447, 119, 20223, 1113, 1103, 1745, 3153, 25229, 1339, 23730, 1496, 1106, 3290, 1120, 1103, 21372, 5541, 15709, 13890, 119, 2408, 5028, 1105, 7036, 1116, 1127, 4634, 1118, 9420, 1107, 8102, 5272, 1170, 1103, 1595, 140, 8871, 1166, 12712, 1174, 1154, 1103, 1411, 119, 1752, 2110, 18634, 1457, 27793, 1320, 3891, 1103, 1298, 1106, 25151, 1103, 3290, 119, 1109, 5635, 13275, 1174, 170, 13223, 2095, 117, 9420, 1242, 2595, 4625, 1113, 3006, 1715, 118, 1103, 1514, 6001, 17213, 14154, 119, 2893, 6347, 9727, 117, 1150, 8300, 1103, 140, 23339, 18375, 1134, 1108, 6118, 4634, 117, 1163, 1131, 1180, 1136, 6088, 1103, 4321, 118, 4792, 2593, 1517, 1103, 7870, 1855, 119, 1438, 117, 1131, 1163, 1167, 3843, 5838, 1250, 1180, 1138, 1151, 2

In [130]:
%pwd

'/Users/JENSAM/GIT/edc22-nlp'

In [None]:
%%HTML