## Transfer learning & fine-tuning

So far we know about using pre-trained huggingface language models (e.g BERT, T5), in this part of the workshop we are going to talk about fine-tuning these pre-trained models for specific downstream NLP tasks (e.g. document classification (sentiment), or summarisation). 

This is generally know as transfer learning. Transfer learning is a machine learning technique for adapting pretrained models to solve specialized problems. Sequential transfer learning is learning on one task, or one dataset, and then transferring this learning to another task or dataset.

## Install dependencies

In [None]:
!%pip install transformers datasets torch

### Dataset: The Yelp Review Full dataset for text classification.
Before we can fine-tune a pretrained model, we need to download a dataset and prepare it for training. We are going to use the Yelp dataset for fine-tuning. 

This dataset is a subset of businesses, reviews and user data.

The dataset contains text and the corresponding label (1-5 stars).



In [None]:
from datasets import load_dataset
dataset = load_dataset("/Users/JENSAM/GIT/edc22-nlp/data/yelp_review_full.py", cache_dir='/Users/JENSAM/GIT/edc22-nlp/data') 
dataset 

Let's take a look at an example

In [None]:
dataset["train"][99]

Remember we need to process the text using a tokenizer, we will use padding and truncation to handle any variations in the sequence lengths. 

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)



To reduce the time it takes for training we can create smaller subsets of the full dataset for fine-tuning

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

In [None]:
small_train_dataset[99]

In [None]:
small_eval_dataset

## Train
We will be using Hugging Face Transformers Trainer class for training. The API supports a wide range of training options & features.

First we need to load the model we are going to fine-tune for a classifcation task).

In [None]:
from transformers import AutoModelForSequenceClassification
model_name = "bert-base-cased" 
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)


Think about what this warning is telling us ...

We need to specify where to save the training checkpoints using the TrainingArguments class, this class contains all the hyperparameters

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

To evaluate our models performance we need to pass the Trainer a function for computing and reporting the metrics, you can load different metrics with the load_metric function (e.g. accuracy, precision, recall, f1).

In [None]:

import numpy as np

from datasets import load_metric

metric = load_metric("accuracy")

Next is a call to the compute method on metric to calculate the prediction accuracies. Predictions must first be converted to logits, which are the raw predictions of the last layer of the neural network.

We use the Argmax and SoftMax functions to make the output values from the neural network be between 0 and 1.
The Argmax function interprets the largest positive output value as 1 and all other values as 0. This gives us the predicted class.
The SoftMax function gives us the probabilities for the predicted class.

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

To monitor the evaluation metrics during fine-tuning we need to specify an evaluation_strategy parameter in the training arguments, in this case at the end of an epoch.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", num_train_epochs=3, evaluation_strategy="epoch")

Create a Trainer object specifying the model, training arguments, datasets and tghe evaluation function we defined above.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    
)

We are now ready to start fine-tuning the model for the text classification task, by calling the train() method.

In [None]:
trainer.train()

## Summarisation

In [None]:
!%pip install rouge-score nltk sentencepiece
import nltk
nltk.download("punkt")

In [None]:
!apt install git-lfs

## Parameters

In [None]:
MODEL_NAME = "t5-small"

## Prepare the dataset

In [None]:
from datasets import load_dataset, load_metric

raw_datsets = load_dataset("xsum")
metric = load_metric("rouge")

In [None]:
raw_datsets

In [None]:
raw_datsets["train"][0]

The function show_random_elements can be used to show some randomly picked examples from the dataset.

In [None]:
import pickletools
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
        print(dataset[picks])
   
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    #display(HTML(df.to_html()))


In [None]:
show_random_elements(raw_datsets["train"], 2)

In [None]:
metric

In [None]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

## Preprocessing the data 

Use the transformers `Tokenizer` to tokenize the imputs, this converts the tokens to the IDs in the pretrained vocabularly, formats the inputs for the model, and generate other inputs that the model needs. 

By instantiating the ´AutoTokenizer.from_pretrained´ method, we get a tokenizer specific to the model architecture & the vocabulary 

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
tokenizer("Hello my name is Boris Johnson, I used to live and party at 10 Downing Street.")

In [None]:
tokenizer.tokenize("Hello my name is Boris Johnson, I used to live and party at 10 Downing Street.", "This is a fabulous sentence.")

Prefix the inputs with "summarize" when using the T5 model checkpoint as it can also do translation and it needs to know which task to perform.

In [None]:
if model_name in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize"
else:
    prefix = ""

Write a function to preprocess the samples, give them to the ´tokenizer´ using the argument ´truncation=true´. This will truncate input that are longer than what the model can handle will be truncated to the maximum length accpeted by the model. 

In [None]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

In [None]:
preprocess_function(raw_datsets['train'][:2])

In [None]:
%%HTML