# Day 2: Next Steps: Finetuning

Let's walk through the finetuning process step by step for a classification task using LoRA.

We also have a more full fledged finetuning script in the `finetuning` folder which lets you easily run 4 different types of PEFT with a single script.

NOTE: Your local machine likely does not have the memory required (at least 24 GB on the GPU) to run this notebook. We thus recommend running this on an instance such as an AWS `g5.4xlarge` instance or a GCP instance with an A100 40GB GPU. 

NOTE 2: While this notebook may seem pretty long, the majority of it follows standard practices for training ML models. The key bit of code here is the following:
```
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=r, lora_alpha=32, lora_dropout=dropout_rate
)
model = get_peft_model(model, peft_config)
```

NOTE 3: If you want to play around with other types of finetuning, check out the scripts in the `finetuning` folder!


This creates a PEFT configuration (LoRA here) and we use it to get a version of our model using LoRA. Read on through this notebook to see it in action!

In [None]:
# Install the required libraries
!pip install git+https://github.com/huggingface/peft
!pip install datasets==2.12.0 evaluate==0.4.0 numpy==1.24.3 torch==2.0.1 tqdm==4.65.0 transformers==4.29.2 ipykernel ipywidgets

In [None]:
from transformers import AutoModelForSeq2SeqLM, AdamW
from peft import get_peft_model, TaskType, LoraConfig
import functools
import torch
import datasets
import os
import evaluate

from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import default_data_collator
from tqdm import tqdm

## Config

Lets set up some parameters for our model. In this case, the model is Google's Flan-T5-Large.

In [None]:
model_name_or_path = "google/flan-t5-large" # https://huggingface.co/google/flan-t5-large

r = 4 # LoRA attention dimension parameter
dropout_rate = 0.1
batch_size = 8
n_epochs = 3
lr = 3e-4
max_length = 128
grad_accumulation_steps = 1

os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Data

We use the [Financial Phrasebank dataset](https://huggingface.co/datasets/financial_phrasebank/viewer/sentences_allagree/train) which contains sentiments about financial news.

We first define a simple pre-processing function for our dataset. It just replaces new lines with a blank space, tokenizes the inputs and prepares the labels.

In [None]:
def preprocess_function(examples, tokenizer, text_column, label_column, max_length):
    inputs = [text.replace('\n', ' ') for text in examples[text_column]]

    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")

    targets = examples["text_label"]
    labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels

    return model_inputs

Now, let's load our dataset, and prepare our training and validation data. For the purposes of this demo, we just split 200 examples from the training set to use for validation.

In [None]:
# load the dataset
# sentences_allagree means all annotators agreed on the label
dataset = datasets.load_dataset("financial_phrasebank", "sentences_allagree")
text_column = "sentence"
label_column = "label"

dataset = dataset["train"].train_test_split(test_size=200, seed=42, shuffle=True)
dataset["validation"] = dataset["test"]
del dataset["test"]

classes = dataset["train"].features[label_column].names
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x[label_column]]},
    batched=True,
    num_proc=1,
)

100%|██████████| 1/1 [00:00<00:00, 718.57it/s]
                                                    

# Model

Next, lets load our model and apply PEFT, specifically LoRA to it! This is a simple as creating a peft configuration, such as `peft.LoraConfig` and passing it to the `get_peft_model` function alongside your model! Note that PEFT is a relatively new technique and as such both the supported models and the supported techniques can and will change over time. Please refer to the [HuggingFace documentation](https://huggingface.co/docs/peft/main/en/index) for more information.

In [None]:
# Build Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)

# Use LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=r, lora_alpha=32, lora_dropout=dropout_rate
)
model = get_peft_model(model, peft_config)

model.print_trainable_parameters()
model = model.to(device)

trainable params: 1,179,648 || all params: 784,329,728 || trainable%: 0.15040205131686657


As seen above, we've dropped our model down from 784M parameters to just 1.1M parameters that need to be trained! The rest of this notebook follows a standard training procedure. 

First we pre-process our data and build the dataloaders.

In [None]:
# Preprocess dataset
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

processed_datasets = dataset.map(
    functools.partial(preprocess_function, tokenizer=tokenizer, text_column=text_column, label_column=label_column, max_length=max_length),
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["validation"]

# Build dataloaders
train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=8, pin_memory=True)

                                                                                          

And now we train!

In [None]:
# Model training loop
optimizer = AdamW(model.parameters(), lr=lr)

for epoch in range(n_epochs):
    # Train
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        
        if (step % grad_accumulation_steps == 0) or step == len(train_dataloader):
            optimizer.step()
            optimizer.zero_grad()
    # Validate
    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        eval_loss += loss.detach().float()

        with torch.no_grad():
            preds = tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
        eval_preds.extend(preds)
    # Calculate metrics
    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")


100%|██████████| 258/258 [01:11<00:00,  3.62it/s]
100%|██████████| 25/25 [00:02<00:00, 10.83it/s]


epoch=0: train_ppl=tensor(1.8698, device='cuda:0') train_epoch_loss=tensor(0.6258, device='cuda:0') eval_ppl=tensor(1.0450, device='cuda:0') eval_epoch_loss=tensor(0.0440, device='cuda:0')


100%|██████████| 258/258 [01:11<00:00,  3.60it/s]
100%|██████████| 25/25 [00:02<00:00, 10.65it/s]


epoch=1: train_ppl=tensor(1.0452, device='cuda:0') train_epoch_loss=tensor(0.0442, device='cuda:0') eval_ppl=tensor(1.0656, device='cuda:0') eval_epoch_loss=tensor(0.0635, device='cuda:0')


100%|██████████| 258/258 [01:11<00:00,  3.60it/s]
100%|██████████| 25/25 [00:02<00:00, 10.77it/s]

epoch=2: train_ppl=tensor(1.0269, device='cuda:0') train_epoch_loss=tensor(0.0265, device='cuda:0') eval_ppl=tensor(1.0434, device='cuda:0') eval_epoch_loss=tensor(0.0425, device='cuda:0')





## Evaluation

We evaluate using the accuracy on the evaluation dataset.

In [None]:
correct = 0
total = 0
for pred, true in zip(eval_preds, dataset["validation"]["text_label"]):
    if pred.strip() == true.strip():
        correct += 1
    total += 1
accuracy = correct / total * 100
print(f"{accuracy=} % on the evaluation dataset")
print(f"{eval_preds[40:60]=}")
print(f"{dataset['validation']['text_label'][40:60]=}")

accuracy=97.0 % on the evaluation dataset
eval_preds[40:60]=['neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'negative', 'positive', 'negative', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']
dataset['validation']['text_label'][40:60]=['neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'negative', 'positive', 'negative', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']
