# Fine-tuning LLMs

In this notebook we will be looking at simple way for fine-tuning pretrained LLMs for tasks specific to your use case.

In [None]:
!pip install transformers[torch]
!pip install datasets
!pip install evaluate

Fine-tuning has the following advantages -

*   Allows to use state-of-the-art models without having to train one from scratch
*  Reduces computation cost, carbon footprint
*  Train a pretrained model on a specific dataset for your task



**Choosing the dataset**

Here we will be using teh Yelp reviews dataset

In [None]:
# Loading dataset

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

dataset

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [None]:
print(f'Number of training samples: {len(dataset["train"])}')
print(f'Number of test samples: {len(dataset["test"])}')

Number of training samples: 650000
Number of test samples: 50000


In [None]:
dataset["train"][11]

{'label': 0,
 'text': "This place is absolute garbage...  Half of the tees are not available, including all the grass tees.  It is cash only, and they sell the last bucket at 8, despite having lights.  And if you finish even a minute after 8, don't plan on getting a drink.  The vending machines are sold out (of course) and they sell drinks inside, but close the drawers at 8 on the dot.  There are weeds grown all over the place.  I noticed some sort of batting cage, but it looks like those are out of order as well.  Someone should buy this place and turn it into what it should be."}

In [None]:
dataset["train"][22]

{'label': 1,
 'text': "Very disappointed in the customer service. We ordered Reuben's  and wanted coleslaw instead of kraut. They charged us $3.00 for the coleslaw. We will not be back . The iced tea is also terrible tasting."}

**Tokenizing the text**

Tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence

In [None]:
from transformers import AutoTokenizer

# Tokenizer for the model name specified
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [26]:
print(list(tokenizer.vocab)[:1000])



In [None]:
print(len(tokenizer.vocab))

28996


Define tokenize function with padding and truncation strategy and use Datasets map method to apply the tokenize function to the entire dataset in a batched manner

In [None]:
def tokenize_function(samples):
    return tokenizer(samples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Since this is an illustration let's create a smaller subset of the dataset to fine-tune on so that it reduces the time needed.

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

print(len(small_train_dataset), len(small_eval_dataset))

1000 100


**Select model for fine-tuning**

Here we are going to fine-tune bert-base-cased on a sequence classficiation task involving 5 labels

In [None]:
from transformers import AutoModelForSequenceClassification

# Specify number of labels
num_labels = 5

# Get pretrained model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=num_labels)

# Get number of model parameters (2 ways)
print(f'{model.num_parameters()/1e6} M parameters')
print(f'{sum(p.numel() for p in model.parameters())/1e6} M parameters')

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


108.314117 M parameters
108.314117 M parameters


There is a warning about some of the pretrained weights not being used and some weights being newly/randomly initialized. This is because the pretrained head of the BERT model is discarded, and is replaced with a randomly initialized classification head. This new model head will get fine-tuned on the new sequence classification task, transferring the knowledge of the pretrained model to it.

In [None]:
# Get device
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Device: {device}')

Device: cuda


In [None]:
# Send model to device
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

**Trainer class**

Transformers provides the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class that makes it easier to train without having to manually write the training loop. It provides many options and feastures for training such as logging, gradient accumulation, mixed precision etc.

[TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) class lets you set the needed flags to activate different training options. Here you can start off by using the default arguments. For example, you can specify where to save the checkpoints during training.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

Trainer class doesn't automatically come with any functions to evaluate the training. You will have to pass a function to compute and report metrics during training. [Evaluate](https://huggingface.co/docs/evaluate/index) library proivdes a simple [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) function that you can load with [evaluate.load](https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/loading_methods#evaluate.load)

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits)

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

If you'd like to monitor your evaluation metrics during fine-tuning, specify the `evaluation_strategy` parameter in your training arguments to report the evaluation metric at the end of each epoch

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

**Trainer object**

Create a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object with your model, training arguments, training and test datasets, and evaluation function.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Then fine-tune your model by calling trainer.train()

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.330956,0.43
2,No log,1.035276,0.58


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.330956,0.43
2,No log,1.035276,0.58
3,No log,1.212608,0.53


TrainOutput(global_step=375, training_loss=0.9383234049479167, metrics={'train_runtime': 286.551, 'train_samples_per_second': 10.469, 'train_steps_per_second': 1.309, 'total_flos': 789354427392000.0, 'train_loss': 0.9383234049479167, 'epoch': 3.0})