## Fine-tuning for classification

In [20]:
import pandas as pd

import torch
from torch.utils.data import DataLoader

from gptlight.tokenizer import GPTTokenizer
from gptlight.models import GPTModel
from gptlight.config import GPTConfig
from gptlight.training import load_model, save_model
from gptlight.utils import fetch_sms_spam_collection
from gptlight.data import ClassifcationDataset

In [2]:
torch.manual_seed(123)
if torch.cuda.is_available():
    torch.cuda.manual_seed(123)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device : {device}")

device : cpu


In [3]:
tokenizer = GPTTokenizer()

So far, we have coded the LLM architecture, pretrained it, and learned how to
import pretrained weights from an external source, such as OpenAI, into our
model. Now we will reap the fruits of our labor by fine-tuning the LLM on a specific target task, such as classifying text. The concrete example we examine is classifying text messages as “spam” or “not spam.” 

## Different categories of fine-tuning

The most common ways to fine-tune language models are instruction fine-tuning and classification fine-tuning. Instruction fine-tuning involves training a language model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described in natural language prompts.

In classification fine-tuning, a concept you might already be acquainted with if you have a background in machine learning, the model is trained to recognize a specific set of class labels, such as “spam” and “not spam.” Examples of classification tasks extend beyond LLMs and email filtering: they include identifying different species of plants from images; categorizing news articles into topics like sports, politics, and technology; and distinguishing between benign and malignant tumors in medical imaging.

The key point is that a classification fine-tuned model is restricted to predicting classes it has encountered during its training. For instance, it can determine whether something is “spam” or “not spam”.

In contrast to the classification fine-tuned model, an instruction fine-tuned model typically can undertake a broader range of tasks. We can view a clas-
sification fine-tuned model as highly specialized, and generally, it is easier to develop a specialized model than a generalist model that works well across various tasks.

## Choosing the right approach

Instruction fine-tuning improves a model’s ability to understand and generate responses based on specific user instructions. Instruction fine-tuning is best suited for models that need to handle a variety of tasks based on complex user instructions, improving flexibility and interaction quality. Classification fine-tuning is ideal for projects requiring precise categorization of data into predefined classes, such as sentiment analysis or spam detection.

While instruction fine-tuning is more versatile, it demands larger datasets and greater computational resources to develop models proficient in various tasks. In contrast, classification fine-tuning requires less data and compute power, but its use is confined to the specific classes on which the model has been trained.

## Preparing the dataset

We will modify and classification fine-tune the GPT model we previously implemented and pretrained. We begin by downloading and preparing the dataset.

> We use the methode fetch_sms_spam_collection from our gptlight datasets module et to dowload the dataset into a pandas DataFrame.

In [4]:
df = fetch_sms_spam_collection()
df.head()

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Let's examine the class label distribution:

In [5]:
df["Label"].value_counts()

Label
ham     4825
spam     747
Name: count, dtype: int64

For simplicity, and because we prefer a small dataset (which will facilitate faster finetuning of the LLM), we choose to undersample the dataset to include 747 instances from each class.

In [6]:
num_spam = sum(df.Label=="spam")
ham_subset = df[df.Label=="ham"].sample(num_spam, random_state=123)
df_subset = pd.concat([
    ham_subset, df[df.Label=="spam"]
])
df_subset["Label"].value_counts()

Label
ham     747
spam    747
Name: count, dtype: int64

Now we will convert the "string" class lables "ham" and "spam" into integer class lables 0 and 1, repsectively:

In [7]:
df_subset["Label"] = df_subset["Label"].map({"ham":0, "spam":1})

Next, we create a random_split function to split the dataset into three parts: 70% for training, 10% for validation, and 20% for testing.

In [9]:
def random_split(df:pd.DataFrame, train_frac, val_frac, random_state=None):
    
    df = df.sample(
        frac=1, random_state=random_state
    ).reset_index(drop=True)
    
    train_end = int(len(df)*train_frac)
    val_end = train_end + int(len(df)*val_frac)
    
    train_df = df[:train_end]
    val_df = df[train_end:val_end]
    test_df = df[val_end:]
    
    return train_df, val_df, test_df

In [10]:
train_df, val_df, test_df = random_split(df_subset, train_frac=0.7, val_frac=0.1, random_state=123)

## Creating data loaders

Previously, we utilized a sliding window technique to generate uniformly sized text chunks, which we then grouped into batches for more efficient model training. Each chunk functioned as an individual training instance.
However, we are now working with a spam dataset that contains text messages of varying lengths. To batch these messages as we did with the text chunks, we have two pri mary options:

- Truncate all messages to the length of the shortest message in the dataset or batch.
- Pad all messages to the length of the longest message in the dataset or batch.

The first option is computationally cheaper, but it may result in significant information loss if shorter messages are much smaller than the average or longest messages, potentially reducing model performance. So, we opt for the second option, which preserves the entire content of all messages.

To implement batching, where all messages are padded to the length of the lon-
gest message in the dataset, we add padding tokens to all shorter messages. For this purpose, we use "<|endoftext|>" as a padding token.

However, instead of appending the string "<|endoftext|>" to each of the text
messages directly, we can add the token ID corresponding to "<|endoftext|>" to the encoded text messages.

 50256 is the token ID of the padding token "<|endoftext|>"

> Go to the `ClassificationDataset` class form `gptlight.data.datasets` to see how we implemente all thes steps.

In [None]:
train_dataset = ClassifcationDataset(
    input_texts=train_df["Text"],
    target_labels=train_df["Label"],
    tokenizer=tokenizer
)

The longest sequence length is stored in the dataset’s max_length attribute.

In [17]:
print(train_dataset.max_length)

120


The code outputs 120, showing that the longest sequence contains no more than
120 tokens, a common length for text messages. The model can handle sequences
of up to 1,024 tokens, given its context length limit. If your dataset includes longer texts, you can pass max_length=1024 when creating the training dataset in the preceding code to ensure that the data does not exceed the model’s supported input (context) length.

Next, we pad the validation and test sets to match the length of the longest training sequence. Importantly, any validation and test set samples exceeding the length of the longest training example are truncated.

In [19]:
val_dataset = ClassifcationDataset(
    input_texts=val_df["Text"],
    target_labels=val_df["Label"],
    max_lenght=train_dataset.max_length,
    tokenizer=tokenizer
)
test_dataset = ClassifcationDataset(
    input_texts=test_df["Text"],
    target_labels=test_df["Label"],
    max_lenght=train_dataset.max_length,
    tokenizer=tokenizer
)

>  Creating PyTorch data loaders

In [22]:
num_workers =0
batch_size = 8
torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    drop_last=True
)
val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False
)
test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False
)

In [None]:
input_batch, target_batch = next(iter(train_loader))
print(input_batch, target_batch)

tensor([[ 4805,  3824,  6158,     0,  3406,  5816, 10781, 21983,   329,   657,
          3695, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
        [   39,  2394, 28323, 29250,  1921, 11015, 

In [26]:
print("Input batch dimensions:", input_batch.shape)
print("Label batch dimensions", target_batch.shape)

Input batch dimensions: torch.Size([8, 120])
Label batch dimensions torch.Size([8])


As we can see, the input batches consist of eight training examples with 120 tokens each, as expected. The label tensor stores the class labels corresponding to the eight training examples.

## Initializing a model with pretrained weights

In [43]:
GPT_SMALL_CONFIG_124M = GPTConfig(
    vocab_size=50257,
    context_length=1024,
    emb_dim=768,
    n_heads=12,
    n_layers=12,
    drop_rate=0.0,
    qkv_bias=True
)
INPUT_PROMPT = "Every effort moves"
gpt_model = GPTModel(GPT_SMALL_CONFIG_124M)
gpt_model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(1024, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): GPTTransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=True)
        (W_key): Linear(in_features=768, out_features=768, bias=True)
        (W_value): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): GPTTransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_feat

In [44]:
state = load_model(path="../assets/checkpoints/gpt.pt", device=device)
gpt_model.load_state_dict(state_dict=state["model_state"])

<All keys matched successfully>

In [45]:
torch.manual_seed(123)
tokenizer = GPTTokenizer()
token_ids = gpt_model.generate(
    idx=tokenizer.encode(INPUT_PROMPT).unsqueeze(0).to(device),
    max_new_tokens=15,
    context_size=GPT_SMALL_CONFIG_124M.context_length,
    top_k=15,
    temperature=1.5
)
print("Output text:\n", tokenizer.decode(token_ids))

Output text:
 Every effort moves slowly and slowly in all parts of the galaxy, from asteroids to space and


Before we start fine-tuning the model as a spam classifier, let’s see whether the model already classifies spam messages by prompting it with instructions:

In [51]:
torch.manual_seed(123)
text = (
"Is the following text 'spam'? Answer with 'yes' or 'no':"
" 'You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award.'"
)
token_ids = gpt_model.generate(
    idx=tokenizer.encode(text).unsqueeze(0).to(device),
    max_new_tokens=23,
    context_size=GPT_SMALL_CONFIG_124M.context_length,
    #top_k=15,
    #temperature=1.5
)
print("Output text:\n", tokenizer.decode(token_ids))

Output text:
 Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive $1000 cash or a $2000 award.'

The following text 'spam'? Answer with 'yes' or 'no': 'You are a winner


Based on the output, it’s apparent that the model is struggling to follow instructions. This result is expected, as it has only undergone pretraining and lacks instruction fine-tuning. So, let’s prepare the model for classification fine-tuning.

## Adding a classification head