### Finetuning SmolLM2 with GRPO and LoRA

In [1]:
# Imports
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import GRPOConfig, GRPOTrainer
import wandb

  from .autonotebook import tqdm as notebook_tqdm


For this experiment, we will be using the `smoltldr` dataset.

Smoltdlr contains a list of 2000 short stories from Reddit in the 'prompt' column along with their respective summaries in the 'completion' column. The goal of this experiment is to see if we can successfully finetune our large language model using GRPO to summarize reddit posts in a similar fashion to the dataset.

In [3]:
# Load dataset
dataset = load_dataset("mlabonne/smoltldr")
print(dataset)

Generating train split: 100%|██████████| 2000/2000 [00:00<00:00, 52769.16 examples/s]
Generating validation split: 100%|██████████| 200/200 [00:00<00:00, 60519.50 examples/s]
Generating test split: 100%|██████████| 200/200 [00:00<00:00, 58514.29 examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 200
    })
    test: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 200
    })
})





The "large" language model we'll be using in SmolLM2-135M. As the name suggests, this is a small model with only 135M parameters (as compared to the billions in today's leading-edge LLMs). Its size makes it feasible for us to run/finetune it on limited hardware for educational purposes, but it will not be useful for anything practical. 

In [5]:
# Load model
model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


The next step is to load the LoRA configuration. Using LoRA, we can reduce the number of trainable parameters we need to fine-tune the model - effectively reducing its memory footpint.

What is LoRA?

LoRA stands for "Low-Rank Adaptation." In brief, it is a parameter-efficient finetuning technique that modifies only a small portion of the model's weights. Instead of updating all the model’s parameters, LoRA decomposes the weight updates into low-rank matrices which "nudge" the model behavior in the desired direction through low-rank decomposition, and finetunes those instead.

Step-by-step:
1. Freezes the original model weights W so they don't get updated while finetuning/training
2. Chooses target layers to inject LoRA into
3. For each target layer, adds two low-rank trainable matrics A and B such that a change in W can be mapped to them (delta(W) = A * B)
4. Only LoRA parameters A and B are updated and saved during finetuning
5. During inference forward passes, it will load the frozen base model alongside the LoRA adapter weights, and compose the original weights + LoRA to compose answers

In [6]:
# Load LoRA
lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
)
model = get_peft_model(model, lora_config)
print(model.print_trainable_parameters())

trainable params: 4,884,480 || all params: 139,399,488 || trainable%: 3.5039
None


Now we need to define the reward function. GRPO is flexible and can use any reward function to improve the model. In this case, we'll be using a simple reward function that encourages the model to generate text that is 50 tokens long - optimal for our summarization task.

In [7]:
# Reward function
ideal_length = 50


def reward_len(completions, **kwargs):
    return [-abs(ideal_length - len(completion)) for completion in completions]

In this step we use the `GRPOConfig` class to define the training arguments. Important ones are as follows:
1. Learning Rate: Controls how fast the model learns. As with any neural network, too high = unstable training, too low = slow convergence
2. Number of Training Epochs: Similar to ANNs, more epochs = more learning with risk of overfitting, less epochs = faster learning but risk of underfitting
3. bf16: bfloat16 precision enables faster and more memory-efficient training on hardware that supports it (like TPUs or newer GPUs)
4. Optimizer: `adamw_8bit` is efficient for large models using 8-bit optimizers

In [8]:
# Training arguments
training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=8,
    optim="adamw_8bit",
    num_train_epochs=1,
    bf16=True,
    remove_unused_columns=False,
    logging_steps=1,
)

Finally, we can initialize the trainer with model, dataset and training arguments to start the training process.

In [None]:
# Trainer
trainer = GRPOTrainer(
    model=model,
    reward_funcs=[reward_len],
    args=training_args,
    train_dataset=dataset["train"],
)

# Train model
trainer.train()