# Fine-tuning an LLM with Hugging Face Trainer

In the previous notebook we fine-tuned a language model using a **manual PyTorch training loop**:

- custom `Dataset` and `DataLoader`
- explicit `optimizer.step()`, `scheduler.step()`, `loss.backward()`
- manual evaluation

This notebook uses the **same model and dataset**, but relies on **Hugging Face `datasets` and `Trainer`** to handle most of the training boilerplate.

Goal:

- Show how to fine-tune the same model and dataset with less code.
- Connect the high-level `Trainer` API to the manual loop from the previous notebook.

We still:

- Load a pretrained Llama model from disk.
- Load the Guanaco / OpenAssistant JSONL dataset from disk.
- Tokenize the data for causal language modeling.
- Train for a small number of steps.
- Run inference and compare base vs fine-tuned model.


> **Bazzite-AI Setup Required**  
> Run `D0_00_Bazzite_AI_Setup.ipynb` first to verify GPU access.

In [30]:
import os
from dataclasses import dataclass

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

torch.__version__

'2.9.1+cu130'

In [31]:
@dataclass
class Config:
    # Model from HuggingFace Hub (same as D3_02)
    HF_LLM_MODEL: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

    # Output directory for fine-tuned model
    output_dir: str = "ft_model_trainer"

    # Data
    max_length: int = 256

    # Optimization
    batch_size: int = 1          # reduced for memory efficiency
    num_epochs: int = 1
    learning_rate: float = 5e-6
    weight_decay: float = 0.01
    warmup_ratio: float = 0.1
    gradient_accumulation_steps: int = 16  # increased to compensate

    # For demo, use small subsets
    train_subset_size: int = 500
    val_subset_size: int = 100

    seed: int = 42
    device: str = "cuda" if torch.cuda.is_available() else "cpu"

cfg = Config()
cfg

Config(HF_LLM_MODEL='TinyLlama/TinyLlama-1.1B-Chat-v1.0', output_dir='ft_model_trainer', max_length=256, batch_size=1, num_epochs=1, learning_rate=5e-06, weight_decay=0.01, warmup_ratio=0.1, gradient_accumulation_steps=16, train_subset_size=500, val_subset_size=100, seed=42, device='cuda')

The hyperparameters here mirror those from the manual PyTorch notebook:

- `batch_size`, `num_epochs`, `learning_rate`, `weight_decay`, `warmup_ratio`, and `gradient_accumulation_steps` have the same meaning.
- `max_length` controls the maximum sequence length after tokenization.
- For the demo we only use small subsets of the full dataset (`train_subset_size` and `val_subset_size`) so that fine-tuning finishes quickly.

The difference is that we will pass these values into `TrainingArguments` instead of using them directly in a manual training loop.

In [32]:
def set_seed(seed: int):
    import random
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(cfg.seed)

[No output generated]

In [33]:
tokenizer = AutoTokenizer.from_pretrained(cfg.HF_LLM_MODEL)

# Many causal LMs do not define a pad token by default
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Pad token:", tokenizer.pad_token, "ID:", tokenizer.pad_token_id)

model = AutoModelForCausalLM.from_pretrained(
    cfg.HF_LLM_MODEL,
    dtype=torch.bfloat16,  # Use bfloat16 for memory efficiency
)
model.to(cfg.device)

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

n_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {n_params / 1e6:.1f}M")
print(f"Model dtype: {next(model.parameters()).dtype}")

Pad token: </s> ID: 2


Number of parameters: 1100.0M
Model dtype: torch.bfloat16


## Dataset format and inference prompt

We use the Guanaco / OpenAssistant dataset. Each entry in the JSONL files has a single field:

```python
"text": "### Human: <instruction>### Assistant: <ideal answer>"
```
During fine-tuning:

- The model sees the full text string.

- It is trained as a causal language model to predict the next token at each position.

At inference time we only have a new user instruction. We must recreate the same format the model saw during training, for example:
```python
### Human: How do I build a PC?### Assistant:
```
and let the model generate the continuation.

In [34]:
def build_prompt_for_inference(user_instruction: str) -> str:
    """
    Build a Guanaco-style prompt for a new instruction at inference time.
    The dataset format looks like:
        "### Human: ...### Assistant: ..."
    """
    return f"### Human: {user_instruction}### Assistant:"

[No output generated]

In [35]:
# Load dataset from HuggingFace Hub (same as D3_02)
dataset = load_dataset("timdettmers/openassistant-guanaco")
dataset

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [36]:
print(dataset)
print("Example training entry:")
print(dataset["train"][0])

# For the demo, restrict to small subsets
train_dataset = dataset["train"].select(range(cfg.train_subset_size))
val_dataset   = dataset["test"].select(range(cfg.val_subset_size))  # Note: "test" not "validation"

len(train_dataset), len(val_dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})
Example training entry:
{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast 

(500, 100)

## Tokenization and preprocessing with `datasets`

Instead of writing a custom PyTorch `Dataset` class and a `DataLoader`, we now:

1. Use `datasets.load_dataset` to read the JSONL files.
2. Define a tokenization function that:
   - tokenizes the `"text"` field,
   - truncates or pads to `max_length`,
   - sets `labels` equal to `input_ids` for causal language modeling.
3. Apply this function to the whole dataset with `dataset.map(...)`.

The result is a `Dataset` object that already returns tokenized fields, which `Trainer` can use directly.


In [37]:
def tokenize_function(batch):
    """
    Tokenize the 'text' field for causal language modeling.

    We:
    - truncate or pad to cfg.max_length,
    - set labels = input_ids (shift is handled by the model internally).
    """
    enc = tokenizer(
        batch["text"],
        truncation=True,
        max_length=cfg.max_length,
        padding="max_length",
    )
    # For causal LM supervised fine-tuning, labels are often the same as input_ids
    enc["labels"] = enc["input_ids"].copy()
    return enc

tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],  # drop original string to keep only tokenized fields
)

tokenized_val = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
)

tokenized_train[0]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

{'input_ids': [1,
  835,
  12968,
  29901,
  1815,
  366,
  2436,
  263,
  3273,
  18707,
  1048,
  278,
  29527,
  749,
  310,
  278,
  1840,
  376,
  3712,
  459,
  1100,
  29891,
  29908,
  297,
  7766,
  1199,
  29973,
  3529,
  671,
  6455,
  4475,
  304,
  7037,
  1601,
  459,
  1100,
  583,
  297,
  278,
  23390,
  9999,
  322,
  274,
  568,
  8018,
  5925,
  29889,
  2277,
  29937,
  4007,
  22137,
  29901,
  376,
  7185,
  459,
  1100,
  29891,
  29908,
  14637,
  304,
  263,
  9999,
  3829,
  988,
  727,
  338,
  871,
  697,
  1321,
  7598,
  363,
  263,
  3153,
  1781,
  470,
  2669,
  29889,
  512,
  7766,
  1199,
  29892,
  445,
  1840,
  338,
  10734,
  8018,
  297,
  278,
  10212,
  9999,
  29892,
  988,
  263,
  1601,
  459,
  1100,
  29891,
  5703,
  261,
  756,
  7282,
  3081,
  975,
  278,
  281,
  1179,
  322,
  1985,
  5855,
  310,
  1009,
  22873,
  29889,
  450,
  10122,
  310,
  263,
  1601,
  459,
  1100,
  29891,
  508,
  1121,
  297,
  5224,
  281,
  1179,
  

In [38]:
# Ensure the datasets return PyTorch tensors
tokenized_train.set_format(type="torch")
tokenized_val.set_format(type="torch")

tokenized_train

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 500
})

In [39]:
os.makedirs(cfg.output_dir, exist_ok=True)

training_args = TrainingArguments(
    output_dir=cfg.output_dir,
    per_device_train_batch_size=cfg.batch_size,
    per_device_eval_batch_size=cfg.batch_size,
    num_train_epochs=cfg.num_epochs,
    learning_rate=cfg.learning_rate,
    weight_decay=cfg.weight_decay,
    warmup_ratio=cfg.warmup_ratio,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    gradient_accumulation_steps=cfg.gradient_accumulation_steps,
    bf16=torch.cuda.is_available(),  # Use bf16 instead of fp16
    report_to="none",
    load_best_model_at_end=True,
)
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False,


## Using `TrainingArguments` and `Trainer`

`TrainingArguments` defines how training should work:

- `per_device_train_batch_size`, `num_train_epochs`, `learning_rate`, `weight_decay`, `warmup_ratio`, and `gradient_accumulation_steps` correspond directly to the values we used in the manual PyTorch loop.
- `evaluation_strategy="epoch"` and `save_strategy="epoch"` tell `Trainer` to run evaluation and save checkpoints at the end of each epoch.
- `fp16=True` enables mixed precision training on GPU, similar to using `torch.cuda.amp`.

`Trainer` will:

- construct DataLoaders internally,
- run the training loop (forward, loss, backward, optimizer step, scheduler step),
- handle evaluation and model saving.

Conceptually, under the hood it performs the same sequence of operations as our manual loop in the previous notebook.

In [40]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

[No output generated]

In [41]:
train_result = trainer.train()
train_result

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


<IPython.core.display.HTML object>

TrainOutput(global_step=32, training_loss=2.4102229923009872, metrics={'train_runtime': 30.9312, 'train_samples_per_second': 16.165, 'train_steps_per_second': 1.035, 'total_flos': 794505510912000.0, 'train_loss': 2.4102229923009872, 'epoch': 1.0})

In [42]:
trainer.save_model(cfg.output_dir)
tokenizer.save_pretrained(cfg.output_dir)
print(f"Model saved to {cfg.output_dir}")

Model saved to ft_model_trainer


In [43]:
metrics = trainer.evaluate()
metrics

<IPython.core.display.HTML object>

{'eval_loss': 1.7101176977157593,
 'eval_runtime': 1.1446,
 'eval_samples_per_second': 87.366,
 'eval_steps_per_second': 87.366,
 'epoch': 1.0}

After calling `trainer.train()`:

- `Trainer` has iterated over the training dataset for `num_train_epochs` epochs.
- For each step, it has:
  - computed the loss,
  - backpropagated the gradients,
  - updated the optimizer and learning rate scheduler.
- After each epoch, it has:
  - evaluated on the validation dataset,
  - saved a checkpoint,
  - optionally kept the best-performing model in memory.

The `metrics` dictionary from `trainer.evaluate()` contains at least the validation loss (`eval_loss`), which is directly comparable to the validation loss from the manual PyTorch notebook.

In [44]:
# Base model (unmodified) from HuggingFace Hub
base_model = AutoModelForCausalLM.from_pretrained(
    cfg.HF_LLM_MODEL,
    dtype=torch.bfloat16,
).to(cfg.device)
base_model.eval()

# Fine-tuned model (from Trainer output)
ft_model = AutoModelForCausalLM.from_pretrained(
    cfg.output_dir,
    dtype=torch.bfloat16,
).to(cfg.device)
ft_model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rot

In [45]:
def generate_response(model, instruction: str, max_new_tokens: int = 128):
    """
    Generate a reply from the model given a human instruction.
    We create a Guanaco-style prompt:
        "### Human: ...### Assistant:"
    and let the model continue.
    """
    prompt_text = build_prompt_for_inference(instruction)
    inputs = tokenizer(prompt_text, return_tensors="pt").to(cfg.device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.pad_token_id,
        )

    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return text

[No output generated]

In [46]:
# Pick an example from the validation set
example = val_dataset[11]
example_text = example["text"]

# Crude split to get the human instruction and assistant answer, for display
if "### Human:" in example_text and "### Assistant:" in example_text:
    human_part = example_text.split("### Human:")[1].split("### Assistant:")[0].strip()
    assistant_part = example_text.split("### Assistant:")[1].strip()
else:
    human_part = example_text
    assistant_part = ""

print("### HUMAN (PROMPT) ###")
print(human_part)

print("\n### GROUND TRUTH ASSISTANT ###")
print(assistant_part)

print("\n### BASE MODEL ###")
print(generate_response(base_model, human_part))

print("\n### FINE-TUNED MODEL ###")
print(generate_response(ft_model, human_part))

### HUMAN (PROMPT) ###
How do I build a PC?

### GROUND TRUTH ASSISTANT ###
Building a PC can be a rewarding experience, but it requires some technical knowledge and patience. Here are the general steps you can follow:

1. Determine your budget and what you want to use the PC for (gaming, work, etc.). This will help you choose the appropriate components.

2. Choose the components you want to use (CPU, motherboard, RAM, storage, GPU, power supply, case, and peripherals).

3. Make sure all the components are compatible with each other. Check the manufacturer's website and online forums for compatibility information.

4. Gather the necessary tools (screwdriver, thermal paste, etc.).

5. Install the CPU onto the motherboard, following the manufacturer's instructions.

6. Install the RAM onto the motherboard, following the manufacturer's instructions.

7. Install the storage (hard drive or solid-state drive) onto the motherboard or into the case, following the manufacturer's instructions.



### Human: How do I build a PC?### Assistant: Sure, let's break it down. Building a PC involves putting together different parts and components to create an overall system that meets your computing needs. Here are the steps involved:

1. Choose your computer hardware - You need to choose the motherboard, processor, RAM, graphics card, storage devices, power supply unit (PSU), and other essential components such as fans, heat sinks, and cables.

2. Install the motherboard - The motherboard is the brains of your PC, it controls everything from power supply, CPU, RAM, to video output. It connects all the other

### FINE-TUNED MODEL ###


### Human: How do I build a PC?### Assistant: To build a personal computer, you need to follow these steps:

1. Determine your budget and needs. Consider the type of operating system you want (Windows or Linux), storage capacity, processing power, graphics card, monitor, and other peripherals like keyboards, mice, speakers, and printers. 2. Choose a case and motherboard. A case is where your components go, while a motherboard is what connects them together. Check for compatibility with your chosen processor, RAM, and storage device. 3. Select a processor. There are many types of processors available depending


## Summary and comparison to manual PyTorch fine-tuning

In this notebook we:

1. Loaded the same pretrained Llama model and Guanaco dataset as in the manual PyTorch notebook.
2. Used `datasets.load_dataset` to load the JSONL files directly into a `DatasetDict`.
3. Applied tokenization with `dataset.map`, creating `input_ids`, `attention_mask`, and `labels` fields.
4. Configured training behaviour via `TrainingArguments`.
5. Used `Trainer` to handle:
   - batching and shuffling,
   - gradient accumulation,
   - mixed precision (on GPU),
   - learning rate scheduling,
   - evaluation and checkpointing.
6. Saved the fine-tuned model and compared its outputs to the base model.

Conceptually, `Trainer` performs the same operations as the manual PyTorch training loop from the previous notebook:
forward pass, loss computation, backward pass, optimizer step, and scheduler step. The main difference is that these steps are now handled by a higher-level API, letting us focus on model, data, and hyperparameters rather than on training boilerplate.


In [None]:
# Shut down the kernel to release memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)