# Training generative models

<p align="center">
  <a href="https://colab.research.google.com/github/auduvignac/llm-finetuning/blob/main/notebooks/corrections/training_tinyllama.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Ouvrir dans Google Colab"/>
  </a>
</p>

In this notebook, we are going to showcase quickly how to train large generative models.

The goal of this lab is also to demonstrate that there exists a lot of pre-built methods that can automate common tasks.

## Download requirements

First install a library necessary to run quantized models.

In [1]:
pip install -U accelerate==1.8.1 peft==0.15.0 bitsandbytes==0.46.0 transformers==4.52.4 datasets

Collecting accelerate==1.8.1
  Downloading accelerate-1.8.1-py3-none-any.whl.metadata (19 kB)
Collecting peft==0.15.0
  Downloading peft-0.15.0-py3-none-any.whl.metadata (13 kB)
Collecting bitsandbytes==0.46.0
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting transformers==4.52.4
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Downloading accelerate-1.8.1-py3-none-any.whl (365 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.3/365.3 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading peft-0.15.0-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.8/410.8 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl (67.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-4.52.4-py3-none-an

## Load a generative model

We are going to load a generative model. Here we make use of heavy optimization technique to make sure everything fits on Collab GPU.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

from transformers import BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [3]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), e

🚧 **TODO** 🚧

Experiment with the generation of the model.

In [4]:
text = "Paris is the capital of"
DEVICE= "cuda:0"


inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0]))

<s> Paris is the capital of France, one is drawn to the city's architecture, history and the unique cultural mix between old and new. The French capital is a melting pot of contrasts, yet you'll be amazed by the glamour of the Eiffel Tower and Opéra Garnier.
From the frenetic vibrancy of Montmartre and Marais to the bohemian Bohemian café of Montparnasse, you'll see what Paris is truly about – the energy and the culture.
Paris is famous for all things romantic. You'll explore its romantic past and present, which includes the Left Bank and Montmartre for a glimpse of the legendary Moulin Rouge nightclub. The city offers many options for lovers of beautiful places to stay. Choose from 5-star accommodation from hotels, apartments, bedrooms, suites, bed and breakfasts, boutique hotels


## Training

First load an instruction dataset.

In [5]:
from datasets import load_dataset

def process_data(sample):
    return tokenizer(sample["text"])

data = load_dataset("tatsu-lab/alpaca", )
data = data.map(process_data, batched=True)

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [6]:
data

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
        num_rows: 52002
    })
})

In [7]:
print(data["train"][0]["text"])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


Since this model is quite big, we are going to reduce the number of trainable parameters.

To do that, we use **Parameter Efficient Fine-Tuning** (PEFT).

In PEFT, we do not tune the full model. Only a small subsets of the parameters are trained, while the others are frozen (i.e., not updated during training).

🚧 **TODO** 🚧

Explain why PEFT reduces the memory cost.

In **full fine-tuning**, all model parameters are updated. This requires storing:

* the model weights,
* gradients for each parameter,
* optimizer states (often doubling memory, e.g., Adam keeps momentum and variance).

As a result, the total memory footprint can reach **3–4× the size of the model**.

In **parameter-efficient fine-tuning**, such as **LoRA** or **adapters**, the **original model weights are frozen** and only small additional modules are trained. This means:

* no gradients or optimizer states are stored for the frozen weights,
* only a tiny fraction (often <1–5%) of parameters require training memory.

Consequently, PEFT drastically reduces GPU memory usage, making fine-tuning large language models feasible on much smaller hardware.


In [8]:
from peft import prepare_model_for_kbit_training

# This determines automatically which module can be used for quantized training inside the model
model = prepare_model_for_kbit_training(model)

In [9]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

The cell below adds a small amount of extra-paramter to the model using Lora technique, that is described here: https://arxiv.org/abs/2106.09685

In [10]:
from peft import LoraConfig, get_peft_model

# Use default config
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 1531904 || all params: 617138176 || trainable%: 0.2482270680334642


We now use the Trainer class of HuggingFace. On simple installations it can be very effective.

In [11]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=16,
        gradient_accumulation_steps=1,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        report_to="none"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
  return fn(*args, **kwargs)


Step,Training Loss
1,1.9273
2,2.0041
3,2.0764
4,1.8546
5,1.9452
6,1.881
7,1.8819
8,1.8382
9,1.6867
10,1.5169


TrainOutput(global_step=100, training_loss=1.2164075750112533, metrics={'train_runtime': 265.4437, 'train_samples_per_second': 6.028, 'train_steps_per_second': 0.377, 'total_flos': 2931690377576448.0, 'train_loss': 1.2164075750112533, 'epoch': 0.030759766225776683})

In [12]:
text = "### Instruction:\nPropose an outdoor activity.\n\n### Response:\n"
inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
model.config.use_cache = True

outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0]))

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


<s> ### Instruction:
Propose an outdoor activity.

### Response:
I would like to propose an outdoor activity for my family this summer. We would go skimming at the lake. It is a great way to spend time outside together, and this would allow for a wonderful and fun experience for my family. The lake is an ideal location for this activity, it is a wide body of water, allowing for ample space for all members of our family, and it also provides the perfect opportunity to explore and learn new water skills, making the activity both educational and fun. Our family can try and perfect the skill of skimming at the lake, it is a very rewarding and fun water activity that we can all participate in together. This summer, I would like to invite all my family members and friends over for a fun day of outdoor adventure. With my family over, I would like to spend the day taking the time to learn new water skills, skim, swim, and explore our family's new water adventure.
