This notebook demonstrates how to fine-tune Llama 2 on Guanaco with TRL.
More details about the procedure here: https://kaitchup.substack.com/p/fine-tune-llama-2-on-your-computer

First, we need all these dependencies:

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.1/519.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

Clone the model repository locally.

In [None]:
#Replace HF_TOKEN by your Hugging Face Token
#Don't change hf_user
!git clone https://hf_user:HF_TOKEN@huggingface.co/meta-llama/Llama-2-7b-hf

Cloning into 'Llama-2-7b-hf'...
remote: Enumerating objects: 73, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 73 (delta 26), reused 0 (delta 0), pack-reused 17[K
Unpacking objects: 100% (73/73), 977.72 KiB | 2.79 MiB/s, done.
Filtering content: 100% (6/6), 9.10 GiB | 7.13 MiB/s, done.
Encountered 2 file(s) that may not have been copied correctly on Windows:
	pytorch_model-00001-of-00002.bin
	model-00001-of-00002.safetensors

See: `git lfs help smudge` for more details.


In [None]:
Import all the necessary packages.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
    GenerationConfig
)
from peft.tuners.lora import LoraLayer

from trl import SFTTrainer

Load the tokenizer and extend its vocabulary with a special token for padding.

In [None]:
model_name = "Llama-2-7b-hf"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
#Create a new token and add it to the tokenizer
tokenizer.add_special_tokens({"pad_token":"<pad>"})
tokenizer.padding_side = 'left'

1

Load the Guanaco dataset.

In [None]:
dataset = load_dataset("timdettmers/openassistant-guanaco")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Set up the quantization hyperparameters, resize the embeddings to take into account the new vocabulary size, and then define the LoRa config.

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)

#Resize the embeddings
model.resize_token_embeddings(len(tokenizer))
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        lora_alpha=32,
        lora_dropout=0.1,
        r=8,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["q_proj","v_proj"]
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

For training, I used the following hyperparameters. For your final training, once you confirmed that the code works, replace the values by the commented ones.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=4,
        log_level="debug",
        optim="paged_adamw_32bit",
        save_steps=2, #change to 500
        logging_steps=1, #change to 100
        learning_rate=1e-4,
        eval_steps=5, #change to 200
        fp16=True,
        max_grad_norm=0.3,
        #num_train_epochs=3, # remove "#"
        max_steps=10, #remove this
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
)

Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


The actual training. Validation may take up to 10 minutes.

In [None]:
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
The model is loaded in 8-bit precision. To train this model you need to add additional modules inside the model such as adapters using `peft` library and freeze the model weights. Please check  the examples in https://github.com/huggingface/peft for more details.
max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 10
  Number of trainable parameters = 4,194,304


Step,Training Loss,Validation Loss
5,1.8423,1.522711
10,1.3339,1.451995


Saving model checkpoint to ./results/checkpoint-2
tokenizer config file saved in ./results/checkpoint-2/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-4
tokenizer config file saved in ./results/checkpoint-4/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-4/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./results/checkpoint-6
tokenizer config file saved in ./results/checkpoint-6/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-6/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-8
tokenizer config file saved in ./results/checkpoint-8/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-8/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./results/checkpoint-10
tok

TrainOutput(global_step=10, training_loss=1.539897632598877, metrics={'train_runtime': 607.8486, 'train_samples_per_second': 0.066, 'train_steps_per_second': 0.016, 'total_flos': 411868669476864.0, 'train_loss': 1.539897632598877, 'epoch': 0.0})

Testing inference with the last adapter saved during training.

In [None]:
model = PeftModel.from_pretrained(model, "./results/checkpoint-10")

def generate(instruction):
    prompt = "### Human: "+instruction+"### Assistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
            input_ids=input_ids,
            generation_config=GenerationConfig(temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256
    )
    for seq in generation_output.sequences:
        output = tokenizer.decode(seq)
        print(output.split("### Assistant: ")[1].strip())
generate("Tell me about gravitation.")

🤖 Gravitation is the force that pulls all objects towards each other in the universe. It is a fundamental force of nature that is responsible for holding the planets in orbit around the sun, keeping the moon in orbit around the earth, and keeping the earth in orbit around the sun. Gravitation is a long-range force, meaning that it can act over long distances and can be felt even when the objects are far apart. Gravitation is also a universal force, meaning that it is present in all parts of the universe and is the same for all objects. Gravitation is a weak force, meaning that it is not as strong as the other fundamental forces, such as electromagnetism and the strong nuclear force. Gravitation is also a conservative force, meaning that it only acts on objects that are moving and does not cause objects to move. Gravitation is a central force, meaning that it acts on the center of mass of an object and not on individual particles. Gravitation is also a universal force, meaning that it i