## Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

### Install requirements

First, run the cells below to install the requirements:

In [1]:
# !pip install -q bitsandbytes datasets accelerate loralib
# !pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
os.environ["WANDB_ENTITY"] = "clif"
os.environ["WANDB_PROJECT"] = "adapters"

### Model loading

Here let's load the `opt-6.7b` model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [3]:
import torch
import torch.nn as nn
from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

# model_name = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
model_name="meta-llama/Meta-Llama-3-8B"

config = BitsAndBytesConfig(
    # load_in_8bit=True,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.51s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [4]:
for param in model.parameters():
  # param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

# model.gradient_checkpointing_enable()  # reduce number of stored activations
# model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

# Hack to prevent HF Trainer from throwing an error due to peft missing.
# model._hf_peft_config_loaded = True

In [5]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Ll

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [6]:
# from peft import LoraConfig, get_peft_model

# config = LoraConfig(
#     r=16,
#     lora_alpha=32,
#     # target_modules="all-linear",
#     lora_dropout=0.05,
#     bias="none",
#     task_type="CAUSAL_LM"
# )

# model = get_peft_model(model, config)
# model.print_trainable_parameters()

In [7]:
# type(model.base_model.h[0].attn.c_attn)

In [8]:
import adapters
from adapters import LoRAConfig, SeqBnConfig, PrefixTuningConfig, DoubleSeqBnConfig

adapters.init(model)

config = LoRAConfig(alpha=8, r=8, dropout=0.05)
# config = DoubleSeqBnConfig()
model.add_adapter("my_adapter", config=config)
model.train_adapter("my_adapter")

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
my_adapter               lora              3,407,872       0.085       1       1
--------------------------------------------------------------------------------
Full model                              4,015,263,744     100.000               0


In [9]:
# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

torch.float16 1050673152 0.23122166765671184
torch.uint8 3489660928 0.7679697704206956
torch.float32 3674112 0.0008085619225925903


In [12]:
model.model.layers[0]

LlamaDecoderLayerWithAdapters(
  (self_attn): LlamaSdpaAttentionWithAdapters(
    (q_proj): LoRALinear4bit(
      in_features=4096, out_features=4096, bias=False
      (loras): ModuleDict(
        (my_adapter): LoRA(
          (lora_dropout): Dropout(p=0.05, inplace=False)
        )
      )
    )
    (k_proj): LoRALinear4bit(
      in_features=4096, out_features=1024, bias=False
      (loras): ModuleDict()
    )
    (v_proj): LoRALinear4bit(
      in_features=4096, out_features=1024, bias=False
      (loras): ModuleDict(
        (my_adapter): LoRA(
          (lora_dropout): Dropout(p=0.05, inplace=False)
        )
      )
    )
    (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
    (rotary_emb): LlamaRotaryEmbedding()
    (prefix_tuning): PrefixTuningLayer(
      (prefix_gates): ModuleDict()
      (pool): PrefixTuningPool(
        (prefix_tunings): ModuleDict()
      )
    )
  )
  (mlp): LlamaMLP(
    (gate_proj): Linear4bit(in_features=4096, out_features=14336, 

In [11]:
for _, v in model.get_adapter("my_adapter").items():
    for _, module in v.items():
        module.to("cuda")

In [12]:
batch = tokenizer("Two things are infinite: ", return_tensors='pt')

with torch.cuda.amp.autocast():
    output = model(**batch)
    print(output)


CausalLMOutputWithPast(loss={'logits': tensor([[[ -0.4431,  -1.6270,   3.3535,  ...,  -0.6890,  -1.7949,   3.3359],
         [-13.4141, -13.2891,   1.1348,  ...,  -7.6680, -11.1172,  -7.2969],
         [-11.2266,  -9.5312,   5.4531,  ...,  -4.4219,  -5.6133,  -4.5703],
         ...,
         [-11.8281,  -9.7500,   5.1289,  ...,  -4.5977,  -4.7734,  -6.8945],
         [-14.5469, -15.2891,   1.6826,  ...,  -7.1641,  -8.1406,  -9.9688],
         [ -6.0859,  -5.7695,   4.6719,  ...,   3.4707,   0.7173,  -0.0970]]],
       grad_fn=<ToCopyBackward0>), 'past_key_values': ((tensor([[[[-4.6631e-01,  5.1483e-02,  3.4210e-02,  ...,  2.9236e-02,
           -1.1429e-02, -1.2732e-01],
          [-6.4660e-01,  3.1546e-01,  5.4163e-02,  ...,  4.6786e-01,
           -1.7191e-01,  6.0746e-01],
          [ 1.6061e-02, -1.2154e-01, -2.3361e-01,  ...,  1.5136e-01,
            1.0709e-01,  1.2181e-01],
          ...,
          [ 3.2064e-01, -7.1216e-02, -1.3133e-01,  ...,  3.4407e-01,
           -3.0703e-02

### Training

In [10]:
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

In [11]:
import transformers
from adapters import AdapterTrainer

trainer = AdapterTrainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=2,
        warmup_steps=10,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=50,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mclif[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


TrainOutput(global_step=10, training_loss=1.6779350280761718, metrics={'train_runtime': 15.896, 'train_samples_per_second': 1.258, 'train_steps_per_second': 0.629, 'total_flos': 21353697755136.0, 'train_loss': 1.6779350280761718, 'epoch': 0.01})

## Share adapters on the 🤗 Hub

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

In [None]:
# model.push_to_hub("ybelkada/opt-6.7b-lora", use_auth_token=True)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [None]:
# import torch
# from peft import PeftModel, PeftConfig
# from transformers import AutoModelForCausalLM, AutoTokenizer

# peft_model_id = "ybelkada/opt-6.7b-lora"
# config = PeftConfig.from_pretrained(peft_model_id)
# model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
# tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# # Load the Lora model
# model = PeftModel.from_pretrained(model, peft_model_id)

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

In [11]:
batch = tokenizer("Two things are infinite: ", return_tensors='pt')

with torch.cuda.amp.autocast():
    output = model(**batch)
    print(output)

# model.eval()
# with torch.cuda.amp.autocast():
#     output_tokens = model.generate(**batch, max_new_tokens=50)

# print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

CausalLMOutputWithPast(loss={'logits': tensor([[[-0.0551, -0.0559, -0.0935,  ..., -0.1272,  0.0433,  0.0605],
         [-0.0678, -0.0016, -0.0130,  ..., -0.0121, -0.0883, -0.0629],
         [-0.1071, -0.0006,  0.1290,  ...,  0.0263, -0.1172, -0.0187],
         ...,
         [-0.1353,  0.0658, -0.0550,  ...,  0.0662, -0.0860, -0.1201],
         [-0.0228, -0.0210,  0.1512,  ...,  0.0680, -0.0450, -0.0954],
         [-0.1010, -0.0561,  0.0629,  ..., -0.0280, -0.0747, -0.0199]]]), 'past_key_values': ((tensor([[[[ 0.0380,  0.0311,  0.1315, -0.0505],
          [ 0.0258,  0.1482, -0.0894, -0.0649],
          [ 0.0279,  0.1357, -0.2384, -0.0299],
          [-0.0515, -0.0163,  0.0718, -0.1258],
          [ 0.0900, -0.0170,  0.0889, -0.1793],
          [-0.0655,  0.0399,  0.0550,  0.0493],
          [-0.1348,  0.0595, -0.0254,  0.0206]],

         [[ 0.0635, -0.1151,  0.0024,  0.0004],
          [-0.0359,  0.0445,  0.1114,  0.0256],
          [-0.1259,  0.0241, -0.0282,  0.0881],
          [ 0.1

As you can see by fine-tuning for few steps we have almost recovered the quote from Albert Einstein that is present in the [training data](https://huggingface.co/datasets/Abirate/english_quotes).