### Install requirements

First, run the cells below to install the requirements:

## Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model. 
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q  git+https://github.com/zphang/peft.git@llama
!pip install -q git+https://github.com/zphang/transformers.git@llama_push


[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.0+cu116 requires torch==1.12.0, but you have torch 1.13.1 which is incompatible.
torchaudio 0.12.0+cu116 requires torch==1.12.0, but you have torch 1.13.1 which is incompatible.[0m[31m
[0m

### Model loading

Here let's load the `opt-6.7b` model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import LLaMAForCausalLM, LLaMATokenizer, DataCollatorForSeq2Seq,TrainingArguments, Trainer

torch.backends.cuda.matmul.allow_tf32 = True

model = LLaMAForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 112
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda112.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Downloading (…)lve/main/config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading (…)l-00001-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00033.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


Embedding(32001, 4096)

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [3]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4194304 || all params: 6742618112 || trainable%: 0.06220586618327525


In [6]:
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names

training_args =TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=200, 
        learning_rate=1e-4, 
        tf32=True,
        logging_steps=1, 
        output_dir='outputs'
    )

decay_parameters = get_parameter_names(model, [nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if n in decay_parameters],
        "weight_decay": training_args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
        "weight_decay": 0.0,
    },
]

optimizer_kwargs = {
    "betas": (training_args.adam_beta1, training_args.adam_beta2),
    "eps": training_args.adam_epsilon,
}
optimizer_kwargs["lr"] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate,
)

In [7]:
def get_max_column_length(tokenizer, dataset,column_name):
    tokenized_inputs = (dataset).map(lambda x: tokenizer(x[column_name], truncation=True, padding='max_length'), batched=True)
    max_source_length = max([len(x['input_ids']) for x in tokenized_inputs])
    return max_source_length


def preprocess(sample,max_source_length,max_target_length,tokenizer,input_column_name,target_column_name,pading="max_length"):
    # add prefix for code2text task
    inputs = [item for item in sample[input_column_name]]
    #tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=pading, truncation=True)
    #tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(sample[target_column_name], max_length=max_target_length, padding=pading, truncation=True)
    #set labels
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

def load_prompt_file(filepath):
        return open(filepath, 'r').read()




In [8]:
def generate_training_prompt(item,prompt_a,prompt_b):
    instruction =item['instruction']
    input = item['input']

    #if item object has input use prompt a else use prompt b
    if len(input)>0:
        return prompt_a.format(instruction,input)
    else:
        return prompt_b.format(instruction)
def generate_json(item,prompt_a,prompt_b):
    return {
        "text": generate_training_prompt(item,prompt_a,prompt_b),
        "target": item['output']
    }
def load_prompt_file(filepath):
        return open(filepath, 'r').read()

prompt_with_input = load_prompt_file('prompt_with_input.txt')
prompt_without_input = load_prompt_file('prompt_without_input.txt')



In [9]:
import json
from datasets import Dataset
alpaca_dataset_path = 'alpaca_data.json'

def load_json_file(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)


json_objects = load_json_file(alpaca_dataset_path)
print(len(json_objects))
mapped_json_objects = [generate_json(item,prompt_with_input,prompt_without_input) for item in json_objects]
mapped_data = {'text': [], 'target': []}
for example in mapped_json_objects:
    mapped_data['text'].append(example['text'])
    mapped_data['target'].append(example['target'])

dataset = Dataset.from_dict(mapped_data)
print(dataset)

52002
Dataset({
    features: ['text', 'target'],
    num_rows: 52002
})


In [14]:
label_pad_token_id = tokenizer.pad_token_id

In [15]:
max_source_length = min([512,get_max_column_length(tokenizer,dataset,'text')])
max_target_length = min([512,get_max_column_length(tokenizer,dataset,'target')])
input_column_name = 'text'
target_column_name = 'target'

print("Max Source Length: {}\n Max Target Length: {}".format(max_source_length,max_target_length))
#preprocess dataset

tokenized_dataset = dataset.map(lambda x: preprocess(x,max_source_length,max_target_length,tokenizer,input_column_name,target_column_name),batched=True)
#adds padding to the labels


data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=label_pad_token_id)


  0%|          | 0/53 [00:00<?, ?ba/s]

  0%|          | 0/53 [00:00<?, ?ba/s]

Max Source Length: 512
 Max Target Length: 512


  0%|          | 0/53 [00:00<?, ?ba/s]

In [18]:
print(tokenized_dataset)

Dataset({
    features: ['text', 'target', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 52002
})


### Training

In [30]:



trainer = Trainer(
    model=model, 
    train_dataset=tokenized_dataset,
    optimizers=(adam_bnb_optim, None),
    args=TrainingArguments(
        hub_model_id="epinnock/llama7b-lora",
        gradient_checkpointing=True,
        per_device_train_batch_size=8, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=1563, 
        learning_rate=1e-4, 
        tf32=True,
        bf16=True,
        logging_steps=100, 
        group_by_length=True,
        save_steps=500,
        push_to_hub=True,
        output_dir='outputs'
    ),
    data_collator=data_collator,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

ValueError: Token is required (write-access action) but no token found. You need to provide a token or be logged in to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens.

## Share adapters on the 🤗 Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model.push_to_hub("epinnock/llama7b-lora", use_auth_token=True)

Uploading the following files to ybelkada/opt-6.7b-lora: adapter_config.json,adapter_model.bin


Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ybelkada/opt-6.7b-lora/commit/6f240b184e666b54a51b3fe482e4711448e6c751', commit_message='Upload model', commit_description='', oid='6f240b184e666b54a51b3fe482e4711448e6c751', pr_url=None, pr_revision=None, pr_num=None)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "epinnock/llama7b-lora"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 112
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda112.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)"adapter_model.bin";:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

In [33]:
batch = tokenizer("List 3 ways to make more money: ", return_tensors='pt').to('cuda')
with torch.no_grad():
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id,  
            no_repeat_ngram_size=2,       
            do_sample=True, 
            top_k=3, 
            top_p=0.7,
            temperature=0.8)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

As you can see by fine-tuning for few steps we have almost recovered the quote from Albert Einstein that is present in the [training data](https://huggingface.co/datasets/Abirate/english_quotes).