# Finetuning Llama 2 with _Adapters_ and QLoRA

In this notebook, we show how to efficiently fine-tune a quantized 7B Llama 2 model using [**QLoRA** (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314) and the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library.

Specifically, we finetune Llama 2 on a supervised **instruction tuning** dataset collected by the [Open Assistant project](https://github.com/LAION-AI/Open-Assistant) for training chatbot models. This is similar to the setup used to train the Guanaco models in the QLoRA paper.

## Installation

Besides `adapters`, we require `bitsandbytes` for quantization and `accelerate` for training.

In [1]:
# !pip install -qq -U adapters accelerate bitsandbytes datasets

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Load Open Assistant dataset

We use the [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset by the QLoRA, which contains a small subset of conversations from the full Open Assistant database and was also used to finetune the Guanaco models in the QLoRA paper.

In [3]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

  from .autonotebook import tqdm as notebook_tqdm
Repo card metadata block was not found. Setting CardData to empty.


Our training dataset has roughly 10k training samples:

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [5]:
print(dataset["train"][0]["text"])

### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading

## Load and prepare model and tokenizer

We download the the official Llama 2 7B checkpoints from the HuggingFace Hub (**Note:** You must request access to this model on the HuggingFace website and use an API token to download it.).

Via the `BitsAndBytesConfig`, we specify that the model should be loaded in 4bit quantization and with double quantization for even better memory efficiency. See [their documentation](https://huggingface.co/docs/bitsandbytes/main/en/index) for more on this.

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig

modelpath="meta-llama/Llama-2-7b-hf"

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(modelpath)
tokenizer.pad_token = tokenizer.eos_token

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.70s/it]


We initialize the adapter functionality in the loaded model via `adapters.init()` and add a new LoRA adapter (named `"assistant_adapter"`) via `add_adapter()`.

In the call to `LoRAConfig()`, you can configure how and where LoRA layers are added to the model. Here, we want to add LoRA layers to all linear projections of the self-attention modules (`attn_matrices=["q", "k", "v"]`) as well as intermediate and outputa linear layers.

In [8]:
import adapters
from adapters import LoRAConfig

adapters.init(model)

config = LoRAConfig(
    selfattn_lora=True, intermediate_lora=True, output_lora=True,
    attn_matrices=["q", "k", "v"],
    alpha=16, r=64, dropout=0.1
)
model.add_adapter("assistant_adapter", config=config)
model.train_adapter("assistant_adapter")

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
assistant_adapter        lora            112,197,632       3.330       1       1
--------------------------------------------------------------------------------
Full model                              3,369,340,928     100.000               0


In [9]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayerWithAdapters(
        (self_attn): LlamaSdpaAttentionWithAdapters(
          (q_proj): LoRALinear4bit(
            in_features=4096, out_features=4096, bias=False
            (loras): ModuleDict(
              (assistant_adapter): LoRA(
                (lora_dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (k_proj): LoRALinear4bit(
            in_features=4096, out_features=4096, bias=False
            (loras): ModuleDict(
              (assistant_adapter): LoRA(
                (lora_dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (v_proj): LoRALinear4bit(
            in_features=4096, out_features=4096, bias=False
            (loras): ModuleDict(
              (assistant_adapter): LoRA(
                (lora_dropout): Dropout(p=0.1, inplace=False)
       

In [10]:
# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

torch.float32 374607872 0.10369450727620085
torch.uint8 3238002688 0.8963054927237991


## Prepare data for training

The dataset is tokenized and truncated.

In [11]:
import os 

def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        max_length=512, # can set to longer values such as 2048
        add_special_tokens=False,
    )

dataset_tokenized = dataset.map(
    tokenize, 
    batched=True, 
    num_proc=os.cpu_count(),    # multithreaded
    remove_columns=["text"]     # don't need this anymore, we have tokens from here on
)

Map (num_proc=24): 100%|██████████| 9846/9846 [00:01<00:00, 9248.73 examples/s] 
Map (num_proc=24): 100%|██████████| 518/518 [00:00<00:00, 1767.51 examples/s]


In [12]:
dataset_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 518
    })
})

## Training

We specify training hyperparameters and train the model using the `AdapterTrainer` class.

These hyperparameters are not well tuned, so feel free to play around!

In [13]:
args = TrainingArguments(
    output_dir="output",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy="epoch",
    logging_steps=10,
    save_steps=50,
    save_total_limit=3,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",
    learning_rate=0.0002,
    group_by_length=True,
    fp16=True,
)

In [None]:
from adapters import AdapterTrainer
from transformers import DataCollatorForLanguageModeling

trainer = AdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    args=args,
)

trainer.train()

## Inference

Finally, we can prompt the model:

In [None]:
# Ignore warnings
from transformers import logging
logging.set_verbosity(logging.CRITICAL)

def prompt_model(model, text: str):
    batch = tokenizer(f"### Human: {text}\n### Assistant:", return_tensors="pt")
    batch = batch.to(model.device)
    
    model.eval()
    with torch.inference_mode():
        output_tokens = model.generate(**batch, max_new_tokens=50)

    return tokenizer.decode(output_tokens[0], skip_special_tokens=True)


In [None]:
print(prompt_model(model, "Explain Calculus to a primary school student"))

## Merge LoRA weights

In [None]:
model.merge_adapter("assistant_adapter")

In [None]:
print(prompt_model(model, "Explain NLP in simple terms"))