# Fine-tune Llama 2 with LoRA by AMD Radeon Pro W7900

In this blog, we show you how to fine-tune Llama 2 on one AMD Radeon Pro W7900 GPU(48GB GDDR) with ROCm. We use Low-Rank Adaptation of Large Language Models (LoRA) to overcome memory and computing limitations and make open-source large language models (LLMs) more accessible. We also show you how to fine-tune and upload models to Hugging Face.

This blog is refer to https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html which give some technical background about Llama2, Fine-tuning, LoRa and run the LoRA finetuning by AMD MI250 GPU. Here let's jump to the steps of fine-tuning by AMD Radeon Pro W7900 GPU.

## Step-by-step Llama 2 fine-tuning

Standard (full-parameter) fine-tuning involves considering all parameters. It requires significant computational power to manage optimizer states and gradient check-pointing. The resulting memory footprint is typically about four times larger than the model itself. For example, loading a 7 billion parameter model (e.g. Llama 2) in FP32 (4 bytes per parameter) requires approximately 28 GB of GPU memory, while fine-tuning demands around 28*4=112 GB of GPU memory. Note that the 112 GB figure is derived empirically, and various factors like batch size, data precision, and gradient accumulation contribute to overall memory usage.

To overcome this memory limitation, you can use a parameter-efficient fine-tuning (PEFT) technique, such as LoRA.

This example leverages tne AMD Radeon Pro W7900 GPU with 48GB VRAM. Using this setup allows us to explore different settings for fine-tuning the Llama 2–7b weights with LoRA.


Our setup:

- Hardware: AMD Radeon Pro W7900
- Software:
    - ROCm 6.0+
    - Pytorch 2.0.1+

Libraries: transformers, accelerate, peft, trl, bitsandbytes, scipy

### Step 0: Setup ROCm environment

The easyway is to use ROCm docker image from https://hub.docker.com/r/rocm/pytorch. I use TAG rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2.

$docker pull rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2

And here is my docker start command as your reference.

```
$alias drun='docker run -it --network=host --device=/dev/kfd --device=/dev/dri/renderD128 --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G --hostname=w7900  -p 80:80 -p 8080:8080 -v /DATA:/DATA -w /DATA'

$drun rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2
```

To run this jupyter notebook, you may install it by `pip install jupyter-lab`

### Step 1: Getting started

First, let’s confirm the availability of the GPU.

In [1]:
!rocm-smi --showproductname



GPU[0]		: Card series: 		0x7448
GPU[0]		: Card model: 		0x0e0d
GPU[0]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Card SKU: 		D7070100


Next, install the required libraries.

In [2]:
!pip install -q pandas peft==0.9.0 transformers==4.31.0 trl==0.4.7 accelerate scipy

#### Install bitsandbytes
1. Install bitsandbytes using the following code.

- For ROCm 6.2

2. Check the bitsandbytes version (0.42.0).

In [3]:
%%bash
pip list | grep bitsandbytes

bitsandbytes              0.42.0


#### Import the required packages

In [4]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Configuring the model and data
You can access Meta’s official Llama-2 model from Hugging Face after making a request, which can take a couple of days. Instead of waiting, we’ll use NousResearch’s Llama-2-7b-chat-hf as our base model (it’s the same as the original, but quicker to access). I downloaded it into /DATA/NousResearch/Llama-2-7b-chat-hf/ of my machine ahead.

In [5]:
# Model and tokenizer names
base_model_name = "/DATA/NousResearch/Llama-2-7b-chat-hf/"
new_model_name = "llama-2-7b-chat-enhanced" #You can give your own name for fine tuned model

# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

# Model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto"
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.91s/it]


In [6]:
# Data set
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
# check the data
print(training_data.shape)
# #11 is a QA sample in English
print(training_data[11])

(1000, 1)
{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture

In [7]:
## There is a dependency during training
!pip install tensorboardX



### Step 3: Start fine-tuning
To set your training parameters, use the following code:

In [8]:
# Training Params
train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

 I got OOM with per_device_train_batch_size=2 at AMD Radeon Pro W7900 with 48GB VRAM. You will see the VRAM usage bellow when run the LoRA finetuning.

#### Training with LoRA configuration
Now you can integrate LoRA into the base model and assess its additional parameters. LoRA essentially adds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, and only trains the newly added weights.

In [9]:
from peft import get_peft_model
# LoRA Config
peft_parameters = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


Note that there are only 0.062% parameters added by LoRA, which is a tiny portion of the original model. This is the percentage we’ll update through fine-tuning, as follows.

In [10]:
# Trainer with LoRA configuration
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning.train()

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 6006.52 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.9542
100,1.7783
150,1.5594
200,1.4967
250,1.3451
300,1.408
350,1.4026
400,1.3362
450,1.3006
500,1.1761




TrainOutput(global_step=1000, training_loss=1.397334213256836, metrics={'train_runtime': 966.394, 'train_samples_per_second': 1.035, 'train_steps_per_second': 1.035, 'total_flos': 1.67211744380928e+16, 'train_loss': 1.397334213256836, 'epoch': 1.0})

The output looks like this:

In [11]:
# Save Model
fine_tuning.model.save_pretrained(new_model_name)

#### Checking memory usage during training with LoRA
During training, you can check the memory usage by running the rocm-smi command in a terminal. This command produces the following output:

To facilitate a comparison between fine-tuning with and without LoRA, our subsequent phase involves running a thorough fine-tuning process on the base model. This involves updating all parameters within the base model. We then analyze differences in memory usage, training speed, training loss, and other relevant metrics.

#### Training without LoRA configuration

You may got OOM failed of full-parameter fine-tunning process refer to https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html. Yes, one Radeon Pro W7900 with 48GB VRAM is not enough for these case.

### Step 4: Test the fine-tuned model with LoRA

To test your model, run the following code:

The output looks like this:

Uploading the model to Hugging Face let’s you conduct subsequent tests or share your model with others (to proceed with this step, you’ll need an active Hugging Face account).

Now you can test with the base model (original) and your fine-tuned model.

- Base model:

- Fine-tuned model:

You can observe the outputs of the two models based on a given query. These outputs exhibit slight differences due to the fine-tuning process altering the model weights.