# Fine-tune Llama-3.1 with LoRA with AMD ROCm GPU

In this blog, we show you how to fine-tune Llama-3.1-8B on AMD GPU with ROCm. We use Low-Rank Adaptation of Large Language Models (LoRA) to overcome memory and computing limitations and make open-source large language models (LLMs) more accessible.

## Step-by-step fine-tuning

Standard (full-parameter) fine-tuning involves considering all parameters. It requires significant computational power to manage optimizer states and gradient check-pointing. The resulting memory footprint is typically about four times larger than the model itself.

To overcome this memory limitation, you can use a parameter-efficient fine-tuning (PEFT) technique, such as LoRA.


Our setup:

- Hardware: AMD ROCm GPU (MI325X, MI300X, etc) [device list](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html)
- Software:
    - ROCm 6.0+
    - Pytorch 2.0.1+
    - Libraries: transformers, accelerate, peft, trl, bitsandbytes, scipy

### Step 0: Setup ROCm environment

The easyway is to use ROCm docker image from https://hub.docker.com/r/rocm/pytorch. I use TAG `rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2`.

```bash
$docker pull rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2
```

And here is my docker start command as your reference.

```bash
$alias drun='docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G --hostname=ROCm-FT -v /DATA:/DATA -w /DATA'

$drun rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2
```

To run this jupyter notebook, you may install it by `pip install jupyter-lab`

### Step 1: Getting started

First, let’s confirm the availability of the GPU.

Next, install the required libraries.

In [1]:
!rocm-smi --showproductname



GPU[0]		: Card series: 		Instinct MI210
GPU[0]		: Card model: 		0x0c34
GPU[0]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Card SKU: 		D67301V
GPU[1]		: Card series: 		Instinct MI210
GPU[1]		: Card model: 		0x0c34
GPU[1]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]		: Card SKU: 		D67301V
GPU[2]		: Card series: 		Instinct MI210
GPU[2]		: Card model: 		0x0c34
GPU[2]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[2]		: Card SKU: 		D67301V
GPU[3]		: Card series: 		Instinct MI210
GPU[3]		: Card model: 		0x0c34
GPU[3]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[3]		: Card SKU: 		D67301V


!pip install -q pandas peft==0.14.0 transformers==4.47.1 trl==0.13.0 accelerate==1.2.1 scipy tensorboardX

In [2]:
%%bash
pip list | grep peft
pip list | grep transformer
pip list | grep accelerate
pip list | grep trl

peft                       0.14.0
transformers               4.47.1
accelerate                 1.2.1
trl                        0.13.0


#### Install bitsandbytes
1. Install bitsandbytes using the following code.

- For ROCm 6.2

2. Check the bitsandbytes version (0.42.0).

In [3]:
%%bash
pip list | grep bitsandbytes

bitsandbytes               0.42.0


#### Check and Set GPUs for fine-tuning

In [4]:
import os
import torch
# set visible gpus as need
gpus = [0, 1, 2, 3]
os.environ.setdefault("CUDA_VISIBLE_DEVICES", ','.join(map(str, gpus)))
print(f"PyTorch detected number of availabel devices: {torch.cuda.device_count()}")

PyTorch detected number of availabel devices: 4


#### Import the required packages

In [5]:
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Configuring the model and data
Please make sure the LLM model files has been download and use the real path in the below code cell.

In [6]:
# Model and tokenizer names
base_model_name = "/data/HF-MODEL/huggingface-model/Meta-Llama-3.1-8B/"
new_model_name = "Llama-3.1-8B-lora" #You can give your own name for fine tuned model

# Tokenizer
#llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True, use_fast=True)
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

bnb_config = BitsAndBytesConfig(
      load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
)

# Model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=True
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.21s/it]


After you have the base model, you can start fine-tuning. We fine-tune our base model for a question-and-answer task using a small data set called mlabonne/guanaco-llama2-1k, which is a subset (1,000 samples) of the timdettmers/openassistant-guanaco data set. This data set is a human-generated, human-annotated, assistant-style conversation corpus that contains 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. This results in over 10,000 fully annotated conversation trees.

In [7]:
# Dataset
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
# check the data
print(training_data.shape)
# #11 is a QA sample in English
print(training_data[11])

(1000, 1)
{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture

### Step 3: Start fine-tuning
To set your training parameters, use the following code:

In [8]:
# Training Params
train_params = TrainingArguments(
    output_dir="./results_lora",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

**NOTE**：You may decrease the per_device_train_batch_size if got OOM. Use rocm-smi to monitor the VRAM usage when running the finetuning.

**Training with LoRA configuration**

Now you can integrate LoRA into the base model and assess its additional parameters. LoRA essentially adds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, and only trains the newly added weights.

In [9]:
from peft import get_peft_model
# LoRA Config
peft_parameters = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424


Above show the trainalbe parameters in percent which is a tiny portion of the original model. This is the percentage we’ll update through fine-tuning, as follows.

In [10]:
# Trainer with LoRA configuration
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    #dataset_text_field="text",
    #tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning.train()

Step,Training Loss
50,1.6537
100,1.4221
150,1.3367
200,1.3874
250,1.3835


TrainOutput(global_step=250, training_loss=1.4366847839355468, metrics={'train_runtime': 550.3139, 'train_samples_per_second': 1.817, 'train_steps_per_second': 0.454, 'total_flos': 1.6854644828110848e+16, 'train_loss': 1.4366847839355468, 'epoch': 1.0})

In [11]:
# Save Model
fine_tuning.model.save_pretrained(new_model_name)

#### Checking memory usage during training with LoRA
During training, you can check the memory usage by running the rocm-smi command in a terminal. This command produces the following output:

To facilitate a comparison between fine-tuning with and without LoRA, our subsequent phase involves running a thorough fine-tuning process on the base model. This involves updating all parameters within the base model. We then analyze differences in memory usage, training speed, training loss, and other relevant metrics.

### Step 4: Test the fine-tuned model with LoRA

To test your model, run the following code:

In [12]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)

#### Fine-tuned Model Inference

In [13]:
# Reload model in FP16 and merge it with LoRA weights

#base_model_name = "/data/HF-MODEL/huggingface-model/Meta-Llama-3.1-8B/"
#new_model_name = "Llama-3.1-8B-lora" #You can give your own name for fine tuned model

base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
from peft import LoraConfig, PeftModel
peft_model = PeftModel.from_pretrained(base_model, new_model_name)
peft_model = peft_model.merge_and_unload()

# Reload tokenizer to save it
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.62it/s]


In [14]:
pipeline = pipeline(
    "text-generation", 
    model=peft_model, 
    tokenizer=llama_tokenizer,
    device_map="auto"
)

Device set to use cuda:0


In [15]:
query = "What do you think is the most important part of building an AI chatbot?"
output = pipeline(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

<s>[INST] What do you think is the most important part of building an AI chatbot? [/INST] There are many different aspects to consider when building an AI chatbot, but I believe that the most
