# Fine-Tuning Llama-3.1 with QLoRA on AMD ROCm GPUs

This tutorial demonstrates how to fine-tune the **Llama-3.1-8B** large language model using **Quantized Low-Rank Adaptation (QLoRA)** on AMD ROCm GPUs. **Llama-3.1**, developed by Meta, is a widely used open-source large language model. For more information, visit [Meta's Llama page](https://ai.meta.com/llama/).

**QLoRA**, introduced by Dettmers et al. in [their 2023 paper](https://arxiv.org/abs/2305.14314), is a parameter-efficient fine-tuning technique that combines quantization and low-rank adaptation to enable efficient fine-tuning of large models with minimal resource requirements.

> **Reference**: Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs," 2023.



## **Prerequisites**

### **1. Hardware Requirements**
- AMD Inistict GPUs (e.g., MI210, MI300X) and Radeon GPU (e.g. Radeon Pro W7900)
- It should need multiple GPUs for some big size LLM. This jupyter notebook could run LLama-3.1-8B QLoRA Finetuning on one Radeon Pro W7900.
- Ensure your system meets the [System Requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html), including ROCm 6.0+ and Ubuntu 22.04.

### **2. Docker**
- Install Docker with GPU support.
- Ensure your user has appropriate permissions to access the GPU.
- Verify Docker permissions and GPU access:
  ```bash
  docker run --rm --device=/dev/kfd --device=/dev/dri rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
  ```

### **3. Hugging Face API Access**
- Obtain an API token from [Hugging Face](https://huggingface.co) for downloading models.
- Ensure you have a Hugging Face API token with the necessary permissions and approval to access [Meta's LLaMA checkpoints](https://huggingface.co/meta-llama/Llama-3.1-8B).

### **4. Data Preparation**
- For this tutorial, we use a sample dataset from Hugging Face, which will be prepared during the setup steps.


## **Prepare Training Environment**

### **1. Pull the Docker Image**

Ensure your system meets the [System Requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html).

Pull the Docker image required for this tutorial:

```bash
docker pull rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
```

### **2. Launch the Docker Container**

Launch the Docker container and map the necessary directories. Replace `/path/to/notebooks` with the full path to the directory on your host machine where these notebooks are stored.

```bash
docker run -it --rm \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --shm-size 8G \
  --hostname=ROCm-FT \
  -v /path/to/notebooks:/workspace/notebooks \
  -w /workspace/notebooks \
  rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
```

**Important**: Replace `/path/to/notebooks` with the absolute path to the directory on your host machine where your notebooks are stored. Ensure this directory is accessible to Docker and contains the necessary files for this tutorial.

### **3. Install and Launch Jupyter**

Inside the Docker container, install Jupyter using the following command:

```bash
pip install --upgrade pip setuptools wheel
pip install jupyter
```

Start the Jupyter server:
```bash
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```

### **4. Install Required Libraries**
Install the libraries needed for this tutorial. Run the following commands inside the Jupyter notebook running within the Docker container:

In [1]:
# Install necessary libraries for fine-tuning, including parameter-efficient fine-tuning (peft) and transformers
!pip install pandas peft==0.14.0 transformers==4.47.1 trl==0.13.0 accelerate==1.2.1 scipy tensorboardX

[0m

Verify the installation:

In [2]:
# Verify the installation and version of the required libraries
!pip list | grep peft
!pip list | grep transformer
!pip list | grep accelerate
!pip list | grep trl

peft                      0.14.0
transformers              4.47.1
accelerate                1.2.1
trl                       0.13.0


### **5. Install BitsAndBytes (ROCm 6.2)**
Install bitsandbytes from source for ROCm 6.2:

# Install bitsandbytes from source to enable quantization on ROCm GPUs
!git clone --recurse https://github.com/ROCm/bitsandbytes.git && cd bitsandbytes && git checkout rocm6.2_internal_testing && make hip && python setup.py install

In [3]:
!git clone --recurse https://github.com/ROCm/bitsandbytes.git && cd bitsandbytes && git checkout rocm_enabled_multi_backend && pip install -r requirements-dev.txt && cmake -DCOMPUTE_BACKEND=hip -S . && make && pip install .

fatal: destination path 'bitsandbytes' already exists and is not an empty directory.


Verify the installation (version 0.43.3.dev):

In [4]:
# Verify the installation and version of bitsandbytes
try:
    import bitsandbytes as bnb
    print("bitsandbytes version:", bnb.__version__)
except ImportError as e:
    print("Error:", e)

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

bitsandbytes version: 0.43.3.dev


In [5]:
!pip list | grep bitsandbytes

bitsandbytes              0.43.3.dev0



**⚠️ Important: Ensure the Correct Kernel is Selected**  
If this process fails, please ensure the correct Jupyter kernel is selected for your notebook.
To do this:
1. Go to the "Kernel" menu.
2. Click "Change Kernel."
3. Select `Python 3 (ipykernel)` from the list.

**Failure to select the correct kernel may lead to unexpected issues when running the notebook.**


### **6. Provide Your Hugging Face Token**

You will need a Hugging Face API token to access Llama-3.1-8B. Tokens typically start with "hf_". Generate your token at [Hugging Face Tokens](https://huggingface.co/settings/tokens) and request access for [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B).

Run the following interactive block in your Jupyter notebook to set up the token:

***Note***: Please uncheck the "Add token as Git credential" option.

In [6]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
status = notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Verify that your token was captured correctly:

In [7]:
# Validate the token
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")


Token validated successfully! Logged in as: alexhegit


## Fine-Tuning the Model

This section walks through the process of setting up and executing fine-tuning for the Llama-3.1 model using the QLoRA technique. The following steps include setting up GPUs, importing the required libraries, configuring the model and training parameters, and running the fine-tuning process.

### Set and Verify GPU Availability

Begin by specifying the GPUs available for fine-tuning and verifying that they are properly detected by PyTorch.

In [8]:
import os
import torch
gpus = [0, 1, 2, 3] # Specify the GPUs to be used for training
os.environ.setdefault("CUDA_VISIBLE_DEVICES", ','.join(map(str, gpus)))
# Ensure PyTorch detects the GPUs correctly
print(f"PyTorch detected number of available devices: {torch.cuda.device_count()}") 

PyTorch detected number of available devices: 1


### Import the Required Packages

Next, import the libraries necessary for fine-tuning, including utilities for dataset loading, model configuration, training setup, and evaluation.

In [9]:
# Load datasets and transformers for handling the Llama-3.1 model
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
# Import utilities for QLoRA fine-tuning and training configurations
from peft import LoraConfig
from trl import SFTTrainer

print("Successfully imported required libraries for dataset handling, model configuration, and QLoRA fine-tuning.")

Successfully imported required libraries for dataset handling, model configuration, and QLoRA fine-tuning.


### Configuring the Model 

Load the base model, tokenizer, and set up the quantization configuration for efficient fine-tuning on ROCm-enabled GPUs.

In [10]:
base_model_name = "meta-llama/Llama-3.1-8B"  # Hugging Face model repository name
new_model_name = "Llama-3.1-8B-qlora"  # Name for the fine-tuned model

# Load and configure the tokenizer for padding and tokenization
llama_tokenizer = AutoTokenizer.from_pretrained(
    base_model_name, 
    trust_remote_code=True, 
    use_fast=True
)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

### Quantization Configuration in QLoRA

As outlined in the [QLoRA paper](https://arxiv.org/abs/2305.14314), weights are stored in a 4-bit format, allowing computations to occur in 16-bit or 32-bit precision. When a QLoRA weight tensor is used, it is dequantized to the chosen precision (16-bit or 32-bit) before performing matrix multiplication. Various precision combinations, such as `float16`, `bfloat16`, and `float32`, are supported. You can experiment with different 4-bit quantization methods, including **NormalFloat4 (NF4)** and pure `float4`. However, based on theoretical insights and empirical results from the paper, NF4 is recommended for its superior performance.

For this tutorial, we use the following configuration:

- **4-bit quantization** with the NF4 type.  
- **16-bit (float16)** precision for computations.  
- **Double quantization**, which applies a second quantization step to reduce memory usage by an additional 0.3 bits per parameter.  

These quantization parameters are controlled using the `BitsandbytesConfig` (refer to [Hugging Face documentation](https://huggingface.co/docs)) as follows:

- **`load_in_4bit`**: Activates loading the model in 4-bit precision.  
- **`bnb_4bit_quant_type`**: Specifies the quantization type, with options for `fp4` (four-bit float) and `nf4` (normal four-bit float). As NF4 is optimized for normally distributed weights, it is the recommended choice.  
- **`bnb_4bit_compute_dtype`**: Determines the data type used for linear layer computations.  
- **`bnb_4bit_use_double_quant`**: Activates double quantization for further memory optimization.  


In [11]:

# Configure 4-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype="float16", 
    bnb_4bit_use_double_quant=True
)

# Load the pre-trained Llama-3.1 model with device mapping for GPU
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=True
)

# Disable caching to optimize for fine-tuning
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Load and Prepare the Dataset

Fine-tune the base model for a question-and-answer task using a small dataset called [mlabonne/guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k/tree/main). This dataset is a subset (1,000 samples) of the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset. This dataset is a human-generated, human-annotated, assistant-style conversation corpus that contains 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. This results in over 10,000 fully annotated conversation trees. 

In [12]:
# Dataset
data_name = "mlabonne/guanaco-llama2-1k"
# Load the fine-tuning dataset from Hugging Face
training_data = load_dataset(data_name, split="train")

# Display dataset structure and a sample for verification
print(training_data.shape)
#11 is a QA sample in English
print(training_data[11])

(1000, 1)
{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture

### Fine-Tuning Configuration

Define the hyperparameters and configurations for the fine-tuning process.

In [13]:
# Define training arguments, including output directory and optimization settings
# Specify number of epochs, batch size, learning rate, and logging steps
train_params = TrainingArguments(
    output_dir="./results_qlora",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

print("Training parameters configured!.")

Training parameters configured!.


***NOTE***：If you encounter out-of-memory (OOM) errors, reduce per_device_train_batch_size or enable gradient checkpointing. Use rocm-smi to monitor VRAM usage during fine-tuning

### QLoRA Configuration

Low-Rank Adaptation (QLoRA) introduces lightweight rank-decomposition matrices into the base model. By focusing only on updating these additional matrices, QLoRA reduces the number of trainable parameters significantly, enabling efficient fine-tuning of large models.

In [14]:
from peft import get_peft_model

# Configure QLoRA parameters for low-rank adaptation
peft_parameters = LoraConfig(
    lora_alpha=8, # Alpha controls the scaling parameter
    lora_dropout=0.1,
    r=8, # r specifies the rank of the low-rank matrices
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424


Expected Output:
```
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424
```
This indicates that only 0.042% of the total parameters are trainable during fine-tuning, which is a tiny fraction of the overall model, ensuring resource efficiency.

### Fine-Tuning with QLoRA
QLoRA's lightweight approach allows fine-tuning while maintaining high efficiency in terms of computation and memory usage. We now define a training pipeline using the QLoRA-integrated model.

In [15]:
# Initialize the trainer with the fine-tuning dataset and configurations
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    args=train_params
)

# Execute the training process
fine_tuning.train()

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Step,Training Loss
50,1.6805
100,1.5443
150,1.466
200,1.3582
250,1.3069
300,1.3639
350,1.4001
400,1.3689
450,1.3129
500,1.2768


TrainOutput(global_step=1000, training_loss=1.3864017639160156, metrics={'train_runtime': 1154.1898, 'train_samples_per_second': 0.866, 'train_steps_per_second': 0.866, 'total_flos': 1.6503795464994816e+16, 'train_loss': 1.3864017639160156, 'epoch': 1.0})

During training, the model outputs metrics such as training loss, step progress, and runtime performance, which can be monitored for insights.

### Save the Fine-Tuned Model

After training is complete, save the model with the specified name.

In [16]:
# Save the fine-tuned model to the specified directory
fine_tuning.model.save_pretrained(new_model_name)
print("Successfully saved the model!")

Successfully saved the model!


### Monitoring GPU Memory

To monitor GPU memory during training, use the following command in a terminal:

This will display memory usage and other GPU metrics to ensure your hardware resources are used optimally.

### Comparison: Fine-Tuning with and without QLoRA

To understand the benefits of QLoRA, you can compare fine-tuning metrics (such as memory usage, training speed, and loss) between:

- Fine-tuning with QLoRA.
- fine-tuning.

QLoRA's resource-efficient approach is especially beneficial for training on hardware with limited memory or computational power.

### Testing the Fine-Tuned Model

Load the fine-tuned model and run inference to evaluate its performance.

In [17]:
# Reload model in FP16 and merge it with QLoRA weights
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
from peft import LoraConfig, PeftModel
peft_model = PeftModel.from_pretrained(base_model, new_model_name)
peft_model = peft_model.merge_and_unload()

# Configure the tokenizer for text generation
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"
pipeline = pipeline(
    "text-generation", 
    model=peft_model, 
    tokenizer=llama_tokenizer,
    max_length=1024,
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


Now let's run a query and view the response generated by our fine-tuned model.

In [18]:

# Use the fine-tuned model to generate responses for a query
query = "What do you think is the most important part of building an AI chatbot?"
output = pipeline(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<s>[INST] What do you think is the most important part of building an AI chatbot? [/INST] There are several key components that are important for building an AI chatbot. These include:

1. Natural Language Processing (NLP): NLP is the ability of a computer program to understand and process human language. This is essential for a chatbot to be able to understand and respond to user input.

2. Machine Learning: Machine learning algorithms are used to train the chatbot on a large dataset of conversations and to improve its ability to understand and respond to user input over time.

3. Dialog management: Dialog management is the process of managing the conversation between the user and the chatbot. This includes understanding the user's intent, maintaining context, and generating appropriate responses.

4. User experience: The user experience is an important factor in the success of a chatbot. This includes the design of the interface, the speed and accuracy of the responses, and the overa