# Fine-tuning Large Language Model (LLM) on a Custom Dataset with QLoRA

## 1. Why Fine-tuning LLMs?

Although prompt engineering is an effective technique, it has limits. Well-designed prompts can guide hacking from a Large Language Model (LLM). However, they may be insufficient for anything more complex than that. At many steps, you will have to add additional context--for example specific sections of text or even entire documents--in order to assure that the LLM will work properly with your use case.

Fine-tuning is another preferred choice if people want to maximize the prime potential of LLMs. It means to put your pre-existing model in true relation with your own data. This way, you can make the LLM unique to your own domain or application that needs it; this makes text produced for your desired task more relevant a text is likely to understand.

## 2. Falcon LLM
Falcon LLM1, an open-source Large Language Model (LLM) from the Technology Innovation Institute, includes 40 billion parameters and was trained on one trillion tokens. Falcon LLM distinguishes itself by using only a fraction of the training computation required by other popular LLMs. It uses custom software and a unique data pipeline to extract high-quality content from web data, which is distinct from the work of NVIDIA, Microsoft, and HuggingFace.

Falcon, a 40 billion parameter autoregressive decoder-only model, underwent two months of training using 384 GPUs on AWS. The pretraining dataset was carefully constructed from public web crawls, filtering out machine-generated text and adult content, resulting in a dataset of nearly five trillion tokens. To enhance Falcon's capabilities, curated sources such as research papers and social media conversations were added to the dataset. The model's performance was extensively validated against open-source benchmarks, confirming its competitiveness with state-of-the-art LLMs from DeepMind, Google, and Anthropic. Falcon outperforms GPT-3 with only 75% of the training compute budget and requires significantly less compute during inference.

## 3. QLoRA

Fine-tuning becomes impractical for extremely large models like GPT-3/4 with 175b+ parameters. To address this, the authors of LoRA (Low-Rank Adaptation)5, introduce a technique that freezes pre-trained model weights and incorporates trainable rank decomposition matrices into each layer, significantly reducing the number of trainable parameters. Despite having fewer parameters and faster training, LoRA achieves comparable or better performance than fine-tuning on various models like RoBERTa, DeBERTa, GPT-2, and GPT-3.

QLoRA6 combines a frozen, 4-bit quantized pretrained language model with LoRA, allowing finetuning of 65B parameter models on a single 48GB GPU while maintaining full 16-bit finetuning task performance. QLoRA incorporates innovative memory-saving techniques such as 4-bit NormalFloat (NF4) data type, double quantization, and paged optimizers. The study demonstrates QLoRA's effectiveness by finetuning over 1,000 models across different datasets, model types, and scales, achieving state-of-the-art results.

## Installing the required dependencies

In [1]:
!pip install -Uqqq pip --progress-bar off
!pip install -qqq bitsandbytes==0.39.0 --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq -U git+https://github.com/huggingface/transformers.git@e03a9cc --progress-bar off
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f --progress-bar off
!pip install -qqq -U git+https://github.com/huggingface/accelerate.git@c9fbb71 --progress-bar off
!pip install -qqq datasets==2.12.0 --progress-bar off
!pip install -qqq loralib==0.1.1 --progress-bar off
!pip install -qqq einops==0.6.1 --progress-bar off

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kaggle-environments 1.14.15 requires transformers>=4.33.1, but you have transformers 4.30.0.dev0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.6 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 16.1.0 which is incompatible.
pathos 0.3.2 requires dill>=0.3.8, but you have dill 0.3.6 which is incompatible.
pathos 0.3.2 requires multiprocess>=0.70.16, but you have multiprocess 0.70.14 which is incompatible.[0m[31m
[0m

## Import Required Libraries

In [2]:
import json
import os
from pprint import pprint

import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
# from huggingface_hub import notebook_login
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so


  warn(msg)


CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...


2024-08-03 07:18:10.279162: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-03 07:18:10.279278: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-03 07:18:10.386932: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Data
We'll use a dataset8 consisting of 79 frequently asked questions (FAQs) and their corresponding answers from an Ecommerce webpage. The dataset is available on Kaggle, and we'll download a copy of it

In [4]:
!gdown 1u85RQZdRTmpjGKcCc5anCMAHZ-um4DUC

Downloading...
From: https://drive.google.com/uc?id=1u85RQZdRTmpjGKcCc5anCMAHZ-um4DUC
To: /kaggle/working/ecommerce-faq.json
100%|██████████████████████████████████████| 21.0k/21.0k [00:00<00:00, 56.0MB/s]


### Let's open the JSON file and take a look at the data

In [5]:
with open("ecommerce-faq.json") as json_file:
    data = json.load(json_file)

### Let's look at a single example of the JSON file

In [6]:
pprint(data["questions"][0], sort_dicts=False)

{'question': 'How can I create an account?',
 'answer': "To create an account, click on the 'Sign Up' button on the top "
           'right corner of our website and follow the instructions to '
           'complete the registration process.'}


### Saved the json data

In [31]:
with open('dataset.json','w') as f:
    json.dump(data["questions"],f)

In [32]:
pd.DataFrame(data["questions"]).head()

Unnamed: 0,question,answer
0,How can I create an account?,"To create an account, click on the 'Sign Up' b..."
1,What payment methods do you accept?,"We accept major credit cards, debit cards, and..."
2,How can I track my order?,You can track your order by logging into your ...
3,What is your return policy?,Our return policy allows you to return product...
4,Can I cancel my order?,You can cancel your order if it has not been s...


In [None]:
# !huggingface-cli login

### Login to hugging Face

In [7]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the Model

To load the model and tokenizer, we'll use the AutoModelForCausalLM and AutoTokenizer classes from the 🤗 Transformers library. We'll also set the pad_token to the eos_token to avoid issues with padding

In [9]:
MODEL_NAME = "tiiuae/falcon-7b"
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
 
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)
 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token



config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

configuration_falcon.py:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Note that we're using the `BitsAndBytesConfig` class to load the model in 4-bit mode. We're also using the `bnb_4bit_use_double_quant` parameter to enable double quantization, which is a technique that allows us to use 4-bit weights and activations while still performing 16-bit arithmetic. We also specify the `nf4` (4-bit NormalFloat) from QLoRa

### Let's prepare the model for training

In [10]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

The `gradient_checkpointing_enable` method enables gradient checkpointing, which is a technique that allows us to trade compute for memory. The `prepare_model_for_kbit_training` method prepares the model for training in 4-bit mode.

In [12]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
 
model = get_peft_model(model, config)

### Define a function to show the trainable parameters

In [None]:
def print_trainable_parameters(model):
    """
    Prints the Number of trainable parameters in the model.
    """
    trainable_params=0
    all_param=0
    for _,param in model.named_parameters():
        all_param+=param.numel()
        if param.requires_grad:
            trainable_params +=param.numel()
    print(
        f"trainable params: {trainable_params}|| all params: {all_param} || trainable: {100*trainable_params/all_param}"
    )

In [44]:
print_trainable_parameters(model)

trainable params: 4718592|| all params: 3613463424 || trainable: 0.13058363808693696


The `LoraConfig` class is used to define the configuration for LoRA, and the following parameters are set:

* r=16: Specifies the rank, which controls the number of parameters in the adapted layers.
* lora_alpha=32: Sets the alpha value, which determines the trade-off between rank and model performance.
* target_modules=["query_key_value"]: Specifies the modules in the model that will be adapted using LoRA. In this case, only the "query_key_value" module will be adapted.
* task_type="CAUSAL_LM": Specifies the type of task as causal language model.

After configuring the LoRA model, the get_peft_model function is called to create the model based on the provided configuration. Note that we're going to train only `0.13%` of the original model parameter size.

## Let's test the model before training by using the following prompt format

In [15]:
prompt = f"""
<human>: How can I create an account?
<assistant>:
""".strip()
print(prompt)

<human>: How can I create an account?
<assistant>:


### We'll modify the model generation config using the following parameters

In [16]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config

GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 11,
  "eos_token_id": 11,
  "max_new_tokens": 200,
  "pad_token_id": 11,
  "temperature": 0.7,
  "top_p": 0.7,
  "transformers_version": "4.30.0.dev0"
}

### Using the provided configuration, we can generate a response that corresponds to our given prompt

In [17]:
%%time
device = "cuda:0"
 
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<human>: How can I create an account?
<assistant>: Please enter your name.
<human>: My name is <human>.
<assistant>: Please enter your email address.
<human>: My email address is <email>.
<assistant>: Please enter your password.
<human>: My password is <password>.
<assistant>: Please enter your password again.
<human>: My password is <password>.
<assistant>: Your password is incorrect.
<assistant>: Please enter your password again.
<assistant>: Your password is correct.
<assistant>: Please enter your phone number.
<human>: My phone number is <phone>.
<assistant>: Please enter your phone number again.
<assistant>: Your phone number is correct.
<assistant>: Please enter your date of birth.
<human>: My date of birth is <date>.
<assistant>: Please enter your date of birth again.
<ass
CPU times: user 1min 18s, sys: 465 ms, total: 1min 19s
Wall time: 1min 21s


Inside the `torch.inference_mode()` context, the `model.generate()` function is called to generate a response based on the provided prompt. The function takes the `input_ids` and `attention_mask` from the encoding tensors, as well as the `generation_config` object.

Finally, the generated output is decoded using the `tokenizer.decode()` method, which converts the output tokens to a human-readable string. The `skip_special_tokens=True` argument ensures that any special tokens, such as padding or separator tokens, are excluded from the decoded output.

The generated response tends to repeat and potentially enters an infinite loop. Can fine-tuning improve the quality of the response?

### load the Dataset

In [70]:
df = pd.read_json('/kaggle/working/dataset.json')

# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

In [72]:
dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 79
})

In [75]:
dataset[0]

{'question': 'How can I create an account?',
 'answer': "To create an account, click on the 'Sign Up' button on the top right corner of our website and follow the instructions to complete the registration process."}

### The next step is to convert each question and answer pair to a prompt and pass it to the tokenizer

In [76]:
def generate_prompt(data_point):
    return f"""
<human>: {data_point["question"]}
<assistant>: {data_point["answer"]}
""".strip()
 
def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt
 
data = dataset.shuffle().map(generate_and_tokenize_prompt)
data

Map:   0%|          | 0/79 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 79
})

## Training

Training with a `QLoRA` adapter is similar to training any transformer using the Trainer by HuggingFace, but we'll need to provide several parameters. The `TrainingArguments` class is used to define the training parameters.

In [79]:
OUTPUT_DIR = "experiments"

In [80]:
%load_ext tensorboard
%tensorboard --logdir experiments/runs

In [92]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=10,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=80,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    report_to="tensorboard",
)

We'll train our model for 10 epoch (800 steps) using a cosine learning rate scheduler and a paged Adam optimizer, which is specific to QLoRA training. The `report_to` argument is used to specify that we want to log the training metrics to TensorBoard.

### Let's use the Trainer class to train our model

In [1]:
# tensor_clone = tensor.clone()

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

We pass the `model`, `data`, and `training_args` to the Trainer class. The `data_collator` argument is used to specify that we don't want to mask any tokens during training.

### Upload the Trained Model
After training our model, we can save it in two common locations. First, we can save it locally using the `save_pretrained()` method:

In [None]:
model.save_pretrained("trained-model")

### Next, we can upload the model to the HuggingFace Hub using the push_to_hub() method

In [None]:
model.push_to_hub(
    "curiousily/falcon-7b-qlora-chat-support-bot-faq", use_auth_token=True
)

### Load the Trained Model
To load the pretrained model, we can use similar code to what we used for loading the original Falcon 7b model

In [97]:
PEFT_MODEL = "curiousily/falcon-7b-qlora-chat-support-bot-faq"
 
config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token
 
model = PeftModel.from_pretrained(model, PEFT_MODEL)

adapter_config.json:   0%|          | 0.00/410 [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

Note that we're loading the config first and then the model. The model and tokenizer are using the base model path (Falcon 7b in this case). The final model is a PeftModel that wraps the original model and adds the QLoRA adapter.

## Evaluation
Let's reuse the generation configuration that we previously set using our pretrained model

In [99]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

### We're ready to generate some responses

In [100]:
DEVICE = "cuda:0"
 
prompt = f"""
<human>: How can I create an account?
<assistant>:
""".strip()
 
encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<human>: How can I create an account?
<assistant>: To create an account, please visit our sign-up page and enter your email address. Once you have completed the registration process, you will receive a confirmation email with instructions on how to activate your account. If you do not receive the email within a few minutes, please check your spam or junk folder. If you still cannot find it, contact our customer support team for assistance.


The response is much improved compared to the untrained model. It's worth noting that the model didn't simply memorize the answer to the question. Let's write a helper function to make generating responses easier

In [101]:
def generate_response(question: str) -> str:
    prompt = f"""
<human>: {question}
<assistant>:
""".strip()
    encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
 
    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

## Now, we can try a few questions

In [102]:
prompt = "Can I return a product if it was a clearance or final sale item?"
print(generate_response(prompt))

Clearance and final sale items are typically non-returnable and non-refundable. Please review the product description or contact our customer support team for more information.
If you have any questions about our return policy, please contact our customer support team for assistance. We will be happy to assist you with the process.


In [103]:
prompt = "How do I know when I'll receive my order?"
 
print(generate_response(prompt))

Once your order is placed, you will receive a confirmation email with tracking information. Please allow up to 24 hours for the tracking information to become available. If you do not receive your tracking information within this time frame, please contact our customer support team. We will assist you with the tracking information and resolve the issue.
