# LoRA Finetuning

Low-Rank Adaptation (LoRA) is an innovative technique designed to fine-tune large language models (LLMs) efficiently. Let's dive into what makes LoRA a game-changer in the realm of machine learning and natural language processing:

## What is LoRA?

- **Concept**: LoRA introduces a low-rank decomposition to the weight matrices within transformer models.
- **Efficiency**: By only training a small number of additional parameters, LoRA dramatically reduces the computational cost.

## Benefits of LoRA

- **Speed**: Fine-tuning with LoRA is significantly faster due to fewer parameters being updated.
- **Customization**: It allows data scientists to tailor large models to their specific tasks without extensive retraining.

## Some Use Cases

- **Personalized AI**: Customize AI models to understand specific jargons or concepts in niche fields.
- **Optimized Performance**: Improve performance on tasks like sentiment analysis or document summarization with domain-specific fine-tuning.
- **Efficient Deployment**: Able to deploy one large base LLM and several small LoRA adapters, instead of having to deploy several large models.

In the following sections, we will explore how to implement LoRA in practice and see its benefits firsthand.


# Installation


*   Make sure we're on a GPU instance
*   Install required packages for finetuning; see [LLM Finetuning Hub](https://github.com/georgian-io/LLM-Finetuning-Hub)
*   Install PEFT from source for new features
*   Restart instance as required



In [None]:
# Install requirements.py
!git clone https://github.com/georgian-io/LLM-Finetuning-Hub.git
!pip install -r ./LLM-Finetuning-Hub/requirements.txt

In [None]:
!pip install flash-attn --no-build-isolation

Collecting flash-attn
  Downloading flash_attn-2.3.3.tar.gz (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: filename=flash_attn-2.3.3-cp310-cp310-linux_x86_64.whl size=57075008 sha256=bcb63b64213ab61590b340b77de84e448a442e19c100480895194df39ad7673d
  Stored in directory: /root/.cache/pip/wheels/e5/e6/fa/941802ec61d1afd320d27160ab1db98e6dba65381f84b76d4a
Successfully built flash-attn
Installing collected packages: flash-attn
Successfully installed flash-attn-2.3.3


In [None]:
!cp ./LLM-Finetuning-Hub/llama2/llama_patch.py ./llama_patch.py

In [None]:
# Install peft from source
!git clone https://github.com/huggingface/peft
!pip install peft

**Restart Runtime!**

# Imports

In [None]:
import torch
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

# peft module helps us generate & inject LoRA modules into base model
from peft import (
    LoraConfig,
    prepare_model_for_kbit_training,
    get_peft_model,
)

# transformers module helps us load a base model
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig, # helper to quantize the model so we can run on a single GPU
    TrainingArguments,
)

# trl modules help us train LoRA weights
from trl import SFTTrainer


# huggingface datasets module
import datasets
from datasets import load_dataset


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


In [None]:
# let's also filter out warnings so outputs are little bit easier to read
import warnings
warnings.filterwarnings("ignore")

# Getting NewsGroup Classification Dataset
The 20 Newsgroups dataset is a classic text classification dataset used in Natural Language Processing. It contains around 20,000 newsgroup posts on 20 topics, serving as an excellent basis for classification tasks. The following functions are designed to streamline the process of loading and preparing this dataset for fine-tuning with LoRA:

- `get_newsgroup_instruction_data`: Constructs formatted prompts for training or inference from texts and labels, useful for guiding the model during fine-tuning.
- `clean_newsgroup_data`: Cleans the dataset by ensuring that both texts and labels are strings, thus preparing the data for further processing.
- `get_newsgroup_data_for_ft`: Loads the dataset, splits it into training and testing sets, and prepares it for fine-tuning by applying the aforementioned functions.
- `get_newsgroup_classes`: Retrieves and lists all unique classes from the dataset, which can be helpful for understanding the classification landscape.

By utilizing these functions, we can effectively prepare the Newsgroups dataset for various classification models, ensuring our fine-tuning process is robust and well-guided.


In [None]:
TRAINING_CLASSIFIER_PROMPT_v2 = """### Sentence:{sentence} ### Class:{label}"""
INFERENCE_CLASSIFIER_PROMPT_v2 = """### Sentence:{sentence} ### Class:"""

def get_newsgroup_instruction_data(mode, texts, labels):
    # this function injects the prompt above to the dataset
    if mode == "train":
        prompt = TRAINING_CLASSIFIER_PROMPT_v2
    elif mode == "inference":
        prompt = INFERENCE_CLASSIFIER_PROMPT_v2

    instructions = []

    for text, label in zip(texts, labels):
        if mode == "train":
            example = prompt.format(
                sentence=text,
                label=label,
            )
        elif mode == "inference":
            example = prompt.format(
                sentence=text,
            )
        instructions.append(example)

    return instructions


def clean_newsgroup_data(texts, labels):
    label2data = {}
    clean_data, clean_labels = [], []
    for data, label in zip(texts, labels):
        if isinstance(data, str) and isinstance(label, str):
            clean_data.append(data)
            clean_labels.append(label)

            if label not in label2data:
                label2data[label] = data

    return label2data, clean_data, clean_labels


def get_newsgroup_data_for_ft(mode="train", train_sample_fraction=0.99):
    newsgroup_dataset = load_dataset("rungalileo/20_Newsgroups_Fixed")
    train_data = newsgroup_dataset["train"]["text"]
    train_labels = newsgroup_dataset["train"]["label"]
    label2data, train_data, train_labels = clean_newsgroup_data(
        train_data, train_labels
    )

    test_data = newsgroup_dataset["test"]["text"]
    test_labels = newsgroup_dataset["test"]["label"]
    _, test_data, test_labels = clean_newsgroup_data(test_data, test_labels)

    # sample n points from training data
    train_df = pd.DataFrame(data={"text": train_data, "label": train_labels})
    train_df, _ = train_test_split(
        train_df,
        train_size=train_sample_fraction,
        stratify=train_df["label"],
        random_state=42,
    )
    train_data = train_df["text"]
    train_labels = train_df["label"]

    train_instructions = get_newsgroup_instruction_data(mode, train_data, train_labels)
    test_instructions = get_newsgroup_instruction_data(mode, test_data, test_labels)

    train_dataset = datasets.Dataset.from_pandas(
        pd.DataFrame(
            data={
                "instructions": train_instructions,
                "labels": train_labels,
            }
        )
    )
    test_dataset = datasets.Dataset.from_pandas(
        pd.DataFrame(
            data={
                "instructions": test_instructions,
                "labels": test_labels,
            }
        )
    )

    return train_dataset, test_dataset


def get_newsgroup_classes():
    newsgroup_dataset = load_dataset("rungalileo/20_Newsgroups_Fixed")
    train_data = newsgroup_dataset["train"]["text"]
    train_labels = newsgroup_dataset["train"]["label"]

    label2data, clean_data, clean_labels = clean_newsgroup_data(
        train_data, train_labels
    )
    df = pd.DataFrame(data={"text": clean_data, "label": clean_labels})

    newsgroup_classes = df["label"].unique()
    newsgroup_classes = ", ".join(newsgroup_classes)

    return newsgroup_classes

Let's load in the dataset!

In [None]:
sample_fraction = 0.025 # editable

train_dataset, _ = get_newsgroup_data_for_ft(mode="train", train_sample_fraction=sample_fraction)
_, test_dataset = get_newsgroup_data_for_ft(mode="inference")
newsgroup_classes = get_newsgroup_classes()

print(f"Sample fraction:{sample_fraction}")
print(f"Training samples:{train_dataset.shape}")



  0%|          | 0/2 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]

Sample fraction:0.025
Training samples:(266, 3)


In [None]:
# Let's take a look at a training example
train_dataset['instructions'][0]

"### Sentence:\n\nThere's a package called Workspace on cica that has 5 desktops; I\nhaven't done much with it yet, but it seems to be able to do what you\nwant it to.\n\nDon't have the exact archive name handy, but it's something like\nwspace<blah>.zip.\n\nTom\n\n-- \n finn@convex.com           \t\t\t      I speak only for myself. ### Class:comp.os.ms-windows.misc"

In [None]:
# how about a test example?
test_dataset['instructions'][0]

'### Sentence:I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy. ### Class:'

# Loading Model

## Quantization Config

-  We are using 4-bit quantization for LoRA training (QLoRA [link text](https://arxiv.org/abs/2305.14314))
-  From huggingface [blog](https://huggingface.co/blog/4bit-transformers-bitsandbytes): **QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU.**


In [None]:
# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

# Load Model and Tokenizer

To begin fine-tuning the model with LoRA, we first need to load a pre-trained model and its corresponding tokenizer. This code snippet accomplishes the following:

- **Model Loading**: We use `AutoModelForCausalLM` to load a pre-trained causal language model from Hugging Face's model hub, which is suitable for tasks such as text generation. Here, we're using "NousResearch/Llama-2-7b-hf", a model checkpoint that's been pre-trained with specific capabilities.
  
- **Quantization**: We apply a `bnb_config ` to `quantization_config` parameter in order to optimize model's size and performance. This is particularly useful when working with large models or when there's a need to deploy models to environments with limited resources.
  
- **Tokenizer**: We use `AutoTokenizer` to load the tokenizer that corresponds to our pre-trained model.

This setup is critical to ensure that the model and tokenizer are correctly configured before starting the fine-tuning process with LoRA.


In [None]:
# Load model and tokenizer
pretrained_ckpt = "NousResearch/Llama-2-7b-hf"

# You can try any of the following 7B models (or any parameter count, if you have access to better GPUs):
# [mistral, flan, falcon, rp, zephyr]

# ["tiiuae/falcon-7b", ""]


model = AutoModelForCausalLM.from_pretrained(
    pretrained_ckpt,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto",
)
model.config.pretraining_tp = 1 #value different than 1 will activate the more accurate but slower computation of the linear layers

tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



## Inference Helper

The function `infer_one_example` is designed to perform inference on a single example using the provided model and tokenizer.

The function takes a text instruction, tokenizes it, and feeds it to the model to generate a prediction. It processes the output to extract the generated text, handling any potential errors (e.g., input too long) by returning an empty string.


In [None]:
def infer_one_example(model, tokenizer, instruction):
  input_ids = tokenizer(instruction, return_tensors="pt", truncation=True).input_ids.cuda()

  with torch.inference_mode():
    try:
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=20,
            do_sample=True,
            top_p=0.95,
            temperature=1e-3,
        )
        result = tokenizer.batch_decode(
            outputs.detach().cpu().numpy(), skip_special_tokens=True
        )[0]
        result = result[len(instruction) :]

    except:
        # oops, it's too long!
        result = ""

  return result

# Zero-Shot Example on Base Model

After setting up our model and tokenizer, we should evaluate the base model's performance before any fine-tuning (maybe it's already good enough!). We'll conduct a zero-shot test, which allows us to assess the model's ability to make predictions on tasks it hasn't been explicitly trained on.

In zero-shot learning, the model uses its pre-trained knowledge to infer the correct output for a given input. Here's how we'll proceed:

1. **Select an Example**: We'll choose a text example that the model hasn't seen during training.
2. **Run Inference**: Using the `infer_one_example` function, we will pass our selected text to the model and generate a prediction.
3. **Evaluate**: We'll examine the model's output to determine if it aligns with expected outcomes, considering that the model has no prior fine-tuning on this specific task.


In [None]:
# Let's get the first instance in our dataset
instruction, label = test_dataset["instructions"][0], test_dataset["labels"][0]
instruction, label

('### Sentence:I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy. ### Class:',
 'rec.autos')

In [None]:
# prompt format
ZERO_SHOT_CLASSIFIER_PROMPT = """Classify the sentence into one of 20 classes. The list of classes is provided below, where the classes are separated by commas:

{newsgroup_classes}

From the above list of classes, select only one class that the provided sentence can be classified into. The sentence will be delimited with triple backticks. Once again, only predict the class from the given list of classes. Do not predict anything else.

### Sentence: ```{sentence}```
### Class:
"""

# inject data into the prompt
prompt_zeroshot = ZERO_SHOT_CLASSIFIER_PROMPT.format(
    newsgroup_classes = newsgroup_classes,
    sentence = instruction
)

In [None]:
infer_one_example(model, tokenizer, prompt_zeroshot)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


'\n### Sentence: ```### Sentence:I am a little confused on all of'

- Obviously not what we wanted!
- Let's try to coerce the model in to the correct output by finetuning it!

# Fine Tuning!

Here's the fun part!

Now that we have assessed the base model's performance, the next stage is to fine-tune it using Low-Rank Adaptation (LoRA). Fine-tuning is a critical step to tailor a pre-trained model to our specific task, in this case, text classification within the 20 Newsgroups dataset. We will implement LoRA as it enables us to update the model's parameters efficiently, providing a significant computational advantage.


## Basic Configs

- **Dropout**: A dropout rate of 0.1 to prevent overfitting during training.
- **Epochs**: How many epochs to train

- **Rank**: A rank of 8, which determines the size of the low-rank matrices in LoRA, note that we need to balance between model flexibility and parameter efficiency.
- **Alpha**: How strongly should LoRA weight differentials affect base weights

In [None]:
# Basic training config
dropout = 0.1
epochs = 3    # 3 epochs takes ~20 min on a T4

# LoRA Configs
rank = 8      # try larger value for more complex task
alpha = 16    # try larger value if task is substantially different from language understanding/processing

## Configuring LoRA Parameters

Using `peft` library, we create a `LoraConfig` object with our defined parameters, preparing our model for fine-tuning with these settings.


We then invoke `prepare_model_for_kbit_training` to get model with quantized weights

And invoke `get_peft_model` to apply the LoRA configuration to our model (creates LoRA weights that we can train on).


In [None]:
# LoRA config based on QLoRA paper
peft_config = LoraConfig(
    lora_alpha=alpha,
    lora_dropout=dropout,
    r=rank,
    bias="none",
    task_type="CAUSAL_LM",
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [None]:
# directory to save model artifact
results_dir = "./finetuned_model"

## Training Loop

With all configurations in place, we initiate a training loop using the `SFTTrainer` class, which has been set up to accommodate our LoRA-enhanced model. We pass our training dataset along with other parameters like maximum sequence length, tokenizer, and the training arguments.

After training, we output the training loss to monitor our model's performance and save the fine-tuned model along with the tokenizer to our specified directory. Additionally, we serialize the training results using `pickle` for later analysis.

By the end of this stage, our model will be fine-tuned with LoRA, making it more adept at handling the classification tasks specific to our dataset.


In [None]:
# Training Loop

training_args = TrainingArguments(
        output_dir=results_dir,
        logging_dir=f"{results_dir}/logs",
        num_train_epochs=epochs,
        per_device_train_batch_size=6,
        gradient_accumulation_steps=2,
        gradient_checkpointing=True,
        optim="paged_adamw_32bit",
        logging_steps=100,
        learning_rate=2e-4,
        bf16=False, # Set to true if you're using A10/A100
        tf32=False, # Set to true if you're using A10/A100
        fp16=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
        report_to="none",
    )

max_seq_length = 512  # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    args=training_args,
    dataset_text_field="instructions",
)

trainer_stats = trainer.train()
train_loss = trainer_stats.training_loss
print(f"Training loss:{train_loss}")

peft_model_id = f"{results_dir}/assets"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)

with open(f"{results_dir}/results.pkl", "wb") as handle:
    run_result = [
        epochs,
        rank,
        dropout,
        train_loss,
    ]
    pickle.dump(run_result, handle)

You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.


Step,Training Loss
100,1.735


Training loss:1.7350161743164063


# Testing Out Fine-Tuned Model

After completing the fine-tuning process, let's evaluate the performance of the updated model. Testing allows us to verify that the model has indeed learned from the training data and can now perform better on the task at hand.

In [None]:
idx = 0
instruction, label = test_dataset["instructions"][idx], test_dataset["labels"][idx]

In [None]:
# Our test text
instruction



'rec.motorcycles'

In [None]:
# Run the fine-tuned model on it
infer_one_example(model, tokenizer, instruction)

'rec.motorcycles'

In [None]:
# GT
label

'rec.autos'

# More Models & Benchmarks!

You can check out complete benchmark results for each models at [our repo](https://github.com/georgian-io/LLM-Finetuning-Hub).

The repo hosts reusable code snippets to use for your experimentss!

**Now onto Mariia to tell us how we can serve a trained model**