# Efficient Fine-Tuning of Large Language Models

Fine-tuning large language models (LLMs) is a critical process that allows these models to adapt to specific tasks or domains. While traditional fine-tuning methods often require significant computational resources, recent advancements in parameter-efficient techniques have made this process more accessible.

#### Parameter-Efficient Fine-Tuning Techniques

1. **Low-Rank Adaptation (LoRA):**  
   LoRA introduces trainable low-rank matrices into each layer of a pre-trained model, enabling task-specific adaptation with minimal additional parameters ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)).

2. **Prefix Tuning:**  
   This approach adds task-specific prefixes to condition the model’s attention mechanisms, fine-tuning only a small fraction of parameters ([Li & Liang, 2021](https://arxiv.org/abs/2101.00190)).

3. **Adapter Modules:**  
   Adapter layers are inserted into the model architecture and trained for specific tasks while the core model remains unchanged, reducing computational overhead ([Houlsby et al., 2019](https://arxiv.org/abs/1902.00751)).

These techniques are designed to reduce the computational and memory demands of fine-tuning, making it feasible on smaller hardware setups.

---

#### Efficient Fine-Tuning Frameworks

Several tools and frameworks have been developed to facilitate efficient fine-tuning, integrating methods such as quantization, optimized backpropagation, and low-memory usage.
Unsloth is one such framework that incorporates advanced optimization techniques for efficient fine-tuning. It supports features such as 4-bit quantization to minimize memory usage, optimized training kernels for speed, and compatibility with diverse models like Llama-3 and Mistral ([Unsloth Documentation](https://docs.unsloth.ai/)).


---

#### References

1. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. *arXiv preprint arXiv:2106.09685*.  
2. Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. *arXiv preprint arXiv:2101.00190*.  
3. Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. *arXiv preprint arXiv:1902.00751*.  
4. Unsloth Documentation: [https://docs.unsloth.ai/](https://docs.unsloth.ai/)

---

#### About This Tutorial

In this tutorial, we will explore the principles of efficient fine-tuning, demonstrate how frameworks like Unsloth implement these principles, and provide hands-on examples to apply these techniques in real-world scenarios. By the end, you will have a clear understanding of how to fine-tune LLMs efficiently for your specific tasks.


### Step 1: Open Your Google Drive and Colab

1. **Google Drive**: Go to [Google Drive](https://drive.google.com) and log in with your account.
2. **Google Colab**: Open [Google Colab](https://colab.research.google.com) to run and edit Python notebooks in the cloud.
3. **Runtime Setup**: In Colab, go to `Runtime` > `Change runtime type` and select **GPU (T4)** under the hardware accelerator options for efficient computation.


### Step 2: Installing Required Dependencies for Unsloth

This script installs and upgrades the necessary libraries to work with the **Unsloth** framework effectively.

#### Code Explanation:

1. **`%%capture`**:
   - Suppresses the output of the following commands to maintain a cleaner notebook environment.

2. **Installing Required Libraries**:
   ```bash
   !pip install unsloth "xformers==0.0.28.post2" codecarbon google transformers


In [6]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [7]:
#check installation
!pip show unsloth

Name: unsloth
Version: 2024.12.2
Summary: 2-5X faster LLM finetuning
Home-page: http://www.unsloth.ai
Author: Unsloth AI team
Author-email: info@unsloth.ai
License: Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
  

### Step 3: Loading Pre-trained Models with Unsloth

### Loading Pre-trained Models with Unsloth

The following script demonstrates how to load a pre-trained model using Unsloth's `FastLanguageModel`. `FastLanguageModel` is a utility from UnSloth for efficiently loading and using pre-trained language models. It simplifies model loading, configuration, and tokenizer integration, making it ideal for tasks like text generation, classification, and summarization.

We can use a variety of pre-trained models suitable for different applications.
Available models are **Llama 3.1 (8B), Llama 3.2 (1B + 3B), Mistral NeMo (12B), Gemma 2 (9B), Qwen 2.5 (7B), Mistral Small (22B), Inference Chat UI, Phi-3.5 (Mini), Llama 3 (8B), Mistral v0.3 (7B), Phi-3 (Medium), Qwen2 (7B), Gemma 2 (2B), TinyLlama**. For more details, visit [Model Documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks).


This script demonstrates how to load a pre-trained model using Unsloth's `FastLanguageModel` with optimized settings.

#### Configuration:
- **`max_seq_length`**: Maximum input sequence length (2048).
- **`dtype`**: Data type for model weights (default: `None`).
- **`load_in_4bit`**: Enables 4-bit precision for efficient memory and speed.

In this particular example, we load a pre-trained 3.2B parameter Llama model variant  along with its corresponding tokenizer for text processing. You can change model and data as per requirement.


In [8]:
from unsloth import FastLanguageModel

# Configuration parameters
max_seq_length = 2048  # Maximum sequence length for inputs
dtype = None           # Data type for model weights (default: None)
load_in_4bit = True    # Use 4-bit precision to optimize memory and speed

# Load the pre-trained model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-bnb-4bit",  # Specify the model
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Step 4: Configuring a PEFT Model

The following code demonstrates how to configure a parameter-efficient fine-tuning (PEFT) model using the `FastLanguageModel.get_peft_model` method:
#### Purpose
The `FastLanguageModel.get_peft_model` function simplifies the fine-tuning of large language models by employing Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA (Low-Rank Adaptation). This method reduces resource usage while maintaining performance.

#### Parameters
1. **`model`**: The base model to be fine-tuned. This is typically a pre-trained large language model.
2. **`r`**: The rank of the low-rank adaptation matrices. This value determines the number of trainable parameters, balancing efficiency and flexibility. In this example, `r=16`.
3. **`target_modules`**: A list of specific modules within the model to which LoRA should be applied. Modules like `q_proj`, `k_proj`, `v_proj`, and others represent key projections or transformations in transformer architectures.
4. **`lora_alpha`**: A scaling factor for the LoRA layers. This helps regulate the magnitude of updates from the adapted layers. The value here is set to `16`.
5. **`lora_dropout`**: Specifies the dropout rate for the LoRA layers. A value of `0` means no dropout is applied, ensuring all connections remain active during training.
6. **`bias`**: Determines how biases are treated during fine-tuning:
   - `"none"`: Biases remain frozen.
   - `"all"`: All biases are trainable.
   - `"layerwise"`: Biases are selectively trainable per layer.
7. **`use_gradient_checkpointing`**: Activates memory-efficient training by checkpointing intermediate activations during backpropagation. The `"unsloth"` value indicates a custom configuration.
8. **`random_state`**: A seed value (e.g., `3407`) used to ensure reproducibility in the training process.
9. **`use_rslora`**: Indicates whether to use Randomized SVD (Singular Value Decomposition) for LoRA. In this case, it is set to `False`.
10. **`loftq_config`**: Specifies an optional configuration for LOFT-Q quantization. Set to `None` if not applicable.



In [9]:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)


Unsloth 2024.12.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## Step 5: Sentiment Analysis Use case

This workflow processes a sentiment dataset to create conversation-style inputs for a chat-based LLM. It leverages the `unsloth` library for template application and the HuggingFace `Dataset` library for dataset formatting. The objective is to generate prompts and responses that facilitate sentiment analysis.

#### Overview

1. **Dataset Preparation**  
   The script loads a sentiment dataset (`dataset-sentiment.csv`) into a Pandas DataFrame and randomly samples 1,000 rows for manageable processing.

2. **Label Conversion**  
   Sentiment labels (`positive`, `neutral`, `negative`) are mapped to numerical values (`2`, `1`, `0`) for uniformity and downstream compatibility.

3. **Prompt Creation**  
   A custom function generates detailed prompts for each text entry. These prompts ask the model to analyze sentiment based on:
   - Emotional language
   - Positive/negative expressions
   - Tone shifts or balanced tone

4. **Response Generation**  
   A predefined function produces responses based on the sentiment labels, providing reasoning for the assessments.

5. **Conversation Structuring**  
   Each dataset entry is transformed into a structured conversation:
   - **System Role**: Sets the context for the model (e.g., "You are a financial sentiment analyzer").
   - **User Role**: Provides the sentiment analysis prompt.
   - **Assistant Role**: Supplies the sentiment assessment response.

6. **Conversion to HuggingFace Dataset**  
   The structured DataFrame is converted into a HuggingFace `Dataset` for compatibility with transformer models.

7. **Chat Template Formatting**  
   Conversations are processed using the `unsloth` chat template, formatted as input for an LLM. This ensures consistent tokenization and structure.

## Final Output

The script produces a HuggingFace-compatible dataset with conversation-style prompts and responses. These are ready for use in training or evaluating sentiment analysis tasks.


In [13]:
import pandas as pd
from datasets import Dataset
from unsloth.chat_templates import get_chat_template

# Apply chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

# Load and prepare dataset
# df = pd.read_csv("/content/dataset-sentiment.csv")
df = pd.read_csv("/content/sample.csv")

# Sample 1000 random rows
# df = df.sample(n=1000, random_state=42)

# Convert labels to numerical format
label_map = {'positive': 2, 'neutral': 1, 'negative': 0}
df['label_num'] = df['label'].map(label_map)

def create_prompt(text):
    return f"""Assess the sentiment of the following text by identifying the presence of sentiment indicators such as emotional language, positive or negative expressions, and tone shifts.

If you find strong sentiment indicators, mark them accordingly and provide a reasoning for why the sentiment is positive, negative, or neutral.

Now, assess the following text:

Text: {text}

Sentiment Indicators Checklist:
- Emotional Language: Uses words that convey strong feelings, such as joy, anger, sadness, or excitement.
- Positive Expressions: Words or phrases that promote positive feelings, praise, or optimism.
- Negative Expressions: Words or phrases that convey criticism, pessimism, or negative attitudes.
- Tone Shifts: Noticeable changes in tone that affect how the content is perceived, potentially altering the sentiment.
- Balanced or Neutral Tone: Absence of strong emotional language, implying a more neutral or objective sentiment.

Provide three separate assessments of the sentiment in the following text. Each assessment should be in the format:

[NUMBER]. [SENTIMENT] - [REASONING]

Where [NUMBER] is 1, 2, or 3, [SENTIMENT] is 'Positive', 'Negative', or 'Neutral', and [REASONING] is a one-line explanation for the chosen sentiment."""

def create_response(label):
    return f"""1. {label.upper()} - Analysis based on the text's indicators and tone.
2. {label.upper()} - Evaluation of language and contextual elements.
3. {label.upper()} - Assessment of overall sentiment impact."""

# Create conversation format
def create_conversation(row):
    return [
        {"role": "system", "content": "You are a financial sentiment analyzer. Your task is to assess text sentiment with detailed reasoning."},
        {"role": "user", "content": create_prompt(row['text'])},
        {"role": "assistant", "content": create_response(row['label'])}
    ]

df['conversations'] = df.apply(create_conversation, axis=1)

# Convert to HuggingFace dataset
dataset = Dataset.from_pandas(df)

# Format prompts
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
            for convo in convos]
    return {"text": texts}

# Format the dataset
formatted_dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    remove_columns=dataset.column_names
)

Map:   0%|          | 0/627 [00:00<?, ? examples/s]

# Step 6:  Fine-Tuning Documentation with `SFTTrainer`

This workflow demonstrates how to fine-tune a model using the `SFTTrainer` class from the `trl` library. The configuration optimizes for small batch sizes, mixed-precision training, and efficient sequence handling.

#### Key Components

#### Libraries and Utilities
- **`SFTTrainer`**: Simplifies supervised fine-tuning for transformer models.
- **`TrainingArguments`**: Manages hyperparameters and training configurations.
- **`DataCollatorForSeq2Seq`**: Prepares data for sequence-to-sequence tasks.
- **`is_bfloat16_supported`**: Dynamically checks if the current environment supports BF16 precision.

---

#### Workflow Breakdown

1. **Model and Dataset**  
   - The `SFTTrainer` requires a `model`, `tokenizer`, and `train_dataset`. The dataset's text field is specified using `dataset_text_field`.

2. **Data Collation**  
   - `DataCollatorForSeq2Seq` ensures that input sequences are tokenized and padded to the required `max_seq_length`.

3. **Training Configuration**  
   - **Key Parameters**:
     - `per_device_train_batch_size`: Number of samples per device during training (set to `2`).
     - `gradient_accumulation_steps`: Number of updates to accumulate gradients before optimizing.
     - `learning_rate`: Initial learning rate for the optimizer (`2e-4`).
     - `fp16` and `bf16`: Mixed-precision settings for faster training and lower memory usage, automatically chosen based on system support.
     - `max_steps`: Maximum training steps (`60` in this example).
     - `logging_steps`: Frequency of training logs (`1` step).
     - `optim`: Optimizer type (`adamw_8bit` for efficient memory usage).
     - `weight_decay`: Regularization parameter for weight decay (`0.01`).
     - `lr_scheduler_type`: Learning rate scheduler (`linear`).
     - `seed`: Ensures reproducibility (`3407`).
     - `output_dir`: Directory to save outputs (`outputs`).

4. **Sequence Packing**  
   - Disabled (`packing = False`) for better handling of short sequences. When enabled, it can accelerate training up to 5x by efficiently packing sequences.

5. **Output and Reporting**  
   - Logs are saved locally (`output_dir`) without reporting to external tools like WandB (`report_to = "none"`). This can be modified for integration with monitoring platforms.

---

#### Advantages of this Configuration

1. **Efficient Training**:  
   - Small batch sizes and gradient accumulation allow training on systems with limited GPU memory.
   - Mixed precision (`fp16` or `bf16`) accelerates training while saving memory.

2. **Dynamic Precision Handling**:  
   - Automatically determines the best precision (FP16 or BF16) based on hardware support.

3. **Customizable and Lightweight**:  
   - Parameters like `max_steps`, `learning_rate`, and `logging_steps` allow flexible tuning for small or large-scale training runs.

4. **Optimized for Short Sequences**:  
   - Option to enable sequence packing to maximize GPU utilization and reduce training time.

---

## Notes

- **Extending Training**: Uncomment `num_train_epochs` to specify epoch-based training instead of step-based training.
- **Monitoring Tools**: Replace `"none"` in `report_to` with tools like `"wandb"` for real-time tracking and visualization.
- **Sequence Packing**: While disabled by default, enabling it is recommended for datasets with very short sequences.

This setup is ideal for fine-tuning LLMs on modest hardware while maintaining flexibility for advanced configurations.


In [14]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/627 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


# Training on Responses Only with `unsloth.chat_templates`

The `train_on_responses_only` utility from `unsloth.chat_templates` is used to focus the training process on the assistant's responses, while preserving the user instructions as context. This is particularly useful when fine-tuning conversational models where the primary interest is improving response quality.

---

## Functionality Overview

- **Purpose**: Modifies the training dataset to include only the assistant's responses for optimization, while retaining the user instructions for input context.
- **Integration**: Used in conjunction with a preconfigured `SFTTrainer` or similar training setup.

---

#### Key Components

#### Parameters

1. **`trainer`**  
   - The existing trainer instance configured for supervised fine-tuning.
   - Carries the model, tokenizer, dataset, and training arguments.

2. **`instruction_part`**  
   - Specifies the token delimiter for user instructions.  
   - Example: `<|start_header_id|>user<|end_header_id|>\n\n` marks the user input section.

3. **`response_part`**  
   - Specifies the token delimiter for assistant responses.  
   - Example: `<|start_header_id|>assistant<|end_header_id|>\n\n` marks the assistant's response section.

---

#### Workflow

1. **Preparation**  
   - Ensure the training dataset includes conversations formatted with clear demarcations for user and assistant roles.  
   - The delimiters (`instruction_part` and `response_part`) help the model focus on the assistant's responses.

2. **Applying the Utility**  
   - Pass the trainer instance and delimiters to `train_on_responses_only`.  
   - This modifies the training focus, reducing unnecessary computation on the user input while ensuring the assistant’s responses are optimized.

3. **Training Process**  
   - The modified trainer continues training, emphasizing the assistant's part in the conversation, which is critical for generating high-quality responses.

---

#### Example Use Case

In a typical chat-based training scenario:
- The dataset contains structured conversations between a user and an assistant.
- The assistant’s responses are the primary output of interest.
- Using `train_on_responses_only`, the model focuses on the response part, improving efficiency and relevance during training.

---


In [15]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/627 [00:00<?, ? examples/s]

In [18]:
#This code displays GPU details and memory usage using PyTorch. It shows the GPU name, total memory capacity, and peak memory reserved during the session,
# helping monitor resource utilization.

import torch
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.086 GB of memory reserved.
