- **Author:** **Kandimalla Hemanth**
- **Date of modified:**  **1-13-2024**
- **E-mail:** **speechcodehemanth2@gmail.com**


In [None]:
!pip install -q -U transformers accelerate evaluate deepspeed tqdm datasets peft

## Parameter-Efficient Fine-Tuning (PEFT): A Paradigm Shift in Large Language Model Adaptation

**Abstract:** Developed by Hugging Face, PEFT represents a fundamental shift in how we interact with large language models (LLMs). In contrast to traditional monolithic fine-tuning, PEFT offers a granular, parameter-efficient approach, unlocking a new paradigm of flexible and scalable LLM adaptation. This document delves into the core principles of PEFT, exploring its advantages, practical applications, and transformative potential across diverse domains.

**Benefits Beyond Efficiency:**

While reduced computational footprint is a significant advantage, PEFT's true value lies in its **flexibility and versatility**. It empowers practitioners to treat LLMs not as monolithic entities, but as adaptable platforms capable of specializing in specific tasks. Imagine a seasoned chef seamlessly tailoring their culinary expertise to diverse palettes – PEFT offers a similar level of control and precision in LLM adaptation.

**Ideal Use Cases:**

PEFT shines in scenarios where traditional approaches falter:

* **Resource constraints:** Limited compute power or memory? PEFT thrives in such settings, dramatically reducing the resource footprint needed to leverage the power of LLMs.
* **Data scarcity:** Possessing a small, high-value dataset? PEFT maximizes its potential, preventing overfitting and promoting generalization.
* **Dynamic needs:** Require your LLM to wear multiple hats? PEFT facilitates seamless adaptation across diverse tasks, avoiding catastrophic forgetting.

**Real-World Applications:**

The impact of PEFT extends far beyond the realm of academic research. Its potential spans diverse industries:

* **Personalized AI assistants:** PEFT paves the way for truly personalized AI companions, capable of understanding and responding to individual user preferences.
* **Targeted marketing:** Craft marketing campaigns that resonate deeply with specific customer segments. PEFT equips LLMs to analyze data and generate personalized messaging for improved engagement and conversion rates.
* **Domain-specific expertise:** Need an LLM that speaks the language of medicine, finance, or law? PEFT allows for fine-tuning to specific domains, unlocking a plethora of industry-specific applications.

**Beyond the Immediate:**

The implications of PEFT extend beyond its immediate benefits, fostering:

* **Democratized AI:** By lowering resource barriers, PEFT empowers smaller players and individuals to leverage the power of LLMs, promoting a more inclusive AI landscape.
* **Rapid prototyping:** PEFT's agility facilitates rapid development and iteration of LLM-based applications, accelerating the path from idea to implementation.
* **Responsible AI:** PEFT's fine-grained control over LLM adaptation promotes responsible AI development, minimizing risks of bias and ensuring ethical use of these powerful tools.

**Conclusion:**

PEFT is not merely a technical innovation; it is a catalyst for a new era of human-AI collaboration. By embracing its transformative potential, we can unlock a future where LLMs are not just powerful, but also adaptable, personalized, and accessible, shaping a world where technology seamlessly complements and enhances human endeavors.




## PEFT: Demystifying Parameter-Efficient Fine-Tuning

**What is PEFT?**

Parameter-Efficient Fine-Tuning (PEFT) is a novel approach to adapting large language models (LLMs) developed by Hugging Face. Unlike traditional fine-tuning that refines the entire model, PEFT focuses on adjusting a smaller subset of parameters, resulting in:

* **Reduced computational cost:** Less computing power and memory are needed compared to traditional methods.
* **Increased adaptation flexibility:** Fine-tune the LLM for specific tasks without affecting its general knowledge.
* **Improved data efficiency:** Work effectively even with limited training datasets.

**When to use PEFT?**

PEFT shines in situations where traditional fine-tuning is less than ideal:

* **Resource constraints:** Limited computing power or budget? PEFT is more resource-friendly.
* **Small datasets:** Don't have a large training dataset? PEFT prevents overfitting and improves generalization.
* **Dynamic needs:** Need the LLM to handle diverse tasks? PEFT allows for flexible adaptation to avoid "catastrophic forgetting" of previously learned skills.

**Where to use PEFT?**

The applications of PEFT are vast and extend to various domains:

* **Personalized AI assistants:** Imagine an assistant that truly understands your preferences and adapts to your unique needs.
* **Targeted marketing:** Develop personalized marketing campaigns that resonate with specific customer segments.
* **Domain-specific expertise:** Train an LLM to become an expert in a particular field like medicine, finance, or law.
* **Rapid prototyping:** Quickly test and iterate on LLM-based applications with PEFT's agile adaptation capabilities.

**How to use PEFT?**

Hugging Face provides various tools and libraries to implement PEFT, making it accessible to both novice and experienced users. Resources include:

* **Pre-trained PEFT models:** Get started quickly with readily available, fine-tuned models for specific tasks.
* **Fine-tuning APIs:** Fine-tune existing models on your own data with flexible configuration options.
* **Community support:** Leverage the Hugging Face community for tutorials, code examples, and best practices.

**What is Importances?**

"Importances" is unclear in this context. Did you mean "parameters" or "applications"?

* **Parameters:** In PEFT, only a subset of parameters are adjusted. Understanding their importance helps identify which aspects of the model contribute most to specific tasks.
* **Applications:** As mentioned earlier, PEFT has numerous applications across diverse domains. Identifying the most relevant applications depends on your specific needs and goals.




# LoRA: Low-Rank Adaptation of Large Language Models

## Introduction

Low-rank adaptation of large language models (LoRA) is a groundbreaking technique designed to address the challenges associated with fine-tuning massive pre-trained language models. As an AI scientist with expertise in various domains, including computer vision, NLP, and audio, the integration of advanced probability, statistics, information theory, detection, and estimation methods into LoRA presents an exciting avenue for research.

## The Challenge of Fine-Tuning Large Language Models

Fine-tuning language models, such as GPT-3 with its 175 billion parameters, is resource-intensive, time-consuming, and can lead to overfitting. LoRA aims to mitigate these challenges by introducing trainable rank decomposition matrices into the model's architecture.

## The LoRA Approach

LoRA freezes the original parameters of the pre-trained model and incorporates trainable rank decomposition matrices into each layer of the Transformer architecture. These matrices, represented as U * V, significantly reduce the number of trainable parameters while preserving the model's expressive power. For instance, LoRA can reduce the parameters of GPT-3 175B from 175 billion to 17 million, a remarkable 10,000 times reduction.

## Implementation Details

LoRA focuses on modifying the self-attention and feed-forward modules of the Transformer architecture. By multiplying the original parameters with rank decomposition matrices, LoRA imposes a low-rank constraint on the model, allowing it to concentrate on pertinent features for the new task or domain.

## Advantages Over Alternative Methods

LoRA surpasses other efficient tuning methods like adapters, prefix-tuning, and PET in several aspects:

- **No Inference Latency:** The multiplication with rank decomposition matrices during training ensures no additional inference latency, unlike methods introducing extra layers or parameters.

- **No Hyperparameter Tuning:** Determining the rank of decomposition matrices based on the pre-trained model size and the new task eliminates the need for complex hyperparameter tuning.

- **No Change in Architecture:** LoRA seamlessly integrates with the existing Transformer architecture, ensuring compatibility and interoperability without altering the model's structure.

## Experimental Results

LoRA has demonstrated competitive or superior results when applied to popular language models like RoBERTa, DeBERTa, GPT-2, and GPT-3 across various natural language understanding and generation tasks.

## Future Directions

As an AI scientist involved in AGI research, exploring the integration of advanced probability, statistics, and information theory into LoRA can further enhance its capabilities. Additionally, investigating its application in diverse domains, including computer vision and audio processing, presents an exciting opportunity for novel contributions.

## Conclusion

LoRA stands as a novel and efficient technique for adapting large language models to specific tasks or domains. Its reduction in trainable parameters, GPU memory requirements, and storage space, coupled with its ease of task-switching during deployment, positions LoRA as a promising avenue for the development of robust and secure AGI systems.

**References:**
1. [LoRA: Low-Rank Adaptation of Large Language Models - arXiv.org](https://arxiv.org/abs/2106.09685)
2. [LoRA: Low-Rank Adaptation of Large Language Models - GitHub](https://github.com/microsoft/LoRA)
3. [LoRA - Hugging Face](https://huggingface.co/docs/diffusers/main/en/training/lora)
4. [Using LoRA for Efficient Stable Diffusion Fine-Tuning - Hugging Face](https://huggingface.co/blog/lora)
5. [LoRA: Low-Rank Adaptation of Large Language Models - Medium](https://medium.com/serpdotai/lora-low-rank-adaptation-of-large-language-models-82869419b7b8)

# LoRA: Low-Rank Adaptation of Large Language Models

## What is LoRA?

LoRA, which stands for Low-Rank Adaptation, is a technique designed to efficiently fine-tune large language models (LLMs) by reducing the number of trainable parameters while maintaining or even improving model performance. It achieves this by freezing the pre-trained model weights and introducing smaller, trainable rank decomposition matrices into each layer of the Transformer architecture. This method significantly decreases the computational resources required for training and adaptation.

## When to Use LoRA

LoRA is particularly useful when there is a need to fine-tune large models on specific tasks or datasets, especially in scenarios where computational resources are limited. It is also beneficial when rapid adaptation to new tasks is required without the need for extensive retraining.

## Where to Use LoRA

LoRA can be applied across various models and contexts, including but not limited to RoBERTa, DeBERTa, GPT-2, and GPT-3. It is also being used in the open-source community for instruct-tuning LLMs and fine-tuning diffusion models.

## How to Use LoRA

To use LoRA, one would integrate it into their PyTorch models using the provided package from Microsoft. During training, only the LoRA parameters are modified, which are significantly fewer than the original model weights. After fine-tuning, the LoRA weights can be merged with the pre-trained weights for deployment, avoiding additional inference latency.

## Why to Use LoRA

LoRA is used because it offers a more efficient and cost-effective way to fine-tune large language models. It reduces the number of trainable parameters by a factor of up to 10,000 and the GPU memory requirement by a factor of 3, making it feasible to fine-tune on less powerful hardware. Additionally, it provides comparable or better performance than full fine-tuning.

## Why It Is Becoming More Famous

LoRA has gained popularity due to its ability to democratize access to large model fine-tuning, allowing low-resource practitioners to adapt large models without prohibitive costs. Its efficiency and the growing need for task-specific adaptations of pre-trained models in the AI community have contributed to its rising fame.

## Future Improvements

The future of LoRA may include further optimization of the rank-deficiency in language model adaptation and exploration of its applicability across different neural network types. Additionally, as an experimental technique, the API and functionalities of LoRA may continue to evolve, potentially expanding its versatility and ease of use.

In conclusion, LoRA is a transformative technique that enables efficient fine-tuning of large language models, making it a valuable tool for a wide range of applications in the field of AI. Its ability to reduce computational requirements while maintaining high model quality has made it an increasingly popular choice among researchers and practitioners.

To implement LoRA (Low-Rank Adaptation) for Large Language Models, follow these detailed steps:

### Preparing the Environment
1. **Set up your development environment**: Ensure you have Python installed and create a virtual environment for your project to manage dependencies.
2. **Install necessary libraries**: Install PyTorch and any other necessary libraries such as transformers, if you're working with models like BERT or GPT-2.

### Integrating LoRA into the Model
3. **Select a pre-trained language model**: Choose a model that you want to adapt, such as GPT-2 or BERT, which are available in the Hugging Face model repository.
4. **Implement low-rank matrices**: Modify the model's architecture by introducing low-rank matrices A and B which will be trained during the adaptation process. These matrices are smaller and will update the original weight matrices W in a low-rank fashion: `W_new = W + AB`.

   ```python
   # Pseudo-code for integrating LoRA into a model's layer
   class LoRALayer(nn.Module):
       def __init__(self, config):
           super().__init__()
           self.rank = config.lora_rank
           self.A = nn.Parameter(torch.randn(config.hidden_size, self.rank))
           self.B = nn.Parameter(torch.randn(self.rank, config.hidden_size))
           
       def forward(self, W, x):
           # Apply the low-rank adaptation
           W_lora = W + self.A @ self.B
           return F.linear(x, W_lora)
   ```

### Training the Model
5. **Freeze pre-trained weights**: Ensure the original weights of the model are frozen and will not be updated during training.
6. **Train low-rank matrices**: Train only the low-rank matrices (A and B) using your task-specific dataset. You can use standard training procedures like backpropagation, but make sure only the LoRA parameters are updated.

   ```python
   # Pseudo-code for training loop
   for epoch in range(num_epochs):
       for batch in dataloader:
           # Forward pass with LoRA layers
           outputs = model(batch)
           loss = criterion(outputs, batch.labels)
           
           # Backward pass and optimize only LoRA parameters
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()  # Make sure optimizer is configured to update only A and B
   ```

### Finalizing the Model
7. **Merge the low-rank matrices**: After training, merge the low-rank matrices with the original model weights to create the final adapted model. This step ensures that the model can be used for inference without any modifications to the original structure.

   ```python
   # Pseudo-code for merging matrices
   for layer in model.layers:
       layer.W += layer.A @ layer.B
   ```

### Evaluating and Using the Model
8. **Evaluate the model**: Test the adapted model on a validation set to ensure that it performs well on the specific task.
9. **Deploy the model**: The adapted model is now ready for deployment. You can use it for inference on your task without additional training.

### Additional Notes
- **Model Checkpointing**: Throughout the training process, regularly save checkpoints of your model to avoid losing progress in case of interruptions.
- **Hyperparameter Tuning**: Experiment with different ranks for the low-rank matrices and other hyperparameters to achieve the best performance on your specific task.
- **Efficiency**: The adapted model should be more efficient in terms of parameter count and memory usage compared to fully fine-tuning the original model.

By following these steps, you will have successfully implemented LoRA to adapt a large language model for your specific NLP task. Remember to consult the original LoRA paper and code repository for more detailed instructions and implementation nuances.



| Paper | Novelty | Importance | Application |
| --- | --- | --- | --- |
| LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS | A method to adapt large pre-trained language models to new tasks with low-rank matrix decomposition | Reduces the number of parameters and the computational cost of fine-tuning | Natural language understanding and generation tasks |
| Prefix-Tuning: Optimizing Continuous Prompts for Generation | A method to tune the continuous input embeddings of large pre-trained language models for text generation tasks | Improves the generation quality and diversity without modifying the model parameters | Text summarization, dialogue generation, machine translation, etc. |
| P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks | An extension of prefix-tuning that uses a learnable projection layer to map the input embeddings to the model hidden space | Achieves comparable or better performance than fine-tuning on various natural language understanding tasks | Natural language inference, sentiment analysis, question answering, etc. |
| GPT Understands, Too | A method to use GPT-3 as a question answering system by leveraging its natural language generation ability | Demonstrates that GPT-3 can perform well on factual and commonsense questions without any additional training | Open-domain question answering, knowledge base construction, etc. |
| The Power of Scale for Parameter-Efficient Prompt Tuning | A method to use large-scale pre-trained language models for prompt tuning with minimal task-specific parameters | Shows that scaling up the model size and the input length can improve the performance of prompt tuning | Natural language understanding and generation tasks |
| AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning | A method to allocate different numbers of parameters for different tasks based on their complexity and similarity to the pre-trained model | Reduces the parameter redundancy and improves the efficiency of fine-tuning | Multi-task learning, meta-learning, etc. |
| Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning | A method to fine-tune large pre-trained language models with few-shot learning and parameter-efficient adaptation | Outperforms in-context learning and reduces the inference cost and latency | Few-shot learning, text generation, etc. |
| Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning | A method to use prompt tuning for multiple tasks simultaneously and transfer the learned knowledge to new tasks | Enhances the generalization ability and the robustness of prompt tuning | Multi-task learning, transfer learning, etc. |
| FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning | A method to use low-rank Hadamard product to compress the model updates in federated learning | Reduces the communication overhead and preserves the privacy of the data | Federated learning, distributed optimization, etc. |
| KronA: Parameter Efficient Tuning with Kronecker Adapter | A method to use Kronecker product to reduce the number of parameters in adapter modules | Improves the parameter efficiency and the scalability of adapter-based fine-tuning | Natural language understanding and generation tasks |
| LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models | A method to use quantization-aware training to further compress the low-rank adapted models | Achieves significant model size reduction and speedup without sacrificing the accuracy | Natural language understanding and generation tasks |
| Controlling Text-to-Image Diffusion by Orthogonal Finetuning | A method to use orthogonal finetuning to control the text-to-image synthesis process | Enables fine-grained manipulation of the generated images based on textual attributes | Text-to-image synthesis, image editing, etc. |





In [None]:
from peft import PeftModel, PeftConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AdamW,
    get_linear_schedule_with_warmup
)
import torch
from torch.utils.data import DataLoader, Dataset
import numpy as np

# Define a custom dataset
class CustomTextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.inputs = [tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True, padding="max_length") for text in texts]

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx]

# Sample text data (replace this with your actual data)
texts = ["Example sentence 1", "Example sentence 2", "Example sentence 3"]

# Parameters
max_length = 128
batch_size = 8
num_training_steps = 1000
warmup_steps = 100

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

# Create dataset and dataloader
dataset = CustomTextDataset(texts, tokenizer, max_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Create a PeftConfig object with advanced configurations
config = PeftConfig(
    peft_type="LORA",
    target_modules=["q_proj", "v_proj"],
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    adapter_hidden_sizes=[64, 64],  # Example of adding adapters with specific sizes
    adapter_activation_function='relu',
    adapter_initializer_range=1e-2
)

# Load a base model and wrap it with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
peft_model = PeftModel(base_model, config)
peft_model.train()  # Set model to training mode

# Setup optimizer and scheduler
optimizer = AdamW(peft_model.parameters(), lr=5e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=num_training_steps)

# Training loop
for step, batch in enumerate(dataloader):
    outputs = peft_model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

    if (step + 1) % 50 == 0:  # Log every 50 steps
        print(f"Step {step + 1}/{num_training_steps}, Loss: {loss.item()}")

# Save the fine-tuned model
peft_model.save_pretrained("fine_tuned_model")

# Inference with the fine-tuned model
peft_model.eval()  # Set model to evaluation mode
input_text = "The quick brown fox"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
with torch.no_grad():
    outputs = peft_model.generate(input_ids, max_length=50)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

In [None]:
import gc, os, sys, threading
import numpy as np
import psutil, torch
from torch.utils.data import DataLoader, Dataset
from accelerate import Accelerator
from datasets import load_dataset
from tqdm.auto import tqdm
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    AdamW,
    get_scheduler
)
from peft import PeftModel, PeftConfig,LoraConfig

# Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

# Tokenizer and data collator
tokenizer = AutoTokenizer.from_pretrained("gpt2")
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
# tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Preprocessing data
def tokenize_function(examples):
    return tokenizer(examples["text"], return_tensors="pt", truncation=True, padding="max_length", max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type="torch", columns=["input_ids"])

# Accelerator for mixed precision and distributed training
accelerator = Accelerator()
# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1



# Load LoRA configuration
config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)
# PeftConfig with advanced options
# config = PeftConfig(
#     peft_type="LORA",
#     target_modules=["q_proj", "v_proj"],
#     r=4,
#     lora_alpha=32,
#     lora_dropout=0.1,
#     task_type="CAUSAL_LM"
# )

# Load base model and wrap with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("gpt2")
peft_model = PeftModel(base_model, config)

# Prepare everything with accelerator
model, optimizer, dataloader = accelerator.prepare(
    peft_model,
    AdamW(peft_model.parameters(), lr=5e-5),
    DataLoader(tokenized_datasets, shuffle=True, collate_fn=data_collator, batch_size=8)
)

# Scheduler and progress bar
num_training_steps = 10_000
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=500, num_training_steps=num_training_steps
)
progress_bar = tqdm(range(num_training_steps))

# Training loop
model.train()
for epoch in range(3):
    for step, batch in enumerate(dataloader, start=1):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        if step % 500 == 0:
            model.save_pretrained(f"model_epoch_{epoch}_step_{step}")
            gc.collect()
            torch.cuda.empty_cache()
            if psutil.virtual_memory().percent > 90: sys.exit("Exiting to avoid OOM")
        if threading.active_count() > 1: sys.exit("Exiting due to extra threads")

# Inference
model.eval()
prompt = "In a distant future, humans"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(accelerator.device)
with torch.no_grad():
    generated_ids = accelerator.unwrap_model(model).generate(inputs.input_ids, max_length=100)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

 **Parameters for LoRA**

**Key Parameters:**

**1. LoRA Attention Dimension (lora_r):**

- **Purpose:** Determines the number of features or dimensions used to represent attention weights within the LoRA mechanism.
- **Value:** 64 (in this case).
- **Impacts:**
    - Model capacity: Higher values can capture more complex relationships but increase computational cost.
    - Attention focus: Lower values encourage more focused attention on specific input elements.

**2. Alpha Parameter for LoRA Scaling (lora_alpha):**

- **Purpose:** Controls the degree to which attention values are spread across different input elements.
- **Value:** 16 (in this case).
- **Impacts:**
    - Attention distribution: Higher values lead to broader attention distribution, while lower values concentrate attention on a smaller subset of input elements.

**3. Dropout Probability for LoRA Layers (lora_dropout):**

- **Purpose:** Regularizes LoRA layers to reduce overfitting and improve generalization.
- **Value:** 0.1 (in this case), indicating a 10% probability of randomly dropping out units during training.
- **Impacts:**
    - Overfitting prevention: Prevents the model from becoming too sensitive to specific training examples.
    - Generalization: Improves performance on unseen data.
