# PEFT - LoRA Basics

**bitsandbytes**

Default version requires CUDA. Experimental versions available for different CPU.

https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

**Inspired by**

https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_opt_bnb_peft.ipynb



In [10]:
# !pip install bitsandbytes

In [11]:

import os

import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

# Load model
# "facebook/opt-6.7b"

model_name = "facebook/opt-350m"

# Load with 8 bit quantization
model = AutoModelForCausalLM.from_pretrained(model_name) #, quantization_config=BitsAndBytesConfig(load_in_8bit=True))

tokenizer = AutoTokenizer.from_pretrained(model_name)

## Model layers


In transformer architectures, the modules that can be configured with **LoRA** often depend on the model's specific implementation. While the commonly known modules are `q_proj`, `k_proj`, and `v_proj`, there are additional modules in transformers that can also benefit from LoRA adaptation.

Here’s a more comprehensive list of LoRA-compatible layers typically found in transformer models:

---

### Common LoRA-Configurable Layers:
1. **`q_proj`**: Query projection in the attention mechanism.
2. **`k_proj`**: Key projection in the attention mechanism.
3. **`v_proj`**: Value projection in the attention mechanism.
4. **`out_proj`**: Output projection in the attention mechanism.
5. **`fc1` / `dense_h_to_4h`**: First fully connected layer in the feedforward network (FFN).
6. **`fc2` / `dense_4h_to_h`**: Second fully connected layer in the feedforward network (target_layers)

### Explanation of Additional Layers:
1. **`out_proj`**:
   - After the attention mechanism calculates its weighted sum, `out_proj` projects the result back into the model's embedding space.

2. **`fc1` and `fc2`**:
   - These are the layers in the feedforward network within each transformer block.
   - Adapting these with LoRA can improve performance in certain downstream tasks.

3. **Other Custom Layers**:
   - Some transformer variants might use different naming conventions (`dense_h_to_4h`, `dense_4h_to_h` in OpenAI's GPT-like architectures) or add additional modules that could be LoRA-configured.

---

### Practical Tips:
- Check the model’s documentation or inspect its architecture to find precise names for the layers.
- Use filtering logic (e.g., keywords like `proj`, `fc`, `dense`) to identify potential layers.
- Test LoRA configurations iteratively to see the impact of adapting additional layers.

This approach ensures you maximize the effectiveness of LoRA while minimizing unnecessary adaptations.

In [60]:
# import torch.nn as nn

# Dumps a list of layers
def list_all_layers():
    # Returns an iterator over all modules in the model, along with their names
    modules = model.named_modules()
    for name, module in modules:
        print(name)

# Read model documentation to learn the keywords
def list_lora_compatible_layers():
    num_lora_compatible_layers = 0
    for name, module in model.named_modules():
        if any(keyword in name for keyword in ["proj", "fc", "dense"]):  # Adapt this as needed
            num_lora_compatible_layers = num_lora_compatible_layers + 1

    print("LoRA-compatible layers:", num_lora_compatible_layers)

def count_lora_layers():
    modules = model.named_modules()
    num_layers = 0
    lora_layers_k = 0
    lora_layers_q = 0
    lora_layers_v = 0
    lora_layers_out = 0
    for name, module in modules:
        num_layers = num_layers + 1
        if 'k_proj' in name:
            lora_layers_k = lora_layers_k + 1
        if 'q_proj' in name:
            lora_layers_q = lora_layers_q + 1  
        if 'v_proj' in name:
            lora_layers_v = lora_layers_v + 1  
        if 'out_proj' in name:
            lora_layers_out = lora_layers_out + 1  

    print("Number of layers in the model = ", num_layers)
    print("LORA layers (k_proj, q_proj, v_proj, out_proj) :  ", lora_layers_k, lora_layers_q, lora_layers_v, lora_layers_out )

In [64]:
# list_all_layers()

list_lora_compatible_layers()

print("--------------------------------")

count_lora_layers()



LoRA-compatible layers: 626
--------------------------------
Number of layers in the model =  791
LORA layers (k_proj, q_proj, v_proj, out_proj) :   24 264 264 24


## Trainable parameters 

Primarily determined by the value assigned to **r**

* **r**: Current rank value in the iteration.
* **lora_alpha**: A scaling factor to balance the effect of LoRA updates.
* **target_modules**: Specifies the modules of the base model to which LoRA is applied. Here, q_proj and v_proj are likely related to query and value * projections in transformer architectures.
* **lora_dropout:** Dropout probability for regularization of LoRA parameters.
* **bias:** Defines how biases in the original model should be treated. "none" means no additional bias parameters are added.
* **task_type**: Specifies the task type. "CAUSAL_LM" indicates this is a causal language modeling task (e.g., autoregressive text generation).


### Task types
The `task_type` parameter in LoRA (Low-Rank Adaptation) specifies the type of task for which the model is being fine-tuned. The specific values of `task_type` determine how the adaptation layers are configured and integrated into the model. Below are the most common `task_type` values supported by LoRA implementations, such as those in the [PEFT](https://github.com/huggingface/peft) library:

---

### List of Supported `task_type` Values:

1. **`CAUSAL_LM`** (Causal Language Modeling)
   - Used for autoregressive tasks where the model predicts the next token in a sequence.
   - Example models: GPT, GPT-2, GPT-3, Codex.
   - Applications: Text generation, code generation.

2. **`SEQ2SEQ_LM`** (Sequence-to-Sequence Language Modeling)
   - Used for tasks that involve input-output sequence transformations.
   - Example models: T5, BART.
   - Applications: Machine translation, text summarization.

3. **`TOKEN_CLASSIFICATION`**
   - Used for token-level tasks where each token in the input receives a label.
   - Example models: BERT, RoBERTa.
   - Applications: Named Entity Recognition (NER), Part-of-Speech tagging.

4. **`SEQ_CLASSIFICATION`** (Sequence Classification)
   - Used for tasks where an entire sequence is classified into a single category.
   - Example models: BERT, RoBERTa.
   - Applications: Sentiment analysis, spam detection, document classification.

5. **`MULTIPLE_CHOICE`**
   - Used for tasks where the model selects the correct choice from multiple options based on a context.
   - Example models: RoBERTa, DeBERTa.
   - Applications: Question answering in a multiple-choice format (e.g., SWAG, RACE).

6. **`QUESTION_ANSWERING`**
   - Used for extractive question-answering tasks where the model predicts a span of text from the input context.
   - Example models: BERT, DistilBERT.
   - Applications: SQuAD-style tasks, FAQ extraction.

7. **`IMAGE_CLASSIFICATION`**
   - Used for image classification tasks in vision models.
   - Example models: ViT, ConvNext.
   - Applications: Object detection, image categorization.

8. **`FEATURE_EXTRACTION`**
   - Used for extracting intermediate features from a model without specific downstream task adaptation.
   - Applications: Embedding generation, unsupervised feature analysis.

9. **`UNSPECIFIED`**
   - A general or fallback category for tasks not explicitly defined in the library.
   - Applications: Custom tasks requiring specialized configurations.

---

### How to Find Supported `task_type` Values:
If you're using a library like PEFT, consult its [documentation](https://github.com/huggingface/peft) or the source code to confirm all supported `task_type` values. You can also check the library's implementation for class definitions or enumerations that list these values.



In [73]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params r={r}: {trainable_params} || all params: {all_param} || trainable: {round(100 * trainable_params / all_param,2)}%"
    )

In [74]:
from peft import LoraConfig, get_peft_model

for r in (1,8,16,32,64,128):
    config = LoraConfig(
        r=r, 
        lora_alpha=32, 
        target_modules=["q_proj","v_proj"], 
        lora_dropout=0.05, 
        bias="none", 
        task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, config)
    print_trainable_parameters(model)

trainable params r=1: 98304 || all params: 331294720 || trainable: 0.03%
trainable params r=8: 786432 || all params: 331982848 || trainable: 0.24%
trainable params r=16: 1572864 || all params: 332769280 || trainable: 0.47%
trainable params r=32: 3145728 || all params: 334342144 || trainable: 0.94%
trainable params r=64: 6291456 || all params: 337487872 || trainable: 1.86%
trainable params r=128: 12582912 || all params: 343779328 || trainable: 3.66%
