## Loading PEFT Model
In order to load and use a PEFT adapter model from Transformers (HuggingFace), first make sure the Hub repository or local directory contains an `adapter_config.json` file and the `adapter weights`.  We can load the 
`PEFT adapter` model using the `AutoModelFor Class`. Two steps involve:

1. specify the PEFT model_id
2. pass it to the AutoModel class

In [1]:
## we weill load the PEFT adapter for causal language modeling
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id="ybelkada/opt-350m-lora"
model=AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=peft_model_id)

  from .autonotebook import tqdm as notebook_tqdm
  return self.fget.__get__(instance, owner)()


In [2]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): lora.Linear(
              (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.05, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=1024, out_features=16, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=16, out_features=1024, bias=Fa

In [3]:
params=model.parameters()

num_params=sum(p.numel() for p in params)

print('Total trainable parameters: ', num_params)

Total trainable parameters:  332769280


We can also `load` the adapter with ``Automodelclass.from_pretrained(model_id).load_adapter(pef_model_id)``

In [4]:
model_id="facebook/opt-350m"
peft_model_id="ybelkada/opt-350m-lora"

In [5]:
model_=AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id)


In [6]:
print(model_==model)
model_.load_adapter(peft_model_id) ## add the adapters to the base model

False


In [7]:
model_==model

False

In [8]:
model_ ## now we can see loar_A and lora_B

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): lora.Linear(
              (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.05, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=1024, out_features=16, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=16, out_features=1024, bias=Fa

## Load in 8bit or 4bit

Most of the pretrained model in the HuggingFace are in 32-bit precision (`float32`). We can use this method to quantize to smaller bits like 8 bits or 4 bits.

We can choose to load model in either 8 bit or in 4 bit. This can be used to save memory while loading large language models. We can do this using object, `transformers.BitsAndBytesConfig()`. Then we passs that object to ``.from_pretrained()`` Also, we must set `device_map='auto'` to effectively distribute the model to your hardware.

***more aboout  BitsAndBytesConfig()***

- **load_in_8bit (bool, optional, defaults to False)** — This flag is used to enable 8-bit quantization with LLM.int8().

- **llm_int8_threshold (float, optional, defaults to 6)** — This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).

- **llm_int8_skip_modules (List[str], optional)** — ``An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for CausalLM models, the last lm_head is kept in its original dtype.``

- **llm_int8_enable_fp32_cpu_offload (bool, optional, defaults to False)** — This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as google/flan-t5-xxl. Note that the int8 operations will not be run on CPU.

In [9]:
from transformers import BitsAndBytesConfig

bit_byte_config=BitsAndBytesConfig(load_in_8bit=True) ## this will internally use LLM.int8() from BitsAndBytes() library for optimized 8-bit matrix multiplication.

model=AutoModelForCausalLM.from_pretrained(peft_model_id,device_map='auto',quantization_config=bit_byte_config)

In [10]:
model.device

device(type='cuda', index=0)

In [11]:
import torch
device=('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')

device: cuda


In [12]:
## loading model in 4-bit
bit_byte_config=BitsAndBytesConfig(load_in_4bit=True) ## this will internally use LLM.int8() from BitsAndBytes() library for optimized 8-bit matrix multiplication.

model=AutoModelForCausalLM.from_pretrained(peft_model_id,device_map='auto',quantization_config=bit_byte_config)

## Add a new adapter || set active adapter/s
We can use `PeftModel.add_adapter()` to add a new adapter to a model with an existing adapter as long as the new adapter is the same type as the current one. For example, if you have an existing LoRA adapter attached to a model. 

Althoug, we can add multiple adapters but unless specified not everone will be used during training or inference. This can be done using `PeftModel.set_adapter()`

Note: We can set a list of adapters; we do this if we are traiing a given model for multiple tasks.  But if we are only using on a single task (such as sentiment analysis) than we will only set a single adapter

In [13]:
from peft import LoraConfig

lora_config=LoraConfig(target_modules=['q_proj','k_proj'],
                       init_lora_weights=True)

model.add_adapter(lora_config,'adapter_1')

In [15]:
from peft import LoraConfig

lora_config=LoraConfig(target_modules=['q_proj','k_proj'],
                       init_lora_weights=True)

model.add_adapter(lora_config,'adapter_2')

In [18]:
model ## we can see we have three lora adapter (adapter_1),adapter1, and adapter 2

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (adapter_1): Identity()
                (adapter1): Identity()
                (adapter_2): Identity()
              )
              (lora_A): ModuleDict(
                (adapter_1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter_2): Linear(in_feature

In [23]:
model.set_adapter('adapter1') ## Now we will be using adapter2 as active adapter for training

In [24]:
model.active_adapter()



'adapter1'

In [25]:
## setting muiltiple adapters
model.add_adapter(lora_config,'adapter2')
model.add_adapter(lora_config,'adapter3')

In [26]:
model.set_adapter(['adapter1','adapter2','adapter3'])

In [28]:
model.active_adapters()

['adapter1', 'adapter2', 'adapter3']

``model.enable_adapters()`` ||  ``model.disable_adapters()``

- model.enable_adapters(): Enables one or more adapters specified  in the adapter_names. This will be used during training 

- model.disable_adapters(): Temporarily prevents an adapter's influence during training or inference.*Once you exit the context manager, the adapter is automatically re-enabled.*


In [34]:
for adapter_name in ["adapter1", "adapter2"]:
    # Incorrect (separate arguments)
    # model.enable_adapters(adapter_name)

    # Correct (list as argument)
    model.enable_adapters([adapter_name])

TypeError: enable_adapters() takes 1 positional argument but 2 were given

 Other utils such merge_adapter(),unmerge_adapter(), unload(),delete_adapter()

In [22]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (adapter_1): Identity()
                (adapter1): Identity()
                (adapter_2): Identity()
                (adapter2): Identity()
                (adapter3): Identity()
              )
              (lora_A): ModuleDict(
                (adapter_1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter1): Linear(in_features=10

## deleting adapter_1, adapter_2
model.disable_adapters

**More about LoraConfig**

*Note taken with the assistance of Gemini*

This section summarizes the key parameters available in the `LoraConfig` class for controlling how LoRA is applied to the base model in PEFT:

**Parameters:**

* `r (int)`: Rank of the update matrices. Lower rank leads to smaller matrices with fewer trainable parameters.
* `target_modules (list)`: Names of modules (e.g., attention blocks) where LoRA update matrices will be applied.
* `lora_alpha (float)`: LoRA scaling factor.
* `bias (str)`: Controls training of bias parameters. Options: 'none', 'all', 'lora_only'.
* `use_rslora (bool)`: When True, uses Rank-Stabilized LoRA with scaling factor `lora_alpha / sqrt(r)`. Defaults to the original `lora_alpha / r`.
* `modules_to_save (list, optional)`: List of modules (besides LoRA layers) to be set as trainable and saved in the final checkpoint. This might include a custom head for fine-tuning.
* `layers_to_transform (list, optional)`: List of layers to be transformed by LoRA. If not specified, all layers in `target_modules` are transformed.
* `layers_pattern (str, optional)`: Pattern to match layer names in `target_modules` when using `layers_to_transform`. Defaults to PeftModel recognizing common layer patterns. Use this for custom models.
* `rank_pattern (dict, optional)`: Mapping from layer names (or regex) to custom ranks, overriding the default `r`.
* `alpha_pattern (dict, optional)`: Mapping from layer names (or regex) to custom alphas, overriding the default `lora_alpha`.
* **`task_type (str, optional)`: Type of downstream task the model will be used for (e.g., 'SEQ_CLS' for text classification). While not strictly required for LoRA, specifying the task type can be helpful for frameworks/libraries that leverage PEFT to provide task-specific functionalities.**

**Remember:**

* Lower `r` values reduce parameters but might impact accuracy.
* Choose `target_modules` carefully to target relevant parts of the model for adaptation.
* Experiment with `lora_alpha` and `bias` settings for optimal performance.
* Consider `use_rslora` for potentially better results.
* Use `modules_to_save` to include custom heads in fine-tuning.
* `layers_to_transform` and related options offer fine-grained LoRA application control.
* Specifying the `task_type` can be beneficial for task-specific adaptation within PEFT.



