## Loading PEFT Model
In order to load and use a PEFT adapter model from Transformers (HuggingFace), first make sure the Hub repository or local directory contains an `adapter_config.json` file and the `adapter weights`.  We can load the 
`PEFT adapter` model using the `AutoModelFor Class`. Two steps involve:

1. specify the PEFT model_id
2. pass it to the AutoModel class

In [1]:
## we weill load the PEFT adapter for causal language modeling
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id="ybelkada/opt-350m-lora"
model=AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=peft_model_id)

  from .autonotebook import tqdm as notebook_tqdm
  return self.fget.__get__(instance, owner)()


In [4]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): lora.Linear(
              (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.05, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=1024, out_features=16, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=16, out_features=1024, bias=Fa

In [2]:
params=model.parameters()

num_params=sum(p.numel() for p in params)

print('Total trainable parameters: ', num_params)

Total trainable parameters:  332769280


We can also `load` the adapter with ``Automodelclass.from_pretrained(model_id).load_adapter(pef_model_id)``

In [3]:
model_id="facebook/opt-350m"
peft_model_id="ybelkada/opt-350m-lora"

In [4]:
model_=AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id)


In [5]:
print(model_==model)
model_.load_adapter(peft_model_id)

False


In [6]:
model_==model

False

In [7]:
model_ ## now we can see loar_A and lora_B

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): lora.Linear(
              (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.05, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=1024, out_features=16, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=16, out_features=1024, bias=Fa

## Load in 8bit or 4bit

Most of the pretrained model in the HuggingFace are in 32-bit precision (`float32`). We can use this method to quantize to smaller bits like 8 bits or 4 bits.

We can choose to load model in either 8 bit or in 4 bit. This can be used to save memory while loading large language models. We can do this using object, `transformers.BitsAndBytesConfig()`. Then we passs that object to ``.from_pretrained()`` Also, we must set `device_map='auto'` to effectively distribute the model to your hardware.

***more aboout  BitsAndBytesConfig()***

- **load_in_8bit (bool, optional, defaults to False)** — This flag is used to enable 8-bit quantization with LLM.int8().

- **llm_int8_threshold (float, optional, defaults to 6)** — This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).

- **llm_int8_skip_modules (List[str], optional)** — ``An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for CausalLM models, the last lm_head is kept in its original dtype.``

- **llm_int8_enable_fp32_cpu_offload (bool, optional, defaults to False)** — This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as google/flan-t5-xxl. Note that the int8 operations will not be run on CPU.

In [15]:
from transformers import BitsAndBytesConfig

bit_byte_config=BitsAndBytesConfig(load_in_8bit=True) ## this will internally use LLM.int8() from BitsAndBytes() library for optimized 8-bit matrix multiplication.

model=AutoModelForCausalLM.from_pretrained(peft_model_id,device_map='auto',quantization_config=bit_byte_config)

In [16]:
model.device

device(type='cuda', index=0)

In [17]:
import torch
device=('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')

device: cuda


In [19]:
## loading model in 4-bit
bit_byte_config=BitsAndBytesConfig(load_in_4bit=True) ## this will internally use LLM.int8() from BitsAndBytes() library for optimized 8-bit matrix multiplication.

model=AutoModelForCausalLM.from_pretrained(peft_model_id,device_map='auto',quantization_config=bit_byte_config)

## Add a new adapter
We can use `PeftModel.add_adapter()` to add a new adapter to a model with an existing adapter as long as the new adapter is the same type as the current one. For example, if you have an existing LoRA adapter attached to a model. 

In [21]:
from peft import LoraConfig

lora_config=LoraConfig(target_modules=['q_proj','k_proj'],
                       init_lora_weights=True)

model.add_adapter(lora_config,'adapter1')

In [28]:
from peft import LoraConfig

lora_config=LoraConfig(target_modules=['q_proj','k_proj'],
                       init_lora_weights=True)

model.add_adapter(lora_config,'adapter_1')

ValueError: Adapter with name adapter_1 already exists. Please use a different name.

In [22]:
## Adding adapter 2 with the same config type 
model.add_adapter(lora_config,'adapter2')

In [27]:
model ## we can see we have three lora adapter (adapter_1),adapter1, and adapter 2

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (adapter_1): Identity()
                (adapter1): Identity()
                (adapter2): Identity()
              )
              (lora_A): ModuleDict(
                (adapter_1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter2): Linear(in_features=

In [29]:
model.set_adapter('adapter2') ## Now we will be using adapter2

In [30]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (lora_dropout): ModuleDict(
                (adapter_1): Identity()
                (adapter1): Identity()
                (adapter2): Identity()
              )
              (lora_A): ModuleDict(
                (adapter_1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter1): Linear(in_features=1024, out_features=8, bias=False)
                (adapter2): Linear(in_features=