# Fine-Tuning with LoRA and QLoRA

first let's get the right libraries

In [120]:
%pip install -U peft transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


Then get a base model

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')

In [2]:
tokenizer.special_tokens_map

{'bos_token': '</s>',
 'eos_token': '</s>',
 'unk_token': '</s>',
 'pad_token': '<pad>'}

Now let's get some data for the example. One dataset in spanish and one dataset in french. We are not going to fully train a model but we are going to look at the logic if actually wanted to do it

In [2]:
from datasets import load_dataset

spanish_data = load_dataset('andreamorgar/spanish_poetry')
french_data = load_dataset('Abirate/french_book_reviews')

Using the latest cached version of the dataset since andreamorgar/spanish_poetry couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/damienbenveniste/.cache/huggingface/datasets/andreamorgar___spanish_poetry/default/0.0.0/e0f3a5567f5c8db711fce1d5dcf244000c5ab587 (last modified on Sun Aug  3 07:19:08 2025).
Using the latest cached version of the dataset since Abirate/french_book_reviews couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/damienbenveniste/.cache/huggingface/datasets/Abirate___french_book_reviews/default/0.0.0/534725e03fec6f560dbe8166e8ae3825314a6290 (last modified on Sun Aug  3 07:21:41 2025).


In [4]:
spanish_data['train']['content']

['\n\nEn el parque confuso\nQue con lánguidas brisas el cielo sahúma,\nEl ciprés, como un huso,\nDevana un ovillo de de bruma.\nEl telar de la luna tiende en plata su urdimbre;\nAbandona la rada un lúgubre corsario,\nY después suena un timbre\nEn el vecindario.\n\nSobre el horizonte malva\nDe una mar argentina,\nEn curva de frente calva\nLa luna se inclina,\nO bien un vago nácar disemina\nComo la valva\nDe una madreperla a flor del agua marina.\n\nUn brillo de lóbrego frasco\nAdquiere cada ola,\nY la noche cual enorme peñasco\nVa quedándose inmensamente sola.\n\nForma el tic-tac de un reloj accesorio,\nLa tela de la vida, cual siniestro pespunte.\nFlota en la noche de blancor mortuorio\nUna benzoica insispidez de sanatorio,\nY cada transeúnte\nParece una silueta del Purgatorio.\n\nCon emoción prosaica,\nSuena lejos, en canto de lúgubre alarde,\nUna voz de hombre desgraciado, en que arde\nEl calor negro del rom de Jamaica.\nY reina en el espíritu con subconsciencie arcaica,\nEl miedo de

In [173]:
french_data['train']['reader_review']

['Ce n\'est pas le premier roman à aborder les thèmes lourds de l\'inceste et de l\'enfance martyre, mais il le fait avec une audace et un brio incomparables qui rendent ce livre marquant dans une vie de lecteur. On y sent à quel point l\'auteur n\'a pas cherché à "faire quelque chose", on ne sent jamais l\'intention, on sent juste l\'urgence, incandescente, à raconter la vérité d\'un homme maltraité par la vie au point de dire à la nuit «\xa0 tu ne me feras pas peur j\'ai plus de noir que toi dans mon enfance\xa0».',
 'Simple, alias Barnabé, est un jeune homme de 22 ans qui a l’âge mental d’un enfant de 3 ans. Kléber, son frère de 17 ans, entre en terminale au lycée, mais décide de s’occuper lui-même de son frère. Leur mère étant morte et leur père refusant de s’encombrer de sa progéniture afin de vivre pleinement sa nouvelle vie, Kléber refuse d’abandonner son frère à Malicroix, l’institution où il dépérissait. Se mettant tant bien que mal à la recherche d’un appartement pour vivre a

Now let's get that data ready for training by tokenizing it

In [None]:
max_length = 128

def preprocess_spanish(examples):
    return tokenizer(
        [x for x in examples['content'] if x], 
        max_length=max_length,
        truncation=True, 
        padding='max_length'
    )

def preprocess_french(examples):
    return tokenizer(
        [x for x in examples['reader_review'] if x], 
        max_length=max_length,
        truncation=True, 
        padding='max_length' # longest
    )

tokenized_spanish = spanish_data.map(
    preprocess_spanish,
    batched=True,
    remove_columns=spanish_data['train'].column_names,
)

tokenized_french = french_data.map(
    preprocess_french,
    batched=True,
    remove_columns=french_data['train'].column_names,
)

Map:   0%|          | 0/9658 [00:00<?, ? examples/s]

In [6]:
tokenized_french

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 9645
    })
})

Now let's some LoRA adapters. We start by setting the config

In [4]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=64,
    task_type="CAUSAL_LM",
    target_modules={'q_proj', 'v_proj'}
)

In [8]:
print(lora_config)

LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=64, target_modules={'q_proj', 'v_proj', 'embed_tokens'}, exclude_modules=None, lora_alpha=8, lora_dropout=0.0, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, use_qalora=False, qalora_group_size=16, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False, target_parameters=None)


In [9]:
model = AutoModelForCausalLM.from_pretrained(model_id)
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=409

And then, we add the adpater for fine-tuning the model for spanish causal language modeling

In [10]:
model.add_adapter(lora_config, adapter_name='spanish_adapter')




In [11]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): lora.Embedding(
        (base_layer): Embedding(50272, 512, padding_idx=1)
        (lora_dropout): ModuleDict(
          (spanish_adapter): Identity()
        )
        (lora_A): ModuleDict()
        (lora_B): ModuleDict()
        (lora_embedding_A): ParameterDict(  (spanish_adapter): Parameter containing: [torch.FloatTensor of size 64x50272])
        (lora_embedding_B): ParameterDict(  (spanish_adapter): Parameter containing: [torch.FloatTensor of size 512x64])
        (lora_magnitude_vector): ModuleDict()
      )
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=T

Now using the same config, we add another adapter

In [12]:
model.add_adapter(lora_config, adapter_name='french_adapter')



In [13]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): lora.Embedding(
        (base_layer): Embedding(50272, 512, padding_idx=1)
        (lora_dropout): ModuleDict(
          (spanish_adapter): Identity()
          (french_adapter): Identity()
        )
        (lora_A): ModuleDict()
        (lora_B): ModuleDict()
        (lora_embedding_A): ParameterDict(
            (spanish_adapter): Parameter containing: [torch.FloatTensor of size 64x50272]
            (french_adapter): Parameter containing: [torch.FloatTensor of size 64x50272]
        )
        (lora_embedding_B): ParameterDict(
            (spanish_adapter): Parameter containing: [torch.FloatTensor of size 512x64]
            (french_adapter): Parameter containing: [torch.FloatTensor of size 512x64]
        )
        (lora_magnitude_vector): ModuleDict()
      )
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=Fa

We can switch back and forth between adapters

In [14]:
model.active_adapters()

['french_adapter']

In [15]:
model.set_adapter('spanish_adapter')

In [16]:
model.active_adapters()

['spanish_adapter']

There is another way to add apdaters. Let's get back the model

In [17]:
model = AutoModelForCausalLM.from_pretrained(model_id)
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=409

We can use the `get_peft_model` function

In [5]:
from peft import get_peft_model

peft_model = get_peft_model(
    model, 
    lora_config, 
    adapter_name='spanish_adapter'
)

In [19]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): lora.Embedding(
            (base_layer): Embedding(50272, 512, padding_idx=1)
            (lora_dropout): ModuleDict(
              (spanish_adapter): Identity()
            )
            (lora_A): ModuleDict()
            (lora_B): ModuleDict()
            (lora_embedding_A): ParameterDict(  (spanish_adapter): Parameter containing: [torch.FloatTensor of size 64x50272])
            (lora_embedding_B): ParameterDict(  (spanish_adapter): Parameter containing: [torch.FloatTensor of size 512x64])
            (lora_magnitude_vector): ModuleDict()
          )
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
          (project_out): Linear(in_features=1024, out_features=512, bias=False)
          (project_in): Linear(in_features=512, out_features=1024, bias=False)
          (layers): ModuleList(
            (0-23

In [20]:
peft_model.get_base_model()

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): lora.Embedding(
        (base_layer): Embedding(50272, 512, padding_idx=1)
        (lora_dropout): ModuleDict(
          (spanish_adapter): Identity()
        )
        (lora_A): ModuleDict()
        (lora_B): ModuleDict()
        (lora_embedding_A): ParameterDict(  (spanish_adapter): Parameter containing: [torch.FloatTensor of size 64x50272])
        (lora_embedding_B): ParameterDict(  (spanish_adapter): Parameter containing: [torch.FloatTensor of size 512x64])
        (lora_magnitude_vector): ModuleDict()
      )
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=T

In [6]:
peft_model.add_adapter(
    adapter_name='french_adapter', 
    peft_config=lora_config
)

peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 512, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
          (project_out): Linear(in_features=1024, out_features=512, bias=False)
          (project_in): Linear(in_features=512, out_features=1024, bias=False)
          (layers): ModuleList(
            (0-23): 24 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (v_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (spanish_adapter): Identity()
                    (french_adapter): Identity()
                  )
                  (lora_A): ModuleDict(
                    (spanish_adapter): Linear(

Let's get the causal language data collator for training

In [22]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False
)

And let's train for spanish

In [23]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./result_training",
    learning_rate=2e-5,
    weight_decay=0.01,
)

peft_model.set_adapter('spanish_adapter')

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_spanish['train'],
    data_collator=data_collator,
)

trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
500,3.5039
1000,3.3938
1500,3.3636




TrainOutput(global_step=1926, training_loss=3.4101653767523366, metrics={'train_runtime': 467.7073, 'train_samples_per_second': 32.912, 'train_steps_per_second': 4.118, 'total_flos': 3811843305897984.0, 'train_loss': 3.4101653767523366, 'epoch': 3.0})

In [28]:
base_model = peft_model.get_base_model()

In [29]:
def generate_text(prompt, model):
    inputs = tokenizer(prompt, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(output[0]) 

base_model.to('cpu')
generate_text('Como estas?', base_model)

'</s>Como estas?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n\n¿Qué es la vida?\n'

In [30]:

peft_model.to('cpu')
peft_model.set_adapter('spanish_adapter')
generate_text('Como estas?', peft_model)

'</s>Como estas?\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!\n\n¡Viva la vida!'

Now let's train for french

In [31]:
training_args = TrainingArguments(
    output_dir="./result_training",
    learning_rate=2e-5,
    weight_decay=0.01,
)

peft_model.to('mps')
peft_model.set_adapter('french_adapter')

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_french['train'],
    data_collator=data_collator,
)

trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
500,3.4952
1000,4.2169
1500,4.6014
2000,4.7981
2500,4.8022
3000,4.8788
3500,4.949




TrainOutput(global_step=3618, training_loss=4.548540257695896, metrics={'train_runtime': 888.0512, 'train_samples_per_second': 32.583, 'train_steps_per_second': 4.074, 'total_flos': 7165314497249280.0, 'train_loss': 4.548540257695896, 'epoch': 3.0})

In [32]:
base_model.to('cpu')
generate_text('Comment ca va?', base_model)

'</s>Comment ca va?\n\nCommenter\nCommenter\nCommenter\n\nCommenter l’s de la vu, je suis de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la vue de la'

In [33]:
peft_model.set_adapter('french_adapter')
generate_text('Comment ca va?', peft_model)

"</s>Comment ca va?\nC'est pas de la vie de la vie de la vu de la ville de la vie. La vie de la vie. La vie de la vie. La vie de la vie. La vie de la vie. La vie de la vie. La vie de viele. La vie de vie de vie. La vie de la vie. La vie de vie de la"

We can save the adapters

In [34]:
peft_model.save_pretrained('peft_adapters')



We can load them back on

In [35]:
from peft import PeftModelForCausalLM

model_spanish = PeftModelForCausalLM.from_pretrained(
    model,
    'peft_adapters/spanish_adapter'
)



We can merge the adpaters into a new one

In [7]:
peft_model.add_weighted_adapter(
    ['spanish_adapter', 'french_adapter'], 
    [0.5, 0.5], 
    adapter_name='new_adapter')

In [8]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 512, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
          (project_out): Linear(in_features=1024, out_features=512, bias=False)
          (project_in): Linear(in_features=512, out_features=1024, bias=False)
          (layers): ModuleList(
            (0-23): 24 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (v_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (spanish_adapter): Identity()
                    (french_adapter): Identity()
                    (new_adapter): Identity()
                  )
                  (lora_A): ModuleDict(


We can infer on multiple adapters at once -> Multi-LoRA

In [9]:
inputs = tokenizer(
    [
        "Hello",
        "Bonjour",
        "Hola",
    ],
    return_tensors="pt",
    padding=True,
)

adapter_names = [
    "__base__", 
    "french_adapter",
    "spanish_adapter",
]

peft_model.eval()

output = peft_model.generate(
    **inputs, 
    adapter_names=adapter_names, 
    max_new_tokens=20
)

In [10]:
tokenizer.decode(output[0]) 

"<pad><pad></s>Hello, I'm a newbie to this sub. I'm looking for a good place to start."

In [11]:
tokenizer.decode(output[1]) 

"</s>Bonjour, je suis pas en train de dire que c'est un peu plus important que ce"

In [12]:
tokenizer.decode(output[2]) 

'<pad></s>Hola, me parece que el pueblo está en el pueblo.\nP'

We can add LoRa adapters to any custom model

In [13]:
from torch import nn


class MLP(nn.Module):
    def __init__(self, num_units_hidden=2000):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Linear(20, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, 2),
            nn.LogSoftmax(dim=-1),
        )

    def forward(self, X):
        return self.seq(X)

In [15]:
from peft import LoraConfig

config = LoraConfig(
    target_modules=["seq.0", "seq.2"],
    modules_to_save=["seq.4"],
)

model = MLP()
peft_model = get_peft_model(model, config)
peft_model

PeftModel(
  (base_model): LoraModel(
    (model): MLP(
      (seq): Sequential(
        (0): lora.Linear(
          (base_layer): Linear(in_features=20, out_features=2000, bias=True)
          (lora_dropout): ModuleDict(
            (default): Identity()
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=20, out_features=8, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in_features=8, out_features=2000, bias=False)
          )
          (lora_embedding_A): ParameterDict()
          (lora_embedding_B): ParameterDict()
          (lora_magnitude_vector): ModuleDict()
        )
        (1): ReLU()
        (2): lora.Linear(
          (base_layer): Linear(in_features=2000, out_features=2000, bias=True)
          (lora_dropout): ModuleDict(
            (default): Identity()
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=2000, out_features=8, bias=False)
          )
          (lora

Let's quantize our model

In [210]:
%pip install -U bitsandbytes

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting bitsandbytes
  Using cached bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Collecting scipy (from bitsandbytes)
  Downloading scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl.metadata (60 kB)
Using cached bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
Downloading scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl (23.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: scipy, bitsandbytes
Successfully installed bitsandbytes-0.42.0 scipy-1.14.1
Note: you may need to restart the kernel to use updated packages.


Unfortunally, we need GPU to be able to quantize

In [11]:
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import prepare_model_for_kbit_training

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=config,
)

model = prepare_model_for_kbit_training(model)

RuntimeError: No GPU found. A GPU is needed for quantization.

Once quantized, we can load a LoRa config

In [219]:
from peft import get_peft_model

peft_model = get_peft_model(
    model, 
    lora_config, 
    adapter_name='spanish_adapter'
)

And train the model

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./result_training",
    learning_rate=2e-5,
    weight_decay=0.01,
)

peft_model.set_adapter('spanish_adapter')

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_spanish['train'],
    data_collator=data_collator,
)

trainer.train()

In [221]:
%pip install quanto

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting quanto
  Downloading quanto-0.2.0-py3-none-any.whl.metadata (10 kB)
Collecting ninja (from quanto)
  Using cached ninja-1.11.1.1-py2.py3-none-macosx_10_9_universal2.macosx_10_9_x86_64.macosx_11_0_arm64.macosx_11_0_universal2.whl.metadata (5.3 kB)
Downloading quanto-0.2.0-py3-none-any.whl (90 kB)
Using cached ninja-1.11.1.1-py2.py3-none-macosx_10_9_universal2.macosx_10_9_x86_64.macosx_11_0_arm64.macosx_11_0_universal2.whl (270 kB)
Installing collected packages: ninja, quanto
Successfully installed ninja-1.11.1.1 quanto-0.2.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
from peft import prepare_model_for_kbit_training

quantization_config = QuantoConfig(weights="int8")
model_id = "facebook/opt-350m"
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=quantization_config,
)
# quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

quantized_model = prepare_model_for_kbit_training(quantized_model)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [5]:
QuantoConfig().to_dict()

{'quant_method': <QuantizationMethod.QUANTO: 'quanto'>,
 'weights': 'int8',
 'activations': None,
 'modules_to_not_convert': None}