# [Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem](https://pytorch.org/blog/finetune-llms/?utm_content=278057354&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024)

## PARAMETER EFFICIENT FINE-TUNING (PEFT) METHODS
* PEFT methods aim at drastically reducing the number of trainable parameters of a model while keeping the same performance as full fine-tuning.

### 1.) LOW-RANK ADAPTATION FOR LARGE LANGUAGE MODELS (LORA) USING 🤗 PEFT
* The LoRA method by Hu et al. from the Microsoft team came out in 2021, and works by attaching extra trainable parameters into a model
* To make fine-tuning more efficient, LoRA decomposes a large weight matrix into two smaller, low-rank matrices (called update matrices). These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined.
* This approach has several advantages:

    - LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.
    - The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA    models for various downstream tasks built on top of them.
    - LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them.     
    - The performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.
    - LoRA does not add any inference latency when adapter weights are merged with the base model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model

# Base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Create peft config
lora_config = LoraConfig(
    r = 8,
    target_modules= ["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    bias = "none",
    task_type = TaskType.CAUSAL_LM
)

# Create PeftModel which inserts LoRA adapters using the above config
model = get_peft_model(model, peft_config)


# Train the model using HF Trainer/ HF accelerate/ custom loop

# Save the adapter weights
model.save_adapter("my_awesome_adapter")

### 2.) QLORA: ONE OF THE CORE CONTRIBUTIONS OF BITSANDBYTES TOWARDS THE DEMOCRATIZATION OF AI
* According to the LoRA formulation, the base model can be compressed in any data type (‘dtype’) as long as the hidden states from the base model are in the same dtype as the output hidden states from the LoRA matrices.
* Compressing and quantizing large language models has recently become an exciting topic as SOTA models become larger and more difficult to serve and use for end users. Many people in the community proposed various approaches for effectively compressing LLMs with minimal performance degradation.
* This is where the `bitsandbytes` library comes in. Quantization of LLMs has largely focused on quantization for inference, but the `QLoRA (Quantized model weights + Low-Rank Adapters)` paper showed the breakthrough utility of using backpropagation through frozen, quantized weights at large model scales.

*  To use LLM.int8 and QLoRA algorithms, respectively, simply pass `load_in_8bit` and `load_in_4bit` to the from_pretrained method.

* In addition to generous use of LoRA, to achieve high-fidelity fine-tuning of 4-bit models, QLoRA uses 3 further algorithmic tricks:

    - 4-bit NormalFloat (NF4) quantization, a custom data type exploiting the property of the normal distribution of model - weights and distributing an equal number of weights (per block) to each quantization bin—thereby enhancing information density.
    - Double Quantization, quantization of the quantization constants (further savings).
    - Paged Optimizers, preventing memory spikes during gradient checkpointing from causing out-of-memory errors.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-125m"

# For LLM.int8()
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)

# For QLoRA
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

#### Training QLoRA Model using HuggingFace PEFT Library

In [None]:
from transformers import Autotokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, Tasktype, get_peft_model

# Create quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_compute_dtype = torch.float16,
    bnb_4bit_quant_type = "nf4"
)

# Base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Prepare model for quantized training
model = prepare_model_for_kbit_training(model)

# Create peft config
lora_config = LoraConfig(
    r = 8,
    target_modules= ["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    bias = "none",
    task_type = TaskType.CAUSAL_LM
)

# Create PeftModel which inserts LoRA adapters using the above config
model = get_peft_model(model, peft_config)

# Train the model using HF Trainer/ HF accelerate/ custom loop

# Save the adapter weights
model.save_adapter("my_awesome_adapter")