# Introduction

We are finally at the point that many people have been waiting for: small LLMs have become very powerful and can run on consumer GPUs. With good fine-tuning in a given domain, they even rival some of the best commercially available LLMs.

The combo of being runnable and fine-tuneable on consumer hardware is possible thanks to weight quantization and LoRA adapters, respectively. 

This post fine-tunes a text embedding model with the unsloth and Sentence Transformers libraries. Specifically, we fine-tune a set of QLoRA adapters using a contrastive loss on a simple Question and Answer dataset. 

# The `unsloth` library

The [`unsloth`](https://unsloth.ai/) library makes it both efficient and affordable to fine-tune transformer networks on consumer hardware.

Most of their work focuses on fine-tuning decoder models, aka the LLM family of models. This makes sense given the high visibility and ever-increasing capabilities of generative networks.

Unsloth has an ocean of starter notebooks that make it easy for anyone to fine-tune relevant, modern LLMs. Many of the notebooks use quantization setups that even fit on 8GB GPUs. If you went back a few years ago, and told people we'd be able to meaningfully fine-tune powerful, SoTA LLMs on such small cards it would have sounded outlandish.

While it makes a ton of sense that generative LLMs receive so much attention, there is also the other side of the architecture coin: encoder models. These are models like BERT that transform sentences into vector embeddings that capture its semantic content and relationships. 

Encoder models power incredibly useful tools like RAG. Despite the LLM hype, it is RAG engines that are the backbone of most LLM applications currently deployed in the wild. 

# RAG workhorses

RAG engines rely on text embedding models, aka the encoder side of transformer networks.  

There is a great post here from the creators of the recent [`modernBERT`](https://huggingface.co/blog/modernbert) embedding model that describes how LLMs capture all the hype and fanfare, but encoding models are the actual workhorses for AI products.

Unfortunately, as of writing, unsloth does not support fine-tuning encoder models. It's been a feature in their pipeline for a while, but they understandably have a ton of other pressing work. 

We can still however leverage some recent PRs, along with the Sentence Transformers library, to patch fine-tuning embeddings into unsloth. 

In this post, we will fine-tune an `all-MiniLM` model, specifically the recent [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2). 

# Fine-tuning embeddings with Unsloth

Here is the rough process we'll go through. We will the all-MiniLM model and wrap it with unsloth's QLoRA adapters. Then, we'll again wrap the unsloth-patched model inside of a custom Sentence Transformers model. It is this final double-wrapped model that will be fine-tuned. 

Both Sentence Transformers and unsloth actually subclass HuggingFace's Trainer and TrainingArguments. Their APIs and functionality are not quite identical, but are close enough for our purposes. 

Sentence Transformers will do the heavy lifting of the learning loop: preparing the input batches, computing the embeddings-specific loss, and handling the weight updates.

Let's get started and put all of this together. First, we need to prepare our environment. 

## Installing Unsloth 

The following command installs unsloth:  

```bash
pip install unsloth
```  

But unsloth is under constant development. It directly patches and modifies many low-level libraries used for LLM inference and training. Because of this, it can be quite tricky to install. The default setup in their Google Colab notebooks do a specific pip installation dance that is quite robust. 

> You may have good luck with the simple `pip install unsloth`. Depending on your linux setup, it might not be so simple. If the simple install fails, mimic the Colab-specific pip installation below.  

```bash
# only do this if the simple pip install fails
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
pip install --no-deps unsloth
```

Once this is ready, we import unsloth and get started. 

In [None]:
# import unsloth first so it can patch in optimizations
import unsloth

Note that it's best-practice to first import unsloth. This lets is patch all the lower-level libraries it needs. Then we'll be using the `FastModel` class. This takes a very handy `auto_class` argument that lets us load encoder models. 

In [None]:
# loads encoder models
from unsloth import FastModel

Now we can bring in all of our regular imports. We'll import them all here, then mention them when it's needed.

In [None]:
# some setup
from pathlib import Path
import torch

# Import the correct base model class for our model
from transformers import BertModel
from datasets import load_dataset, concatenate_datasets
from peft import LoraConfig, TaskType
import sentence_transformers
from sentence_transformers import SentenceTransformerTrainingArguments, SentenceTransformerTrainer, SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import InformationRetrievalEvaluator

Now we can start defining the variables we'll need. Let's start by setting the all-MiniLM model and its class. 

In [None]:
# Model Configuration
BASE_MODEL_ID = 'sentence-transformers/all-MiniLM-L12-v2'
BERT_MODEL = BertModel

# Maximum sequence length of this model
MAX_SEQ_LENGTH = 512
LOAD_IN_4BIT = True  # For QLoRA (4-bit quantization) 

Let's load this `FastModel`. This is the full model, before we've attached QLoRA adapters.

In [None]:
model, tokenizer = FastModel.from_pretrained(
    model_name = BASE_MODEL_ID,
    auto_model = BERT_MODEL,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = None, # Auto-detects (BF16/FP16)
    load_in_4bit = LOAD_IN_4BIT,
)
print(f"Loaded {BASE_MODEL_ID} with Unsloth.")

## QLoRA patches

Next, we'll patch in the QLoRA weights to be learned using the unsloth library. Unsloth has a whole set of good default arguments that have been earned and hard-won for fine-tuning LLMs. From my initial experiments, it seems like some of these will need re-thinking for encoder models. But, they are certainly a good starting point.

In [None]:
# LORA Configuration
LORA_R = 16          # Rank of the LORA matrices. 
LORA_ALPHA = 32      
LORA_DROPOUT = 0.0   # Dropout of 0 is best.
USE_RSLORA = False   # Rank-Stabilized LoRA if desired

# Target modules for adapters
LORA_TARGET_MODULES = ["query", "key", "value", "dense"] 
LORA_EXCLUDE_MODULES = [] # put anything you want to skip here


With our QLoRA parameters we can attach them to our base model.

In [None]:
print("Attaching QLoRA adapters...")
model = FastModel.get_peft_model(
    model,
    r = LORA_R,
    lora_alpha = LORA_ALPHA,
    lora_dropout = LORA_DROPOUT,
    target_modules = LORA_TARGET_MODULES,
    exclude_modules = LORA_EXCLUDE_MODULES,
    use_rslora = USE_RSLORA,
    bias = "none", # Standard practice for LoRA
    use_gradient_checkpointing = "unsloth",
    modules_to_save = None, # Add to train non-LORA modules
    task_type = TaskType.FEATURE_EXTRACTION, # Important!
)
print("LORA adapters added.")

A key part here is the line `task_type = TaskType.FEATURE_EXTRACTION` which sets the models up for embedding tasks.

We can see how QLoRA only learn a fraction of the model's original parameters, making it feasible to run this training on powerful consumer hardware instead of massive clusters. 

In [None]:
# check how many parameters we will actually learn
model.print_trainable_parameters()

## Wrapping unsloth model with Sentence Transformers

To use an embeddings-specific loss, we need the Sentence Transformers library. We basically need to patch in our own model into a regular ST setup. 

Once we have the QLoRA-patched embeddings model, we can follow the Sentence Transformers documentation to create a custom model. There are only a few set of requirements we need.
  
We need to manually create a Transformer model. For this, we'll directly pass in our embeddings model.

We also need to tell Sentence Transformers how to convert the models' final output into an embedding. This is called the pooling stage. There are a ton of techniques, but it seems like mean-pooling is currently winning out. Mean pooling means we basically take the token-wise average of the network's final activations and call that final single vector the embedding. 
  
Lastly, many models include a normalization stage. This determines whether or not we scale vectors to have a uniform unit length. It's the default for sentence transformers, and in practice I've found it's saved me a lot of headache to always and only deal with normalized vectors. 

Next, we pass our three modules into a SentenceTransformer class, which create the final model that can be used by the library's Trainer class.

> Note: you can also pass in additional arguments here that would have typically be passed to the huggingface model, such as the attention implementation.

Phew. That's a lot. Let's write a function to make our life a bit easier. 

In [None]:
# Prepare the ST model
def get_st_unsloth_wrapper(
        model,
        tokenizer,
        base_model_id=BASE_MODEL_ID,
        pooling_mode="mean",
        max_seq_length=MAX_SEQ_LENGTH,
        ):
    print("Initializing Sentence Transformer modules...")

    # 1. Create the Transformer module instance
    transformer_module = sentence_transformers.models.Transformer(
        model_name_or_path=base_model_id,
        max_seq_length=max_seq_length,
    )

    # 2. Replace the internal Hugging Face model with our LORA-patched Unsloth model
    transformer_module.auto_model = model
    transformer_module.tokenizer = tokenizer

    print(f"Manually assigned Unsloth LORA model to sentence_transformers.models.Transformer module.")

    # 3. Create the Pooling module
    hidden_size = model.config.hidden_size
    pooling_module = sentence_transformers.models.Pooling(
        word_embedding_dimension=hidden_size,
        pooling_mode=pooling_mode,
    )
    print(f"Using Pooling module with mode: {pooling_mode}")

    # 4. Add the Normalize module
    normalize_module = sentence_transformers.models.Normalize()
    modules = [transformer_module, pooling_module, normalize_module]

    # 5. Initialize SentenceTransformer with custom modules
    sbert_model = SentenceTransformer(modules=modules)

    print(f"SentenceTransformer wrapper created with custom modules.")
    return sbert_model

# wrap our unsloth model in Sentence Transformers
sbert_model = get_st_unsloth_wrapper(
        model,
        tokenizer,
        max_seq_length=MAX_SEQ_LENGTH,
        base_model_id=BASE_MODEL_ID,
        pooling_mode="mean",
  )

## Data preparation

Now that the model is setup, let's get the data ready. This is the data setup from Philip Schmidd's notebook, grabbed here since we're mainly focused on the unsloth and QLoRA setup.

The main thing we need to do is properly format this dataset for the contrastive loss we will be using.

A proper deep dive into contrastive losses is far beyond the scope of this guide. Here's an excellent reference that teaches you all the basics (and then some) you'll likely need to go. 

In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")

# rename columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")

# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))

# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)

# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

The key takeaway is that all the hard research into contrastive losses has paid of tremendously: it has resulted in a certain kind of loss, called MBCE, that makes it possible to train embeddings model with loosely, implicitly labeled data like Q&A pairs. 

Question and Answer pairs became a pair of reference (anchor) and matching (positive) vectors that should cluster together.

During training, the model randomly picking matching vectors from *different* training example in the same batch to use as a negative. 

This means all you need to start training a good embeddings model is a good set of Q&A questions.

With how ubiquitous and powerful this kind of data has become thanks to SFT and reasoning-based RL, you can see how we're very close to an insanely powerful data bootstrapping feedback loop. It's just around the corner...

And, we can always do some more work to improve this loss, and pick better negative examples. But as an aside, it is pretty outrageous and lucky how quickly we can set up fine-tuning embeddings models with this loss.

Let's go ahead and define this powerful loss function. 

In [None]:
# define the loss function
loss = MultipleNegativesRankingLoss(sbert_model) 
print(f"Using loss: {type(loss).__name__}")

The next step is to define and group up all of our training arguments. A rule of thumb is that LoRA can overfit if it's trained for too many epochs. So we start with one, but this is definitely a parameter to play with.

For our batch size, you should use the largest one that fits in your GPU's VRAM. This is especially important for our loss, since it randomly picks other negative examples from the same batch. The larger the batch size, the more random negative examples it can pick from. 

In [None]:
## Preparing all of our training arguments 

# Training Configuration 
NUM_TRAIN_EPOCHS = 1              # Start with 1 epochs
PER_DEVICE_TRAIN_BATCH_SIZE = 512   # Adjust based on GPU VRAM and MAX_SEQ_LENGTH.
PER_DEVICE_EVAL_BATCH_SIZE = 1024   # Can usually be higher than train batch size.
GRADIENT_ACCUMULATION_STEPS = 1   # Only for small cards

# don't repeat samples in the same batch given our loss
batch_sampler = BatchSamplers.NO_DUPLICATES if isinstance(loss, MultipleNegativesRankingLoss) else None

The rest of the training arguments are standard for unsloth models. However, QLoRA for encoders is a quite unexplored space. There are likely far more optimal configurations, but this is a good start. 

In [None]:
# set lower for longer training runs
LEARNING_RATE = 1e-4

WARMUP_RATIO = 0.1                 # Percent of warmup steps
OPTIMIZER = "adamw_8bit"           # start with 8bit optimizer
LR_SCHEDULER_TYPE = "cosine"       # schedule for the lr
WEIGHT_DECAY = 0.1                 # Weight decay 
FP16 = not torch.cuda.is_bf16_supported() # Use FP16 if BF16 is not available
BF16 = torch.cuda.is_bf16_supported()     # Use BF16 on supported GPUs (Ampere+) for stability.

Let's define how we'll evaluate the model. We'll also define where to save the fine-tuned model.

In [None]:
# set the output directory
OUTPUT_DIR = Path("finetuned_embeddings")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# evaluation and saving
EVAL_STRATEGY = "steps"
EVAL_STEPS = 8        # evaluate every N steps
SAVE_STRATEGY = "steps"
SAVE_STEPS = 8       # save checkpoint every N steps
SAVE_TOTAL_LIMIT = 2    # keep only the last N checkpoints
LOGGING_STEPS = 10      # log metrics every N steps

Once again, the specific evaluation setup is borrowed from Philip Schmidd's notebook. 

In [None]:
# load test dataset
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)  # Our queries (qid => question)

# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]


evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    score_functions={"cosine": cos_sim},
)

In [None]:
print("Defining training arguments...")
args = SentenceTransformerTrainingArguments(
    # Core Training Parameters
    output_dir=str(OUTPUT_DIR),
    num_train_epochs=NUM_TRAIN_EPOCHS,
    per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    warmup_ratio=WARMUP_RATIO,
    weight_decay=WEIGHT_DECAY,
    optim=OPTIMIZER,
    batch_sampler=batch_sampler,
    fp16=FP16,
    bf16=BF16,
    tf32=True, # NOTE: gpu must support
    fp16_full_eval=True,
    # Evaluation and Saving
    eval_strategy=EVAL_STRATEGY,
    eval_steps=EVAL_STEPS,
    save_strategy=SAVE_STRATEGY,
    save_steps=SAVE_STEPS,
    save_total_limit=SAVE_TOTAL_LIMIT,
    load_best_model_at_end=True if evaluator else False,
    metric_for_best_model="eval_sts-eval_spearman_cosine" if evaluator and isinstance(evaluator, InformationRetrievalEvaluator) else None,
    greater_is_better=True, 
    # Logging and Reporting
    logging_steps=LOGGING_STEPS,
    report_to="tensorboard",
    run_name=f"{BASE_MODEL_ID.split('/')[-1]}-st-finetune",
    seed=42,
)

# Preparing the Trainer

In [None]:
print("Initializing SentenceTransformerTrainer...")
trainer = SentenceTransformerTrainer(
    model=sbert_model, # Pass the standard SentenceTransformer model
    args=args,
    train_dataset=train_dataset.select_columns(["anchor", "positive"]),
    eval_dataset=test_dataset.select_columns(["anchor", "positive", "score"]) if evaluator else None,
    loss=loss,
    evaluator=evaluator,
    callbacks=[],
)

# Comparison with the original model

In [None]:

original_model = SentenceTransformer(BASE_MODEL_ID)
original_model.eval()

fine_tuned_model = FastModel.from_pretrained(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)
fine_tuned_model.eval()
fine_tuned_sbert = get_st_unsloth_wrapper(
    fine_tuned_model,
    tokenizer,
    max_seq_length=MAX_SEQ_LENGTH,
    base_model_id=BASE_MODEL_ID,
    pooling_mode="mean",
)

# Evaluate the model
baselines = evaluator(original_model)
fine_tuned_results = evaluator(fine_tuned_sbert)

# Print the main score
print(f"Original model: {baselines}")
print(f"Fine-tuned model: {fine_tuned_results}")

# Conclusion