# Introduction

- This post walks through a simple, barebones setup for fine-tuning text embeddings model using the unsloth and sentence transformers libraries. 
	- Specifically, we will fine-tune a set of QLoRA weights using a contrastive loss on a simple Q&A dataset. 
- The Unsloth library has made it incredibly easy, efficient, and affordable to fine-tune Large Language Models (LLMs).
- Most of their work has focused on fine-tuning decoders, aka the generative side of models. This makes sense, given the high visibility and ever-increasing capabilities of said models.
	- We are finally reaching the place many people have been hoping for where small ( ~< 8B ) parameters models that can run on powerful consumer GPUs can, via good fine-tuning, reach specialized and niche performance that rivals some of the most powerful commercial offerings.
- They have an ocean of notebooks that make it incredibly easy for anyone to get set up fine-tuning powerful LLM models. 
	- They've done a ton of quantization work to make sure many/most fine-tuning fits on reasonable consumer hardware which is a massive game-changer. 
- However,  tools like RAG are also incredibly powerful. RAG is the backbone of most LLM applications currently deployed in the real world. 
- RAG engines rely on text embedding models, aka the encoder side of LLM training. 
- There is a great post here from the creators of the recent `modernBERT` embedding model that details how LLMs capture all the hype and fanfare, but encoding models are the actual workhorses for AI products.
- Unsloth does not explicitly support fine-tuning encoder models. It has been a feature the pipeline for a while, but they understandably have a ton going on. 
- Thankfully, we can easily leverage some recent PRs, along with the Sentence Transformers library, to fine-tune embeddings model with unsloth. 
- In summary, we will load in a regular huggingface embeddings model, then we will wrap it in unsloth's QLoRA fine-tuning setup. Then, we'll wrap this again inside of a custom sentence-transformers model. Finally, we can use sentence-transformers directly to train this model.
	- Thankfully, both sentence-transformers and unsloth subclass huggingface's Trainer and TrainingArguments. The APIs and functionality are not identical, but they are close enough for our purposes here. 
- Sentence Transformers will do the heavy lifting of our learning loop, handling the input data batches, the embeddings-specific loss, and the weight updates. We are then going to need a custom Sentence Transformers model with unsloth as a base, and we'll be ready to fine-tune. 

- First, let's start by picking a good embeddings baseline. We'll use the recent nomic embeddings based on modernBERT. 
- We'll load up this model, and look under the hood to see exactly how QLoRA will work.
- Next, we'll patch in the QLoRA weights to be learned using the unsloth library. Unsloth has a whole set of good default arguments that have been earned and hard-won for fine-tuning LLMs. From my initial experiments, it seems like some of these will need re-thinking for encoder models. But, they are certainly a good starting point.
- We can see how QLoRA only learn a fraction of the model's original parameters, making it feasible to run this training on powerful consumer hardware instead of massive clusters. 

- Once we have the QLoRA-patched embeddings model, we can follow the Sentence Transformers documentation to create a custom model. There are only a few set of requirements we need:
	- We need to manually create a Transformer model. For this, we'll directly pass in our embeddings model.
	- We also need to tell Sentence Transformers how to convert the models' final output into an embedding. This is called the pooling stage. There are a ton of techniques, but it seems like mean-pooling is currently winning out. Mean pooling means we basically take the token-wise average of the network's final activations and call that final single vector the embedding. 
	- Lastly, many models include a normalization stage. This determines whether or not we scale vectors to have a uniform unit length. It's the default for sentence transformers, and in practice I've found it's saved me a lot of headache to always and only deal with normalized vectors. 
- Next, we pass our three modules into a SentenceTransformer class, which create the final model that can be used by the library's Trainer class.
	- Note: you can also pass in additional arguments here that would have typically be passed to the huggingface model, such as the attention implementation.
- Now, we are ready to set up the data. Since we're focused on the QLoRA and unsloth details, we'll pick a simple hello-world embeddings dataset. 
- Let's load it up, and poke around to see what's inside.
- The main thing we need to do is properly format this dataset for the contrastive loss we will be using.
	- A proper deep dive into contrastive losses is far beyond the scope of this guide. Here's an excellent reference that teaches you all the basics (and then some) you'll likely need to go. 
	- The key takeaway is that all the hard research into contrastive losses has paid of tremendously: it has resulted in a certain kind of loss, called MBCE, that makes it possible to train embeddings model with loosely, implicitly labeled data like Q&A pairs. 
		- Question and Answer pairs became a pair of reference (anchor) and matching (positive) vectors that should cluster together.
		- During training, the model randomly picking matching vectors from *different* training example in the same batch to use as a negative. 
		- This means all you need to start training a good embeddings model is a good set of Q&A questions.
			- With how ubiquitous and powerful this kind of data has become thanks to SFT and reasoning-based RL, you can see how we're very close to an insanely powerful data bootstrapping feedback loop. It's just around the corner...
		- And, we can always do some more work to improve this loss, and pick better negative examples. But as an aside, it is pretty outrageous and lucky how quickly we can set up fine-tuning embeddings models th
	- Let's focus back on the task at hand. We'll prepare the both the test and training datasets in the formats needed by the semi-magical MBCE loss.
		- This boils down to marking questions as "anchors" and answers as "positive"s. 