# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

##### ❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

#### Q1 Answer
When training embedding models, the **choice of data type**—specifically **Question & Document (Q\&D) pairs** vs. **inter-document or related sentence pairs**—has a significant impact on the learned representations.

---

### 🔍 **Nuance: Q\&D Pairs vs. Inter-Document/Sentence Pairs**

#### **1. Q\&D Pairs (Query-Document)**

* **Nature**: These consist of a **question (query)** and a **relevant answer/document**.

* **Goal**: Learn to embed **semantic asymmetry**—i.e., the question often **underspecifies** the content in the document.

* **Example**:

  * Q: "What causes rain?"
  * D: "Rain is caused by the condensation of water vapor in the atmosphere."

* **Embedding Effect**:

  * Models learn to handle *cross-lingual-style* semantic matching.
  * Strong performance in **retrieval** tasks where the input is a query and the system must find relevant long/short documents.

* **Use Case Fit**: Great for **information retrieval**, search, QA, chatbots.

---

#### **2. Inter-Document / Related Sentence Pairs**

* **Nature**: These consist of two **semantically similar or paraphrased** texts.

* **Goal**: Learn to embed **semantic equivalence** or high similarity.

* **Example**:

  * Sentence 1: "He drove to work."
  * Sentence 2: "He took the car to the office."

* **Embedding Effect**:

  * Models focus on **symmetric** similarity (A ↔ B).
  * Useful for clustering, deduplication, paraphrase detection, semantic search.

* **Use Case Fit**: Better for **semantic similarity**, duplicate detection, clustering, summarization alignment.

---

### ⚠️ **Caveats of Q\&D Training**

1. **Asymmetry**:
   Embeddings trained on Q\&D pairs may perform poorly on tasks requiring **symmetric similarity**, since "question" and "document" are not interchangeable.

2. **Query Diversity Matters**:
   If queries are too generic or templated (e.g., "What is..."), the model may not generalize well. **Diverse and natural queries** improve robustness.

3. **Answer Length Mismatch**:
   Documents are often much longer than queries. Models must learn to **distill relevance**, not just match overall topic.

4. **One-to-Many Mapping**:
   One query may have multiple valid answers. If the model sees only one positive, it may be penalized for ranking valid alternatives lower.

5. **In-batch Negatives Risk**:
   Some in-batch negatives may actually be relevant answers to other queries, adding noise unless deduplicated carefully.

6. **Shallow vs. Deep Semantics**:
   Q\&D training often emphasizes **high-level semantic matching** rather than deep understanding or reasoning, unless carefully constructed.

---

### ✅ **Special Considerations for Questions (Q)**

* **Naturalness**: Use **natural language questions**, not keyword-based or templated ones.
* **Variety**: Include a mix of factoid, causal, comparison, and complex queries.
* **Clarity**: Ensure questions are unambiguous.
* **Answerability**: Avoid questions with no real answer in the corpus (or flag them separately).
* **Length**: Use varied lengths to prevent overfitting to short or long queries.

---

### 🧠 Summary

| Aspect             | Q\&D Pairs                        | Related Sentences / Docs          |
| ------------------ | --------------------------------- | --------------------------------- |
| Focus              | Asymmetric semantic matching      | Symmetric similarity              |
| Ideal For          | Retrieval, QA, search             | Paraphrasing, deduplication       |
| Embedding Behavior | Aligns different semantic forms   | Aligns similar semantic content   |
| Special Needs      | Query diversity, natural phrasing | Balanced, semantically rich pairs |

---



## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [3]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [4]:
#!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [5]:
#!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml 

### Provide OpenAI API Key

In [6]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [7]:
!mkdir data

A subdirectory or file data already exists.


In [8]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 31554    0 31554    0     0  64890      0 --:--:-- --:--:-- --:--:-- 65059


In [9]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 70721    0 70721    0     0   247k      0 --:--:-- --:--:-- --:--:--  249k


In [10]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

> NOTE: You may need to run this cell twice to get it to work.

In [12]:
training_documents = text_splitter.split_documents(text_loader.load())

In [13]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [14]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [15]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]


## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4.1-mini` 

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [16]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4.1-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [17]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [18]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

#### Activity 1 completed in the code chunk below:

In [19]:
import tqdm
import asyncio

"""
Sample Usage of TQDM:

for i in tqdm.tqdm(range(10)):
  time.sleep(1)
"""

async def create_questions(documents, n_questions):

    questions = {}
    relevant_docs = {}
    
    for doc in tqdm.tqdm(documents, desc="Generating questions"):
      # Oreoare tge uoyt for the chain
      input_context = doc.page_content
      doc_id = doc.metadata["id"]
      
      # Call the question generating chain
      response = await question_generation_chain.ainvoke({"context": input_context, "n_questions": n_questions})
      
      # Extract questions
      generated_questions = response.content.split("\n")
      generated_questions = [q.strip() for q in generated_questions if q.strip()]
      
      # Some inputs might be number like "1. What is ...?", so clean numbering
      cleaned_questions = []
      for q in generated_questions:
        if q[0].isdigit() and q[1] == '.':
          cleaned_questions.append(q[2:].strip())
        elif q[0].isdigit() and q[1] == ' ':
          cleaned_questions.append(q[1:].strip())
        else:
          cleaned_questions.append(q)
      
      #Now save each question
      for q in cleaned_questions:
        question_id = str(uuid.uuid4())
        questions[question_id] = q
        relevant_docs[question_id] = [doc_id]

    return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [20]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Generating questions: 100%|██████████| 78/78 [01:48<00:00,  1.39s/it]


We'll use the function to generate training, validation, and test data.

In [21]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Generating questions: 100%|██████████| 12/12 [00:13<00:00,  1.15s/it]


In [22]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Generating questions: 100%|██████████| 12/12 [00:16<00:00,  1.35s/it]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [23]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [24]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [25]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [26]:
#!pip install -qU sentence_transformers datasets pyarrow

In [27]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [28]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [29]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [30]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [31]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [32]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!
>
 #### MultipleNegativeRankingLoss

The following summary was provided by ChatGPT

This code defines a **PyTorch loss function class** named `MultipleNegativesRankingLoss`, used to **train sentence embedding models** (like those from `SentenceTransformer`) for **retrieval tasks**, such as matching a query to a relevant document.

---

### 🔍 **Key Purpose**

Train a model so that, for each **anchor** input (like a question), the model **maximizes similarity** with its paired **positive** (like the correct answer) while minimizing similarity with all other inputs (used as in-batch negatives).

---

### 🧱 **Structure Breakdown**

#### 1. **Imports**

* Standard Python typing and iterable tools
* PyTorch modules for tensors and neural nets
* `sentence_transformers` utilities for computing similarity and working with models

#### 2. **Class Definition: `MultipleNegativesRankingLoss`**

Inherits from `torch.nn.Module`.

##### **Initialization (`__init__`)**

* Takes a `SentenceTransformer` model, a similarity scale (default = `20.0`), and a similarity function (`cos_sim` by default).
* Prepares a `CrossEntropyLoss` function, which is used during training.

##### **Forward Pass**

* Input: A list of dictionaries with sentence inputs (`sentence_features`) and dummy `labels` (not used).
* **Step-by-step:**

  1. Compute embeddings for each input group (anchor, positive, optional negatives).
  2. Separate the **anchors** from the **candidates** (all the rest: positives and negatives).
  3. Compute a **similarity score matrix** (anchor vs. all candidates).
  4. Multiply scores by a scaling factor (e.g., 20.0) for numerical stability.
  5. Construct labels: Each anchor should match its corresponding positive at the same index.
  6. Apply cross-entropy loss on the similarity scores to train the model.

##### **Other Methods**

* `get_config_dict`: Returns the loss config (scale and similarity function name).
* `citation`: Returns the BibTeX citation for the original paper.

---

### ✅ **Use Case**

Best used when you have datasets of (anchor, positive) pairs or triplets and want to **optimize embeddings** for **semantic search, FAQ matching, or retrieval systems**. The in-batch structure allows for efficient training without needing explicit negative samples.

---

### 📚 **Example Provided**

Trains a model with the `SentenceTransformerTrainer` using a small dataset of sentence pairs.

---

#### 

####MatryoshkaLoss

This code implements **Matryoshka Loss**, a technique that allows a SentenceTransformer model to **learn embeddings that perform well at multiple dimensionalities**, enabling **faster and cheaper inference** without retraining the model.

---

## 🧠 **Core Idea**

Train a model so that **embeddings truncated to smaller sizes** (like 512, 256, 128, etc.) still retain semantic meaning and maintain performance. This is inspired by the paper [*Matryoshka Representation Learning* (Kusupati et al., 2024)](https://arxiv.org/abs/2205.13147).

---

## ⚙️ **Components Explained**

### 1. `shrink(tensor, dim)`

* Truncates the embedding to the first `dim` dimensions.
* Normalizes the result (`L2` norm).
* Used to simulate lower-dimensional embeddings from the full representation.

---

### 2. `ForwardDecorator`

* **Wraps `SentenceTransformer.forward()`**.
* Caches the full embeddings and applies shrinking **without recomputing the model's outputs** for each dimension.
* Used only when the underlying loss **is not** a `Cached...` loss.

---

### 3. `CachedLossDecorator`

* Applies when using a loss that **already caches embeddings**, like:

  * `CachedMultipleNegativesRankingLoss`
  * `CachedGISTEmbedLoss`
  * `CachedMultipleNegativesSymmetricRankingLoss`
* Applies shrinking to precomputed representations, avoiding the need to wrap `model.forward`.

---

### 4. `MatryoshkaLoss` (Main Class)

* Wraps another loss (e.g. `MultipleNegativesRankingLoss`) and evaluates it **across multiple embedding dimensions**.
* Takes:

  * A SentenceTransformer model
  * A loss function (standard or cached)
  * A list of target embedding sizes (e.g., `[768, 512, 256]`)
  * Optional weights and control over how many dimensions to evaluate per training step

#### Two Execution Paths:

* For **cached losses**: Uses `CachedLossDecorator`
* For **standard losses**: Applies `ForwardDecorator` on the fly to avoid recomputing model outputs

---

## ✅ **Example Use Case**

Train a model where you want to:

* Use **full-size embeddings (e.g. 768d)** for high-performance tasks
* Use **smaller embeddings (e.g. 128d or 64d)** for fast similarity search or edge inference
* Do this **without training separate models** for each case

---

## 📌 Summary

**MatryoshkaLoss** enables a single SentenceTransformer model to be trained so its embeddings are **robust and performant across multiple dimensions**, optimizing both quality and deployment efficiency.


Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [33]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [34]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [35]:
import wandb
wandb.init(mode="disabled")

> NOTE: You may not see direct improvement during the training cycles - this is absolutely expected. We will verify performance later in the notebook. 

In [36]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.979167,0.972222,0.972222
32,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.963789,0.951389,0.951389
48,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
50,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
64,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
80,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
96,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
100,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
112,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
128,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167


In [69]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [70]:
hf_username = "bsmith3715"

In [71]:
import uuid

model.push_to_hub(f"{hf_username}/legal-ft-{uuid.uuid4()}")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/bsmith3715/legal-ft-c1756b89-ec54-4faf-8008-d6229c57bdd7/commit/10783caeac44f38336d75c75b4f730a9ce0bfa8e'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [40]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [41]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [42]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:11<00:00,  2.03it/s]


In [43]:
te3_results_df = pd.DataFrame(te3_results)

In [44]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(1.0)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [45]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 24/24 [00:03<00:00,  7.05it/s]


In [46]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [47]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.7916666666666666)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [48]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:03<00:00,  6.93it/s]


In [49]:
finetune_results_df = pd.DataFrame(finetune_results)

In [50]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(0.9583333333333334)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [51]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [52]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [53]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [54]:
rag_llm =  ChatOpenAI(
    model="gpt-4.1-nano",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [55]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [56]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'Based on the provided context, an "agent" in the context of AI refers to systems that can act on behalf of a user, such as travel agents or digital assistants. However, the term is highly vague and lacks a clear, universally accepted definition. Different people interpret "agents" differently—some see them as systems that go and act independently on your behalf, while others think of them as LLMs with access to tools that can be used iteratively to solve problems. The concept of "agents" remains somewhat elusive and is often associated with the idea of autonomy, but without a precise or consistent meaning.'

In [57]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced models that are better than GPT-3, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, and Baidu.'

In [58]:
base_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context does not specify a particular time of year that is considered the "laziest" for AI.'

In [59]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The provided context does not specify the name "Simon" or details about the largest model he has run on his phone. Therefore, I do not know the answer.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [60]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [61]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [62]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'The provided documents indicate that the term "agent" is used in various ways and lacks a clear, universally accepted definition. Generally, some people consider AI agents to be systems that act on your behalf, like travel agents, while others think of them as LLMs given access to tools to solve problems in a loop. However, the concept remains vague and often associated with the idea of autonomous systems that can perform tasks independently. Despite ongoing discussions and prototypes, true autonomous AI agents that reliably act on your behalf are still considered to be "coming soon" and face challenges such as gullibility and difficulty distinguishing truth from fiction.'

In [63]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced better-than-GPT-3 models, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, and Baidu.'

In [64]:
finetune_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context does not specify a particular time of year when AI is the "laziest." Therefore, I do not know.'

In [65]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is Mistral 7B.'

#### ❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

When conducting a "Vibe" check: it appears that the finetuned RAG chain (FT) was better at answering questions when compared to the baseline RAG chain (BL). 

Q1: The FT response begins with context that states 'agent' is a vague term and then proceeds to explain in more detail. The BL response attempts to define agent with an example, but later states that the term may be vague. They both provided reasonable responses. 

Q2: Nearly identical responses. 

Q3: Both responses indicate that the provided context doesn't specify, but the FT response explicily states that it does not know. 

Q4: The BL response was unable to answer the question, while the FT response was able to provide a meaningful response. 

When completing the RAGAS evaluation (found below) the finetuned RAG chain outperformed the baseline RAG chain in all evaluation metrics which are summarized below:

1. context_recall: 0.5139 (BL)  **0.7139 (FT)**
2. faithfulness: 0.5358  (BL) **0.7413 (FT)**
3. factual_correctness(mode=f1): 0.4000  (BL) **0.5742 (FT)**
4. answer_relevancy: 0.6293  (BL)  **0.8532 (FT)**
5. context_entity_recall: 0.3757  (BL) **0.4856 (FT)**
6. noise_sensitivity(mode=relevant): 0.1029  (BL)  **0.1547 (FT)**


## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [66]:
#!pip install -qU ragas langchain openai sentence-transformers faiss-cpu rapidfuzz
!pip install ragas




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\bsmith53\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [74]:
### YOUR CODE HERE

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [75]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(training_documents, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/83 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/128 [00:00<?, ?it/s]

Node fcb3a924-5e97-46f5-8b06-23e71872456c does not have a summary. Skipping filtering.
Node c990a021-bfb1-4fba-a317-814ef14783a3 does not have a summary. Skipping filtering.
Node 339d1c1d-208f-49ed-9ede-0cc2f79f48ed does not have a summary. Skipping filtering.
Node fda152b6-5a0d-45ac-a5e2-d0edfe61b830 does not have a summary. Skipping filtering.
Node 40865621-ef30-459d-8188-9bb856cf380a does not have a summary. Skipping filtering.
Node 5b574d12-8c6f-49f9-9029-27ae409b2b5f does not have a summary. Skipping filtering.
Node 87bf9ba9-9bcb-4203-ac92-ce58b4352df2 does not have a summary. Skipping filtering.
Node 5ffa2a9a-6960-4845-ad1c-8261e525d1ac does not have a summary. Skipping filtering.
Node b9015bec-ec79-4447-b9b4-0fa166aec2ce does not have a summary. Skipping filtering.
Node 61b41657-fbe3-469c-be01-c1169303dd95 does not have a summary. Skipping filtering.
Node cd8d81b3-9dca-42b8-a96e-c101f1061dec does not have a summary. Skipping filtering.
Node 7b5d15cf-ec7c-4305-abfa-dbe9557febc0 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/339 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

#### Baseline Data Set Evaluation

In [82]:


baseline_data = dataset

for test_row in baseline_data:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [83]:
baseline_data.to_pandas()


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Can you explane why 2023 was such an important...,[The legal arguments here are complex. I’m not...,[Stuff we figured out about AI in 2023\n\n\n\n...,"Based on the provided context, 2023 was an imp...",2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
1,What were some key insights about Large Langua...,"[Prompt injection explained, with video, slide...",[Large Language Models\nThey’re actually quite...,Some key insights about Large Language Models ...,"In 2023, it was noted that Large Language Mode...",single_hop_specifc_query_synthesizer
2,Whut are Large Langauge Modles and how have th...,"[Prompt injection explained, with video, slide...",[Here’s the sequel to this post: Things we lea...,Large Language Models (LLMs) are advanced AI s...,Large Language Models are created by taking a ...,single_hop_specifc_query_synthesizer
3,"As an AI Research and Development Specialist, ...","[Prompt injection explained, with video, slide...",[They’re actually quite easy to build\nThe mos...,The surprising ease of building Large Language...,"LLMs are surprisingly easy to build, as it onl...",single_hop_specifc_query_synthesizer
4,Why is critical thinking so important when dea...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nI like people who are skeptical of...,Critical thinking is essential when dealing wi...,Critical thinking is important because there h...,multi_hop_abstract_query_synthesizer
5,How does the recent dramatic collapse in the c...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nI’m still trying to figure out the...,The recent dramatic collapse in the cost of ru...,The recent dramatic collapse in the cost of ru...,multi_hop_abstract_query_synthesizer
6,How does the Apple MLX library improve the exp...,"[While MLX is a game changer, Apple’s own “App...",[<1-hop>\n\nI’m still trying to figure out the...,The provided context does not contain specific...,The Apple MLX library is considered excellent ...,multi_hop_abstract_query_synthesizer
7,How has the introduction of GPT-4 Turbo influe...,[That same laptop that could just about run a ...,[<1-hop>\n\nHere’s the rest of the transcript....,The introduction of GPT-4 Turbo contributed to...,The introduction of GPT-4 Turbo significantly ...,multi_hop_abstract_query_synthesizer
8,How can mistral.rs be used to run Llama 3.2 Vi...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nOctober\n\n1st: OpenAI DevDay 2024...,I do not know.,Mistral.rs was used on October 19th to run Lla...,multi_hop_specific_query_synthesizer
9,How do Meta's Llama 3.2 models compare to Llam...,"[This unleashed a whirlwind of innovation, whi...",[<1-hop>\n\nMeta’s Llama 3.2 models deserve a ...,Meta's Llama 3.2 models represent an advanceme...,"Meta's Llama 3.2 models, particularly the 1B a...",multi_hop_specific_query_synthesizer


In [84]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(baseline_data.to_pandas())

In [85]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

In [86]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.5139, 'faithfulness': 0.5358, 'factual_correctness(mode=f1)': 0.4000, 'answer_relevancy': 0.6293, 'context_entity_recall': 0.3757, 'noise_sensitivity(mode=relevant)': 0.1029}

In [87]:
finetuned_data = dataset

for test_row in finetuned_data:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [88]:
finetuned_data.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Can you explane why 2023 was such an important...,[Stuff we figured out about AI in 2023\n\n\n\n...,[Stuff we figured out about AI in 2023\n\n\n\n...,2023 was a significant year for AI because it ...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
1,What were some key insights about Large Langua...,[Things we learned about LLMs in 2024\n\n\n\n\...,[Large Language Models\nThey’re actually quite...,Some key insights about Large Language Models ...,"In 2023, it was noted that Large Language Mode...",single_hop_specifc_query_synthesizer
2,Whut are Large Langauge Modles and how have th...,[Here’s the sequel to this post: Things we lea...,[Here’s the sequel to this post: Things we lea...,Large Language Models (LLMs) are a type of sof...,Large Language Models are created by taking a ...,single_hop_specifc_query_synthesizer
3,"As an AI Research and Development Specialist, ...",[They’re actually quite easy to build\nThe mos...,[They’re actually quite easy to build\nThe mos...,"As an AI Research and Development Specialist, ...","LLMs are surprisingly easy to build, as it onl...",single_hop_specifc_query_synthesizer
4,Why is critical thinking so important when dea...,"[If you think about what they do, this isn’t s...",[<1-hop>\n\nI like people who are skeptical of...,Critical thinking is essential when dealing wi...,Critical thinking is important because there h...,multi_hop_abstract_query_synthesizer
5,How does the recent dramatic collapse in the c...,[These price drops are driven by two factors: ...,[<1-hop>\n\nI’m still trying to figure out the...,The recent dramatic collapse in the cost of ru...,The recent dramatic collapse in the cost of ru...,multi_hop_abstract_query_synthesizer
6,How does the Apple MLX library improve the exp...,"[On the other hand, as software engineers we a...",[<1-hop>\n\nI’m still trying to figure out the...,The Apple MLX library enhances the experience ...,The Apple MLX library is considered excellent ...,multi_hop_abstract_query_synthesizer
7,How has the introduction of GPT-4 Turbo influe...,[The GPT-4 barrier was comprehensively broken\...,[<1-hop>\n\nHere’s the rest of the transcript....,The introduction of GPT-4 Turbo significantly ...,The introduction of GPT-4 Turbo significantly ...,multi_hop_abstract_query_synthesizer
8,How can mistral.rs be used to run Llama 3.2 Vi...,"[This unleashed a whirlwind of innovation, whi...",[<1-hop>\n\nOctober\n\n1st: OpenAI DevDay 2024...,I do not know.,Mistral.rs was used on October 19th to run Lla...,multi_hop_specific_query_synthesizer
9,How do Meta's Llama 3.2 models compare to Llam...,[Meta’s Llama 3.2 models deserve a special men...,[<1-hop>\n\nMeta’s Llama 3.2 models deserve a ...,The provided information indicates that Meta's...,"Meta's Llama 3.2 models, particularly the 1B a...",multi_hop_specific_query_synthesizer


In [89]:
evaluation_tuned_dataset = EvaluationDataset.from_pandas(finetuned_data.to_pandas())

In [90]:
custom_run_config = RunConfig(timeout=360)

result_tuned = evaluate(
    dataset=evaluation_tuned_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result_tuned

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.7139, 'faithfulness': 0.7413, 'factual_correctness(mode=f1)': 0.5742, 'answer_relevancy': 0.8532, 'context_entity_recall': 0.4856, 'noise_sensitivity(mode=relevant)': 0.1547}

In [92]:
print(result)
print(result_tuned)

{'context_recall': 0.5139, 'faithfulness': 0.5358, 'factual_correctness(mode=f1)': 0.4000, 'answer_relevancy': 0.6293, 'context_entity_recall': 0.3757, 'noise_sensitivity(mode=relevant)': 0.1029}
{'context_recall': 0.7139, 'faithfulness': 0.7413, 'factual_correctness(mode=f1)': 0.5742, 'answer_relevancy': 0.8532, 'context_entity_recall': 0.4856, 'noise_sensitivity(mode=relevant)': 0.1547}
