# Fine-tuning Embeddings for RAG on Specific Data - HARD MODE

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [2]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [3]:
#!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [4]:
#!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml 

### Provide OpenAI API Key

In [5]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [6]:
!mkdir data_hard

mkdir: data_hard: File exists


In [7]:
!curl https://www.mrmoneymustache.com/2012/01/13/the-shockingly-simple-math-behind-early-retirement/ -o data_hard/the-shockingly-simple-math-behind-early-retirement.html
!curl https://www.mrmoneymustache.com/2012/05/29/how-much-do-i-need-for-retirement/ -o data_hard/how-much-do-i-need-for-retirement.html
!curl https://www.mrmoneymustache.com/2012/03/29/killing-your-1000-grocery-bill/ -o data_hard/killing-your-1000-grocery-bill.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  293k    0  293k    0     0   102k      0 --:--:--  0:00:02 --:--:--  102k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  359k    0  359k    0     0   169k      0 --:--:--  0:00:02 --:--:--  169k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  475k    0  475k    0     0   108k      0 --:--:--  0:00:04 --:--:--  118k


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data_hard/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

765

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
# Calculate the indices for slicing
train_end = int(len(training_documents) * 0.76)
val_end = train_end + int(len(training_documents) * 0.12)

# Slice the data
training_split_documents = training_documents[:train_end]
val_split_documents = training_documents[train_end:val_end]
test_split_documents = training_documents[val_end:]

## [HARD-MODE] Task 3a: Constructing a Fine-tuning Dataset - Naive Approach

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [14]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [15]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [16]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [17]:
import tqdm
import uuid
import re

async def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}

  ### YOUR CODE HERE
  for document in tqdm.tqdm(documents, desc="Processing documents"):
        doc_id = document.metadata["id"]
        context = document.page_content

        # Use question_generation_chain to generate questions on the chunks
        question_generation_result = await question_generation_chain.ainvoke({
            "context":context,
            "n_questions":n_questions
        })

        for question in question_generation_result.content.split('\n'):
            # Strip question
            question = re.sub(r"^\d+\.", "", question).strip()
            
            # Associate doc with question by id in return value objects
            q_id = str(uuid.uuid4())
            questions[q_id] = question
            relevant_docs[q_id] = [doc_id]

  return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [18]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Processing documents: 100%|██████████| 581/581 [13:23<00:00,  1.38s/it]  


We'll use the function to generate training, validation, and test data.

In [19]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing documents: 100%|██████████| 91/91 [02:10<00:00,  1.44s/it]


In [20]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Processing documents: 100%|██████████| 93/93 [01:46<00:00,  1.14s/it]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [21]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("data_hard/training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [22]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("data_hard/val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [23]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("data_hard/test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## [HARD-MODE] Task 3b: Constructing a Fine-tuning Dataset - RAGAS Knowledge Graph Approach

In [24]:
os.environ["RAGAS_APP_TOKEN"] = getpass.getpass("Please enter your Ragas API key!")

In [25]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [26]:
from ragas.testset.persona import Persona

persona_fiduciary = Persona(
    name = "Fiduciary",
    role_description = "You are an expert in personal and family finance who speaks in the best interest of others",
)

persona_fire = Persona(
    name = "FIRE",
    role_description = "You have achieved financial independence and have retired early, simply by mastering the fundamentals of personal finance",
)

persona_worker = Persona(
    name = "Worker",
    role_description = "You are a hard worker, who one day hopes to be financially independent and retire early",
)

persona_list = [persona_fiduciary, persona_fire, persona_worker]
persona_list

[Persona(name='Fiduciary', role_description='You are an expert in personal and family finance who speaks in the best interest of others'),
 Persona(name='FIRE', role_description='You have achieved financial independence and have retired early, simply by mastering the fundamentals of personal finance'),
 Persona(name='Worker', role_description='You are a hard worker, who one day hopes to be financially independent and retire early')]

In [27]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings,persona_list=persona_list)

In [28]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1),
        
        #multi hops were not working for me - seeing "No clusters found in the knowledge graph. Try changing the relationship condition."
        #(MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0),
        #(MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0),
]

In [29]:
import copy

kg_docs = copy.deepcopy(training_split_documents)

for document in kg_docs:
  id = document.metadata["id"]
  document.page_content = id+"###"+document.page_content

kg_training_dataset = generator.generate_with_langchain_docs(kg_docs, testset_size=50, query_distribution=query_distribution)
training_dataframe = kg_training_dataset.to_pandas()


Applying SummaryExtractor:   0%|          | 0/453 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/581 [00:00<?, ?it/s]

Node 005ca84a-a5e2-438d-9bd4-780909d777e6 does not have a summary. Skipping filtering.
Node 401c9b16-b9b6-490f-95ae-e0e7d6d7a952 does not have a summary. Skipping filtering.
Node 984b76e9-91ea-494a-b3d9-324759bf0e79 does not have a summary. Skipping filtering.
Node 659ea6c4-c892-4cdf-afd6-59e4f0c2477a does not have a summary. Skipping filtering.
Node 6774a6fd-5afa-460d-9242-2dcf893ef117 does not have a summary. Skipping filtering.
Node c0495fbf-e295-491e-ac69-1551fb0eeef6 does not have a summary. Skipping filtering.
Node 5e7f6b5c-1e2c-4f87-913a-6df0cabf2889 does not have a summary. Skipping filtering.
Node 47d16392-d478-43ac-be74-146bdcbd0942 does not have a summary. Skipping filtering.
Node 5155a824-6630-4259-952b-3dea95457c8c does not have a summary. Skipping filtering.
Node 9951e7b0-36dd-43c1-b511-b5c586e39471 does not have a summary. Skipping filtering.
Node 5aade670-db32-4008-83e6-0a27b9a227d8 does not have a summary. Skipping filtering.
Node 26582e64-b5e1-4db4-8fb5-fce670159c35 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/1615 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/50 [00:00<?, ?it/s]

In [30]:
kg_val_docs = copy.deepcopy(val_split_documents)

for document in kg_val_docs:
  id = document.metadata["id"]
  document.page_content = id+"###"+document.page_content

kg_val_dataset = generator.generate_with_langchain_docs(kg_val_docs, testset_size=50, query_distribution=query_distribution)
val_dataframe = kg_val_dataset.to_pandas()

Applying SummaryExtractor:   0%|          | 0/72 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/91 [00:00<?, ?it/s]

Node 61bf9e22-a29f-41c0-a325-4becfadfb294 does not have a summary. Skipping filtering.
Node ce9482a2-97d6-429e-b902-1924ccfe35a4 does not have a summary. Skipping filtering.
Node fda9e80f-7bb9-49cb-848d-4200689a446a does not have a summary. Skipping filtering.
Node ce558062-d93f-45ea-8ebc-6a86688c1f92 does not have a summary. Skipping filtering.
Node 55b874b7-e10b-43d2-b4a7-82e1cb23a237 does not have a summary. Skipping filtering.
Node 87ac9130-ad01-45d6-a9c9-ce34b330a13a does not have a summary. Skipping filtering.
Node f4b34e79-7ab2-43df-bb5b-a8d5e65da9ca does not have a summary. Skipping filtering.
Node dffbc500-4dc6-430d-94d6-3aa2e84407fa does not have a summary. Skipping filtering.
Node 0df59c03-b4b9-4c5e-bf93-77bc88418c14 does not have a summary. Skipping filtering.
Node 7fe94bfe-5cf7-4d93-893e-6f98a90a4714 does not have a summary. Skipping filtering.
Node 09520c08-5f36-4b8d-8119-84cf80962a15 does not have a summary. Skipping filtering.
Node fffbc69b-8b3c-4e42-aea6-587bfad85cea d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/254 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/50 [00:00<?, ?it/s]

In [31]:
kg_test_docs = copy.deepcopy(test_split_documents)

for document in kg_test_docs:
  id = document.metadata["id"]
  document.page_content = id+"###"+document.page_content

kg_test_dataset = generator.generate_with_langchain_docs(kg_test_docs, testset_size=50, query_distribution=query_distribution)
test_dataframe = kg_test_dataset.to_pandas()

Applying SummaryExtractor:   0%|          | 0/74 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/93 [00:00<?, ?it/s]

Node 420b7f38-1409-43b5-85b5-d5eb035fc750 does not have a summary. Skipping filtering.
Node 5e09cfe0-55a3-41d3-9eed-8639a0f5cc56 does not have a summary. Skipping filtering.
Node 75d56d2d-c02d-4794-961a-329cfa760a79 does not have a summary. Skipping filtering.
Node 25a245ce-1040-4fb7-8d9f-b26023ec59ca does not have a summary. Skipping filtering.
Node dffbe640-4b0b-4480-94c7-9f607757f327 does not have a summary. Skipping filtering.
Node 6f34be65-caf5-42f9-82ba-b1ed0fba9dac does not have a summary. Skipping filtering.
Node 01cf39bf-f35f-4068-b2ac-7387d4fef54c does not have a summary. Skipping filtering.
Node efb1c802-f645-4a2e-ab62-a7c4f32b21d2 does not have a summary. Skipping filtering.
Node f1232243-6d44-4d06-8b63-62b9042eb52a does not have a summary. Skipping filtering.
Node ba458b2a-4ddb-4121-9f1e-606644be8c8c does not have a summary. Skipping filtering.
Node 6713794e-13c4-48ec-8e45-7eb7da04d1d6 does not have a summary. Skipping filtering.
Node 7b6f208e-dfe6-4830-b33c-8ade3530052a d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/260 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/50 [00:00<?, ?it/s]

In [32]:
# 
def createKGDataSet(filename, dataframe, docs):
  questions = {}
  relevant_contexts = {}
  corpus = {}

  generated_questions = dataframe["user_input"].to_list()

  for i, question in enumerate(generated_questions):
      q_id = str(uuid.uuid4())
      questions[q_id] = question
      reference_context = dataframe["reference_contexts"][i]
      doc_id = reference_context[0].split("###")[0]
      relevant_contexts[q_id] = [doc_id]

  for document in docs:
      document_id = document.metadata["id"]
      document_content = document.page_content.split("###")[1]
      corpus[document_id] = document_content

  ds = {
      "questions": questions,
      "relevant_contexts": relevant_contexts,
      "corpus": corpus
  }

  with open(filename, "w") as f:
      json.dump(ds, f)


# create the jsonl files and save to our data directory
createKGDataSet("data_hard/kg_training_dataset.jsonl", training_dataframe, kg_docs)
createKGDataSet("data_hard/kg_val_dataset.jsonl", val_dataframe, kg_val_docs)
createKGDataSet("data_hard/kg_test_dataset.jsonl", test_dataframe, kg_test_docs)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [33]:
#!pip install -qU sentence_transformers datasets pyarrow

In [34]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [35]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [36]:
BATCH_SIZE = 10

In [37]:
# train on subset of data for time
SUBSET_SIZE = 100

Let's move our dataset into the expected format for training.

In [38]:
with open("data_hard/training_dataset.jsonl", "r") as f:
    t_ds = json.load(f)
    n_corpus = t_ds["corpus"]
    n_queries = t_ds["questions"]
    n_relevant_docs = t_ds["relevant_contexts"]

# corpus = train_dataset['corpus']
# queries = train_dataset['questions']
# relevant_docs = train_dataset['relevant_contexts']

naive_examples = []
for query_id, query in n_queries.items():
    doc_id = n_relevant_docs[query_id][0]
    text = n_corpus[doc_id]
    naive_example = InputExample(texts=[query, text])
    naive_examples.append(naive_example)

Now we can create a `torch` `DataLoader`!

In [39]:
subset_naive_examples = naive_examples[:SUBSET_SIZE]
# for naive
naive_loader = DataLoader(
    subset_naive_examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [40]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

- `MultipleNegativesRankingLoss` is a type of loss function that fundamentally uses a ranking algorithm in order to more closely associate semantically similar sentences, and further separate semantically dissimilar sentences. The ranking algorithm is done such that for each positive pair of similar sentences, there are multiple dissimilar, or negative pairs that should be ranked lower. Hence the name.

- `MatryoshkaLoss` is a loss function designed to produce multiple dimension embeddings that are more-and-more granular hence the matryoshka doll reference. It accomplishes this by effectively being a wrapper for a "inner loss function" which is applied at multiple levels of a hierarchical embedding, with each level resulting in more granular, truncated dimension embedding. The dimensions are specified in matryoshka_dimensions. From the paper, it "allows a single embedding to adapt to the computational constraints of downstream tasks", while maintaining the fidelity of the truncated embeddings.

- In the code, `MultipleNegativesRankingLoss` is the inner loss function for `MatryoshkaLoss`. The dimensions `[768, 512, 256, 128, 64]` are the levels of the embedding hierarchy where the MultipleNegativesRankingLoss is applied.

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [41]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

with open("data_hard/val_dataset.jsonl", "r") as f:
    v_ds = json.load(f)
    val_corpus = v_ds["corpus"]
    val_queries = v_ds["questions"]
    val_relevant_docs = v_ds["relevant_contexts"]

#for naive
naive_evaluator = InformationRetrievalEvaluator(val_queries, val_corpus, val_relevant_docs)

BUILD KG LOADER AND EVALUATOR

In [44]:
# build KG loader and evaluator

with open("data_hard/kg_training_dataset.jsonl", "r") as f:
    kg_t_ds = json.load(f)
    kg_corpus = kg_t_ds['corpus']
    kg_queries = kg_t_ds['questions']
    kg_relevant_docs = kg_t_ds['relevant_contexts']

kg_examples = []
for i, query in kg_queries.items():
    doc_id = kg_relevant_docs[i][0]
    text = kg_corpus[doc_id]
    kg_example = InputExample(texts=[query, text])
    kg_examples.append(kg_example)

subset_kg_examples = kg_examples[:SUBSET_SIZE]
kg_loader = DataLoader(
    subset_kg_examples, batch_size=BATCH_SIZE
)

with open("data_hard/kg_val_dataset.jsonl", "r") as f:
    kg_v_ds = json.load(f)
    kg_val_corpus = kg_v_ds['corpus']
    kg_val_queries = kg_v_ds['questions']
    kg_val_relevant_docs = kg_v_ds['relevant_contexts']

kg_evaluator = InformationRetrievalEvaluator(kg_val_queries, kg_val_corpus, kg_val_relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [45]:
EPOCHS = 5

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [46]:
import wandb
wandb.init(mode="disabled")

TRAIN WITH BOTH METHODS

In [None]:

# make a reusable method since we are training with two different methods
def trainMoustache(tuned_model_name, loader, evaluator):
    warmup_steps = int(len(loader) * EPOCHS * 0.1)

    model.fit(
        train_objectives=[(loader, train_loss)],
        epochs=EPOCHS,
        warmup_steps=warmup_steps,
        output_path=tuned_model_name,
        show_progress_bar=True,
        evaluator=evaluator,
        evaluation_steps=50
    )
    model.push_to_hub(f"{hf_username}/{tuned_model_name}")

trainMoustache("finetuned_arctic_ft_naive", naive_loader, naive_evaluator)
trainMoustache("finetuned_arctic_ft_kg", kg_loader, kg_evaluator)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
10,No log,No log,0.752747,0.884615,0.93956,0.978022,0.752747,0.294872,0.187912,0.097802,0.752747,0.884615,0.93956,0.978022,0.868374,0.832827,0.833937
20,No log,No log,0.758242,0.906593,0.93956,0.967033,0.758242,0.302198,0.187912,0.096703,0.758242,0.906593,0.93956,0.967033,0.869751,0.837751,0.839842
30,No log,No log,0.763736,0.901099,0.950549,0.967033,0.763736,0.300366,0.19011,0.096703,0.763736,0.901099,0.950549,0.967033,0.869945,0.838025,0.840238
40,No log,No log,0.758242,0.906593,0.945055,0.972527,0.758242,0.302198,0.189011,0.097253,0.758242,0.906593,0.945055,0.972527,0.870583,0.837156,0.839162
50,No log,No log,0.758242,0.912088,0.945055,0.972527,0.758242,0.304029,0.189011,0.097253,0.758242,0.912088,0.945055,0.972527,0.870936,0.837581,0.839587


Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
5,No log,No log,0.62,0.7,0.82,0.86,0.62,0.233333,0.164,0.086,0.62,0.7,0.82,0.86,0.726857,0.684833,0.691476
10,No log,No log,0.62,0.7,0.82,0.9,0.62,0.233333,0.164,0.09,0.62,0.7,0.82,0.9,0.743678,0.695468,0.699493
15,No log,No log,0.62,0.74,0.86,0.9,0.62,0.246667,0.172,0.09,0.62,0.74,0.86,0.9,0.753898,0.707333,0.711707
20,No log,No log,0.66,0.76,0.88,0.9,0.66,0.253333,0.176,0.09,0.66,0.76,0.88,0.9,0.772709,0.731857,0.736651
25,No log,No log,0.66,0.76,0.88,0.9,0.66,0.253333,0.176,0.09,0.66,0.76,0.88,0.9,0.773585,0.732857,0.737745


In [51]:
from huggingface_hub import notebook_login

notebook_login()

hf_username = "don-unagi"

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [52]:
#model.push_to_hub(f"{hf_username}/legal-ft-2")

#just drag drop files to new model in HF with same names

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [53]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [56]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [57]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 186/186 [01:03<00:00,  2.92it/s]


In [58]:
te3_results_df = pd.DataFrame(te3_results)

In [59]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(0.9247311827956989)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [60]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 186/186 [00:07<00:00, 25.65it/s]


In [61]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [62]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.44623655913978494)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned) - NAIVE

In [69]:
finetune_naive_embeddings = HuggingFaceEmbeddings(model_name="don-unagi/finetuned_arctic_ft_naive")
finetune_naive_results = evaluate_openai(test_dataset, finetune_naive_embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/27.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at don-unagi/finetuned_arctic_ft_naive and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

100%|██████████| 186/186 [00:03<00:00, 47.92it/s]


In [70]:
finetune_naive_results_df = pd.DataFrame(finetune_naive_results)

In [72]:
finetune_naive_hit_rate = finetune_naive_results_df["is_hit"].mean()
finetune_naive_hit_rate

np.float64(0.9247311827956989)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned) - Knowledge Graph

In [73]:
finetune_kg_embeddings = HuggingFaceEmbeddings(model_name="don-unagi/finetuned_arctic_ft_kg")
finetune_kg_results = evaluate_openai(test_dataset, finetune_kg_embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/30.0k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at don-unagi/finetuned_arctic_ft_kg and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

100%|██████████| 186/186 [00:03<00:00, 47.34it/s]


In [74]:
finetune_kg_results_df = pd.DataFrame(finetune_kg_results)

In [76]:
finetune_kg_hit_rate = finetune_kg_results_df["is_hit"].mean()
finetune_kg_hit_rate

np.float64(0.9247311827956989)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [77]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [78]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [79]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [80]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [81]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [67]:
base_rag_chain.invoke({"question" : "How do retirement accounts affect the financial aid process when applying for college grants through FAFSA?"})["response"]

NameError: name 'base_rag_chain' is not defined

In [84]:
base_rag_chain.invoke({"question" : "How much do I need for retirement?"})["response"]

'To determine how much you need for retirement, you can use the 4% Rule. This rule suggests that you should take your annual spending level and multiply it by 25. For example, if you spend $25,000 annually, you would need $625,000 to retire. Additionally, financial independence enthusiasts recommend multiplying your annual spending by somewhere between 20 and 30 to find your retirement number.'

In [123]:
base_rag_chain.invoke({"question" : "What is the cost of beef per pound in MN??"})["response"]

'The cost of beef per pound in MN is mentioned as $2.50/lb.'

In [86]:
base_rag_chain.invoke({"question" : "What priorities should I make if retiring early is a goal of mine?"})["response"]

"If retiring early is a goal of yours, you should prioritize the following:\n\n1. **Lifestyle Consideration**: Assess the lifestyle you want after retirement. Ensure that your saving efforts do not lead to a miserable existence before retirement, as it's important to enjoy your life now while planning for the future.\n\n2. **Savings and Frugality Skills**: Focus on building your savings and developing frugality skills. Worrying about whether you will have enough to retire should come after you have established a solid savings plan.\n\n3. **Investment**: Make it a priority to invest your savings, even if you feel you can't afford to. The reality is that you can't afford not to invest, as it can significantly impact your financial situation in retirement.\n\n4. **Flexibility in Retirement Plans**: Consider the possibility of working part-time during retirement or making lifestyle adjustments, such as living in a less expensive area or cutting back on discretionary spending, to reduce the

### NAIVE Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [87]:
finetune_naive_vectorstore = FAISS.from_documents(training_documents, finetune_naive_embeddings)
finetune_naive_retriever = finetune_naive_vectorstore.as_retriever(search_kwargs={"k": 6})

In [88]:
finetune_naive_rag_chain = (
    {"context": itemgetter("question") | finetune_naive_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [89]:
finetune_naive_rag_chain.invoke({"question" : "How do I retire early?"})["response"]

'To retire early, you can follow these steps:\n\n1. **Increase Your Savings Rate**: Aim to save a significant portion of your income, such as 50% or more. This can be achieved by living below your means and avoiding lifestyle inflation.\n\n2. **Invest Wisely**: Invest your savings in a diversified portfolio that can grow over time. Consider using retirement accounts that offer tax advantages.\n\n3. **Calculate Your Retirement Needs**: Use tools like retirement calculators to determine how much you need to save to sustain your desired lifestyle in retirement. The 4% rule is a common guideline, suggesting you can withdraw 4% of your retirement savings annually.\n\n4. **Consider Part-Time Work**: Some early retirees find that working part-time can help cover living expenses while allowing them to enjoy retirement.\n\n5. **Plan for Healthcare**: Ensure you have a plan for healthcare costs, as these can be significant before you qualify for Medicare.\n\n6. **Adjust Your Lifestyle**: Be prep

In [90]:
finetune_naive_rag_chain.invoke({"question" : "How much do I need for retirement?"})["response"]

'To determine how much you need for retirement, you can use the 4% Rule. This rule suggests that you should take your annual spending level and multiply it by 25. For example, if you spend $25,000 a year, you would need $625,000 to retire. Additionally, financial independence enthusiasts recommend taking your annual spending and multiplying it by somewhere between 20 and 30 to find your retirement number.'

In [91]:
finetune_naive_rag_chain.invoke({"question" : "How do I decrease my grocery bill?"})["response"]

'To decrease your grocery bill, consider the following strategies:\n\n1. **Track Your Spending**: Keep a record of your grocery expenses to identify areas where you can cut back.\n2. **Cook from Scratch**: Use basic ingredients that you buy in bulk to prepare meals at home, which can be more cost-effective than pre-packaged foods.\n3. **Reduce Food Waste**: Pay attention to what you throw away and adjust your shopping habits accordingly to minimize waste.\n4. **Buy Fresh Produce**: Allocate part of your budget for fresh fruits and vegetables, which can be healthier and more satisfying.\n5. **Limit Eating Out**: Cut back on dining out to save money, as this can significantly impact your overall food expenses.\n6. **Use Simple Cleaning Products**: Consider switching to inexpensive cleaning alternatives like white vinegar and baking soda, which can reduce costs associated with cleaning supplies.\n\nBy implementing these tips, you can effectively lower your grocery bill.'

In [92]:
finetune_naive_rag_chain.invoke({"question" : "What priorities should I make if retiring early is a goal of mine?"})["response"]

'If retiring early is a goal of yours, you should prioritize the following:\n\n1. **Budgeting and Expense Management**: Start by assessing your current expenses and identify areas where you can cut back. Focus on living below your means to increase your savings rate.\n\n2. **Savings Rate**: Aim to save a significant portion of your income, ideally 50% or more if possible. This will help you build your nest egg more quickly.\n\n3. **Investment Strategy**: Invest your savings wisely to ensure they grow over time. Consider strategies like the 4% rule for withdrawals and factor in inflation.\n\n4. **Lifestyle Choices**: Be prepared to make lifestyle sacrifices now for a more comfortable retirement later. This may involve avoiding unnecessary expenses and not upgrading your lifestyle with raises.\n\n5. **Creativity and Engagement**: Cultivate interests and hobbies that you can pursue in retirement to avoid boredom and ensure a fulfilling life after work.\n\n6. **Long-term Planning**: Set a 

### KG Fine-tuned Embedding Model

In [93]:
finetune_kg_vectorstore = FAISS.from_documents(training_documents, finetune_kg_embeddings)
finetune_kg_retriever = finetune_kg_vectorstore.as_retriever(search_kwargs={"k": 6})

In [94]:
finetune_kg_rag_chain = (
    {"context": itemgetter("question") | finetune_kg_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [102]:
finetune_kg_rag_chain.invoke({"question" : "How do I retire early?"})["response"]

'To retire early, you can follow these steps:\n\n1. **Set a Savings Goal**: Determine how much money you need to save to support your desired lifestyle in retirement. A common guideline is to aim for a nest egg that allows you to withdraw 4% annually.\n\n2. **Increase Your Savings Rate**: Aim to save a significant portion of your income. Some individuals save as much as 60% of their income, especially if they avoid lifestyle inflation (e.g., not spending raises on luxury items).\n\n3. **Invest Wisely**: Invest your savings in a diversified portfolio to grow your wealth over time. The timing of market conditions can impact your retirement, so aim to reach your retirement number during favorable market conditions.\n\n4. **Consider Alternative Income Sources**: Look for ways to generate additional income, such as side hustles or passive income streams, which can supplement your savings.\n\n5. **Plan for Healthcare and Other Expenses**: Factor in healthcare costs and other potential expens

In [101]:
finetune_kg_rag_chain.invoke({"question" : "How much do I need for retirement?"})["response"]

'To determine how much you need for retirement, you can use the 4% rule. This rule suggests that you should take your annual spending level and multiply it by 25. For example, if you plan to spend $25,000 per year, you would need $625,000 to retire. Financial Independence enthusiasts recommend a similar approach, suggesting you multiply your annual spending by somewhere between 20 and 30 to find your retirement number.'

In [100]:
finetune_kg_rag_chain.invoke({"question" : "How do I decrease my grocery bill?"})["response"]

'To decrease your grocery bill, consider the following strategies:\n\n1. **Stock Up on Basics**: Keep a supply of staple ingredients that you frequently use in your recipes. This allows you to cook more from scratch, which can be more cost-effective.\n\n2. **Buy in Bulk**: Purchase staples like beans and legumes in bulk to save money.\n\n3. **Focus on Fresh Produce**: By cooking from staples, you can allocate more of your budget to fresh fruits and vegetables, which are essential for a healthy diet.\n\n4. **Track Your Spending**: Monitor your grocery expenses to identify areas where you might be overspending or wasting food.\n\n5. **Reduce Waste**: Look at what you throw away and adjust your shopping habits accordingly to minimize waste.\n\n6. **Simplify Cleaning Supplies**: Consider switching to basic cleaning products like white vinegar and baking soda, which can save you money on cleaning supplies.\n\n7. **Shop Seasonally and Locally**: Buying seasonal and local produce can often be

In [99]:
finetune_kg_rag_chain.invoke({"question" : "What priorities should I make if retiring early is a goal of mine?"})["response"]

'If retiring early is your goal, you should prioritize the following:\n\n1. **Savings Rate**: Aim to save a significant portion of your income, ideally 50% or more, to build your nest egg quickly.\n\n2. **Financial Independence**: Focus on reaching a point where your investments can generate enough income to cover your living expenses, often referred to as financial independence.\n\n3. **Budgeting**: Keep track of your expenses and create a budget that allows you to live below your means, which will help you save more.\n\n4. **Investment Strategy**: Invest wisely to grow your savings. Consider a diversified portfolio that balances risk and return.\n\n5. **Lifestyle Choices**: Make conscious choices about your lifestyle to reduce unnecessary expenses, which can help you save more for retirement.\n\n6. **Flexibility in Work**: Consider jobs or side gigs that you enjoy, which can provide income while allowing you to maintain a work-life balance.\n\n7. **Long-term Planning**: Set clear fin

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

- Just a few basic vibe checks, it seems like the finetuned models give fuller answers with better references to concepts from the blogs, but none of these do a bad job. Since these might be highly indexed topics, the base model is pretty good. But I would still prefer one of the tuned responses. Naive and KG seem pretty tied but we didn't really try very hard to make them fail. The vibes are too chill I guess. Let's leave it to RAGAS to bring the harsh vibes and evaluate more closely. Vibe checking only goes so far anyway. 

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [115]:
### YOUR CODE HERE
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
dataset_generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings,persona_list=persona_list)


In [116]:
base_dataset = dataset_generator.generate_with_langchain_docs(text_loader.load(), testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/3 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/3 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/3 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/70 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/85 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [117]:
base_dataset.upload()

'https://app.ragas.io/dashboard/alignment/testset/677eccf5-76b6-442a-b04a-9d2c4e1eeadf'

In [106]:

from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")
os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

In [118]:
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluation_dataset = EvaluationDataset.from_pandas(base_dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [119]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

def evaluateDSWithRagChain(rag_chain):    
    for test_row in base_dataset:
        response = rag_chain.invoke({"question" : test_row.eval_sample.user_input})
        test_row.eval_sample.response = response["response"]
        test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
            

    evaluation_dataset = EvaluationDataset.from_pandas(base_dataset.to_pandas())
    custom_run_config = RunConfig(timeout=360)

    result = evaluate(
        dataset=evaluation_dataset,
        metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
        llm=evaluator_llm,
        run_config=custom_run_config
    )
    return result

In [120]:
base_rag_chain_results = evaluateDSWithRagChain(base_rag_chain)
base_rag_chain_results

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.1647, 'faithfulness': 0.5224, 'factual_correctness': 0.2692, 'answer_relevancy': 0.6511, 'context_entity_recall': 0.1134, 'noise_sensitivity_relevant': 0.2495}

In [121]:
finetune_naive_rag_chain_results = evaluateDSWithRagChain(finetune_naive_rag_chain)
finetune_naive_rag_chain_results

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.2976, 'faithfulness': 0.4999, 'factual_correctness': 0.3450, 'answer_relevancy': 0.8059, 'context_entity_recall': 0.2327, 'noise_sensitivity_relevant': 0.1627}

In [122]:
finetune_kg_rag_chain_results = evaluateDSWithRagChain(finetune_kg_rag_chain)
finetune_kg_rag_chain_results

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[50]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.2579, 'faithfulness': 0.5012, 'factual_correctness': 0.3991, 'answer_relevancy': 0.8067, 'context_entity_recall': 0.2340, 'noise_sensitivity_relevant': 0.2125}



| Metric | Base | Naive fine-tune | KG fine-tune |
|--------|---------|---------|---------|
| Context Recall | 0.1647 | 0.2976 | 0.2579 |
| Faithfulness | 0.5224 | 0.4999 | 0.5012 |
| Factual Correctness | 0.2692 | 0.3450 | 0.3991 |
| Answer Relevancy | 0.6511 | 0.8059 | 0.8067 |
| Context Entity Recall | 0.1134 | 0.2327 | 0.2340 |
| Noise Sensitivity Relevant | 0.2495 | 0.1627 | 0.2125 |

From the table, we can see that both fine-tuning methods (Naive and KG) improve the performance of the base model across most metrics. However, the improvements vary depending on the specific metric and the fine-tuning method used.


1. Context Recall: This metric measures how well the model can recall the context when generating responses. The Naive fine-tuning method shows a significant improvement over the base model, while the KG fine-tuning method also improves but not as much as the Naive method.
2. Faithfulness: This metric measures how well the model's responses stay true to the information in the context. Both fine-tuning methods show a slight decrease in performance compared to the base model.
3. Factual Correctness: This metric measures the factual accuracy of the model's responses. Both fine-tuning methods improve the factual correctness, with the KG fine-tuning method showing the most significant improvement.
4. Answer Relevancy: This metric measures how relevant the model's responses are to the input query. Both fine-tuning methods significantly improve the answer relevancy, with the KG fine-tuning method slightly outperforming the Naive method.
5. Context Entity Recall: This metric measures how well the model can recall entities from the context in its responses. Both fine-tuning methods show a significant improvement over the base model, with the KG fine-tuning method slightly outperforming the Naive method.
6. Noise Sensitivity Relevant: This metric measures how sensitive the model is to noise in the input query. The Naive fine-tuning method shows a significant improvement over the base model, while the KG fine-tuning method also improves but not as much as the Naive method.


Some evaluation caveats:
- the KG fine-tuning method was evaluated on the naive test data, so naive approach had a slight advantage. (we also had kg test data)
- the Knowledge graph was primarily made up of single hop queries, if we were to use multihop test data, the KG evaluation would probably start to outperform the naive approach. Lesson here is that evaluation results can vary!
