# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-m`
  - Task 5: Evaluating our Retriever

- 🤝 Breakout Room #2:
  - Task 1: Vibe Checking Our LCEL RAG Chain
  - Task 2: Ragas Evaluation



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

In essence, using question-document pairs to train the embedding model is for setting the direction in terms of the right documents for matching, whereas the other approach is for cross-document matching once the model is trained 

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [21]:
from ragas.metrics import context_recall
from ragas.metrics import context_precision
from ragas import evaluate

In [22]:
import nest_asyncio
from sympy import false

nest_asyncio.apply()

### Install Dependencies

In [23]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community==0.2.17 langchain-text-splitters

In [24]:
!pip install -qU faiss-cpu unstructured==0.15.7 python-pptx==1.0.2 nltk==3.9.1

### Provide OpenAI API Key

In [25]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

## Task 2: Loading Data

We'll be using a recent document released by the EU 'laying down harmonised rules on artificial intelligence and amending Regulations'.

The data can be found [here](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689), though we will be using a HTML version which was collected into the AIM DataRepository.

First, we'll clone and then `cd` into the DataRepository.

In [26]:
# !git clone https://github.com/AI-Maker-Space/DataRepository.git

In [27]:
# %cd DataRepository

Next we're going to be using the `UnstructuredHTMLLoader` to load our HTML document into a LangChain document using the [Unstructured](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.html.UnstructuredHTMLLoader.html) library.

In [28]:
# from langchain_community.document_loaders import UnstructuredHTMLLoader

# training_documents_loaded = UnstructuredHTMLLoader("eu_ai_act.html")

In [29]:
from langchain_community.document_loaders import PyMuPDFLoader

doc1 = "Blueprint-for-an-AI-Bill-of-Rights.pdf"
loader = PyMuPDFLoader(
    doc1
)
documents = loader.load()

doc2 = "NIST.AI.600-1.pdf"
loader = PyMuPDFLoader(
    doc2
)
documents.extend(loader.load())

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [30]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

In [31]:
# training_documents = text_splitter.split_documents(training_documents_loaded.load())
training_documents = text_splitter.split_documents(documents)

In [32]:
len(training_documents)

603

Next, we're going to associate each of our chunks with a unique identifier.

In [33]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [34]:
training_split_documents = training_documents[:400]
val_split_documents = training_documents[400:500]
test_split_documents = training_documents[500:600]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [July 18th](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [35]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [36]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [37]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [38]:
import asyncio
import uuid
from tqdm import tqdm

async def process_document(document, n_questions):
    questions_generated = await question_generation_chain.ainvoke({"context": document.page_content, "n_questions": n_questions})

    doc_questions = {}
    doc_relevant_docs = {}

    for question in questions_generated.content.split("\n"):
        question_id = str(uuid.uuid4())
        doc_questions[question_id] = "".join(question.split(".")[1:]).strip()
        doc_relevant_docs[question_id] = [document.metadata["id"]]

    return doc_questions, doc_relevant_docs

async def create_questions(documents, n_questions):
    tasks = [process_document(doc, n_questions) for doc in documents]

    questions = {}
    relevant_docs = {}

    for task in tqdm(asyncio.as_completed(tasks), total=len(documents), desc="Processing documents"):
        doc_questions, doc_relevant_docs = await task
        questions.update(doc_questions)
        relevant_docs.update(doc_relevant_docs)

    return questions, relevant_docs

We'll use the function to generate training, validation, and test data with `n_questions=2` for each.

In [39]:
training_questions, training_relevant_contexts = await create_questions(training_documents, 2)

Processing documents: 100%|██████████| 603/603 [00:06<00:00, 91.05it/s] 


In [40]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing documents: 100%|██████████| 100/100 [00:07<00:00, 12.66it/s]


In [41]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Processing documents: 100%|██████████| 100/100 [00:08<00:00, 11.51it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

> NOTE: If you ran into issues creating the data - you can use the data from the DataRespository. It's simply called: `train_dataset.jsonl`, etc.

In [42]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [43]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [44]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-m`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [45]:
!pip install -qU sentence_transformers datasets pyarrow

In [46]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-m"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [47]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [48]:
BATCH_SIZE = 20

Let's move our dataset into the expected format for training.

In [49]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    try:
        doc_id = relevant_docs[query_id][0]
        text = corpus[doc_id]
        example = InputExample(texts=[query, text])
        examples.append(example)
    except KeyError:
        # Some keys in doc_id are not in the corpus
        print('ignoring ', query_id)
        continue

print(len(examples), ' examples')

ignoring  05df20e2-f4e5-4344-b7a8-4239ec0f65e7
ignoring  af1fa2ea-8182-486f-8911-b0c78707026d
ignoring  b3db14d0-8f7f-4d6b-8f66-61438802c0e9
ignoring  ac7793c5-60d7-47f8-a704-90aea802aed7
ignoring  7f78b0fe-cc1d-423b-ace7-dc4179078d4d
ignoring  b9a08058-39c1-4c74-a070-91eae0b12a4f
ignoring  b7eb08df-39c4-4721-b0e4-9001abdb8ca2
ignoring  77b659b4-1b8e-4324-9fd7-c277d59c8474
ignoring  24487d5e-f936-4aa7-a592-1793df2585f1
ignoring  26718fbd-55f0-4633-a141-d13440a6df51
ignoring  239c55b0-8728-4013-a2b8-fc829b62dd37
ignoring  d63d1666-24f9-41b1-9a60-f37f46d46513
ignoring  85bdd22f-225b-417e-a1f9-87b31f9dccc0
ignoring  ae2b6484-60b4-4fc9-abcd-9ca42114593d
ignoring  e20d343c-9547-4073-a5cc-99bdd64acbf5
ignoring  22339da4-e28f-4563-acd0-82318f4162bb
ignoring  d4cfdcba-cff9-43f0-9191-817dcd174a94
ignoring  d076ccc2-81f8-4f19-9dbb-4d51d07b7e15
ignoring  0748ff61-68f5-41ba-815e-0b191943e4ab
ignoring  9ca5e319-4c96-4a89-bb4a-b1a20ec2861b
ignoring  0aead898-5597-4edb-af6f-cbcdee0810de
ignoring  128

Now we can create a `torch` `DataLoader`!

In [50]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [51]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

MultipleNegativesRankingLoss is a loss function used to train embeddings in such a way that semantically similar sentences are closer in vector space. Given a query, a positive, and a negative example, it uses a formulation that brings positive examples closer and pushes negative examples further apart.

MatryoshkaLoss is used to enforce a nested structure within the embedding space at multiple embedding dimensions, making use of an inner loss function (MNRL in this case). 

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [52]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [53]:
EPOCHS = 5

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [54]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
40,No log,No log,0.765,0.915,0.95,0.97,0.765,0.305,0.19,0.097,0.765,0.915,0.95,0.97,0.874843,0.843262,0.84494,0.765,0.915,0.95,0.97,0.765,0.305,0.19,0.097,0.765,0.915,0.95,0.97,0.874843,0.843262,0.84494


Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
40,No log,No log,0.765,0.915,0.95,0.97,0.765,0.305,0.19,0.097,0.765,0.915,0.95,0.97,0.874843,0.843262,0.84494,0.765,0.915,0.95,0.97,0.765,0.305,0.19,0.097,0.765,0.915,0.95,0.97,0.874843,0.843262,0.84494
50,No log,No log,0.785,0.92,0.955,0.98,0.785,0.306667,0.191,0.098,0.785,0.92,0.955,0.98,0.887723,0.857486,0.8586,0.785,0.92,0.955,0.98,0.785,0.306667,0.191,0.098,0.785,0.92,0.955,0.98,0.887723,0.857486,0.8586
80,No log,No log,0.8,0.935,0.96,0.98,0.8,0.311667,0.192,0.098,0.8,0.935,0.96,0.98,0.896011,0.868353,0.869346,0.8,0.935,0.96,0.98,0.8,0.311667,0.192,0.098,0.8,0.935,0.96,0.98,0.896011,0.868353,0.869346
100,No log,No log,0.8,0.925,0.96,0.975,0.8,0.308333,0.192,0.0975,0.8,0.925,0.96,0.975,0.895241,0.868631,0.870163,0.8,0.925,0.96,0.975,0.8,0.308333,0.192,0.0975,0.8,0.925,0.96,0.975,0.895241,0.868631,0.870163
120,No log,No log,0.8,0.925,0.96,0.975,0.8,0.308333,0.192,0.0975,0.8,0.925,0.96,0.975,0.895296,0.868806,0.870346,0.8,0.925,0.96,0.975,0.8,0.308333,0.192,0.0975,0.8,0.925,0.96,0.975,0.895296,0.868806,0.870346
150,No log,No log,0.805,0.93,0.965,0.975,0.805,0.31,0.193,0.0975,0.805,0.93,0.965,0.975,0.896157,0.870048,0.871504,0.805,0.93,0.965,0.975,0.805,0.31,0.193,0.0975,0.805,0.93,0.965,0.975,0.896157,0.870048,0.871504
160,No log,No log,0.8,0.925,0.965,0.97,0.8,0.308333,0.193,0.097,0.8,0.925,0.965,0.97,0.890467,0.864042,0.865887,0.8,0.925,0.965,0.97,0.8,0.308333,0.193,0.097,0.8,0.925,0.965,0.97,0.890467,0.864042,0.865887
200,No log,No log,0.805,0.925,0.965,0.97,0.805,0.308333,0.193,0.097,0.805,0.925,0.965,0.97,0.892093,0.866292,0.868008,0.805,0.925,0.965,0.97,0.805,0.308333,0.193,0.097,0.805,0.925,0.965,0.97,0.892093,0.866292,0.868008


## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [55]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [56]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-m`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [57]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 200/200 [02:58<00:00,  1.12it/s]


In [58]:
te3_results_df = pd.DataFrame(te3_results)

In [59]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

0.935

### `Snowflake/snowflake-arctic-embed-m` (base)

In [60]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-m")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 200/200 [00:05<00:00, 37.28it/s]


In [61]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [62]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.485

### `Snowflake/snowflake-arctic-embed-m` (fine-tuned)

In [63]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 200/200 [00:04<00:00, 48.36it/s]


In [64]:
finetune_results_df = pd.DataFrame(finetune_results)

In [65]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

0.99

# 🤝 Breakout Room #2

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [66]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

# training_documents = text_splitter.split_documents(training_documents_loaded.load())
training_documents = text_splitter.split_documents(documents)

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [67]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [68]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [69]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [70]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [71]:
base_rag_chain.invoke({"question" : "What are the risks of AI?"})["response"]

'The risks of AI can be categorized into several types, including:\n\n1. **Technical / Model Risks**: These include risks from malfunction such as confabulation, dangerous or violent recommendations, data privacy issues, and challenges related to value chain and component integration.\n\n2. **Misuse by Humans**: This category encompasses malicious use of AI, including the dissemination of harmful content, data privacy violations, and issues related to human-AI configuration.\n\n3. **Ecosystem / Societal Risks**: These are systemic risks that can affect society at large, including data privacy concerns, environmental impacts, and intellectual property issues.\n\nAdditionally, AI risks can vary based on the stage of the AI lifecycle (design, development, deployment, operation, and decommissioning) and the scope of the risks (individual model, application, or ecosystem level). Some risks may also be cross-cutting between these categories.'

In [72]:
base_rag_chain.invoke({"question" : "What is the purpose of red-teaming in AI safety?"})["response"]

'The purpose of red-teaming in AI safety is to identify potential adverse behaviors or outcomes of AI models or systems, understand how these issues could occur, and stress test safeguards. Red-teaming exercises can be performed before or after AI models are made available to the public, with a focus on pre-deployment contexts. The quality of the outputs from AI red-teaming is influenced by the background and expertise of the red team, and diverse teams can help identify flaws in the AI systems. Ultimately, the results of red-teaming should be analyzed further before being incorporated into organizational governance, decision-making, and AI risk management efforts.'

In [73]:
base_rag_chain.invoke({"question" : "What are the biggest privacy concerns related to AI?"})["response"]

'The biggest privacy concerns related to AI include data privacy, which involves the protection of personal information and ensuring that data is used responsibly and ethically. Other concerns may involve the misuse of AI technologies that could lead to unauthorized access to sensitive information, the potential for surveillance, and the risk of data breaches that compromise individual privacy. Additionally, there are concerns about how AI systems handle and process personal data, which can lead to issues of consent and transparency.'

In [74]:
base_rag_chain.invoke({"question" : "How can government regulate AI?"})["response"]

'The government can regulate AI by establishing principles that ensure AI systems are lawful, purposeful, accurate, safe, understandable, responsible, monitored, transparent, and accountable. This can be achieved by integrating these principles into policy, practice, and the technological design process. Additionally, the government can inform policy decisions using frameworks like the Blueprint for an AI Bill of Rights, which aims to protect the public from potential harms associated with AI technologies.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [75]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [76]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [77]:
finetune_rag_chain.invoke({"question" : "What are the risks of AI?"})["response"]

'The risks of AI can differ from or intensify traditional software risks and can be categorized along various dimensions, including:\n\n1. **Stage of the AI lifecycle**: Risks can arise during design, development, deployment, operation, and decommissioning.\n2. **Scope**: Risks may exist at individual model or system levels, at the application or implementation levels, or at the ecosystem level.\n3. **Technical / Model risks**: These include risks from malfunction, such as confabulation, dangerous or violent recommendations, data privacy issues, and integration challenges.\n4. **Human interaction risks**: These involve the abuse, misuse, and unsafe repurposing of AI systems by humans.\n5. **Time scale**: Risks may materialize abruptly or over extended periods, such as immediate emotional harm from harmful deepfake images or long-term societal impacts from disinformation.\n\nAdditionally, there are concerns about emotional entanglement between humans and AI systems, which could lead to 

In [78]:
finetune_rag_chain.invoke({"question" : "What is the purpose of red-teaming in AI safety?"})["response"]

'The purpose of red-teaming in AI safety is to assess the resilience of AI models or systems against potential adverse behaviors or outcomes. It involves structured testing exercises to identify flaws and vulnerabilities, such as inaccurate, harmful, or discriminatory outputs. Red-teaming can be performed before or after AI models are made available to the public, and it aims to stress test safeguards, inform design and implementation decisions, and enhance overall AI risk management efforts.'

In [79]:
finetune_rag_chain.invoke({"question" : "What are the biggest privacy concerns related to AI?"})["response"]

'The biggest privacy concerns related to AI include:\n\n1. **Use of Personal Data**: GAI systems often require large volumes of data for training, which may include personal data. This raises risks to privacy principles such as transparency, individual participation (including consent), and purpose specification.\n\n2. **Lack of Disclosure**: Many model developers do not disclose the specific data sources used for training, limiting user awareness of whether personally identifiable information (PII) was included and how it was collected.\n\n3. **Data Memorization**: Outputs from AI systems may display instances of training data memorization, which could infringe on copyright and privacy rights.\n\n4. **Reverse Engineering Risks**: There are risks associated with outputting training data samples, which can lead to reverse engineering, model extraction, and membership inference attacks.\n\n5. **Revealing Sensitive Information**: AI systems may inadvertently reveal biometric, confidential

In [80]:
finetune_rag_chain.invoke({"question" : "How can government regulate AI?"})["response"]

'The government can regulate AI by implementing strong safety regulations and measures to address harms when they occur, similar to the regulatory framework for motor vehicles. This includes ensuring that AI systems are lawful, purposeful, performance-driven, accurate, reliable, effective, safe, secure, resilient, understandable, responsible, traceable, regularly monitored, transparent, and accountable. Additionally, agencies can create inventories of AI use cases and develop plans to bring AI systems into compliance with established regulations.'

#####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

The finetune RAG chain definitely performed better. It answered the codes of practice question and provided relatively equivalent answers for the other questions.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

In [81]:
# !pip install -qU ragas

### RAGAS Synthetic Testset Generation

First things first, we need to generate some data to test our model on.

Let's use our test data that we created before as a base!

In [82]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

In [83]:
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

In [84]:
testset = generator.generate_with_langchain_docs(test_split_documents, test_size=20, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/200 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/20 [00:00<?, ?it/s]

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [85]:
testset.to_pandas().head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How can advancing accountability in AI contrib...,[National Institue of Standards and Technology...,Advancing accountability in AI can contribute ...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
1,How can structured feedback help improve diver...,[39 \nMS-3.3-004 \nProvide input for training ...,Structured feedback can help improve diverse a...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
2,What are some examples of latent systemic bias...,"[intersecting groups; Completeness, representa...","Forms of latent systemic bias in images, text,...",simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
3,How does the concept of Human-AI Configuration...,[45 \nMG-4.1-007 \nVerify that AI Actors respo...,The concept of Human-AI Configuration is relat...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True
4,How can provenance data tracking and synthetic...,[distinguish human-generated content from AI-g...,Provenance data tracking and synthetic content...,simple,"[{'source': 'NIST.AI.600-1.pdf', 'file_path': ...",True


### Generating Answer Datasets

For each of our pipelines, let's generate answers to these questions!

Once we have our: Questions, Answers, Contexts, Ground Truths we can move on to evaluating our datasets!

In [86]:
from datasets import Dataset

def generate_answers(chain, testset):
  answers = []
  contexts = []
  questions = testset.to_pandas()["question"].values.tolist()
  ground_truths = testset.to_pandas()["ground_truth"].values.tolist()

  for question in tqdm(questions):
    answer = chain.invoke({"question" : question})
    answers.append(answer["response"])
    contexts.append([context.page_content for context in answer["context"]])

  return Dataset.from_dict({
      "question" : questions,
      "answer" : answers,
      "contexts" : contexts,
      "ground_truth" : ground_truths
  })

In [87]:
base_dataset = generate_answers(base_rag_chain, testset)

100%|██████████| 19/19 [00:52<00:00,  2.76s/it]


In [88]:
finetune_dataset = generate_answers(finetune_rag_chain, testset)

100%|██████████| 19/19 [01:03<00:00,  3.32s/it]


### Evaluating Using the Test Set

Now that we have a test set - it's time to evaluate our pipelines with it!

In [89]:
# from ragas.metrics import (
#     context_recall,
#     context_precision,
# )

In [90]:
# from ragas import evaluate

result = evaluate(
    base_dataset,
    metrics=[
        context_precision,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/38 [00:00<?, ?it/s]

In [91]:
result

{'context_precision': 0.4996, 'context_recall': 0.4292}

In [92]:
result.to_pandas().head()

Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall
0,How can advancing accountability in AI contrib...,[FROM \nPRINCIPLES \nTO PRACTICE \nA TECHINCAL...,Advancing accountability in AI can contribute ...,Advancing accountability in AI can contribute ...,0.0,0.0
1,How can structured feedback help improve diver...,[FROM \nPRINCIPLES \nTO PRACTICE \nA TECHINCAL...,Structured feedback can help improve diverse a...,Structured feedback can help improve diverse a...,0.0,0.333333
2,What are some examples of latent systemic bias...,[sources; demographic group and subgroup cover...,Some examples of latent systemic bias that can...,"Forms of latent systemic bias in images, text,...",0.755556,1.0
3,How does the concept of Human-AI Configuration...,[FROM \nPRINCIPLES \nTO PRACTICE \nA TECHINCAL...,The concept of Human-AI Configuration relates ...,The concept of Human-AI Configuration is relat...,0.566667,1.0
4,How can provenance data tracking and synthetic...,[information access about both authentic and s...,Provenance data tracking and synthetic content...,Provenance data tracking and synthetic content...,0.95,1.0


In [93]:
result = evaluate(
    finetune_dataset,
    metrics=[
        context_precision,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/38 [00:00<?, ?it/s]

In [94]:
result

{'context_precision': 0.9576, 'context_recall': 0.8289}

In [95]:
result.to_pandas().head()

Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall
0,How can advancing accountability in AI contrib...,[14 \nGOVERN 1.2: The characteristics of trust...,Advancing accountability in AI can contribute ...,Advancing accountability in AI can contribute ...,1.0,1.0
1,How can structured feedback help improve diver...,[39 \nMS-3.3-004 \nProvide input for training ...,Structured feedback can help improve diverse a...,Structured feedback can help improve diverse a...,1.0,0.666667
2,What are some examples of latent systemic bias...,[sources; demographic group and subgroup cover...,Some examples of latent systemic bias that can...,"Forms of latent systemic bias in images, text,...",0.926667,1.0
3,How does the concept of Human-AI Configuration...,[Human-AI Conﬁguration \nMP-3.4-005 Implement ...,The concept of Human-AI Configuration relates ...,The concept of Human-AI Configuration is relat...,1.0,1.0
4,How can provenance data tracking and synthetic...,[A.1.6. Content Provenance \nOverview \nGAI te...,Provenance data tracking and synthetic content...,Provenance data tracking and synthetic content...,1.0,1.0


#### 🏗️ Activity #3:

Discuss changes that you'd make to this pipeline based on the performance improvements that you see with RAGAS and the fine-tuning.

Come up with 3 changes, and then we'll discuss these options as a group!

1. ...
2. ...
3. ...

In [98]:
finetune_embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='finetuned_arctic', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [99]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [101]:
model.push_to_hub("checkthisout/finetuned_arctic")

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

'https://huggingface.co/checkthisout/finetuned_arctic/commit/169c1c3b2f24466ea4d6db3d4ad22ebf2ee1bf06'