<a href="https://colab.research.google.com/github/bekingcn/colab-archive/blob/main/Fine_tuning_Embedding_Models_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Embedding Models

In the following Notebook we will be exploring one of the most powerful techniques to take your single-domain RAG pipelines to the next level...

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

In [None]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.8/49.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.8/384.8 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m997.8/997.8 kB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.4/140.4 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m361.3/361.3 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install -qU unstructured faiss-cpu

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m38.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m103.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.9/43.9 kB[0m [31m4.5 MB/s[

### Provide OpenAI API Key

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

The data can be found in [this GitHub repo](https://github.com/AI-Maker-Space/DataRepository/tree/main/high-performance-rag).

In [None]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 81 (delta 22), reused 28 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (81/81), 70.06 MiB | 24.98 MiB/s, done.
Resolving deltas: 100% (22/22), done.


In [None]:
%cd DataRepository

/content/DataRepository/DataRepository


In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

training_documents_loaded = UnstructuredHTMLLoader("elon_lex_transcript.html.html")

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

In [None]:
training_documents = text_splitter.split_documents(training_documents_loaded.load())

In [None]:
len(training_documents)

963

In [None]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

In [None]:
training_split_documents = training_documents[:300]

In [None]:
val_split_documents = training_documents[300:350]

In [None]:
test_split_documents = training_documents[350:400]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [None]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [None]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

In [None]:
import tqdm

def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}
  for document in tqdm.tqdm(documents):
    document_content = {"context" : document.page_content, "questions" : []}
    questions_generated = question_generation_chain.invoke({"context": document.page_content, "n_questions": n_questions})
    for question in questions_generated.content.split("\n"):
      question_id = str(uuid.uuid4())
      questions[question_id] = "".join(question.split(".")[1:]).strip()
      relevant_docs[question_id] = [document.metadata["id"]]
  return questions, relevant_docs

We'll use the function to generate training, validation, and test data.

In [None]:
training_questions, training_relevant_contexts = create_questions(training_split_documents, 2)

100%|██████████| 300/300 [05:02<00:00,  1.01s/it]


In [None]:
val_questions, val_relevant_contexts = create_questions(val_split_documents, 2)

100%|██████████| 50/50 [00:44<00:00,  1.11it/s]


In [None]:
test_questions, test_relevant_contexts = create_questions(test_split_documents, 2)

100%|██████████| 50/50 [00:51<00:00,  1.04s/it]


We'll save each dataset for use later.

> NOTE: These datasets will be provided in the repository in case you run into any issues with the data generation steps or you wish to save API calls.

In [None]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [None]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [None]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-m`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [None]:
!pip install -qU sentence_transformers datasets pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.[0m[31m


In [None]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-m"
model = SentenceTransformer(model_id)

  from tqdm.autonotebook import tqdm, trange
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/84.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

In [None]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

Ideally we'd want to use a much larger batch size. (~64+)

In [None]:
BATCH_SIZE = 20

Let's move our dataset into the expected format for training.

In [None]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [None]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

The loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

In [None]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [None]:
EPOCHS = 5

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [None]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
30,No log,No log,0.87,0.97,0.98,0.99,0.87,0.323333,0.196,0.099,0.87,0.97,0.98,0.99,0.937034,0.919167,0.919936,0.87,0.97,0.98,0.99,0.87,0.323333,0.196,0.099,0.87,0.97,0.98,0.99,0.937034,0.919167,0.919936
50,No log,No log,0.9,0.98,0.99,0.99,0.9,0.326667,0.198,0.099,0.9,0.98,0.99,0.99,0.951724,0.938667,0.939436,0.9,0.98,0.99,0.99,0.9,0.326667,0.198,0.099,0.9,0.98,0.99,0.99,0.951724,0.938667,0.939436
60,No log,No log,0.9,0.98,0.98,0.99,0.9,0.326667,0.196,0.099,0.9,0.98,0.98,0.99,0.952727,0.94,0.940909,0.9,0.98,0.98,0.99,0.9,0.326667,0.196,0.099,0.9,0.98,0.98,0.99,0.952727,0.94,0.940909
90,No log,No log,0.89,0.98,0.98,0.99,0.89,0.326667,0.196,0.099,0.89,0.98,0.98,0.99,0.948808,0.934762,0.935671,0.89,0.98,0.98,0.99,0.89,0.326667,0.196,0.099,0.89,0.98,0.98,0.99,0.948808,0.934762,0.935671
100,No log,No log,0.89,0.97,0.98,0.99,0.89,0.323333,0.196,0.099,0.89,0.97,0.98,0.99,0.947936,0.93375,0.934659,0.89,0.97,0.98,0.99,0.89,0.323333,0.196,0.099,0.89,0.97,0.98,0.99,0.947936,0.93375,0.934659
120,No log,No log,0.89,0.98,0.98,0.99,0.89,0.326667,0.196,0.099,0.89,0.98,0.98,0.99,0.948629,0.934583,0.935492,0.89,0.98,0.98,0.99,0.89,0.326667,0.196,0.099,0.89,0.98,0.98,0.99,0.948629,0.934583,0.935492
150,No log,No log,0.89,0.98,0.98,0.99,0.89,0.326667,0.196,0.099,0.89,0.98,0.98,0.99,0.948629,0.934583,0.935492,0.89,0.98,0.98,0.99,0.89,0.326667,0.196,0.099,0.89,0.98,0.98,0.99,0.948629,0.934583,0.935492


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [None]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [None]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-m`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [None]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 100/100 [00:19<00:00,  5.25it/s]


In [None]:
te3_results_df = pd.DataFrame(te3_results)

In [None]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-m` (base)

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-m")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 100/100 [00:01<00:00, 81.37it/s]


In [None]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [None]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.64

### `Snowflake/snowflake-arctic-embed-m` (fine-tuned)

In [None]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 100/100 [00:01<00:00, 80.47it/s]


In [None]:
finetune_results_df = pd.DataFrame(finetune_results)

In [None]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

0.99

## Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(training_documents_loaded.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [None]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [None]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [None]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [None]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [None]:
base_rag_chain.invoke({"question" : "What is Ayahuasca?"})["response"]

'I do not know.'

In [None]:
base_rag_chain.invoke({"question" : "What is the difference between Neurosurgery vs. Neuralink surgery?"})["response"]

'The difference between neurosurgery and Neuralink surgery lies primarily in the approach and risk involved. Traditional neurosurgery often involves more invasive procedures, such as opening deeper parts of the brain or manipulating blood vessels, which carry significant risks. In contrast, Neuralink surgery involves cortical micro-insertions that are performed on the surface of the brain, which significantly reduces the risk compared to more invasive surgeries like those for tumors or aneurysms. Additionally, Neuralink aims to utilize advanced technology and robotics to enhance precision in electrode placement, potentially changing the landscape of neurosurgical practices.'

In [None]:
base_rag_chain.invoke({"question" : "What is Neural Dust?"})["response"]

'I do not know.'

In [None]:
base_rag_chain.invoke({"question" : "What is a Neural Decoder?"})["response"]

'A Neural Decoder is a system or algorithm that interprets neural signals, such as sequences of spikes from neurons, to extract meaningful information or predictions. It involves machine learning techniques to create a mapping between the neural data and specific outputs or labels, addressing challenges in architecture and hyperparameters to optimize performance.'

In [None]:
base_rag_chain.invoke({"question" : "Who got Neuralink surgery?"})["response"]

'The first human being to receive Neuralink surgery is referred to as Noland.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [None]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [None]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [None]:
finetune_rag_chain.invoke({"question" : "What is Ayahuasca?"})["response"]

'Ayahuasca is a psychoactive brew made from the Banisteriopsis caapi vine and other ingredients, often used in traditional Amazonian shamanic practices for spiritual and healing purposes. It contains DMT (dimethyltryptamine), a powerful hallucinogenic compound, which can induce altered states of consciousness and profound experiences.'

In [None]:
finetune_rag_chain.invoke({"question" : "What is the difference between Neurosurgery vs. Neuralink surgery?"})["response"]

'Neurosurgery typically involves more complex procedures that may include opening deeper parts of the brain or manipulating blood vessels, which carries significant risks. In contrast, Neuralink surgery focuses on making cortical micro-insertions on the surface of the brain, which is considered to carry significantly less risk compared to traditional neurosurgeries like tumor or aneurysm surgeries. Additionally, Neuralink has developed a rigorous practice regimen, including hundreds of animal surgeries and the use of lifelike models to simulate the procedure before performing it on humans.'

In [None]:
finetune_rag_chain.invoke({"question" : "What is Neural Dust?"})["response"]

'Neural dust refers to a technology that involves tiny, wireless sensors that can be implanted in the brain to monitor neural activity. These sensors are designed to communicate with external devices, allowing for real-time data collection and potentially enabling advanced brain-computer interfaces.'

In [None]:
finetune_rag_chain.invoke({"question" : "What is a Neural Decoder?"})["response"]

'A Neural Decoder is a system or model that interprets and translates neural signals, such as sequences of spikes from neurons, into meaningful outputs or predictions. It involves building a dataset, optimizing labels for the model, and determining the appropriate architecture and hyperparameters to effectively map the input data to the desired output. The process combines elements of machine learning, science, and art to create a reliable and efficient decoding mechanism.'

In [None]:
finetune_rag_chain.invoke({"question" : "Who got Neuralink surgery?"})["response"]

'Noland Arbaugh got Neuralink surgery.'

## RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

In [None]:
!pip install -qU ragas

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.9/163.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### RAGAS Synthetic Testset Generation

First things first, we need to generate some data to test our model on.

Let's use our test data that we created before as a base!

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

In [None]:
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

In [None]:
testset = generator.generate_with_langchain_docs(test_split_documents, test_size=20, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/100 [00:00<?, ?it/s]



Generating:   0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
testset.to_pandas().head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How does the variability of tetraplegia situat...,"[DJ Seo (03:27:14) Yeah, I mean the primary go...",The variability of tetraplegia situations impa...,simple,"[{'source': 'elon_lex_transcript.html.html', '...",True
1,What is the potential future scenario of havin...,"[Lex Fridman (03:37:00) And then, you’d be con...",The potential future scenario is that there co...,simple,"[{'source': 'elon_lex_transcript.html.html', '...",True
2,What was Matthew MacDougall's initial focus wh...,"[Matthew MacDougall (03:45:07) Basically, the ...",Matthew MacDougall's initial focus when lookin...,simple,"[{'source': 'elon_lex_transcript.html.html', '...",True
3,What led the speaker to decide to study the br...,"[Lex Fridman (03:48:25) So, from there, the ea...",The speaker decided to study the brain because...,simple,"[{'source': 'elon_lex_transcript.html.html', '...",True
4,How does being nervous while talking affect pe...,[Lex Fridman (03:26:05) … a lot of novel insig...,Being nervous while talking can affect perform...,simple,"[{'source': 'elon_lex_transcript.html.html', '...",True


### Generating Answer Datasets

For each of our pipelines, let's generate answers to these questions!

Once we have our: Questions, Answers, Contexts, Ground Truths we can move on to evaluating our datasets!

In [None]:
from datasets import Dataset

def generate_answers(chain, testset):
  answers = []
  contexts = []
  questions = testset.to_pandas()["question"].values.tolist()
  ground_truths = testset.to_pandas()["ground_truth"].values.tolist()

  for question in tqdm.tqdm(questions):
    answer = chain.invoke({"question" : question})
    answers.append(answer["response"])
    contexts.append([context.page_content for context in answer["context"]])

  return Dataset.from_dict({
      "question" : questions,
      "answer" : answers,
      "contexts" : contexts,
      "ground_truth" : ground_truths
  })

In [None]:
base_dataset = generate_answers(base_rag_chain, testset)

100%|██████████| 20/20 [00:25<00:00,  1.28s/it]


In [None]:
finetune_dataset = generate_answers(finetune_rag_chain, testset)

100%|██████████| 20/20 [00:29<00:00,  1.48s/it]


### Evaluating Using the Test Set

Now that we have a test set - it's time to evaluate our pipelines with it!

In [None]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

In [None]:
from ragas import evaluate

result = evaluate(
    base_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/80 [00:00<?, ?it/s]

In [None]:
result

{'context_precision': 0.5128, 'faithfulness': 0.4737, 'answer_relevancy': 0.4764, 'context_recall': 0.4583}

In [None]:
result.to_pandas().head()

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall
0,How does the variability of tetraplegia situat...,I do not know.,"[Future Neuralink capabilities, that you can e...",The variability of tetraplegia situations impa...,1.0,0.0,0.0,0.0
1,What is the potential future scenario of havin...,I do not know.,"[Future Neuralink capabilities, experience, I’...",The potential future scenario is that there co...,0.0,0.0,0.0,0.0
2,What was Matthew MacDougall's initial focus wh...,Matthew MacDougall's initial focus when lookin...,[Matthew MacDougall (03:48:34) Yeah. It was so...,Matthew MacDougall's initial focus when lookin...,1.0,1.0,0.999999,0.0
3,What led the speaker to decide to study the br...,The speaker was influenced to study the brain ...,"[into delving deep into the human brain, so no...",The speaker decided to study the brain because...,1.0,1.0,0.977114,0.0
4,How does being nervous while talking affect pe...,I do not know.,"[point, muscle memory kicks in and you sort of...",Being nervous while talking can affect perform...,1.0,0.0,0.0,0.0


In [None]:
result = evaluate(
    finetune_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/80 [00:00<?, ?it/s]

In [None]:
result

{'context_precision': 0.8086, 'faithfulness': 0.8799, 'answer_relevancy': 0.8680, 'context_recall': 0.9667}

In [None]:
result.to_pandas().head()

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall
0,How does the variability of tetraplegia situat...,The variability of tetraplegia situations impa...,"[DJ Seo (03:27:14) Yeah, I mean the primary go...",The variability of tetraplegia situations impa...,0.966667,1.0,1.0,1.0
1,What is the potential future scenario of havin...,The potential future scenario suggests that th...,[Lex Fridman (00:31:41) And that output rate w...,The potential future scenario is that there co...,1.0,0.8,0.0,1.0
2,What was Matthew MacDougall's initial focus wh...,Matthew MacDougall's initial focus when lookin...,[Matthew MacDougall (03:48:34) Yeah. It was so...,Matthew MacDougall's initial focus when lookin...,1.0,1.0,1.0,1.0
3,What led the speaker to decide to study the br...,The speaker was influenced by personal experie...,"[(01:29:17) But also at the same time, I think...",The speaker decided to study the brain because...,1.0,1.0,0.977114,1.0
4,How does being nervous while talking affect pe...,Being nervous while talking can negatively imp...,"[point, muscle memory kicks in and you sort of...",Being nervous while talking can affect perform...,0.7,0.428571,0.997113,1.0


## Conclusion

As you can see - with only a few hundred data points, we're able to increase our embedding model and increase the effectiveness of RAG!