## Finetuning nomic-embed-text-v1

This notebook aims to finetune [nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) model on AI Ethical Framework documents.

This model  is ranked 8 on the MTEB leaderboard for models < 250M parameters. The hope is that this model, once finetuned on bespoke data, will outperform off-the-shelf models.

We start by generating training, validation and test datasets of question/context pairs on which this model can be finetuned. Thereafter, we evaluate the performance of this model using a retrieval pipeline and the RAGAS framework.



## Install Dependencies

In [51]:
!pip install -qU sentence_transformers datasets pyarrow==15.0.2 pymupdf

In [52]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters langchain-qdrant qdrant-client ragas

In [53]:
!pip install -qU faiss-cpu unstructured==0.15.7 python-pptx==1.0.2 nltk==3.9.1

In [54]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


In [86]:
import os
from typing import Tuple
import torch
from operator import itemgetter
from langchain_core.prompts.chat import SystemMessagePromptTemplate, ChatPromptTemplate, PromptTemplate
from langchain_core.messages import SystemMessage, ChatMessage
from langchain_core.runnables import Runnable, RunnableParallel, RunnablePassthrough
from langchain.agents import Tool, AgentType, initialize_agent
from langchain.memory import ConversationBufferMemory, ConversationSummaryBufferMemory
from langchain_openai.chat_models import ChatOpenAI
from langchain.agents import AgentExecutor
from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents.output_parsers import ReActSingleInputOutputParser
from langchain.tools.render import render_text_description
from langchain_core.output_parsers import StrOutputParser
from langchain.chains.conversation.base import ConversationChain
from langchain.agents import AgentExecutor, create_react_agent
from langchain.agents.tools import tool
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.output_parsers import PydanticOutputParser
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

## Prepare Relevant Documents

In [13]:
import numpy as np
from typing import List
from langchain_core.documents.base import Document
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


bill_of_rights_docs = PyMuPDFLoader('Blueprint-for-an-AI-Bill-of-Rights.pdf').load()
nist_framework_docs = PyMuPDFLoader('NIST.AI.600-1.pdf').load()
all_documents = bill_of_rights_docs + nist_framework_docs

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=250,
    length_function=len
  )
documents = text_splitter.split_documents(all_documents)

In [14]:
from uuid import uuid4
def assign_ids(documents):
  """Assigns a unique id to each document"""
  seen = set()
  for d in documents:
    id = str(uuid4())
    while id in seen:
      id = str(uuid4())
    d.metadata['id'] = id

assign_ids(documents)

In [15]:
len(documents)

1517

In [17]:
random_idxs = np.random.permutation(len(documents))
training_documents = np.array(documents)[random_idxs].tolist()

## Generating Questions using ChatGPT
The section below generates a set of question/context pairs upon which an instance of nomic-embed-text-v1 can be finetuned.

In [18]:
prompt = ChatPromptTemplate.from_template("""
Given the context below, please generate questions which can be answered using the provided context.

You must generate exactly {n} questions per context in the following format:

1. Question# 1
2. Question# 2
.. and so on

Context:
{context}

""")

In [19]:
llm = ChatOpenAI(temperature=0, model='gpt-4o-mini')
question_chain = prompt | llm

In [20]:
# Sample run
r = question_chain.invoke({"n": 2, "context": training_documents[15].page_content})

In [21]:
r.content

'1. What determines the applicability of suggested actions to relevant AI actors?  \n2. How are Action IDs structured in relation to AI RMF functions and subcategories?'

In [22]:
from tqdm.auto import tqdm
from collections import defaultdict
def get_questions_and_contexts(question_chain, documents, num_questions=2):
  """Given `documents` generates questions which can be answered using the `documents`
  Args:
  documents - The given context documents
  Returns:
  question_store, document_store
  question_store: A dictionary from unique id->generated question
  document_store: A dictionary from unqie question id -> document id
  """
  question_store = {}
  document_store = {}
  seen_ids = set()

  for d in tqdm(documents):
    response = question_chain.invoke({
        "n": num_questions,
        "context": d.page_content
        })
    questions = [q[3:].strip() for q in response.content.split("\n")]
    for q in questions:
      id = str(uuid4())
      while id in seen_ids:
        id = str(uuid4())
      seen_ids.add(id)
      question_store[id] = q
      document_store[id] = d.metadata['id']

  return question_store, document_store

In [23]:
from sklearn.model_selection import train_test_split
training_documents, val_documents = train_test_split(training_documents, test_size=0.20)
val_documents, test_documents = train_test_split(training_documents, test_size=0.50)

In [24]:
training_question_store, training_document_store = get_questions_and_contexts(question_chain, training_documents)

  0%|          | 0/1213 [00:00<?, ?it/s]

In [25]:
val_question_store, val_document_store = get_questions_and_contexts(question_chain, val_documents)

  0%|          | 0/606 [00:00<?, ?it/s]

In [26]:
test_question_store, test_document_store = get_questions_and_contexts(question_chain, test_documents)

  0%|          | 0/607 [00:00<?, ?it/s]

In [28]:
# Write datasets to disk
import json
training_corpus = {d.metadata['id']: d.page_content for d in training_documents}
val_corpus = {d.metadata['id']: d.page_content for d in val_documents}
test_corpus = {d.metadata['id']: d.page_content for d in test_documents}


def write_data(data, filename):
  with open(filename, 'w') as f:
    f.write(json.dumps(data))

def write_corpus(question_store, document_store, corpus, prefix="training_"):
  data = {
    "questions": question_store,
    "contexts": document_store,
    "corpus": corpus
  }
  write_data(data, f"{prefix}data.jsonl")

write_corpus(training_question_store,
             training_document_store,
             training_corpus,
             prefix="training_")

write_corpus(val_question_store,
             val_document_store,
             val_corpus,
             prefix="val_")

write_corpus(test_question_store,
             test_document_store,
             test_corpus,
             prefix="test_")

In [87]:
# Read datasets from disk
import json

with open('training_data.jsonl', 'r') as f:
  training_data = json.loads(f.read())
training_question_store = training_data["questions"]
training_document_store = training_data["contexts"]
training_corpus = training_data["corpus"]


with open('val_data.jsonl', 'r') as f:
  val_data = json.loads(f.read())
val_question_store = val_data["questions"]
val_document_store = val_data["contexts"]
val_corpus = val_data["corpus"]


with open('test_data.jsonl', 'r') as f:
  test_data = json.loads(f.read())
test_question_store = test_data["questions"]
test_document_store = test_data["contexts"]
test_corpus = test_data["corpus"]

# Finetuning nomic-embed-text-v1 on AI Ethics Dataset
Now that we have a training, validation and test dataset of question/context pairs, we finetune an instance of nomic-embed-text-v1 model on this data.

In [32]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample
from sentence_transformers import SentenceTransformer


model_id = "nomic-ai/nomic-embed-text-v1"
model = SentenceTransformer(model_id, trust_remote_code=True)



In [33]:
examples = []
for query_id, query in training_question_store.items():
    doc_id = training_document_store[query_id]
    text = training_corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

In [34]:
# https://huggingface.co/blog/matryoshka
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
BATCH_SIZE = 32
train_dl = DataLoader(
    examples, batch_size=BATCH_SIZE
)

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

In [36]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_corpus
queries = val_question_store
relevant_docs = {}

# Convert to format needed by retriever
for qid, doc_id in val_document_store.items():
  relevant_docs[qid] = [doc_id]

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

In [37]:
EPOCHS = 20

# https://huggingface.co/blog/how-to-train-sentence-transformers
model.fit(
    train_objectives=[(train_dl, train_loss)],
    epochs=EPOCHS,
    warmup_steps=0,
    output_path='ai_ethics_nomic',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
50,No log,No log,0.75,0.945793,0.970874,0.988673,0.75,0.315264,0.194175,0.098867,0.75,0.945793,0.970874,0.988673,0.884,0.848802,0.849302,0.75,0.945793,0.970874,0.988673,0.75,0.315264,0.194175,0.098867,0.75,0.945793,0.970874,0.988673,0.884,0.848802,0.849302
77,No log,No log,0.779935,0.964401,0.986246,0.995146,0.779935,0.321467,0.197249,0.099515,0.779935,0.964401,0.986246,0.995146,0.90438,0.873451,0.87368,0.779935,0.964401,0.986246,0.995146,0.779935,0.321467,0.197249,0.099515,0.779935,0.964401,0.986246,0.995146,0.90438,0.873451,0.87368
100,No log,No log,0.770227,0.961165,0.983819,0.995146,0.770227,0.320388,0.196764,0.099515,0.770227,0.961165,0.983819,0.995146,0.899835,0.867413,0.867654,0.770227,0.961165,0.983819,0.995146,0.770227,0.320388,0.196764,0.099515,0.770227,0.961165,0.983819,0.995146,0.899835,0.867413,0.867654
150,No log,No log,0.797735,0.972492,0.9911,0.999191,0.797735,0.324164,0.19822,0.099919,0.797735,0.972492,0.9911,0.999191,0.9147,0.885863,0.885897,0.797735,0.972492,0.9911,0.999191,0.797735,0.324164,0.19822,0.099919,0.797735,0.972492,0.9911,0.999191,0.9147,0.885863,0.885897
154,No log,No log,0.802589,0.970874,0.9911,0.999191,0.802589,0.323625,0.19822,0.099919,0.802589,0.970874,0.9911,0.999191,0.916693,0.888537,0.88857,0.802589,0.970874,0.9911,0.999191,0.802589,0.323625,0.19822,0.099919,0.802589,0.970874,0.9911,0.999191,0.916693,0.888537,0.88857
200,No log,No log,0.804207,0.978155,0.992718,1.0,0.804207,0.326052,0.198544,0.1,0.804207,0.978155,0.992718,1.0,0.919039,0.891272,0.891272,0.804207,0.978155,0.992718,1.0,0.804207,0.326052,0.198544,0.1,0.804207,0.978155,0.992718,1.0,0.919039,0.891272,0.891272
231,No log,No log,0.829288,0.98301,0.994337,0.999191,0.829288,0.32767,0.198867,0.099919,0.829288,0.98301,0.994337,0.999191,0.929667,0.905739,0.905801,0.829288,0.98301,0.994337,0.999191,0.829288,0.32767,0.198867,0.099919,0.829288,0.98301,0.994337,0.999191,0.929667,0.905739,0.905801
250,No log,No log,0.817152,0.981392,0.991909,0.999191,0.817152,0.327131,0.198382,0.099919,0.817152,0.981392,0.991909,0.999191,0.924799,0.899208,0.899275,0.817152,0.981392,0.991909,0.999191,0.817152,0.327131,0.198382,0.099919,0.817152,0.981392,0.991909,0.999191,0.924799,0.899208,0.899275
300,No log,No log,0.830097,0.986246,0.995955,1.0,0.830097,0.328749,0.199191,0.1,0.830097,0.986246,0.995955,1.0,0.931378,0.907681,0.907681,0.830097,0.986246,0.995955,1.0,0.830097,0.328749,0.199191,0.1,0.830097,0.986246,0.995955,1.0,0.931378,0.907681,0.907681
308,No log,No log,0.834142,0.984628,0.995146,1.0,0.834142,0.328209,0.199029,0.1,0.834142,0.984628,0.995146,1.0,0.932873,0.909711,0.909711,0.834142,0.984628,0.995146,1.0,0.834142,0.328209,0.199029,0.1,0.834142,0.984628,0.995146,1.0,0.932873,0.909711,0.909711


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Store the finetuned model

In [39]:
!zip -r ai_ethics_nomic.zip ai_ethics_nomic

  adding: ai_ethics_nomic/ (stored 0%)
  adding: ai_ethics_nomic/model.safetensors (deflated 8%)
  adding: ai_ethics_nomic/sentence_bert_config.json (deflated 4%)
  adding: ai_ethics_nomic/2_Normalize/ (stored 0%)
  adding: ai_ethics_nomic/special_tokens_map.json (deflated 80%)
  adding: ai_ethics_nomic/README.md (deflated 69%)
  adding: ai_ethics_nomic/tokenizer.json (deflated 71%)
  adding: ai_ethics_nomic/tokenizer_config.json (deflated 76%)
  adding: ai_ethics_nomic/config_sentence_transformers.json (deflated 36%)
  adding: ai_ethics_nomic/modules.json (deflated 62%)
  adding: ai_ethics_nomic/vocab.txt (deflated 53%)
  adding: ai_ethics_nomic/config.json (deflated 60%)
  adding: ai_ethics_nomic/1_Pooling/ (stored 0%)
  adding: ai_ethics_nomic/1_Pooling/config.json (deflated 57%)


Also save the finetuned model to huggingface-hub

In [40]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [42]:
model.push_to_hub("deman539/nomic-embed-text-v1")

model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]

'https://huggingface.co/deman539/nomic-embed-text-v1/commit/41a21be8294607f3b748546566951fda8d8a6962'

## Evaluating context-retrieval hitrate
Now that the model is finetuned, we evaluate its performance on context-retrieval by checking whether it can retrieve the context which was used to create the synthetic question (hit). 

In [88]:
from langchain_community.vectorstores import FAISS

def evaluate(data, embedding):
  questions = data['questions']
  corpus = data['corpus']
  contexts = data['contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embedding)
  retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

  num_hits = 0
  for q_id, question in questions.items():
    results = retriever.invoke(question)
    expected_id = contexts[q_id]
    is_hit = expected_id in [d.metadata['id'] for d in results]
    num_hits += int(is_hit)
  return num_hits/len(questions)

In [89]:
baseline_embeddings = HuggingFaceEmbeddings(model_name="nomic-ai/nomic-embed-text-v1", model_kwargs={'trust_remote_code': True})
baseline_hit_rate = evaluate(test_data, baseline_embeddings)

  state_dict = loader(resolved_archive_file)


In [90]:
print(baseline_hit_rate)

0.9408381265406738


In [91]:
finetuned_embeddings = HuggingFaceEmbeddings(model_name="ai_ethics_nomic", model_kwargs={'trust_remote_code': True})
finetuned_hit_rate = evaluate(test_data, finetuned_embeddings)



In [93]:
print(finetuned_hit_rate)

0.9991783073130649


While the un-finetuned model had a high enough hit-rate of 94%, the finetuned model almost gets a 100% score on fetching the appropriate context.

# RAGAS Evaluation
Now we evaluate the performance on our finetuned model on the same RAGAS metrics which we previously evaluated `text-embedding-3-small` and `text-embedding-3-large` models on using RAGAS. 

In [71]:
# imports
import tiktoken
import os
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore
from qdrant_client.http.models import Distance, VectorParams
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Qdrant
from langchain_openai.llms import OpenAI
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.conversation.memory import ConversationSummaryBufferMemory
from langchain.chains.conversation.base import ConversationChain
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.messages import SystemMessage, AIMessage, HumanMessage
from langchain_core.prompts import (ChatMessagePromptTemplate, SystemMessagePromptTemplate,
                                    AIMessagePromptTemplate, HumanMessagePromptTemplate)
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.prompts.chat import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.output_parsers import PydanticOutputParser
from langchain.output_parsers import OutputFixingParser

# RAGAS imports
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context




For example, replace imports like: `from langchain.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._context_entities_recall import (


In [56]:
# Load documents
bill_of_rights_docs = PyMuPDFLoader('Blueprint-for-an-AI-Bill-of-Rights.pdf').load()
nist_framework_docs = PyMuPDFLoader('NIST.AI.600-1.pdf').load()
all_documents = bill_of_rights_docs + nist_framework_docs

In [68]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1")
def nomic_len_function(text):
  inputs = tokenizer(text)
  return len(inputs.input_ids)

In [69]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=nomic_len_function
)
split_documents = text_splitter.split_documents(all_documents)

In [74]:
# Index documents
LOCATION = ":memory:"
COLLECTION_NAME = "ai-ethics-nomic"
VECTOR_SIZE = 768

embedding = HuggingFaceEmbeddings(model_name="ai_ethics_nomic", model_kwargs={'trust_remote_code': True})

qdrant_client = QdrantClient(LOCATION)

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)
)

qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME,
    embedding=embedding
)

_ = qdrant_vector_store.add_documents(split_documents)



In [75]:
from operator import itemgetter
prompt = """
Please answer the question below using the provided context. If the question cannnot be answered
using the context, politely state that you can't answer that question.

Question:
{question}

Context:
{context}
"""
prompt = ChatPromptTemplate.from_template(prompt)
retriever = qdrant_vector_store.as_retriever()
llm = ChatOpenAI(model='gpt-4o', temperature=0)
rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | llm, "context": itemgetter("context")}
)

In [76]:
# Read back golden question eval set generated previously
import pandas as pd
df = pd.read_csv('golden_eval_set.csv')

In [77]:
from tqdm.auto import tqdm
test_questions = df["question"].values.tolist()
test_groundtruths = df["ground_truth"].values.tolist()
answers = []
contexts = []

for question in tqdm(test_questions):
  response = rag_chain.invoke({"question": question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

  0%|          | 0/193 [00:00<?, ?it/s]

In [78]:
from datasets import Dataset
response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})
response_dataset.save_to_disk('nomic-finetuned-responses-semantic')

Saving the dataset (0/1 shards):   0%|          | 0/193 [00:00<?, ? examples/s]

In [79]:
!zip -r nomic-finetuned-responses-semantic.zip nomic-finetuned-responses-semantic

  adding: nomic-finetuned-responses-semantic/ (stored 0%)
  adding: nomic-finetuned-responses-semantic/data-00000-of-00001.arrow (deflated 73%)
  adding: nomic-finetuned-responses-semantic/state.json (deflated 38%)
  adding: nomic-finetuned-responses-semantic/dataset_info.json (deflated 65%)


In [80]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/965 [00:00<?, ?it/s]



In [81]:
import pickle
with open('nomic-eval-results', 'wb') as f:
  pickle.dump(results, f)

In [82]:
results

{'faithfulness': 0.8835, 'answer_relevancy': 0.8371, 'context_recall': 0.8972, 'context_precision': 0.8873, 'answer_correctness': 0.6834}

In [84]:
import pickle
with open('nomic-eval-results', 'rb') as f:
  results_nomic = pickle.load(f)

In [83]:
# Load previously cached results from the notebook `Test Data and RAGAS Evaluation`
import pickle
with open('te3-eval-results', 'rb') as f:
    results_te3_small = pickle.load(f)

with open('te3-large-semantic-eval-results', 'rb') as f:
    results_te3_large_semantic = pickle.load(f)

In [85]:
df_te3_small = pd.DataFrame(list(results_te3_small.items()), columns=['Metric', 'TE3-Small'])
df_te3_large_semantic = pd.DataFrame(list(results_te3_large_semantic.items()), columns=['Metric', 'TE3-Large-Semantic'])
df_nomic = pd.DataFrame(list(results_nomic.items()), columns=['Metric', 'Nomic'])
df_merged = pd.merge(df_te3_small, df_te3_large_semantic, on='Metric')
df_merged = pd.merge(df_merged, df_nomic, on='Metric')
df_merged['TE3-Small -> TE3-Large-Semantic'] = df_merged['TE3-Large-Semantic'] - df_merged['TE3-Small']
df_merged['TE3-Small -> Nomic'] = df_merged['Nomic'] - df_merged['TE3-Small']
df_merged

Unnamed: 0,Metric,TE3-Small,TE3-Large-Semantic,Nomic,TE3-Small -> TE3-Large-Semantic,TE3-Small -> Nomic
0,faithfulness,0.91295,0.902031,0.883496,-0.01092,-0.029455
1,answer_relevancy,0.813614,0.815119,0.83709,0.001505,0.023476
2,context_recall,0.903331,0.858105,0.8972,-0.045226,-0.006131
3,context_precision,0.886874,0.901267,0.887306,0.014393,0.000432
4,answer_correctness,0.625057,0.618008,0.683363,-0.007049,0.058306


It can be seen above that while `faithfulness` and `context_recall` went down slightly, `answer_correctness` moved up 5% using the finetuned model. However, note that this synthetic dataset (and corresponding ground-truth) was itself generated using off-the-shelf models so these performance numbers may not be very reliable. Even then, the finetuned mnodel achieving a high answer-correctness is significant.