# Local Evaluation of Embedding Model Fine-tuning

## Install dependencies

This notebook demonstrates invoking Bedrock models directly using the AWS SDK, but for later notebooks in the workshop you'll also need to install [LangChain](https://github.com/hwchase17/langchain).

In this example, you will use [Facebook AI Similarity Search (Faiss)](https://faiss.ai/) as the vector database to store your embeddings. There are CPU or GPU options available, depending on your platform.

#### Ignore any errors from installing dependencies

In [None]:
%pip install -Uq langchain==0.3.21
%pip install -Uq pydantic==1.10.13
%pip install -Uq sqlalchemy==2.0.21
%pip install -Uq faiss-cpu==1.7.4 # For CPU Installation
#%pip install faiss-gpu # For CUDA 7.5+ Supported GPU's.
%pip install -Uq pypdf==3.14.0
%pip install -Uq datasets
%pip install -Uq langchain_huggingface
%pip install -Uq langchain-progress

In [None]:
from IPython.display import display_html

display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

## Fetching Data

In this evaluation, you'll pull samples from the [PubMedQA dataset](https://huggingface.co/datasets/qiaojin/PubMedQA). It has sets of prebuilt Question/Context/Answers on complex medical topics which will test whether fine tuning the embeddings helps with retrieval.

In [None]:
from datasets import load_dataset

In [None]:
source_dataset = load_dataset("qiaojin/PubMedQA", "pqa_artificial", split="train")

In [None]:
source_dataset[1]

Here you process the dataset elements into document objects so they can be loaded into the FAISS datastore. Since these context objects are already small, chunking them further is not necessary.

Use the `max_items` parameter to scope down the evaluation set, or set it to `-1` to run the entire set. 

>Note that the full dataset will have about 211k sets and a total of 655k contexts which will take a long time to process.

In [None]:
from langchain_core.documents import Document


max_items = -1

documents = []

if max_items > -1:
    print(f"max_items set, reducing input to {max_items} items.")
else:
    max_items = len(source_dataset)

for idx, item in enumerate(source_dataset.select(range(max_items))):
    #print(item["pubid"])
    print(f"{idx} of {max_items}", end="\r")

    for context in item["context"]["contexts"]:
        document = Document(
            page_content= context,
            metadata={
                "pubid":item["pubid"],
                "question": item["question"],
                "meshes":item["context"]["meshes"]
            }
        )

        documents.append(document)

Check the dataset size and a sample document. Note that the document object has metadata attached for `meshes` that could be useful in a metadata filtering search.

Since medical terminology is complex and determining whether the right contexts were retrieved could be difficult, the `pubid` of the source set is included in the metadata as well. You will use this to determine context correctness in evaluation.

In [None]:
print(len(documents))
print(documents[0])

This example hosts the embedding models locally, but they could be hosted on SageMaker AI hosting endpoints as well.

This automatically determines whether GPUs are available, to determine whether you can use GPU acceleration on the embedding models.

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU instead")

Load the base model for comparison, the model that was fine-tuned was based on `Alibaba-NLP/gte-base-en-v1.5`. You could also try testing larger models to see how well the tuned model compares in efficiency/quality.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

base_model_name = "Alibaba-NLP/gte-base-en-v1.5"

base_filesafe_model_name = base_model_name.replace("/","_")

base_model_kwargs = {"device": device, "trust_remote_code":True}
base_encode_kwargs = {"normalize_embeddings": True}
base_embeddings = HuggingFaceEmbeddings(
    model_name=base_model_name, model_kwargs=base_model_kwargs, encode_kwargs=base_encode_kwargs
)

Load the fine-tuned model from a local folder. 

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

tuned_model_name = "<<YOUR_MODEL_NAME>>"

tuned_filesafe_model_name = tuned_model_name.replace("/","_")

tuned_model_kwargs = {"device": device, "trust_remote_code":True}
tuned_encode_kwargs = {"normalize_embeddings": True}
tuned_embeddings = HuggingFaceEmbeddings(
    model_name="<<PATH_TO_YOUR_TUNED_MODEL_ARTIFACTS>>", model_kwargs=tuned_model_kwargs, encode_kwargs=tuned_encode_kwargs
)

## Build Vector Databases

With the models loaded, you can now build your vector databases. You'll take the set of documents you created above, embed them with each embedding model, then save them into separate vector stores for comparison. The sections to save the database have been commented out to prevent overwriting in scenarios where you were provided a sample database already. In those cases, you can skip the `FAISS.from_documents()` cells and just load from the filesystem. 

If saving a new index, the file path will be based on the number of documents you are embedding and the model, followed by `-faiss-index`.

In [None]:
from langchain.vectorstores import FAISS

In [None]:
base_db = FAISS.from_documents(documents, base_embeddings)

In [None]:
base_db_name = f"{len(documents)}_{base_filesafe_model_name}-faiss_index"
base_db_name

### UNCOMMENT THIS TO SAVE LOCALLY

In [None]:
#commented to avoid overwriting by accident
#base_db.save_local(base_db_name)

In [None]:
base_db = FAISS.load_local(base_db_name, base_embeddings, allow_dangerous_deserialization=True)

In [None]:
tuned_db_name = f"{len(documents)}_{tuned_filesafe_model_name}-faiss_index"
tuned_db_name

In [None]:
tuned_db = FAISS.from_documents(documents, tuned_embeddings)

### UNCOMMENT THIS TO SAVE LOCALLY

In [None]:
#commented to avoid overwriting by accident
#tuned_db.save_local(tuned_db_name)

In [None]:
tuned_db = FAISS.load_local(tuned_db_name, tuned_embeddings, allow_dangerous_deserialization=True)

## Perform Sample Searches

With your `base_db` and `tuned_db` populated, you can run some quick searches to ensure that documents are coming and take a look at result quality. The `similarity_search_with_score` API returns a set of elements (defaulting to k=4) and scores them based on the L2 (Euclidean) distance from the query embedding. Since you are measuring distance, smaller scores mean that the vectors are closer together, and therefore semantically similar. Cosine similarity is another metric you can use.

You'll first pull an element from the `source_dataset` (the original data from the HF Datasets repository). Note the `question` and the `pubid`.

In [None]:
query_test_item = source_dataset[1]
query_test_item

In [None]:
query = query_test_item["question"]

Now run your searches. In each result, look at the `metadata` field and compare the `pubid` element in the context to the one from the source data.

In the example for `source_dataset[0]`, you'll notice that the first element is aligned with `pubid = 25433161`, showing a correct answer. Also note the L2 distance of `~0.230`. The other contexts in the query are incorrect as noted by the `pubid` mismatch.

In [None]:
print(f"Pubid: {query_test_item['pubid']}\nQuery: {query}\n")

results_with_scores = base_db.similarity_search_with_score(query)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\nScore: {score}\n\n")

Running the same query against the tuned embeddings shows an improvement in retrieval. Notice that in both tests, the first context result is the same and is correct. However, in the tuned example, the score for that context is `0.088` versus `0.230`, indicating that the vectors are much closer together. Also note that this example retrieved 2 correct contexts instead of just 1, with the 2nd context getting a score of `0.49`. This would have been missed if we were using the base model and `k=4`.

In [None]:
print(f"Pubid: {query_test_item['pubid']}\nQuery: {query}\n")

results_with_scores = tuned_db.similarity_search_with_score(query)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\nScore: {score}\n\n")

# Benchmark Search Performance

This will run testing against a larger corpus of information so that you can get a sense of overall performance.

The `get_results` function outlined before will take one of the vector db instances and an item to search for, then match it against the test item's `pubid` for correctness and calculate the average distance for those correct answers. You can then use these to calculate an average across the entire dataset to see whether the tuned model is an improvement over the base.

In [None]:
def get_results(db_instance, test, print_context=False, print_results=False):

    query = test.metadata["question"]
    
    results_with_scores = db_instance.similarity_search_with_score(query)
    
    correct = 0
    avg_distance_correct = 0
    sum_distance_correct = 0
    ground_truth_pubid = test.metadata["pubid"]
    
    if print_context:
        print(f"""ground truth pubid: {ground_truth_pubid}""")
    
    for doc, score in results_with_scores:
        if ground_truth_pubid == doc.metadata["pubid"]:
            if print_context:
                print("CORRECT CONTEXT")
            sum_distance_correct += score
            correct += 1
        
        if print_context:
            print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\nScore: {score}\n\n")
    
    
    if correct > 0:
        avg_distance_correct = sum_distance_correct/correct
    else:
        avg_distance_correct = 1
        
    if print_results:
        print("========================================")
        print(f"Number of correct contexts: {correct}, avg distance={avg_distance_correct}")
        print("========================================")

    return {
        "correct": correct,
        "avg_distance_correct": avg_distance_correct
    }

Grab an element for testing.

In [None]:
test_item = documents[0]

query = test_item.metadata["question"]
correct_id = test_item.metadata["pubid"]

test_item

In [None]:
get_results(base_db,test_item)

Define the `sample_size` for the test.

In [None]:
sample_size = 20000

In [None]:
%%time

base_total_correct = 0
base_total_avg_distance_correct = 0

for test_item in documents[:sample_size]:
    result = get_results(base_db, test_item)
    base_total_correct += result["correct"]
    base_total_avg_distance_correct += result["avg_distance_correct"]

base_scores = {
        "avg_correct": base_total_correct/sample_size,
        "avg_distance_correct": base_total_avg_distance_correct/sample_size
    }

base_scores

In [None]:
%%time

tuned_total_correct = 0
tuned_total_avg_distance_correct = 0
tuned_total_sum_distance_correct = 0

for test_item in documents[:sample_size]:
    result = get_results(tuned_db, test_item)
    tuned_total_correct += result["correct"]
    tuned_total_avg_distance_correct += result["avg_distance_correct"]

tuned_scores = {
        "avg_correct": tuned_total_correct/sample_size,
        "avg_distance_correct": tuned_total_avg_distance_correct/sample_size
    }

tuned_scores

Using a `sample_size` of `20000` gave good results for the tuned model, with an 8.6% improvement in number of correct answers, and a 29% improvement in the distance of those correct answers from the query. This shows a dramatic improvement in retrieval from fine-tuning the embedding model with a small dataset (9k examples) and less than 20 minutes of model training time.

In [None]:
import pandas as pd
data = {'dimension':[], 'base': [], 'tuned': [], 'delta': [], 'delta_percent': []}

for key in base_scores.keys():
        
    if key == "avg_correct":
        delta = tuned_scores[key]-base_scores[key]
    else:
        delta = base_scores[key]-tuned_scores[key]
        
    delta_percent = (delta/base_scores[key])*100
    
    data['dimension'].append(key)
    data['base'].append(base_scores[key])
    data['tuned'].append(tuned_scores[key])
    data['delta'].append(delta)
    data['delta_percent'].append(delta_percent)
    
df = pd.DataFrame(data)

print(f"sample size: {sample_size}")

df