This notebook was inpired by [this LlamaIndex example](https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/)

Making some changes to it with the only intention of trying ideas and learning.

Notice that I am assuming you have the relevant API_KEYs as environmental variables.

In [20]:
%pip install llama-index-finetuning
%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface

In [2]:
from bubls.utils.data.download import download_file_from_url
from bubls.utils.data.loading import load_corpus
from bubls.utils.evaluation.evaluate_embeddings import (
    get_query_hit_pairs, sentence_transformer_ir_evaluator
)
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.finetuning import SentenceTransformersFinetuneEngine
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import pandas as pd
import json
import os

## Defining Global Variables

In [2]:
METADATA = {
    "train": {
        "lyft_10k": {
            "source_url": "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf",
            "file_name": "lyft_10k_2021.pdf",
            "save_data_to": os.path.join(os.environ["DATA_DIR"], "lyft_10k"),
        }
    },
    "val": {
        "uber_10k": {
            "source_url": "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf",
            "file_name": "uber_10k_2021.pdf",
            "save_data_to": os.path.join(os.environ["DATA_DIR"], "uber_10k"),
        }
    }
}

PERSIST_FINETUNE_DATA_TO = os.path.join(os.environ["PERSIST_DIR"], "eg1_finetune_data")

## Ingest Data
- Download Information
- Split train and validation data
- Load corpus
- Generate QA embeddings

In [3]:
data = {}
for split in METADATA:
    files = []
    for k, md in METADATA[split].items():
        files.append(
            download_file_from_url(md["source_url"], md["file_name"], md["save_data_to"])
        )
    data_path = os.path.join(PERSIST_FINETUNE_DATA_TO, f"{split}_data.json")
    if not os.path.exists(data_path):
        print("Generating data with QA embedding pairs")
        # For every node we have id, embedding placeholder, metadata, text, relationships, etc.
        nodes = load_corpus(files, verbose=True)
        data[split] = generate_qa_embedding_pairs(
            llm=OpenAI(model="gpt-3.5-turbo"), nodes=nodes
        )
        data[split].save_json(data_path)
    else:
        print("Loading data with QA embedding pairs")
        data[split] = EmbeddingQAFinetuneDataset.from_json(data_path)
        


Loading data with QA embedding pairs
Loading data with QA embedding pairs


## Run Embedding Fine-tuning

In [4]:
finetune_engine = SentenceTransformersFinetuneEngine(
    data["train"],
    model_id="BAAI/bge-small-en",
    model_output_path="finetuned_model",
    val_dataset=data["val"],
)

In [5]:
## This block of code executes the training and returns the model 
## I am currently running this in a RPi5 so unfortunately can't execute it locally.
# finetune_engine.finetune()
# embed_model = finetune_engine.get_finetuned_model()

## Evaluate and compare
I am going to evaluate the performance of `BAAI/bge-small-en` vs `OpenAI`.
I couldn't include the fine-tuned one because I couldn't execute it because of memory constraints. NEvertheless all the ideas hold and the process would be the same as what you will find next

### Open AI

In [8]:
openai_embedding = OpenAIEmbedding()
ada_val_results = get_query_hit_pairs(data["val"], openai_embedding)
df_ada = pd.DataFrame(ada_val_results)
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada


0.8829268292682927

### BAAI/bge-small-en

In [23]:
bge = "local:BAAI/bge-small-en"
bge_val_results = get_query_hit_pairs(data["val"], bge)
df_bge = pd.DataFrame(bge_val_results)
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge


0.802439024390244

In [25]:
bge_val_results = sentence_transformer_ir_evaluator(data["val"], bge, "bge")
df_bge = pd.DataFrame(bge_val_results)
df_bge

### Fine-tuned
Fine-tuning our small open-source embedding model drastically improved its retrieval quality, comparable to the quality of the OpenAI embedding.

In [None]:
finetuned = "local:finetuned_model"
ada_val_results = get_query_hit_pairs(data["val"], finetuned)
df_ada = pd.DataFrame(ada_val_results)
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada
