# Part 3: Experiment and optimize the RAG application

- GitHub repository: https://github.com/anyscale/ray-summit-2023-training/tree/main
- Anyscale Endpoints: https://endpoints.anyscale.com/
- Ray documentation: https://docs.ray.io/
- LlamaIndex documentation: https://gpt-index.readthedocs.io/en/stable/

<span style="background: yellow; color: red; font-size: 1rem;"><b>TODO:</b></span> assume evaluation functions have all been defined in previous part.

## Strategy 1: Search for optimal configuration for standard components

## Strategy 2: Use different data representation for retrieval vs. generation

## Strategy 3: Fine-tune embeddings

In this section, we explore fine-tuning embedding model to improve retrieval performance.
We consider the "cold start" regime, where we haven't deployed the model, and haven't collected any user queries or labeled "golden" context. 

Therefore, we consider a synthetic data approach, where we leverage LLM to generation question, relevant context, answer pairs from our knowledge corpus.

### Experiment configs

In [56]:
import os

# Since we are working with a relatively large set of documents, 
# we sub-sample the data for quicker iterations.
SUBSAMPLE_RATIO = 0.05

# can be any sentence-transformer compatible model
# https://www.sbert.net/docs/pretrained_models.html
BASE_MODEL = 'BAAI/bge-small-en'  

# Select a chunk size that is: 
# 1) under the context window limit of the chosen model, and
# 2) close to the desired chunk size at retrieval time
FINETUNE_CHUNK_SIZE = 512


ROOT_DIR = Path(os.getcwd()).parent
print(ROOT_DIR)

/home/ray/default/ray-summit-2023-training/Ray-LlamaIndex


### Load data

We start by loading our knowledge corpus (i.e. the ray documentation webpages that we've downloaded and parsed previously).

In [16]:
from pathlib import Path

DATASETS_DIRECTORY = Path("/efs/shared_storage/simon/datasets")

In [26]:
import json
from llama_index.schema import Document

def read_json(filename):
    with open(filename, 'r') as f:
        data = json.load(f)
    return data

def to_doc(entry_dict):
    return Document(text=entry_dict['text'], metadata={'source': entry_dict['source']}) 

def load_corpus(filename):
    sections = read_json(filename)
    docs = [to_doc(dict_) for dict_ in sections]
    return docs

In [27]:
docs = load_corpus(DATASETS_DIRECTORY / 'eval_full_corpus.json')

Now, we split the documents into chunks. The main considerations on the chunk size here are:
1. need to fit into the context window of the embedding model that we want to finetune (512 for the sentence transformer model that we selected, similar to most open source models.)
2. should be close to the chunk size we want to use at retrieval time (so it's best to run some experiments to determine the best chunk size for your application first, see strategy 1). 

In [36]:
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(chunk_size=FINETUNE_CHUNK_SIZE)
nodes = parser.get_nodes_from_documents(docs, show_progress=True)
print('Parsed {} docs into {} nodes'.format(len(docs), len(nodes)))

Parsing documents into nodes:   0%|          | 0/8944 [00:00<?, ?it/s]

Parsed 8944 docs into 14242 nodes


### Subsample and create train/validation split

In [37]:
import random

def train_test_split(data, split_ratio=0.8):
    """
    Split a list of items into training and testing sets.

    Args:
        data (list): The list of items to be split.
        split_ratio (float): The ratio of items to include in the training set (default is 0.8).

    Returns:
        tuple: A tuple containing two lists - the training set and the testing set.
    """
    if not 0 <= split_ratio <= 1:
        raise ValueError("Split ratio must be between 0 and 1")

    # Shuffle the data to ensure randomness in the split
    random.shuffle(data)

    # Calculate the split indices
    split_index = int(len(data) * split_ratio)

    # Split the data into training and testing sets
    train_set = data[:split_index]
    test_set = data[split_index:]

    return train_set, test_set

def subsample(data, ratio):
    """
    Subsample a list to a given ratio.

    Args:
        data (list): The list of items to be subsampled.
        ratio (float): The ratio of items to retain in the subsample.

    Returns:
        list: A subsampled list containing the specified ratio of items.
    """
    if not 0 <= ratio <= 1:
        raise ValueError("Ratio must be between 0 and 1")

    # Calculate the number of items to retain in the subsample
    num_items_to_retain = int(len(data) * ratio)

    # Randomly select items to retain
    subsampled_data = random.sample(data, num_items_to_retain)

    return subsampled_data

In [41]:
train_nodes, val_nodes = train_test_split(nodes)
print('Split dataset into: {} train nodes, {} val nodes'.format(len(train_nodes), len(val_nodes)))

Split dataset into: 11393 train nodes, 2849 val nodes


In [42]:
train_nodes = subsample(train_nodes, SUBSAMPLE_RATIO)
val_nodes = subsample(val_nodes, SUBSAMPLE_RATIO)
print('Subsampled dataset into: {} train nodes, {} val nodes'.format(len(train_nodes), len(val_nodes)))

Subsampled dataset into: 569 train nodes, 142 val nodes


### Generate synthetic dataset

Now, we will generate a synthetic dataset of question, "golden" context, and "golden" answer dataset by leveraing an LLM (by default using gpt-3.5-turbo from OpenAI)

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs

train_dataset = generate_qa_embedding_pairs(train_nodes)
val_dataset = generate_qa_embedding_pairs(val_nodes)

Let's save the dataset for future use since it's fairly time consuming to generate.

In [None]:
train_dataset.save_json(Path(ROOT_DIR, "datasets/synthetic_train_dataset.json"))
val_dataset.save_json(Path(ROOT_DIR, "datasets/synthetic_val_dataset.json"))

### Run embedding finetuning

Now, we are ready to fine-tune our embedding model!

In [None]:
train_dataset = EmbeddingQAFinetuneDataset.from_json(Path(ROOT_DIR, "datasets/synthetic_train_dataset.json"))
val_dataset = EmbeddingQAFinetuneDataset.from_json(Path(ROOT_DIR, "datasets/synthetic_val_dataset.json"))

We can construct a fine-tune engine, which is an easy to use interface for running fine-tuning jobs (either locally or via a API-based service).

Here, we use the sentence transformer fine-tuning engine to run fine-tuning locally on the ray cluster.

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="exp_finetune_test",
    val_dataset=val_dataset,
)

For demonstration purpose, we will run the fine-tuning job for 2 epoches over our synthetic dataset.
In practice, you should use the validation loss to determine how many epochs to fine-tune the embedding model for.

In [None]:
finetune_engine.finetune()

After the fine-tuning job finishes, we can easy get a referene to the fine-tuned model to be used in our LlamaIndex application.

In [None]:
embed_model = finetune_engine.get_finetuned_model()

### Evaluate our fine-tuned embedding model

Now, we will leverage the retrieval evaluation process we built out in [part 2](www.google.com) to assess the quality of our fine-tuned embedding model.

In [61]:
import re
import json
from pathlib import Path

# Load labeled eval dataset
with open(DATASETS_DIRECTORY /  "eval-dataset-v1.jsonl", "r") as f:
    test_dataset = [json.loads(item) for item in list(f)]

# Update source
# TODO: update saved dataset to avoid this step
for row in test_dataset:
    row["source"] = row["source"].replace("https://docs.ray.io/en/latest/", "https://docs.ray.io/en/master/")

In [62]:
test_dataset[:5]

[{'question': 'I’m struggling a bit with Ray Data type conversions when I do map_batches. Any advice?',
  'source': 'https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format'},
 {'question': 'How does autoscaling work in a Ray Serve application?',
  'source': 'https://docs.ray.io/en/master/serve/scaling-and-resource-allocation.html#autoscaling'},
 {'question': 'how do I get the address of a ray node',
  'source': 'https://docs.ray.io/en/master/ray-core/miscellaneous.html#node-information'},
 {'question': 'Does Ray support NCCL?',
  'source': 'https://docs.ray.io/en/master/ray-more-libs/ray-collective.html'},
 {'question': 'Is Ray integrated with DeepSpeed?',
  'source': 'https://docs.ray.io/en/master/ray-air/examples/gptj_deepspeed_fine_tuning.html#fine-tuning-the-model-with-ray-air-a-name-train-a'}]

In [64]:
def evaluate_retrieval(
    labeled_dataset,
    index,
    top_k=5,
    verbose=False,
):
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for entry in tqdm(labled_dataset):
        query = entry['question']
        expected_source = entry['source']
        
        retrieved_nodes = retriever.retrieve(query)
        retrieved_sources = [node.node.metadata['source'] for node in retrieved_nodes]
        is_hit = expected_source in retrieved_sources  # assume 1 relevant doc
        
        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_sources,
            'expected': expected_source,
            'query': query,
        }
        eval_results.append(eval_result)
    return eval_results

Now, we can build index with our fine-tuned embedding.

In [65]:
from llama_index import VectorStoreIndex, ServiceContext

In [None]:
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    chunk_size=512,
)

index = VectorStoreIndex.from_documents(
    docs, 
    service_context=service_context, 
    show_progress=True
)

In [None]:
Run retrieval evaluation 

In [None]:
results = evaluate_retrieval(test_dataset, index, top_k=5, verbose=True)

In [None]:
import pandas as pd

df = pd.DataFrame(results)
hit_rate = df['is_hit'].mean()
print(hit_rate)