# Part 2: Evaluating our LLM application

So far, we've chosen typical/arbitrary values for the various parts of our RAG application. But if we were to change something, such as our chunking logic, embedding model, LLM, etc. how can we know that we have a better configuration than before. A generative task like this is very difficult to quantitatively assess and so we need to develop creative ways to do so. 

Because we have many moving parts in our application, we need to perform unit/component and end-to-end evaluation. Component-wise evaluation can involve evaluating our retrieval in isolation (is the best source in our set of retrieved chunks) and evaluating our LLMs response (given the best source, is the LLM able to produce a quality answer). As for end-to-end evaluation, we can assess the quality of the entire system (given all data, what is the quality of the response).

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Component and end-to-end evaluation.

## Setup

In [1]:
import os

os.environ["ANYSCALE_API_BASE"] = "https://api.endpoints.anyscale.com/v1/chat/completions"
os.environ["ANYSCALE_API_KEY"] = "esecret_2hvvt43kbmpgzev7k2xqa9h6dv"

os.environ["OPENAI_API_BASE"] = "https://api.openai.com/v1"
os.environ["OPENAI_API_KEY"] = "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"

## Golden Context Dataset

In an ideal world, we would have a golden validation dataset: given a set of queries, we would have the correct sources that answer those queries, and optionally the correct answer that should be returned by the LLM.

For this example, we have manually collected 177 representative user queries and identified the correct source in the documentation that answer those user queries.

In [2]:
from pathlib import Path
import json

golden_dataset_path = Path("../datasets/eval-dataset-v1.jsonl")

with open(golden_dataset_path, "r") as f:
    data = [json.loads(item) for item in list(f)]
    
len(data)

177

Our dataset contains 'question' and 'source' pairs. If we have a golden context dataset, it is the best option for evaluation.

In [3]:
data[:5]

[{'question': 'I’m struggling a bit with Ray Data type conversions when I do map_batches. Any advice?',
  'source': 'https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format'},
 {'question': 'How does autoscaling work in a Ray Serve application?',
  'source': 'https://docs.ray.io/en/master/serve/scaling-and-resource-allocation.html#autoscaling'},
 {'question': 'how do I get the address of a ray node',
  'source': 'https://docs.ray.io/en/master/ray-core/miscellaneous.html#node-information'},
 {'question': 'Does Ray support NCCL?',
  'source': 'https://docs.ray.io/en/master/ray-more-libs/ray-collective.html'},
 {'question': 'Is Ray integrated with DeepSpeed?',
  'source': 'https://docs.ray.io/en/master/ray-air/examples/gptj_deepspeed_fine_tuning.html#fine-tuning-the-model-with-ray-air-a-name-train-a'}]

## Cold Start

We may not always have a prepared dataset of questions and the best source to answer that question readily available. To address this cold start problem, we could use an LLM to look at our documents and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. However, this dataset generation method could be a bit noisy. The generate questions may not always be resembling of what your users may ask and the specific chunk we say is the best source may also have that exact information in other chunks. Nonetheless, this is a great way to start our development process while we collect + manually label a high quality dataset.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show the synthetic data generation process.

We need to define a few parameters first.  
- Notably, the chunk size determines the size of the text chunk shown to the LLM when generating hypothetical question & answer pairs. This must be set below the context window limitation of the chosen LLM.
- We choose a subsample ratio since we just want to construct a small representative subset for the purpose of evaluation and iteration. (We choose an even smaller subset for the purpose of the demonstration here).
- We use `gpt-3.5-turbo` since it's fast and cheap. 

In [8]:
from pathlib import Path

RAY_DOCS_DIRECTORY = Path("/efs/shared_storage/amog/docs.ray.io/en/master/")

First, we load in the documents and chunk them to the appropriate sizes, creating LlamaIndex nodes. We already did the data processing in part 1, and have packaged the logic as a utility.

In [30]:
from data import create_nodes

# needs to be smaller than context window
CHUNK_SIZE = 1024

nodes = create_nodes(RAY_DOCS_DIRECTORY, chunk_size=CHUNK_SIZE, chunk_overlap=20).take_all()

2023-09-18 13:59:24,133	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[FlatMap(extract_sections)] -> TaskPoolMapOperator[FlatMap(chunk_document)]
2023-09-18 13:59:24,134	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-18 13:59:24,134	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/200 [00:00<?, ?it/s]

In [31]:
nodes = [node_dict["node"] for node_dict in nodes]
id_to_node = {node.node_id: node for node in nodes}

Now, we subsample the nodes to obtain a representative subset (here we use a very small subset for a fast demonstration)

In [32]:
from utils import subsample

SUBSAMPLE_RATIO = 0.01

subsampled_nodes = subsample(nodes, SUBSAMPLE_RATIO)
print('Subsampled {} nodes into {} nodes'.format(len(nodes), len(subsampled_nodes)))

Subsampled 6836 nodes into 68 nodes


Now, we use LlamaIndex's built in utility `generate_qa_embedding_pairs` to create synthetic query/context pairs.

(We can also use this utility for fine-tuning embeddings, hence the naming. More on this in part 3!)

In [33]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.llms import OpenAI

llm = OpenAI(model='gpt-3.5-turbo')
synthetic_dataset = generate_qa_embedding_pairs(subsampled_nodes, llm=llm, num_questions_per_chunk=2)

100%|██████████| 68/68 [02:18<00:00,  2.04s/it]


Now we will transform the shape of the data a bit to match the format of our labeled data.

In [34]:
synthetic_data = []
for query_id, context_ids in synthetic_dataset.relevant_docs.items():
    query = synthetic_dataset.queries[query_id]
    golden_context = id_to_node[context_ids[0]].metadata['source']
    entry = {
        'question': query,
        'source': golden_context,
    }
    synthetic_data.append(entry)

In [35]:
synthetic_data[:5]

[{'question': 'How can Tune experiments be stopped using metrics, trial errors, and early stopping schedulers?',
  'source': 'https://docs.ray.io/en/master/tune/tutorials/tune-stopping.html#summary'},
 {'question': 'What steps can be taken to resume an experiment that was manually interrupted or experienced unexpected cluster failure while trials were still running?',
  'source': 'https://docs.ray.io/en/master/tune/tutorials/tune-stopping.html#summary'},
 {'question': 'What is the purpose of the `wait_for_gpu` function in the given context?',
  'source': 'https://docs.ray.io/en/master/tune/api/doc/ray.tune.utils.wait_for_gpu.html#ray-tune-utils-wait-for-gpu'},
 {'question': 'How can the `wait_for_gpu` function be used in the example code provided?',
  'source': 'https://docs.ray.io/en/master/tune/api/doc/ray.tune.utils.wait_for_gpu.html#ray-tune-utils-wait-for-gpu'},
 {'question': 'What is the purpose of the `sync_down` function in the `Syncer` class?',
  'source': 'https://docs.ray.io

In [36]:
from utils import write_jsonl

write_jsonl("../datasets/synthetic-eval-dataset.jsonl", synthetic_data)

Since we already have a dataset with representative user queries and ground truth labels, we will use that for evaluation instead of a synthetically generated dataset.

## Evaluating Retrieval

The first component to evaluate in our RAG application is retrieval. Given a query, is our retriever pulling in the correct context to answer that query? Regardless of how good our LLM is, if it does not have the right context to answer the question, it cannot provide the right answer.

We can use our golden context dataset to evaluate retrieval. The simplest approach is that for each query in our dataset, we can test to see if the correct source is included in any of the chunks that are retrieved by our retriever. This measures "hit rate".

However, simply checking for existence can be misleading if we increase the number of chunks that we retrieve. Therefore, we also want to check the score that our retriever gives for the correct source. A higher score means our retriever is accurately determining the correct context. 

To summarize, for each query in our evaluation dataset, we will measure the following:
1. Is the correct source included in any of the retrived chunks?
2. What is the score our retriever gives to the correct source?

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show retrieval evaluation

First, let's a get a retriever over the vector database. We have packaged this as a utility. It is the same as we did in notebook 1.

In [37]:
from utils import get_retriever

In [38]:
retriever = get_retriever(similarity_top_k=5)

LLM is explicitly disabled. Using MockLLM.


Now let's evaluate our retriever. 

In [39]:
results = []

for entry in data:
    query = entry["question"]
    expected_source = entry['source']
    
    retrieved_nodes = retriever.retrieve(query)
    retrieved_sources = [node.metadata['source'] for node in retrieved_nodes]
    
    # If our label does not include a section, then any sections on the page should be considered a hit.
    if "#" not in expected_source:
        retrieved_sources = [source.split("#")[0] for source in retrieved_sources]
    
    if expected_source in retrieved_sources:
        is_hit = True
        score = retrieved_nodes[retrieved_sources.index(expected_source)].score
    else:
        is_hit = False
        score = 0.0
    
    result = {
        "is_hit": is_hit,
        "score": score,
        "retrieved": retrieved_sources,
        "expected": expected_source,
        "query": query,
    }
    results.append(result)

In [40]:
results[:2]

[{'is_hit': True,
  'score': 0.9110969673181731,
  'retrieved': ['https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format',
   'https://docs.ray.io/en/master/data/working-with-tensors.html#transforming-tensor-data',
   'https://docs.ray.io/en/master/data/working-with-pytorch.html#transformations-with-torch-tensors',
   'https://docs.ray.io/en/master/data/examples/pytorch_resnet_batch_prediction.html#preprocessing',
   'https://docs.ray.io/en/master/data/transforming-data.html#transforming-batches-with-tasks'],
  'expected': 'https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format',
  'query': 'I’m struggling a bit with Ray Data type conversions when I do map_batches. Any advice?'},
 {'is_hit': True,
  'score': 0.9274146984119639,
  'retrieved': ['https://docs.ray.io/en/master/serve/architecture.html#ray-serve-autoscaling',
   'https://docs.ray.io/en/master/cluster/key-concepts.html#autoscaling',
   'https://docs.ray.io/en/master/ser

Let's see how well our retriever does. It's not great right now, but we now have a solid metric to evaluate our retriever for future optimizations.

In [41]:
total_hits = sum(result["is_hit"] for result in results)
hit_percentage = total_hits / len(results)
hit_percentage

0.4406779661016949

In [42]:
average_score = sum(result["score"] for result in results) / len(results)
average_score

0.39457275827305033

## End-to-end evaluation

While we can evaluate our retriever in isolation, ultimately we want to evaluate our RAG application end-to-end, which includes the final response generated from our LLM.

To effectively evaluate our generated responses, we need "ground truth" responses. These ground truth responses can be generated by feeding the correct context to a "golden" LLM. Then, we can use an LLM to evaluate our generated responses compared to the ground truth responses.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show e2e evaluation

### Choosing a Golden LLM

To generate ground truth responses, and then to evaluate the generated responses vs. the ground truth, we need a "golden" LLM. But which LLM should we use? We now run into a problem: we need to determine the quality of different LLMs to choose as a "golden" LLM, but doing so requires a "golden" LLM. Leaderboards on general benchmarks provide a rough indication on which LLMs perform better, but in this case, we will go with the eye-test.

Let's get responses from both GPT-4 and Llama2-70B and see for ourselves which one is better.

In [43]:
from bs4 import BeautifulSoup

def fetch_text_from_source(source: str):
    url, anchor = source.split("#") if "#" in source else (source, None)
    file_path = Path("/efs/shared_storage/amog/", url.split("https://")[-1])
    with open(file_path, "r", encoding="utf-8") as file:
        html_content = file.read()
    soup = BeautifulSoup(html_content, "html.parser")
    if anchor:
        target_element = soup.find(id=anchor)
        if target_element:
            text = target_element.get_text()
        else:
            return fetch_text_from_source(source=url)
    else:
        text = soup.get_text()
    return text

In [44]:
example_source = data[0]["source"]
print(example_source)

text = fetch_text_from_source(example_source)
text

https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format


'\nConfiguring batch format#\nRay Data represents batches as dicts of NumPy ndarrays or pandas DataFrames. By\ndefault, Ray Data represents batches as dicts of NumPy ndarrays.\nTo configure the batch type, specify batch_format in\nmap_batches(). You can return either format from your function.\n\n\n\nNumPy\nfrom typing import Dict\nimport numpy as np\nimport ray\n\ndef increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:\n    batch["image"] = np.clip(batch["image"] + 4, 0, 255)\n    return batch\n\nds = (\n    ray.data.read_images("s3://[email\xa0protected]/image-datasets/simple")\n    .map_batches(increase_brightness, batch_format="numpy")\n)\n\n\n\n\n\npandas\nimport pandas as pd\nimport ray\n\ndef drop_nas(batch: pd.DataFrame) -> pd.DataFrame:\n    return batch.dropna()\n\nds = (\n    ray.data.read_csv("s3://[email\xa0protected]/iris.csv")\n    .map_batches(drop_nas, batch_format="pandas")\n)\n\n\n\n\n'

In [51]:
from tqdm import tqdm
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.response_synthesizers import get_response_synthesizer
from llama_index.schema import TextNode, NodeWithScore

def generate_responses(entries, llm, context_window=None):
    service_context = ServiceContext.from_defaults(llm=llm, context_window=context_window)
    rs = get_response_synthesizer(service_context=service_context)

    responses = []
    for entry in tqdm(entries):
        query = entry["question"]
        source = entry["source"]

        context = fetch_text_from_source(source)
        nodes = [NodeWithScore(node=TextNode(text=context))]

        response = rs.synthesize(query, nodes=nodes)
        responses.append(response.response)
    return responses

Let's get responses from gpt-4

In [53]:
llm = OpenAI(model='gpt-4', temperature=0.0)
gpt4_responses = generate_responses(data[:3], llm)

100%|██████████| 3/3 [00:45<00:00, 15.26s/it]


In [54]:
gpt4_responses

['Sure, when using the `map_batches()` function in Ray Data, you can specify the batch format by using the `batch_format` argument. If you want to work with NumPy ndarrays, you can set `batch_format="numpy"`. For example, if you have a function that increases the brightness of an image, you can use it like this:\n\n```python\nfrom typing import Dict\nimport numpy as np\nimport ray\n\ndef increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:\n    batch["image"] = np.clip(batch["image"] + 4, 0, 255)\n    return batch\n\nds = (\n    ray.data.read_images("s3://[email\xa0protected]/image-datasets/simple")\n    .map_batches(increase_brightness, batch_format="numpy")\n)\n```\n\nOn the other hand, if you prefer to work with pandas DataFrames, you can set `batch_format="pandas"`. For instance, if you have a function that drops NA values from a DataFrame, you can use it like this:\n\n```python\nimport pandas as pd\nimport ray\n\ndef drop_nas(batch: pd.DataFrame) -> pd.DataF

Now let's get responses from LLama2-70b

In [55]:
from llama_index.llms import Anyscale
from llama_index import ServiceContext

llm = Anyscale(model='meta-llama/Llama-2-70b-chat-hf', temperature=0.0)
llama_responses = generate_responses(data[:3], llm, context_window=4096)

100%|██████████| 3/3 [00:34<00:00, 11.49s/it]


In [56]:
llama_responses

[' It sounds like you\'re having trouble with converting data types when using the `map_batches` function in Ray Data. Specifically, you\'re mentioning issues with NumPy arrays and pandas DataFrames.\n\nOne thing to keep in mind is that Ray Data represents batches as dictionaries of NumPy arrays or pandas DataFrames by default. This means that when you\'re working with batches, you\'ll need to specify the correct data type when passing them to functions that operate on batches.\n\nOne way to do this is by using the `batch_format` parameter in the `map_batches` function. This parameter allows you to specify whether the batches should be represented as NumPy arrays or pandas DataFrames.\n\nFor example, in the code snippet you provided, the `batch_format` parameter is set to `"numpy"` when reading images from an S3 bucket. This tells Ray Data to expect batches to be represented as NumPy arrays. Similarly, when reading a CSV file, the `batch_format` parameter is set to `"pandas"` to indica

Now let's compare the two

In [57]:
BOLD = '\033[1m'
END = '\033[0m'
    
for query, gpt_response, llama_response in zip(data[:5], gpt4_responses, llama_responses):
    print(f"{BOLD}Query:{END} {query['question']}")
    print(f"{BOLD}GPT4 answer:{END} {gpt_response}")
    print(f"{BOLD}Llama2-70B answer:{END} {llama_response}")
    print("\n")

[1mQuery:[0m I’m struggling a bit with Ray Data type conversions when I do map_batches. Any advice?
[1mGPT4 answer:[0m Sure, when using the `map_batches()` function in Ray Data, you can specify the batch format by using the `batch_format` argument. If you want to work with NumPy ndarrays, you can set `batch_format="numpy"`. For example, if you have a function that increases the brightness of an image, you can use it like this:

```python
from typing import Dict
import numpy as np
import ray

def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
    batch["image"] = np.clip(batch["image"] + 4, 0, 255)
    return batch

ds = (
    ray.data.read_images("s3://[email protected]/image-datasets/simple")
    .map_batches(increase_brightness, batch_format="numpy")
)
```

On the other hand, if you prefer to work with pandas DataFrames, you can set `batch_format="pandas"`. For instance, if you have a function that drops NA values from a DataFrame, you can use it like

Based on these answers, we go with GPT-4 as our "golden" LLM.

### Generating our Golden Responses

Now that we have chosen which LLM to use, we can generate our reference responses. Let's generate 10 reference responses and save them to a file.

In [60]:
llm = OpenAI(model='gpt-4', temperature=0.0)
ten_samples = data[:10]
golden_responses = generate_responses(ten_samples, llm)

100%|██████████| 10/10 [01:17<00:00,  7.78s/it]


In [61]:
reference_dataset = [{"question": entry["question"], "source": entry["source"], "response": response} for entry, response in zip(ten_samples, golden_responses)]

In [62]:
with open("golden-responses.json", "w") as file:
    json.dump(reference_dataset, file, indent=4)

## Evaluating our Query Engine

Once we have reference responses, we can get our generated responses from our query engine. Then pass both responses to our golden LLM to evaluate the responses from our application.

In [63]:
with open("golden-responses.json", "r") as file:
    golden_responses = json.load(file)

In [64]:
golden_responses[0]

{'question': 'I’m struggling a bit with Ray Data type conversions when I do map_batches. Any advice?',
 'source': 'https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format',
 'response': 'Sure, when using the `map_batches()` function in Ray Data, you can specify the batch format by using the `batch_format` parameter. If you want to represent batches as dictionaries of NumPy ndarrays, you can set `batch_format="numpy"`. For example, if you have a function like `increase_brightness` that operates on NumPy ndarrays, you can use it with `map_batches()` like this:\n\n```python\nds = (\n    ray.data.read_images("s3://[email\xa0protected]/image-datasets/simple")\n    .map_batches(increase_brightness, batch_format="numpy")\n)\n```\n\nOn the other hand, if you want to represent batches as pandas DataFrames, you can set `batch_format="pandas"`. For instance, if you have a function like `drop_nas` that operates on pandas DataFrames, you can use it with `map_batches()` li

In [65]:
from utils import get_query_engine

In [67]:
query_engine = get_query_engine(similarity_top_k=5, llm_model_name='meta-llama/Llama-2-70b-chat-hf')

# Store both the original response object and the response string.
rag_responses = []
rag_response_str = []

for entry in tqdm(golden_responses):
    query = entry["question"]
    response = query_engine.query(query)
    rag_responses.append(response)
    rag_response_str.append(response.response)

100%|██████████| 10/10 [01:27<00:00,  8.76s/it]


In [68]:
rag_response_str[0]

' It seems like you\'re encountering issues with type conversions when using `map_batches` with Ray Data. Here are some tips that may help:\n\n1. Use the `batch_format` argument: When calling `map_batches`, you can specify the `batch_format` argument to indicate the format of the batches. If you\'re working with NumPy arrays, set `batch_format="numpy"`. If you\'re working with PyTorch tensors, set `batch_format="torch"`. This can help Ray Data handle the type conversions correctly.\n2. Return the correct type: When writing a transformation function for `map_batches`, make sure to return a batch in the correct format. If you\'re working with NumPy arrays, return a dictionary of NumPy arrays. If you\'re working with PyTorch tensors, return a dictionary of PyTorch tensors.\n3. Avoid returning lists: When using `map_batches`, avoid returning lists of arrays or tensors. Instead, return a dictionary of arrays or tensors, where each key corresponds to a batch dimension. This can help Ray Data

In [69]:
from llama_index.evaluation import CorrectnessEvaluator

In [70]:
eval_llm = OpenAI(model='gpt-4', temperature=0.0)
service_context = ServiceContext.from_defaults(llm=eval_llm)
evaluator = CorrectnessEvaluator(service_context=service_context)

In [71]:
eval_results = []
for rag_response, golden_response in tqdm(zip(rag_response_str, golden_responses)):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    generated_answer = rag_response
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=generated_answer)
    eval_results.append(eval_result)

0it [00:00, ?it/s]


RuntimeError: asyncio.run() cannot be called from a running event loop

In [None]:
[r.score for r in eval_results]

Let's save the query, both responses, and the score to a JSON file

In [None]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]

In [None]:
with open("eval-scores.json", "w") as file:
    json.dump(scores, file, indent=4)

We can also calculate the average scores

In [None]:
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

## Evaluation without Golden Responses

Generating reference responses and then using them for evaluation can give us a more accurate assesment on how our query engine is performing. However, this approach can be expensive. We have to make an initial pass through GPT4 to generate the reference response, and then we have to make another pass through GPT4 to evaluate our application's responses against the reference response.

We can explore other evaluation metrics to get a better sense on how our query engine is performing, without needing to make multiple passes to GPT4.

### Evaluating for faithfulness/relevancy

One metric we can test is relevancy, which does not require generating reference responses. With this approach, we check to see if the generated response is relevant to at least one of the retrieved sources and to the query. This ensures that our LLM is not making up a response, but rather that it is relevant to the question that is being asked, and also that is relevant to at least one of the retrieved context.

This does NOT check whether the response is a correct response.

This capability is built into LlamaIndex, via the various `Evaluator` modules. We use gpt-4 as the evaluator.

In [72]:
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
from llama_index.llms import OpenAI
from llama_index import ServiceContext

def evaluate(queries: list, responses: list, metric: str):
    llm = OpenAI(model="gpt-4", temperature=0.0)
    service_context = ServiceContext.from_defaults(llm=llm)
    
    
    if metric == 'faithfulness':
        evaluator = FaithfulnessEvaluator(service_context=service_context)
    elif metric == 'relevancy':
        evaluator = RelevancyEvaluator(service_context=service_context)
    else:
        raise ValueError("Unknown metric: ", metrc)

    evals = []
    for query, response in tqdm(list(zip(queries, responses))):
        eval_result = evaluator.evaluate_response(query=query, response=response)
        evals.append(eval_result)
    
    return evals

def get_pass_rate(evals):
    return len([val.passing for val in evals]) / len(evals)

In [73]:
faithfulness_results = evaluate(queries=[sample["question"] for sample in ten_samples], responses=rag_responses, metric='faithfulness')

  0%|          | 0/10 [00:00<?, ?it/s]


RuntimeError: asyncio.run() cannot be called from a running event loop



In [None]:
faithfulness_score = get_pass_rate(faithfulness_results)
faithfulness_score

In [None]:
relevancy_results = evaluate(queries=[sample["question"] for sample in ten_samples], responses=rag_responses, metric='relevancy')

In [None]:
relevancy_score = get_pass_rate(relevancy_results)
relevancy_score