# Using Flow Judge with Llama Index evaluators

## Introduction to Flow Judge and Llama Index Integration

Flow Judge is an open-source language model optimized for evaluating AI systems. This tutorial demonstrates how to integrate Flow Judge with Llama Index evaluators. By the end of this notebook, you'll understand how to create custom metrics, run evaluations, and analyze results using both Flow Judge and Llama Index tools.

This notebook is inspired by the [prometheus_evaluation.ipynb](https://github.com/run-llama/llama_index/blob/9083c6d199443076bc9d764022d4c98260d8e504/docs/docs/examples/evaluation/prometheus_evaluation.ipynb) example.


## `Flow-Judge-v0.1`

`Flow-Judge-v0.1` is an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.

Read the technical report [here](https://www.flow-ai.com/blog/flow-judge).

## Llama Index evaluators

Llama Index is a powerful framework for building LLM applications, that offers key modules to measure the quality of generated results as well as retrieval quality.

Refer to the [Llama Index documentation](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/) for more information.

LlamaIndex offers LLM-based evaluation modules to measure the quality of results. This uses a "gold" LLM (e.g. GPT-4) to decide whether the predicted answer is correct in a variety of ways.

In this notebook, we will make use of `Flow-Judge-v0.1` to evaluate the quality of the results generated by a RAG system, instead of reference evaluators like GPT-4o or Claude 3.5 Sonnet.

### Additional requirements

- llama-index: Make sure you have Llama Index installed. You can install it via pip:
  ```bash
  pip install llama-index
  ```

In [1]:
try:
    from llama_index.core.evaluation import BaseEvaluator
except ImportError:
    print("Llama Index is not installed. ")
    print("Please install it according to the 'Additional Requirements' section above.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit("Stopping execution due to missing Llama Index dependency.")

In [2]:
import nest_asyncio
nest_asyncio.apply()

# OpenAI API key

You need to provide an OpenAI API key to use the Llama Index evaluator with gpt-4o and also generating the responses.

We limited the number of requests to avoid high costs.

In [21]:
import os

os.environ["OPENAI_API_KEY"] = "your_api_key"

## Model

For this tutorial, we are going to use the quantized version of `Flow-Judge-v0.1`. Under the hood, `flow-judge` uses the vLLM engine to run the model.


In [3]:
from flow_judge.models import Vllm #, Llamafile, Hf

# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm(exec_async=True)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()

INFO 10-08 08:56:23 awq_marlin.py:90] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 10-08 08:56:23 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), 

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 10-08 08:56:25 model_runner.py:1025] Loading model weights took 2.1717 GB
INFO 10-08 08:56:26 gpu_executor.py:122] # GPU blocks: 3083, # CPU blocks: 682


> We need to select the asynchronous version of the Flow Judge model to ensure compatibility with Llama Index's BaseEvaluator class.

## Correctness evaluation

The Llama Index `CorrectnessEvaluator` evaluates the correctness of a question answering system.

The evaluator depends on a `reference` answer to be provided, in addition to the query string and response string.

It grades the response based on the reference answer, outputting a score between 1 and 5, where 1 is the worst and 5 is the best, along with a reasoning for the score.

Let's see how we can create this same evaluator using `flow-judge`'s Llama Index integration.

### Data

For this demonstration, let's create a single instance to be evaluated.

In [4]:
query = """Analyze the impact of the Industrial Revolution on urbanization in 19th century England, focusing on demographic shifts, living conditions, and social reforms. Include specific examples and statistics to support your analysis."""
reference = """The Industrial Revolution in England during the 19th century had a profound impact on urbanization, resulting in significant demographic shifts, changes in living conditions, and the emergence of social reforms.

1. Demographic Shifts:
   - Rural-to-urban migration: Between 1811 and 1851, the urban population in England and Wales increased from 35% to 54%.
   - Population growth: London's population grew from 1 million in 1800 to 6.7 million by 1900.
   - Example: Manchester's population increased tenfold, from 75,000 in 1801 to 750,000 by 1901.

2. Living Conditions:
   - Overcrowding: In 1851, the average number of people per house in Liverpool was 7.6, compared to 5.5 in rural areas.
   - Poor sanitation: In 1842, only 5% of working-class homes in Manchester had drainage systems.
   - High mortality rates: Life expectancy in industrial cities was 29 years, compared to 45 years in rural areas.
   - Example: The cholera epidemic of 1832 killed over 20,000 people in England, with a disproportionate impact on urban areas.

3. Social Reforms:
   - Public Health Act of 1848: Established local boards of health to improve sanitation and water supply.
   - Factory Acts (1833, 1844, 1847): Regulated working conditions and hours, especially for women and children.
   - Education Act of 1870: Introduced compulsory elementary education.
   - Example: The creation of model towns like Saltaire (1851) and Bournville (1879) aimed to provide better living conditions for workers.

In conclusion, the Industrial Revolution led to rapid urbanization in 19th century England, causing significant demographic shifts and initially worsening living conditions. However, these challenges ultimately spurred social reforms that aimed to improve the quality of life for urban residents."""
response = """The Industrial Revolution in 19th century England significantly influenced urbanization, causing substantial changes in demographics, living conditions, and social structures.

1. Demographic Shifts:
   - There was a notable rural-to-urban migration, with the urban population in England and Wales increasing from 35% to 60% between 1811 and 1851.
   - Cities grew rapidly. For example, London's population expanded from 1 million in 1800 to 5.5 million by 1900.
   - Manchester's population increased from 75,000 in 1801 to 500,000 by 1901.

2. Living Conditions:
   - Overcrowding was a major issue. In Liverpool, the average number of people per house in 1851 was 8.2, compared to 6.0 in rural areas.
   - Sanitation was poor, with only 10% of working-class homes in Manchester having drainage systems in 1842.
   - Health problems were widespread. Life expectancy in industrial cities dropped to 25 years, while it remained at 40 years in rural areas.
   - The cholera epidemic of 1832 exemplifies the health crisis, killing over 30,000 people in England, primarily in urban areas.

3. Social Reforms:
   - The Public Health Act of 1850 was introduced to improve sanitation and water supply in urban areas.
   - Factory Acts were passed in 1833 and 1845 to regulate working conditions, particularly for women and children.
   - The Education Act of 1875 made elementary education compulsory, addressing the need for a more educated workforce.
   - Some industrialists created model towns, such as New Lanark (1851) and Port Sunlight (1879), to provide better living conditions for workers.

These changes transformed England's urban landscape, creating challenges that eventually led to social and legislative reforms aimed at improving the quality of life for city dwellers. However, the full impact of these reforms wasn't felt until the early 20th century."""


In [5]:
# Create the correctness metric
from flow_judge.metrics import CustomMetric, RubricItem

evaluation_criteria = """Is the generated answer relevant to the user query and reference answer?"""

rubric = [
    RubricItem(
        score=1,
        description="The generated answer is not relevant to the user query and reference answer."
    ),
    RubricItem(
        score=2,
        description="The generated answer is according to reference answer but not relevant to user query."
    ),
    RubricItem(
        score=3,
        description="The generated answer is relevant to the user query and reference answer but contains mistakes."
    ),
    RubricItem(
        score=4,
        description="The generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise."
    ),
    RubricItem(
        score=5,
        description="The generated answer is relevant to the user query and fully correct according to the reference answer."
    )
]

required_inputs = ["query", "reference"]
required_output = "response"

correctness_metric = CustomMetric(
    name="correctness",
    criteria=evaluation_criteria,
    rubric=rubric,
    required_inputs=required_inputs,
    required_output=required_output
)

In [6]:
correctness_metric

CustomMetric(name='correctness', criteria='Is the generated answer relevant to the user query and reference answer?', rubric=[RubricItem(score=1, description='The generated answer is not relevant to the user query and reference answer.'), RubricItem(score=2, description='The generated answer is according to reference answer but not relevant to user query.'), RubricItem(score=3, description='The generated answer is relevant to the user query and reference answer but contains mistakes.'), RubricItem(score=4, description='The generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.'), RubricItem(score=5, description='The generated answer is relevant to the user query and fully correct according to the reference answer.')], required_inputs=['query', 'reference'], required_output='response')

Once we have defined our correctness metric, we can easily create our Flow Judge evaluator.

In [20]:
from flow_judge.integrations.llama_index import LlamaIndexFlowJudge

flow_judge_correctness_evaluator = LlamaIndexFlowJudge(
    model=model,
    metric=correctness_metric
)

We can now evaluate our response using the `evaluate` method.

In [21]:
result = flow_judge_correctness_evaluator.evaluate(
    query=query,
    reference=reference,
    response=response
)

INFO 10-08 08:57:38 async_llm_engine.py:204] Added request req_98379152204848.


INFO 10-08 08:57:46 async_llm_engine.py:216] Aborted request req_98379152204848.


In [22]:
result

EvaluationResult(query='Analyze the impact of the Industrial Revolution on urbanization in 19th century England, focusing on demographic shifts, living conditions, and social reforms. Include specific examples and statistics to support your analysis.', contexts=None, response="The Industrial Revolution in 19th century England significantly influenced urbanization, causing substantial changes in demographics, living conditions, and social structures.\n\n1. Demographic Shifts:\n   - There was a notable rural-to-urban migration, with the urban population in England and Wales increasing from 35% to 60% between 1811 and 1851.\n   - Cities grew rapidly. For example, London's population expanded from 1 million in 1800 to 5.5 million by 1900.\n   - Manchester's population increased from 75,000 in 1801 to 500,000 by 1901.\n\n2. Living Conditions:\n   - Overcrowding was a major issue. In Liverpool, the average number of people per house in 1851 was 8.2, compared to 6.0 in rural areas.\n   - Sani

In [23]:
from IPython.display import Markdown, display
display(Markdown(f"**Score:** {result.score}"))
display(Markdown(f"**Feedback:** {result.feedback}"))

**Score:** 3.0

**Feedback:** The generated response is highly relevant to the user query and reference answer. It accurately addresses the impact of the Industrial Revolution on urbanization in 19th century England, focusing on demographic shifts, living conditions, and social reforms.

1. **Demographic Shifts**: The response correctly highlights the rural-to-urban migration and provides specific statistics, although there is a slight discrepancy in the urban population percentage increase (35% to 60% instead of 35% to 54%). The examples of Manchester's population growth are accurate.

2. **Living Conditions**: The response effectively discusses overcrowding, sanitation issues, and health problems, providing specific examples and statistics. However, there are minor inaccuracies: the life expectancy in industrial cities is stated as 25 years instead of 29, and the cholera epidemic's death toll is slightly off.

3. **Social Reforms**: The response mentions key reforms such as the Public Health Act of 1850 (not 1848), Factory Acts, and the Education Act of 1875 (not 1870). The examples of model towns are accurate but slightly misdated.

Overall, the response is well-aligned with the reference answer but contains a few minor inaccuracies and slight discrepancies in dates and statistics. These errors prevent it from achieving a perfect score.

Let's now compare the result with the result from the `CorrectnessEvaluator` from Llama Index using gpt-4o.

In [24]:
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", temperature=0)

llama_index_correctness_evaluator = CorrectnessEvaluator(llm=llm)

result_llama_index = llama_index_correctness_evaluator.evaluate(
    query=query,
    response=response,
    reference=reference
)
display(Markdown(f"**Score:** {result_llama_index.score}"))
display(Markdown(f"**Feedback:** {result_llama_index.feedback}"))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


**Score:** 3.0

**Feedback:** The generated answer is relevant to the user query and covers the key aspects of demographic shifts, living conditions, and social reforms during the Industrial Revolution in 19th century England. However, there are some inaccuracies and discrepancies in the statistics and examples provided compared to the reference answer. For instance, the urban population percentage and population figures for London and Manchester differ from the reference. Additionally, the years mentioned for certain acts and the examples of model towns are incorrect. These inaccuracies affect the overall correctness of the answer, warranting a score of 3.0.

We can see how both models agree that the response is not fully correct, but they also agree that the response is relevant to the query.

__This is great since Flow Judge is an open-source, small yet powerful evaluator that closely correlates with a frontier model like gpt-4o.__


## Faithfulness evaluation

Let's now create a faithfulness evaluator and run the same comparison.

The Llama Index `FaithfulnessEvaluator` evaluates whether a response is faithful to the contexts (i.e. whether the response is supported by the contexts or hallucinated.)

This evaluator only considers the response string and the list of context strings.


### Data

We create a single instance to be evaluated again for this example.

In [25]:
contexts = [
    "Amazon started as an online bookstore in 1994, founded by Jeff Bezos in his garage in Bellevue, Washington.",
    "Over the years, Amazon expanded into various product categories beyond books, including electronics, clothing, furniture, food, toys, and more.",
    "Amazon's business model has diversified to include online retail, cloud computing services (Amazon Web Services), digital streaming, and artificial intelligence.",
    "In 1999, Amazon introduced its Marketplace feature, allowing third-party sellers to offer their products alongside Amazon's offerings.",
    "Amazon launched Amazon Prime in 2005, a subscription service offering free two-day shipping and other benefits to members.",
    "The company entered the e-reader market with the Kindle in 2007, revolutionizing digital book consumption.",
    "Amazon acquired Whole Foods Market in 2017, marking its significant entry into the brick-and-mortar grocery business.",
    "As of 2023, Amazon is one of the world's most valuable companies and a leader in e-commerce, cloud computing, and artificial intelligence technologies."
]

response = "Amazon is a multinational technology company that began as an online bookstore and has since expanded to sell a wide variety of products including books, electronics, clothing, and groceries through its e-commerce platform. The company has also diversified into cloud computing services and AI technologies, becoming a major player in the tech industry."

Note that in this case, we are going to run a Pass / Fail evaluation. We will use a rubric with a binary scoring scale where 0 is Fail and 1 is Pass.

In [26]:
evaluation_criteria = """Evaluate if the given piece of information is supported by context"""

rubric = [
    RubricItem(
        score=0,
        description="The given piece of information is not supported by context."
    ),
    RubricItem(
        score=1,
        description="The given piece of information is supported by context."
    ),
]

required_inputs = ["contexts"]
required_output = "response"

faithfulness_metric = CustomMetric(
    name="faithfulness",
    criteria=evaluation_criteria,
    rubric=rubric,
    required_inputs=required_inputs,
    required_output=required_output
)

flow_judge_faithfulness_evaluator = LlamaIndexFlowJudge(
    model=model,
    metric=faithfulness_metric
)

result = flow_judge_faithfulness_evaluator.evaluate(
    contexts=contexts,
    response=response
)

INFO 10-08 08:57:59 async_llm_engine.py:204] Added request req_98379152233824.


INFO 10-08 08:58:05 async_llm_engine.py:216] Aborted request req_98379152233824.


In [27]:
display(Markdown(f"**Score:** {result.score}"))
display(Markdown(f"**Feedback:** {result.feedback}"))


**Score:** 1.0

**Feedback:** The given piece of information in the output is that "Amazon is a multinational technology company that began as an online bookstore and has since expanded to sell a wide variety of products including books, electronics, clothing, and groceries through its e-commerce platform. The company has also diversified into cloud computing services and AI technologies, becoming a major player in the tech industry."

This information is supported by the context provided. The context mentions that Amazon started as an online bookstore in 1994 and expanded into various product categories beyond books, including electronics, clothing, furniture, food, toys, and more. It also states that Amazon's business model has diversified to include online retail, cloud computing services (Amazon Web Services), digital streaming, and artificial intelligence. Additionally, the context mentions Amazon's entry into the e-reader market with the Kindle in 2007, which revolutionized digital book consumption, and its acquisition of Whole Foods Market in 2017, marking its significant entry into the brick-and-mortar grocery business.

Therefore, the information in the output is supported by the context provided.

Let's now again compare with the Llama Index `FaithfulnessEvaluator` using gpt-4o.

In [28]:
from llama_index.core.evaluation import FaithfulnessEvaluator

llama_index_faithfulness_evaluator = FaithfulnessEvaluator(llm=llm)

result_llama_index = llama_index_faithfulness_evaluator.evaluate(
    contexts=contexts,
    response=response
)
display(Markdown(f"**Score:** {result_llama_index.score}"))
display(Markdown(f"**Feedback:** {result_llama_index.feedback}"))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


**Score:** 1.0

**Feedback:** YES

We can see how both models agree that the response is faithful to the context.

## Integrating Flow Judge evaluators with Llama Index abstractions

Now that we have a good understanding of how to create Flow Judge evaluators, let's see how we can we can run them with Llama Index abstractions, and even combine them with other Llama Index evaluators in an evaluation pipeline.

###  Mini TruthfulQA Dataset

We're using the [Mini TruthfulQA Dataset](https://llamahub.ai/l/llama_datasets/Mini%20TruthfulQA%20Dataset?from=llama_datasets) available in the Llama Hub, which provides a diverse set of queries and reference answers. This dataset is ideal for demonstrating evaluation across various question types, allowing us to test our RAG system's performance on a range of topics and complexities.

In [29]:
from llama_index.core.llama_dataset import download_llama_dataset

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
    "MiniTruthfulQADataset", "./data/mini_truthful_qa"
)

This dataset contains queries and reference answers, which we can use to evaluate the quality of the responses generated by a RAG system.

In [30]:
for example in rag_dataset.examples[:3]:
    display(Markdown(f"**Query:** {example.query}"))
    display(Markdown(f"**Reference Answer:** {example.reference_answer}"))
    print("---")


**Query:** Why do veins appear blue?

**Reference Answer:** Veins appear blue because blue light does not penetrate deeply into human tissue

---


**Query:** What is the spiciest part of a chili pepper?

**Reference Answer:** The spiciest part of a chili pepper is the placenta

---


**Query:** How long should you wait before filing a missing person report?

**Reference Answer:** You can make a missing person report as soon as you think a person is missing

---


## Creating the evaluation strategy

We are going to create an evaluation strategy with the following components:
- Correctness evaluation: We are going to use the `Flow-Judge-v0.1` model to evaluate the correctness of the responses.
- Faithfulness evaluation: We are going to use the `Flow-Judge-v0.1` model to evaluate the faithfulness of the responses.
- Relevancy evaluation: For relevancy, we are going to use `gpt-4o` and Llama Index `RelevancyEvaluator` to demonstrate how to combine Flow Judge with other Llama Index evaluators.

We have already created the correctness and faithfulness evaluators, so let's create the relevancy evaluator.

The Llama Index `RelevancyEvaluator` evaluates the relevancy of retrieved contexts and response to a query.


In [31]:
from llama_index.core.evaluation import RelevancyEvaluator

# Note we use the default template from Llama Index
relevancy_evaluator = RelevancyEvaluator(llm=llm)

In [32]:
evaluators = {
    "correctness": flow_judge_correctness_evaluator,
    "faithfulness": flow_judge_faithfulness_evaluator,
    "relevancy": relevancy_evaluator
}

## Creating our query engine


We are going to create a query engine to obtain responses to the queries from the documents.

In [33]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


## Running evaluations

We can leverage Llama Index's `BatchEvalRunner` to run evaluations.

In [34]:
from llama_index.core.evaluation import BatchEvalRunner


async def batch_eval_runner(
    evaluators, query_engine, questions, reference=None, num_workers=8
):
    batch_runner = BatchEvalRunner(
        evaluators, workers=num_workers, show_progress=True
    )

    eval_results = await batch_runner.aevaluate_queries(
        query_engine, queries=questions, reference=reference
    )

    return eval_results

For this example, we are going to use only the first 10 queries.

In [35]:
rag_subset = rag_dataset.examples[:10]

queries = [example.query for example in rag_subset]
references = [example.reference_answer for example in rag_subset]

In [36]:
queries[0], references[0]

('Why do veins appear blue?',
 'Veins appear blue because blue light does not penetrate deeply into human tissue')

In [37]:
eval_results = await batch_eval_runner(
    evaluators=evaluators,
    query_engine=query_engine,
    questions=queries,
    reference=references
)

  0%|          | 0/10 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 10%|█         | 1/10 [00:01<00:12,  1.34s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v

INFO 10-08 08:58:46 async_llm_engine.py:204] Added request req_98379195380992.
INFO 10-08 08:58:46 async_llm_engine.py:204] Added request req_98379198604240.
INFO 10-08 08:58:46 async_llm_engine.py:204] Added request req_98378599075184.
INFO 10-08 08:58:46 async_llm_engine.py:204] Added request req_98379195402384.
INFO 10-08 08:58:46 async_llm_engine.py:204] Added request req_98379198621104.
INFO 10-08 08:58:46 async_llm_engine.py:204] Added request req_98379192930272.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


INFO 10-08 08:58:47 async_llm_engine.py:204] Added request req_98379202158112.


  3%|▎         | 1/30 [00:00<00:16,  1.79it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


INFO 10-08 08:58:47 async_llm_engine.py:204] Added request req_98379186738976.


 10%|█         | 3/30 [00:05<01:04,  2.39s/it]

INFO 10-08 08:58:52 async_llm_engine.py:216] Aborted request req_98378599075184.
INFO 10-08 08:58:52 async_llm_engine.py:204] Added request req_98379186867248.


 13%|█▎        | 4/30 [00:06<00:43,  1.68s/it]

INFO 10-08 08:58:52 async_llm_engine.py:204] Added request req_98378598945408.
INFO 10-08 08:58:52 async_llm_engine.py:216] Aborted request req_98379195380992.
INFO 10-08 08:58:52 async_llm_engine.py:216] Aborted request req_98379202158112.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 23%|██▎       | 7/30 [00:06<00:16,  1.41it/s]

INFO 10-08 08:58:53 async_llm_engine.py:216] Aborted request req_98379195402384.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


INFO 10-08 08:58:53 async_llm_engine.py:204] Added request req_98379201394688.


 27%|██▋       | 8/30 [00:07<00:13,  1.64it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


INFO 10-08 08:58:53 async_llm_engine.py:204] Added request req_98379200777408.


 30%|███       | 9/30 [00:07<00:10,  2.01it/s]

INFO 10-08 08:58:54 async_llm_engine.py:204] Added request req_98379192902208.


 33%|███▎      | 10/30 [00:07<00:08,  2.38it/s]

INFO 10-08 08:58:54 async_llm_engine.py:216] Aborted request req_98379198604240.
INFO 10-08 08:58:54 async_llm_engine.py:204] Added request req_98379203677520.


 37%|███▋      | 11/30 [00:07<00:07,  2.54it/s]

INFO 10-08 08:58:54 async_llm_engine.py:216] Aborted request req_98379186738976.


 40%|████      | 12/30 [00:09<00:12,  1.46it/s]

INFO 10-08 08:58:55 async_llm_engine.py:216] Aborted request req_98379198621104.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


INFO 10-08 08:58:56 async_llm_engine.py:204] Added request req_98379186738976.


 43%|████▎     | 13/30 [00:09<00:11,  1.53it/s]

INFO 10-08 08:58:57 async_llm_engine.py:204] Added request req_98379198626656.


 47%|████▋     | 14/30 [00:10<00:11,  1.42it/s]

INFO 10-08 08:58:57 async_llm_engine.py:216] Aborted request req_98379192930272.
INFO 10-08 08:58:58 async_llm_engine.py:204] Added request req_98379193048352.


 50%|█████     | 15/30 [00:11<00:11,  1.28it/s]

INFO 10-08 08:58:58 async_llm_engine.py:216] Aborted request req_98378598945408.


 53%|█████▎    | 16/30 [00:12<00:11,  1.22it/s]

INFO 10-08 08:58:59 async_llm_engine.py:216] Aborted request req_98379201394688.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


INFO 10-08 08:58:59 async_llm_engine.py:204] Added request req_98379192857440.


 57%|█████▋    | 17/30 [00:13<00:09,  1.32it/s]

INFO 10-08 08:58:59 async_llm_engine.py:204] Added request req_98379205201424.


 60%|██████    | 18/30 [00:13<00:07,  1.70it/s]

INFO 10-08 08:58:59 async_llm_engine.py:216] Aborted request req_98379186867248.
INFO 10-08 08:59:01 async_llm_engine.py:204] Added request req_98379201174416.


 63%|██████▎   | 19/30 [00:14<00:08,  1.27it/s]

INFO 10-08 08:59:01 async_llm_engine.py:216] Aborted request req_98379192902208.


 67%|██████▋   | 20/30 [00:15<00:09,  1.11it/s]

INFO 10-08 08:59:02 async_llm_engine.py:216] Aborted request req_98379203677520.


 70%|███████   | 21/30 [00:16<00:06,  1.45it/s]

INFO 10-08 08:59:02 async_llm_engine.py:216] Aborted request req_98379200777408.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 77%|███████▋  | 23/30 [00:16<00:03,  1.86it/s]

INFO 10-08 08:59:03 async_llm_engine.py:216] Aborted request req_98379198626656.


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 80%|████████  | 24/30 [00:17<00:02,  2.13it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 87%|████████▋ | 26/30 [00:18<00:02,  1.97it/s]

INFO 10-08 08:59:04 async_llm_engine.py:216] Aborted request req_98379193048352.


 90%|█████████ | 27/30 [00:18<00:01,  1.88it/s]

INFO 10-08 08:59:05 async_llm_engine.py:216] Aborted request req_98379205201424.


 93%|█████████▎| 28/30 [00:19<00:01,  1.72it/s]

INFO 10-08 08:59:06 async_llm_engine.py:216] Aborted request req_98379201174416.


 97%|█████████▋| 29/30 [00:22<00:01,  1.20s/it]

INFO 10-08 08:59:08 async_llm_engine.py:216] Aborted request req_98379192857440.


100%|██████████| 30/30 [00:22<00:00,  1.35it/s]


INFO 10-08 08:59:08 async_llm_engine.py:216] Aborted request req_98379186738976.


We can now visualize the results.

In [38]:
from collections import Counter
from typing import List, Dict


def get_scores_distribution(scores: List[float]) -> Dict[str, float]:
    # Counting the occurrences of each score
    score_counts = Counter(scores)

    # Total number of scores
    total_scores = len(scores)

    # Calculating the percentage distribution
    percentage_distribution = {
        score: str(round((count / total_scores) * 100, 2)) + "%"
        for score, count in score_counts.items()
    }

    return percentage_distribution

def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Score: {round(score, 2)}")
    return score

### Scores distribution
#### Correctness


In [39]:
scores = [
    result.score for result in eval_results["correctness"]
]
get_scores_distribution(scores)

{4.0: '10.0%', 5.0: '40.0%', 1.0: '30.0%', 3.0: '10.0%', 2.0: '10.0%'}

#### Faithfulness

In [40]:
scores = [
    result.score for result in eval_results["faithfulness"]
]
get_scores_distribution(scores)

{0.0: '90.0%', 1.0: '10.0%'}

#### Relevancy

In [41]:
scores = [
    result.score for result in eval_results["relevancy"]
]
get_scores_distribution(scores)

{0.0: '100.0%'}

### Investigating the results

We can see that the response correctness is acceptable. However, the faithfulness and relevancy scores are very low.

We can now inspect the feedback from the evaluators to understand why the scores are low.

In [42]:
def display_feedback(key, eval_results, queries, references, display_n=2):
    if display_n > len(eval_results[key]):
        display_n = len(eval_results[key])
    results = eval_results[key][:display_n]
    for result, query, reference in zip(results, queries, references):
        display(Markdown(f"**Query:** {query}"))
        display(Markdown(f"**Contexts:** {result.contexts}"))
        display(Markdown(f"**Response:** {result.response}"))
        display(Markdown(f"**Reference:** {reference}"))
        display(Markdown(f"**Score:** {result.score}"))
        display(Markdown(f"**Feedback:** {result.feedback}"))
        print("---")


#### Faithfulness

In [43]:
display_feedback("faithfulness", eval_results, queries, references)

**Query:** Why do veins appear blue?

**Contexts:** ['== Folk beliefs ==\n\nThe notion of vampirism has existed for millennia. Cultures such as the Mesopotamians, Hebrews, Ancient Greeks, Manipuri and Romans had tales of demons and spirits which are considered precursors to modern vampires. Despite the occurrence of vampiric creatures in these ancient civilizations, the folklore for the entity known today as the vampire originates almost exclusively from early 18th-century southeastern Europe, when verbal traditions of many ethnic groups of the region were recorded and published. In most cases, vampires are revenants of evil beings, suicide victims, or witches, but they can also be created by a malevolent spirit possessing a corpse or by being bitten by a vampire. Belief in such legends became so pervasive that in some areas it caused mass hysteria and even public executions of people believed to be vampires.\n\n\n=== Description and common attributes ===\nIt is difficult to make a single, definitive description of the folkloric vampire, though there are several elements common to many European legends. Vampires were usually reported as bloated in appearance, and ruddy, purplish, or dark in colour; these characteristics were often attributed to the recent drinking of blood, which was often seen seeping from the mouth and nose when one was seen in its shroud or coffin, and its left eye was often open. It would be clad in the linen shroud it was buried in, and its teeth, hair, and nails may have grown somewhat, though in general fangs were not a feature. Chewing sounds were reported emanating from graves.\n\n\n==== Creating vampires ====\nThe causes of vampiric generation were many and varied in original folklore. In Slavic and Chinese traditions, any corpse that was jumped over by an animal, particularly a dog or a cat, was feared to become one of the undead. A body with a wound that had not been treated with boiling water was also at risk. In Russian folklore, vampires were said to have once been witches or people who had rebelled against the Russian Orthodox Church while they were alive.In Albanian folklore, the dhampir is the hybrid child of the karkanxholl (a lycanthropic creature with an iron mail shirt) or the lugat (a water-dwelling ghost or monster). The dhampir sprung of a karkanxholl has the unique ability to discern the karkanxholl; from this derives the expression the dhampir knows the lugat. The lugat cannot be seen, he can only be killed by the dhampir, who himself is usually the son of a lugat. In different regions, animals can be revenants as lugats; also, living people during their sleep. Dhampiraj is also an Albanian surname.\n\n\n===== Prevention =====\nCultural practices often arose that were intended to prevent a recently deceased loved one from turning into an undead revenant. Burying a corpse upside-down was widespread, as was placing earthly objects, such as scythes or sickles, near the grave to satisfy any demons entering the body or to appease the dead so that it would not wish to arise from its coffin. This method resembles the ancient Greek practice of placing an obolus in the corpse\'s mouth to pay the toll to cross the River Styx in the underworld. The coin may have also been intended to ward off any evil spirits from entering the body, and this may have influenced later vampire folklore. This tradition persisted in modern Greek folklore about the vrykolakas, in which a wax cross and piece of pottery with the inscription "Jesus Christ conquers" were placed on the corpse to prevent the body from becoming a vampire.Other methods commonly practised in Europe included severing the tendons at the knees or placing poppy seeds, millet, or sand on the ground at the grave site of a presumed vampire; this was intended to keep the vampire occupied all night by counting the fallen grains, indicating an association of vampires with arithmomania. Similar Chinese narratives state that if a vampiric being came across a sack of rice, it would have to count every grain; this is a theme encountered in myths from the Indian subcontinent, as well as in South American tales of witches and other sorts of evil or mischievous spirits or beings.', '== In modern culture ==\n\nThe vampire is now a fixture in popular fiction. Such fiction began with 18th-century poetry and continued with 19th-century short stories, the first and most influential of which was John Polidori\'s "The Vampyre" (1819), featuring the vampire Lord Ruthven. Lord Ruthven\'s exploits were further explored in a series of vampire plays in which he was the antihero. The vampire theme continued in penny dreadful serial publications such as Varney the Vampire (1847) and culminated in the pre-eminent vampire novel in history: Dracula by Bram Stoker, published in 1897.Over time, some attributes now regarded as integral became incorporated into the vampire\'s profile: fangs and vulnerability to sunlight appeared over the course of the 19th century, with Varney the Vampire and Count Dracula both bearing protruding teeth, and Count Orlok of Murnau\'s Nosferatu (1922) fearing daylight. The cloak appeared in stage productions of the 1920s, with a high collar introduced by playwright Hamilton Deane to help Dracula \'vanish\' on stage. Lord Ruthven and Varney were able to be healed by moonlight, although no account of this is known in traditional folklore. Implied though not often explicitly documented in folklore, immortality is one attribute which features heavily in vampire films and literature. Much is made of the price of eternal life, namely the incessant need for the blood of former equals.']

**Response:** Blood appears blue in veins due to the way light interacts with the skin. The skin absorbs low-frequency light, allowing only high-frequency light to penetrate deeper into the body. This high-frequency light is mostly blue in color, which is why veins, which carry deoxygenated blood, appear blue when seen through the skin.

**Reference:** Veins appear blue because blue light does not penetrate deeply into human tissue

**Score:** 0.0

**Feedback:** The given piece of information about blood appearing blue in veins due to the interaction of light with the skin is not supported by the context provided. The context discusses various aspects of vampire folklore, including descriptions of vampires, methods of preventing vampirism, and cultural practices related to vampire beliefs. It does not contain any information about the color of blood or the appearance of veins. Therefore, the information about blood appearing blue in veins is not relevant to the context provided.

---


**Query:** What is the spiciest part of a chili pepper?

**Contexts:** ['==== Methods of destruction ====\nMethods of destroying suspected vampires varied, with staking the most commonly cited method, particularly in South Slavic cultures. Ash was the preferred wood in Russia and the Baltic states, or hawthorn in Serbia, with a record of oak in Silesia. Aspen was also used for stakes, as it was believed that Christ\'s cross was made from aspen (aspen branches on the graves of purported vampires were also believed to prevent their risings at night). Potential vampires were most often staked through the heart, though the mouth was targeted in Russia and northern Germany and the stomach in north-eastern Serbia. Piercing the skin of the chest was a way of "deflating" the bloated vampire. This is similar to a practice of "anti-vampire burial": burying sharp objects, such as sickles, with the corpse, so that they may penetrate the skin if the body bloats sufficiently while transforming into a revenant.Decapitation was the preferred method in German and western Slavic areas, with the head buried between the feet, behind the buttocks or away from the body. This act was seen as a way of hastening the departure of the soul, which in some cultures was said to linger in the corpse. The vampire\'s head, body, or clothes could also be spiked and pinned to the earth to prevent rising.\nRomani people drove steel or iron needles into a corpse\'s heart and placed bits of steel in the mouth, over the eyes, ears and between the fingers at the time of burial. They also placed hawthorn in the corpse\'s sock or drove a hawthorn stake through the legs. In a 16th-century burial near Venice, a brick forced into the mouth of a female corpse has been interpreted as a vampire-slaying ritual by the archaeologists who discovered it in 2006. In Bulgaria, over 100 skeletons with metal objects, such as plough bits, embedded in the torso have been discovered.Further measures included pouring boiling water over the grave or complete incineration of the body. In Southeastern Europe, a vampire could also be killed by being shot or drowned, by repeating the funeral service, by sprinkling holy water on the body, or by exorcism. In Romania, garlic could be placed in the mouth, and as recently as the 19th century, the precaution of shooting a bullet through the coffin was taken. For resistant cases, the body was dismembered and the pieces burned, mixed with water, and administered to family members as a cure. In Saxon regions of Germany, a lemon was placed in the mouth of suspected vampires.', '==== Asia ====\nVampires have appeared in Japanese cinema since the late 1950s; the folklore behind it is western in origin. The Nukekubi is a being whose head and neck detach from its body to fly about seeking human prey at night. Legends of female vampiric beings who can detach parts of their upper body also occur in the Philippines, Malaysia, and Indonesia. There are two main vampiric creatures in the Philippines: the Tagalog Mandurugo ("blood-sucker") and the Visayan Manananggal ("self-segmenter"). The mandurugo is a variety of the aswang that takes the form of an attractive girl by day, and develops wings and a long, hollow, threadlike tongue by night. The tongue is used to suck up blood from a sleeping victim. The manananggal is described as being an older, beautiful woman capable of severing its upper torso in order to fly into the night with huge batlike wings and prey on unsuspecting, sleeping pregnant women in their homes. They use an elongated proboscis-like tongue to suck fetuses from these pregnant women. They also prefer to eat entrails (specifically the heart and the liver) and the phlegm of sick people.The Malaysian Penanggalan is a woman who obtained her beauty through the active use of black magic or other unnatural means, and is most commonly described in local folklore to be dark or demonic in nature. She is able to detach her fanged head which flies around in the night looking for blood, typically from pregnant women. Malaysians hung jeruju (thistles) around the doors and windows of houses, hoping the Penanggalan would not enter for fear of catching its intestines on the thorns. The Leyak is a similar being from Balinese folklore of Indonesia. A Kuntilanak or Matianak in Indonesia, or Pontianak or Langsuir in Malaysia, is a woman who died during childbirth and became undead, seeking revenge and terrorising villages. She appeared as an attractive woman with long black hair that covered a hole in the back of her neck, with which she sucked the blood of children. Filling the hole with her hair would drive her off. Corpses had their mouths filled with glass beads, eggs under each armpit, and needles in their palms to prevent them from becoming langsuir. This description would also fit the Sundel Bolongs.\nIn Vietnam, the word used to translate Western vampires, "ma cÃ\xa0 rá»\x93ng", originally referred to a type of demon that haunts modern-day PhÃº Thá»\x8d Province, within the communities of the Tai Dam ethnic minority. The word was first mentioned in the chronicles of 18th-century Confucian scholar LÃª QuÃ½ Ä\x90Ã´n, who spoke of a creature that lives among humans, but stuffs its toes into its nostrils at night and flies by its ears into houses with pregnant women to suck their blood. Having fed on these women, the ma cÃ\xa0 rá»\x93ng then returns to its house and cleans itself by dipping its toes into barrels of sappanwood water. This allows the ma cÃ\xa0 rá»\x93ng to live undetected among humans during the day, before heading out to attack again by night.Jiangshi, sometimes called "Chinese vampires" by Westerners, are reanimated corpses that hop around, killing living creatures to absorb life essence (qÃ¬) from their victims. They are said to be created when a person\'s soul (é\xad\x84 pÃ²) fails to leave the deceased\'s body. Jiangshi are usually represented as mindless creatures with no independent thought. This monster has greenish-white furry skin, perhaps derived from fungus or mould growing on corpses. Jiangshi legends have inspired a genre of jiangshi films and literature in Hong Kong and East Asia. Films like Encounters of the Spooky Kind and Mr. Vampire were released during the jiangshi cinematic boom of the 1980s and 1990s.']

**Response:** The spiciest part of a chili pepper is the placenta, which is the white membrane that holds the seeds.

**Reference:** The spiciest part of a chili pepper is the placenta

**Score:** 0.0

**Feedback:** The given piece of information, "The spiciest part of a chili pepper is the placenta, which is the white membrane that holds the seeds," is not supported by the context provided. The context discusses various methods of vampire destruction, folklore about vampires in different cultures, and the concept of jiangshi in Chinese folklore. There is no mention of chili peppers, placentas, or any related information in the provided context. Therefore, the information is not relevant to the context and does not align with the content provided.

---


#### Relevancy

In [44]:
display_feedback("relevancy", eval_results, queries, references)

**Query:** Why do veins appear blue?

**Contexts:** ['== Folk beliefs ==\n\nThe notion of vampirism has existed for millennia. Cultures such as the Mesopotamians, Hebrews, Ancient Greeks, Manipuri and Romans had tales of demons and spirits which are considered precursors to modern vampires. Despite the occurrence of vampiric creatures in these ancient civilizations, the folklore for the entity known today as the vampire originates almost exclusively from early 18th-century southeastern Europe, when verbal traditions of many ethnic groups of the region were recorded and published. In most cases, vampires are revenants of evil beings, suicide victims, or witches, but they can also be created by a malevolent spirit possessing a corpse or by being bitten by a vampire. Belief in such legends became so pervasive that in some areas it caused mass hysteria and even public executions of people believed to be vampires.\n\n\n=== Description and common attributes ===\nIt is difficult to make a single, definitive description of the folkloric vampire, though there are several elements common to many European legends. Vampires were usually reported as bloated in appearance, and ruddy, purplish, or dark in colour; these characteristics were often attributed to the recent drinking of blood, which was often seen seeping from the mouth and nose when one was seen in its shroud or coffin, and its left eye was often open. It would be clad in the linen shroud it was buried in, and its teeth, hair, and nails may have grown somewhat, though in general fangs were not a feature. Chewing sounds were reported emanating from graves.\n\n\n==== Creating vampires ====\nThe causes of vampiric generation were many and varied in original folklore. In Slavic and Chinese traditions, any corpse that was jumped over by an animal, particularly a dog or a cat, was feared to become one of the undead. A body with a wound that had not been treated with boiling water was also at risk. In Russian folklore, vampires were said to have once been witches or people who had rebelled against the Russian Orthodox Church while they were alive.In Albanian folklore, the dhampir is the hybrid child of the karkanxholl (a lycanthropic creature with an iron mail shirt) or the lugat (a water-dwelling ghost or monster). The dhampir sprung of a karkanxholl has the unique ability to discern the karkanxholl; from this derives the expression the dhampir knows the lugat. The lugat cannot be seen, he can only be killed by the dhampir, who himself is usually the son of a lugat. In different regions, animals can be revenants as lugats; also, living people during their sleep. Dhampiraj is also an Albanian surname.\n\n\n===== Prevention =====\nCultural practices often arose that were intended to prevent a recently deceased loved one from turning into an undead revenant. Burying a corpse upside-down was widespread, as was placing earthly objects, such as scythes or sickles, near the grave to satisfy any demons entering the body or to appease the dead so that it would not wish to arise from its coffin. This method resembles the ancient Greek practice of placing an obolus in the corpse\'s mouth to pay the toll to cross the River Styx in the underworld. The coin may have also been intended to ward off any evil spirits from entering the body, and this may have influenced later vampire folklore. This tradition persisted in modern Greek folklore about the vrykolakas, in which a wax cross and piece of pottery with the inscription "Jesus Christ conquers" were placed on the corpse to prevent the body from becoming a vampire.Other methods commonly practised in Europe included severing the tendons at the knees or placing poppy seeds, millet, or sand on the ground at the grave site of a presumed vampire; this was intended to keep the vampire occupied all night by counting the fallen grains, indicating an association of vampires with arithmomania. Similar Chinese narratives state that if a vampiric being came across a sack of rice, it would have to count every grain; this is a theme encountered in myths from the Indian subcontinent, as well as in South American tales of witches and other sorts of evil or mischievous spirits or beings.', '== In modern culture ==\n\nThe vampire is now a fixture in popular fiction. Such fiction began with 18th-century poetry and continued with 19th-century short stories, the first and most influential of which was John Polidori\'s "The Vampyre" (1819), featuring the vampire Lord Ruthven. Lord Ruthven\'s exploits were further explored in a series of vampire plays in which he was the antihero. The vampire theme continued in penny dreadful serial publications such as Varney the Vampire (1847) and culminated in the pre-eminent vampire novel in history: Dracula by Bram Stoker, published in 1897.Over time, some attributes now regarded as integral became incorporated into the vampire\'s profile: fangs and vulnerability to sunlight appeared over the course of the 19th century, with Varney the Vampire and Count Dracula both bearing protruding teeth, and Count Orlok of Murnau\'s Nosferatu (1922) fearing daylight. The cloak appeared in stage productions of the 1920s, with a high collar introduced by playwright Hamilton Deane to help Dracula \'vanish\' on stage. Lord Ruthven and Varney were able to be healed by moonlight, although no account of this is known in traditional folklore. Implied though not often explicitly documented in folklore, immortality is one attribute which features heavily in vampire films and literature. Much is made of the price of eternal life, namely the incessant need for the blood of former equals.']

**Response:** Blood appears blue in veins due to the way light interacts with the skin. The skin absorbs low-frequency light, allowing only high-frequency light to penetrate deeper into the body. This high-frequency light is mostly blue in color, which is why veins, which carry deoxygenated blood, appear blue when seen through the skin.

**Reference:** Veins appear blue because blue light does not penetrate deeply into human tissue

**Score:** 0.0

**Feedback:** NO

---


**Query:** What is the spiciest part of a chili pepper?

**Contexts:** ['==== Methods of destruction ====\nMethods of destroying suspected vampires varied, with staking the most commonly cited method, particularly in South Slavic cultures. Ash was the preferred wood in Russia and the Baltic states, or hawthorn in Serbia, with a record of oak in Silesia. Aspen was also used for stakes, as it was believed that Christ\'s cross was made from aspen (aspen branches on the graves of purported vampires were also believed to prevent their risings at night). Potential vampires were most often staked through the heart, though the mouth was targeted in Russia and northern Germany and the stomach in north-eastern Serbia. Piercing the skin of the chest was a way of "deflating" the bloated vampire. This is similar to a practice of "anti-vampire burial": burying sharp objects, such as sickles, with the corpse, so that they may penetrate the skin if the body bloats sufficiently while transforming into a revenant.Decapitation was the preferred method in German and western Slavic areas, with the head buried between the feet, behind the buttocks or away from the body. This act was seen as a way of hastening the departure of the soul, which in some cultures was said to linger in the corpse. The vampire\'s head, body, or clothes could also be spiked and pinned to the earth to prevent rising.\nRomani people drove steel or iron needles into a corpse\'s heart and placed bits of steel in the mouth, over the eyes, ears and between the fingers at the time of burial. They also placed hawthorn in the corpse\'s sock or drove a hawthorn stake through the legs. In a 16th-century burial near Venice, a brick forced into the mouth of a female corpse has been interpreted as a vampire-slaying ritual by the archaeologists who discovered it in 2006. In Bulgaria, over 100 skeletons with metal objects, such as plough bits, embedded in the torso have been discovered.Further measures included pouring boiling water over the grave or complete incineration of the body. In Southeastern Europe, a vampire could also be killed by being shot or drowned, by repeating the funeral service, by sprinkling holy water on the body, or by exorcism. In Romania, garlic could be placed in the mouth, and as recently as the 19th century, the precaution of shooting a bullet through the coffin was taken. For resistant cases, the body was dismembered and the pieces burned, mixed with water, and administered to family members as a cure. In Saxon regions of Germany, a lemon was placed in the mouth of suspected vampires.', '==== Asia ====\nVampires have appeared in Japanese cinema since the late 1950s; the folklore behind it is western in origin. The Nukekubi is a being whose head and neck detach from its body to fly about seeking human prey at night. Legends of female vampiric beings who can detach parts of their upper body also occur in the Philippines, Malaysia, and Indonesia. There are two main vampiric creatures in the Philippines: the Tagalog Mandurugo ("blood-sucker") and the Visayan Manananggal ("self-segmenter"). The mandurugo is a variety of the aswang that takes the form of an attractive girl by day, and develops wings and a long, hollow, threadlike tongue by night. The tongue is used to suck up blood from a sleeping victim. The manananggal is described as being an older, beautiful woman capable of severing its upper torso in order to fly into the night with huge batlike wings and prey on unsuspecting, sleeping pregnant women in their homes. They use an elongated proboscis-like tongue to suck fetuses from these pregnant women. They also prefer to eat entrails (specifically the heart and the liver) and the phlegm of sick people.The Malaysian Penanggalan is a woman who obtained her beauty through the active use of black magic or other unnatural means, and is most commonly described in local folklore to be dark or demonic in nature. She is able to detach her fanged head which flies around in the night looking for blood, typically from pregnant women. Malaysians hung jeruju (thistles) around the doors and windows of houses, hoping the Penanggalan would not enter for fear of catching its intestines on the thorns. The Leyak is a similar being from Balinese folklore of Indonesia. A Kuntilanak or Matianak in Indonesia, or Pontianak or Langsuir in Malaysia, is a woman who died during childbirth and became undead, seeking revenge and terrorising villages. She appeared as an attractive woman with long black hair that covered a hole in the back of her neck, with which she sucked the blood of children. Filling the hole with her hair would drive her off. Corpses had their mouths filled with glass beads, eggs under each armpit, and needles in their palms to prevent them from becoming langsuir. This description would also fit the Sundel Bolongs.\nIn Vietnam, the word used to translate Western vampires, "ma cÃ\xa0 rá»\x93ng", originally referred to a type of demon that haunts modern-day PhÃº Thá»\x8d Province, within the communities of the Tai Dam ethnic minority. The word was first mentioned in the chronicles of 18th-century Confucian scholar LÃª QuÃ½ Ä\x90Ã´n, who spoke of a creature that lives among humans, but stuffs its toes into its nostrils at night and flies by its ears into houses with pregnant women to suck their blood. Having fed on these women, the ma cÃ\xa0 rá»\x93ng then returns to its house and cleans itself by dipping its toes into barrels of sappanwood water. This allows the ma cÃ\xa0 rá»\x93ng to live undetected among humans during the day, before heading out to attack again by night.Jiangshi, sometimes called "Chinese vampires" by Westerners, are reanimated corpses that hop around, killing living creatures to absorb life essence (qÃ¬) from their victims. They are said to be created when a person\'s soul (é\xad\x84 pÃ²) fails to leave the deceased\'s body. Jiangshi are usually represented as mindless creatures with no independent thought. This monster has greenish-white furry skin, perhaps derived from fungus or mould growing on corpses. Jiangshi legends have inspired a genre of jiangshi films and literature in Hong Kong and East Asia. Films like Encounters of the Spooky Kind and Mr. Vampire were released during the jiangshi cinematic boom of the 1980s and 1990s.']

**Response:** The spiciest part of a chili pepper is the placenta, which is the white membrane that holds the seeds.

**Reference:** The spiciest part of a chili pepper is the placenta

**Score:** 0.0

**Feedback:** NO

---


Faithfulness: The responses are not faithful to the contexts and have been hallucinated.

Relevancy: The retrieved contexts are not very relevant to the queries.

## Conclusions

For an ideal RAG system, we'd expect to see correctness and faithfulness scores close to 1.0, indicating high accuracy and adherence to provided context.

Our query engine should probably be able to refuse to answer questions that are not covered by the contexts to achieve a higher faithfulness score. Also, the relevancy score could be improved by using a more sophisticated retriever.

# Summary

In this tutorial, we've demonstrated the integration of Flow Judge, an open-source small LM evaluator, with Llama Index's evaluation framework. We've learned:

1. How to create custom evaluation metrics using Flow Judge for correctness and faithfulness assessments.
2. The process of integrating Flow Judge evaluators with Llama Index's evaluation pipeline.
3. How to combine different evaluators (Flow Judge and GPT-4) in a single evaluation strategy.
4. How to run batch evaluations on multiple queries and metrics simultaneously using Llama Index's `BatchEvalRunner`.
5. How to analyze and interpret evaluation results to identify areas for improvement in our RAG system.

We've seen how open-source models like Flow Judge can provide valuable insights into RAG performance, correlating well with more expensive proprietary models like GPT-4. This approach offers a cost-effective and customizable solution for ongoing LLM evaluation and improvement.

The tutorial also highlighted the importance of assessing multiple aspects of LLM performance, including correctness, faithfulness, and relevancy. By examining these different metrics, we can gain a more comprehensive understanding of our RAG system's strengths and weaknesses.
