# LlamaIndex Bottoms-Up Development - Evaluation Baseline

LlamaIndex provides some basic evaluation of query engines! We can setup an evaluator that will measure both hallucinations, as well as if the query was actually answered!

This is provided by two main evaluations:

- `ResponseSourceEvaluator` - uses an LLM to decide if the response is similar enough to the sources -- a good measure for hallunication detection!
- `QueryResponseEvaluator` - uses an LLM to decide if a response is similar enough to the original query -- a good measure for checking if the query was answered!

You may have noticed that we are using an LLM for this task. That means we will want to pick a powerful LLM, like GPT-4 or Claude-2.

Lastly, using these methods, we can also use the LLM to generate syntheic questions to evaluate with!

## Setup the Baseline Query Engine

### Loading our Docs

In [1]:
import openai
import os

#os.environ["OPENAI_API_KEY"] = "API_KEY_HERE"
#openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
import os
import sys
sys.path.append(os.path.join(os.getcwd(), '..'))

In [3]:
from llama_docs_bot.markdown_docs_reader import MarkdownDocsReader
from llama_index import SimpleDirectoryReader

def load_markdown_docs(filepath):
    """Load markdown docs from a directory, excluding all other file types."""
    loader = SimpleDirectoryReader(
        input_dir=filepath, 
        required_exts=[".md"],
        file_extractor={".md": MarkdownDocsReader()},
        recursive=True
    )

    documents = loader.load_data()

    # exclude some metadata from the LLM
    for doc in documents:
        doc.excluded_llm_metadata_keys = ["File Name", "Content Type", "Header Path"]

    return documents

In [4]:
# load our documents from each folder.
# we keep them seperate for now, in order to create seperate indexes later
getting_started_docs = load_markdown_docs("../docs/getting_started")
community_docs = load_markdown_docs("../docs/community")
data_docs = load_markdown_docs("../docs/core_modules/data_modules")
agent_docs = load_markdown_docs("../docs/core_modules/agent_modules")
model_docs = load_markdown_docs("../docs/core_modules/model_modules")
query_docs = load_markdown_docs("../docs/core_modules/query_modules")
supporting_docs = load_markdown_docs("../docs/core_modules/supporting_modules")
tutorials_docs = load_markdown_docs("../docs/end_to_end_tutorials")
contributing_docs = load_markdown_docs("../docs/development")

### Create the indicies

In [5]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI

# create a global service context
service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0))
set_global_service_context(service_context)

In [6]:
from llama_index import VectorStoreIndex, StorageContext, load_index_from_storage

# create a vector store index for each folder
try:
    getting_started_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./getting_started_index"))
    community_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./community_index"))
    data_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./data_index"))
    agent_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./agent_index"))
    model_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./model_index"))
    query_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./query_index"))
    supporting_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./supporting_index"))
    tutorials_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./tutorials_index"))
    contributing_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./contributing_index"))
except:
    getting_started_index = VectorStoreIndex.from_documents(getting_started_docs)
    getting_started_index.storage_context.persist(persist_dir="./getting_started_index")

    community_index = VectorStoreIndex.from_documents(community_docs)
    community_index.storage_context.persist(persist_dir="./community_index")

    data_index = VectorStoreIndex.from_documents(data_docs)
    data_index.storage_context.persist(persist_dir="./data_index")

    agent_index = VectorStoreIndex.from_documents(agent_docs)
    agent_index.storage_context.persist(persist_dir="./agent_index")

    model_index = VectorStoreIndex.from_documents(model_docs)
    model_index.storage_context.persist(persist_dir="./model_index")

    query_index = VectorStoreIndex.from_documents(query_docs)
    query_index.storage_context.persist(persist_dir="./query_index")    

    supporting_index = VectorStoreIndex.from_documents(supporting_docs)
    supporting_index.storage_context.persist(persist_dir="./supporting_index")

    tutorials_index = VectorStoreIndex.from_documents(tutorials_docs)
    tutorials_index.storage_context.persist(persist_dir="./tutorials_index")

    contributing_index = VectorStoreIndex.from_documents(contributing_docs)
    contributing_index.storage_context.persist(persist_dir="./contributing_index")

### Create Query Engine Tools

Since we have so many indicies, we can create a query engine tool for each and then use them in a single query engine!

In [7]:
from llama_index.tools import QueryEngineTool

# create a query engine tool for each folder
getting_started_tool = QueryEngineTool.from_defaults(
    query_engine=getting_started_index.as_query_engine(), 
    name="Getting Started", 
    description="Useful for answering questions about installing and running llama index, as well as basic explanations of how llama index works."
)

community_tool = QueryEngineTool.from_defaults(
    query_engine=community_index.as_query_engine(),
    name="Community",
    description="Useful for answering questions about integrations and other apps built by the community."
)

data_tool = QueryEngineTool.from_defaults(
    query_engine=data_index.as_query_engine(),
    name="Data Modules",
    description="Useful for answering questions about data loaders, documents, nodes, and index structures."
)

agent_tool = QueryEngineTool.from_defaults(
    query_engine=agent_index.as_query_engine(),
    name="Agent Modules",
    description="Useful for answering questions about data agents, agent configurations, and tools."
)

model_tool = QueryEngineTool.from_defaults(
    query_engine=model_index.as_query_engine(),
    name="Model Modules",
    description="Useful for answering questions about using and configuring LLMs, embedding modles, and prompts."
)

query_tool = QueryEngineTool.from_defaults(
    query_engine=query_index.as_query_engine(),
    name="Query Modules",
    description="Useful for answering questions about query engines, query configurations, and using various parts of the query engine pipeline."
)

supporting_tool = QueryEngineTool.from_defaults(
    query_engine=supporting_index.as_query_engine(),
    name="Supporting Modules",
    description="Useful for answering questions about supporting modules, such as callbacks, service context, and avaluation."
)

tutorials_tool = QueryEngineTool.from_defaults(
    query_engine=tutorials_index.as_query_engine(),
    name="Tutorials",
    description="Useful for answering questions about end-to-end tutorials and giving examples of specific use-cases."
)

contributing_tool = QueryEngineTool.from_defaults(
    query_engine=contributing_index.as_query_engine(),
    name="Contributing",
    description="Useful for answering questions about contributing to llama index, including how to contribute to the codebase and how to build documentation."
)

### Create Unified Query Engine

In [8]:
# needed for notebooks
import nest_asyncio
nest_asyncio.apply()

from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        getting_started_tool,
        community_tool,
        data_tool,
        agent_tool,
        model_tool,
        query_tool,
        supporting_tool,
        tutorials_tool,
        contributing_tool
    ],
    # enable this for streaming
    # response_synthesizer=get_response_synthesizer(streaming=True),
    verbose=False
)

### Test the Query Engine!

In [9]:
response = query_engine.query("How do I install llama index?")
print(str(response))

To install Llama Index, you can follow these steps:

1. Clone the repository by running the following command in your terminal:
   `git clone https://github.com/jerryjliu/llama_index.git`

2. Once the repository is cloned, navigate to the cloned directory.

3. If you want to do an editable install (where you can modify source files), run the command:
   `pip install -e .`

4. If you want to install optional dependencies and dependencies used for development (such as unit testing), run the command:
   `pip install -r requirements.txt`

After completing these steps, Llama Index should be downloaded and installed on your system.


## Evaluate the Basline!

Now that we have our baseline query engine created, we can create a basic evaluation pipeline!

Our pipeline will:

- Generate a small dataset of questions
- Save/cache these questions (so we can properly compare performance later!)
- Evaluate both response quality and hallucination

To do this reliably, we need to use an LLM smarter than `gpt-3.5-turbo`, so we will setup `gpt-4` for the evaluation process!

### Generate the Dataset

In order to make the question generation more effecient, we can remove small documents and combine all documents into a giant single docoument.

I also modify the question generation prompt, to generate a single question for each chunk, along with extra context for what it is reading.

In [10]:
from llama_index import Document

documents = SimpleDirectoryReader("../docs", recursive=True, required_exts=[".md"]).load_data()

all_text = ""

for doc in documents:
    all_text += doc.text

giant_document = Document(text=all_text)

In [11]:
import os
import random
random.seed(42)

from llama_index import ServiceContext
from llama_index.prompts import Prompt
from llama_index.llms import OpenAI
from llama_index.evaluation import DatasetGenerator

gpt4_service_context = ServiceContext.from_defaults(llm=OpenAI(llm="gpt-4", temperature=0))

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", "r") as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(
        [giant_document],
        text_question_template=Prompt(
            "A sample from the LlamaIndex documentation is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Using the documentation sample, carefully follow the instructions below:\n"
            "{query_str}"
        ),
        question_gen_query=(
            "You are an evaluator for a search pipeline. Your task is to write a single question "
            "using the provided documentation sample above to test the search pipeline. The question should "
            "reference specific names, functions, and terms. Restrict the question to the "
            "context information provided.\n"
            "Question: "
        ),
        # set this to be low, so we can generate more questions
        service_context=gpt4_service_context
    )
    generated_questions = data_generator.generate_questions_from_nodes()

    # randomly pick 40 questions from each dataset
    generated_questions = random.sample(generated_questions, 40)
    question_dataset.extend(generated_questions)

    print(f"Generated {len(question_dataset)} questions.")

    # save the questions!
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

In [13]:
print(random.sample(question_dataset, 5))

['What are the node postprocessors available in the LlamaIndex documentation?', 'What are the available options for the storage backend of the index store in LlamaIndex?', 'What are the three primary sections within the layout of the ChatView component?', 'What embedding model does LlamaIndex use by default?', 'What is the purpose of the `load_collection_model` function in the LlamaIndex documentation?']


### Evaluate with the Dataset

Now that we have our dataset, let's measure performance!

#### Evaluating Response for Hallucination

In [20]:
import time
import asyncio
import nest_asyncio
nest_asyncio.apply()

from llama_index import Response

def evaluate_query_engine(evaluator, query_engine, questions):
    async def run_query(query_engine, q):
        try:
            return await query_engine.aquery(q)
        except:
            return Response(response="Error, query failed.")

    total_correct = 0
    all_results = []
    for batch_size in range(0, len(questions), 5):
        batch_qs = questions[batch_size:batch_size+5]

        tasks = [run_query(query_engine, q) for q in batch_qs]
        responses = asyncio.run(asyncio.gather(*tasks))
        print(f"finished batch {(batch_size // 5) + 1} out of {len(questions) // 5}")

        for response in responses:
            if evaluator.evaluate_response(response=response).passing: 
                eval_result = 1
            else:
                eval_result = 0
            total_correct += eval_result
            all_results.append(eval_result)

        
        # helps avoid rate limits
        time.sleep(1)

    return total_correct, all_results

In [21]:
from llama_index.evaluation import FaithfulnessEvaluator

# gpt-4 evaluator!
evaluator = FaithfulnessEvaluator(service_context=gpt4_service_context)

total_correct, all_results = evaluate_query_engine(evaluator, query_engine, question_dataset)

print(f"Hallucination? Scored {total_correct} out of {len(question_dataset)} questions correctly.")

finished batch 1 out of 8
finished batch 2 out of 8
finished batch 3 out of 8
finished batch 4 out of 8
finished batch 5 out of 8
finished batch 6 out of 8
finished batch 7 out of 8
finished batch 8 out of 8
Hallucination? Scored 31 out of 40 questions correctly.


#### Investigating Hallucinations

In [22]:
import numpy as np

hallucinated_questions = np.array(question_dataset)[np.array(all_results) == 0]
print(hallucinated_questions)

['How can I convert tools to LangChain tools using the provided documentation sample?'
 'What is the purpose of the `GuidancePydanticProgram` class in the LlamaIndex documentation?'
 'What are the available options for the storage backend of the index store in LlamaIndex?'
 'What is the purpose of the "router query engine" in the LlamaIndex framework?'
 'What is the purpose of the `fetchDocuments` function in the `fetchDocuments.tsx` file in the React frontend?'
 "What is the function used to retrieve the collections for the logged-in user in the Delphic project's frontend?"
 'What is the purpose of the Algovera tool built on top of LlamaIndex?'
 'What are the three primary sections within the layout of the ChatView component?'
 'What is the purpose of the SQLTableNodeMapping object in the LlamaIndex documentation sample?']


In [23]:
response = query_engine.query('What is the purpose of the `GuidancePydanticProgram` class in the LlamaIndex documentation?')
print(str(response))
print("-----------------")
print(response.get_formatted_sources(length=1000))

The purpose of the `GuidancePydanticProgram` class in the LlamaIndex documentation is not provided in the given context information.
-----------------
> Source (Doc id: 2a787a34-1b5d-4b64-860d-6c945c58ffa0): Sub question: What is the purpose of the GuidancePydanticProgram class?
Response: The purpose of the GuidancePydanticProgram class is not provided in the given context information.

> Source (Doc id: 7af36a27-d9a2-4230-9491-9ef44b27ebd9): Sub question: How do I use and configure LLMs?
Response: To use and configure LLMs, you can refer to the code snippet provided in the context information. Additionally, if you are using other LLM classes from langchain, you may need to explicitly configure the `context_window` and `num_output` via the `ServiceContext` since the information is not available by default.

> Source (Doc id: cc4520d5-38bc-472a-bac7-386818b6d063): Sub question: What is the purpose of the Query Modules?
Response: The purpose of the Query Modules is to provide a generic i

In [24]:
response = query_engine.query('What is the purpose of the `RouterQueryEngine` in LlamaIndex and how can it be used in the search pipeline?')
print(str(response))
print("-----------------")
print(response.get_formatted_sources(length=1000))

The purpose of the `RouterQueryEngine` in LlamaIndex is to allow for query transformations over index structures. It can be used in the search pipeline by taking in a natural language query and returning a rich response. The `RouterQueryEngine` is often built on one or many Indices via Retrievers, and multiple query engines can be composed together to achieve more advanced capability.
-----------------
> Source (Doc id: 30a5b893-072d-4344-ae6e-af10a27d85f4): Sub question: What is the purpose of the `RouterQueryEngine` in LlamaIndex?
Response: The purpose of the `RouterQueryEngine` in LlamaIndex is to allow you to perform query transformations over your index structures. Query transformations are modules that convert a query into another query. They can be single-step, where the transformation is run once before the query is executed against an index, or multi-step, where the query is transformed, executed against an index, the response is retrieved, and subsequent queries are transform

#### Evaluating Response for Answer Quality

In [32]:
import time
import asyncio
import nest_asyncio
nest_asyncio.apply()
from llama_index import Response

def evaluate_query_engine(evaluator, query_engine, questions):
    async def run_query(query_engine, q):
        try:
            return await query_engine.aquery(q)
        except:
            return Response(response="Error, query failed.")

    total_correct = 0
    all_results = []
    for batch_size in range(0, len(questions), 5):
        batch_qs = questions[batch_size:batch_size+5]

        tasks = [run_query(query_engine, q) for q in batch_qs]
        responses = asyncio.run(asyncio.gather(*tasks))
        print(f"finished batch {(batch_size // 5) + 1} out of {len(questions) // 5}")
 
        for query, response in zip(batch_qs, responses):
    
            if evaluator.evaluate_response(query=query, response=response).passing: 
                eval_result = 1
            else:
                eval_result = 0
            total_correct += eval_result
            all_results.append(eval_result)
        
        # helps avoid rate limits
        time.sleep(1)

    return total_correct, all_results

In [33]:
from llama_index.evaluation import QueryResponseEvaluator

evaluator = QueryResponseEvaluator(service_context=gpt4_service_context)

#evaluator = RelevancyEvaluator(service_context=service_context)

# query index
#query_engine = vector_index.as_query_engine()
#query = "What battles took place in New York City in the American Revolution?"
#response = query_engine.query(query)
#eval_result = evaluator.evaluate_response(query=query, response=response)
#print(str(eval_result))

total_correct, all_results = evaluate_query_engine(evaluator, query_engine, question_dataset)

print(f"Response satisfies the query? Scored {total_correct} out of {len(question_dataset)} questions correctly.")

finished batch 1 out of 8
finished batch 2 out of 8
finished batch 3 out of 8
finished batch 4 out of 8
finished batch 5 out of 8
finished batch 6 out of 8
finished batch 7 out of 8
finished batch 8 out of 8
Response satisfies the query? Scored 20 out of 40 questions correctly.


#### Investigating Incorrect Answers

In [34]:
import numpy as np

unanswered_queries = np.array(question_dataset)[np.array(all_results) == 0]
print(unanswered_queries)

['How can I convert tools to LangChain tools using the provided documentation sample?'
 'What is the purpose of the `GuidancePydanticProgram` class in the LlamaIndex documentation?'
 'What is the purpose of the SubQuestionQueryEngine class in LlamaIndex?'
 'What is the purpose of the `query_wrapper_prompt` in the `HuggingFaceLLM` class?'
 'What are the available options for the storage backend of the index store in LlamaIndex?'
 'What is the purpose of the LoadAndSearchToolSpec in the LlamaIndex documentation?'
 'What is the purpose of the DEFAULT_REFINE_PROMPT_SEL_LC in the LlamaIndex documentation?'
 "What is the purpose of the `CollectionQueryConsumer` class in the Delphic application's WebSocket handling?"
 'How can I create a Django superuser using the Delphic application?'
 'What is the purpose of the "router query engine" in the LlamaIndex framework?'
 'What is the purpose of the `VectorStoreIndex` class in the LlamaIndex documentation sample?'
 'What is the purpose of the `Refi

In [35]:
response = query_engine.query('What is the purpose of the `ReActAgent` and how can it be initialized with other agents as tools?')
print(str(response))
print("-----------------")
print(response.get_formatted_sources(length=256))

The purpose of the `ReActAgent` is not provided in the given context information. However, the `ReActAgent` can be initialized with other agents as tools by creating instances of the desired agents and passing them as arguments to the `QueryEngineTool` constructor. These instances are then added to a list of query engine tools, along with their corresponding metadata. Finally, the `ReActAgent` is instantiated using the `from_tools()` method, which takes the list of query engine tools as an argument.
-----------------
> Source (Doc id: e745971e-773c-4e73-a6e5-802ddc1dbe64): Sub question: What is the purpose of the ReActAgent?
Response: The purpose of the ReActAgent is not provided in the given context information.

> Source (Doc id: f860a27d-5f2f-4748-9164-c96e7534d762): Sub question: How can the ReActAgent be initialized with other agents as tools?
Response: The ReActAgent can be initialized with other agents as tools by creating instances of the desired agents and passing them as argu

In [36]:
response = query_engine.query('What is the purpose of the LoadAndSearchToolSpec in the LlamaIndex documentation?')
print(str(response))
print("-----------------")
print(response.get_formatted_sources(length=256))

The purpose of the LoadAndSearchToolSpec in the LlamaIndex documentation is not mentioned in the given context information.
-----------------
> Source (Doc id: 2fdce762-1dd8-4454-94b8-0342e1f6e527): Sub question: What is the purpose of the LoadAndSearchToolSpec in the LlamaIndex documentation?
Response: The purpose of the LoadAndSearchToolSpec in the LlamaIndex documentation is not mentioned in the given context information.

> Source (Doc id: 3f72f10f-76f1-462c-b4b0-cc59fda677c7): Sub question: How does the LoadAndSearchToolSpec work in the LlamaIndex documentation?
Response: The LoadAndSearchToolSpec is not mentioned in the given context information.

> Source (Doc id: 71a15480-fa4a-4603-8729-f60b77736e1d): Sub question: What are the features of the LoadAndSearchToolSpec in the LlamaIndex documentation?
Response: The features of the LoadAndSearchToolSpec in the LlamaIndex documentation are not mentioned in the given context information.

> Source (Doc id: a048c2cb-f494-43b5-bfaa-d97e

# Conclusion

In this notebook, we covered several key topics!

- setting up a sub-question query engine
- generating a dataset of evaluation questions
- evaluating responses for hallucination
- evaluating responses for answer quality