## RAG using LlamaIndex `QueryEngine`
Alfred is hosting a party and needs to be able to find relevant information on personas that will be attending the party. Therefore, we will use a QueryEngine to index and search through a database of personas.

We will be using personas from the [dvilasuero/finepersonas-v0.1-tiny](https://huggingface.co/datasets/dvilasuero/finepersonas-v0.1-tiny) dataset. This dataset contains 5K personas that will be attending the party!

In [1]:
from datasets import load_dataset
from pathlib import Path

dataset = load_dataset(
    path='dvilasuero/finepersonas-v0.1-tiny',
    split='train'
)

Path('data').mkdir(parents=True, exist_ok=True)
for i, persona in enumerate(dataset):
    with open(Path('data') / f'persona_{i}.txt', 'w') as f:
        f.write(persona['persona'])

### Loading and embedding person documents
We will use the SimpleDirectoryReader to load the persona descriptions from the data directory. This will return a list of Document objects. 

In [2]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir='data')
documents = reader.load_data()
len(documents)

5000

Now we have a list of `Document` objects, we can use the `IngestionPipeline` to create nodes from the documents and prepare them for the `QueryEngine`. We will use the `SentenceSplitter` to split the documents into smaller chunks and the `HuggingFaceEmbedding` to embed the chunks.

In [3]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations = [
        SentenceSplitter(),
        HuggingFaceEmbedding(
            model_name='BAAI/bge-small-en-v1.5'
        ),
    ]
)

nodes = await pipeline.arun(documents = documents[:10])
nodes

[TextNode(id_='6bbf9d1f-b2c8-4c8b-9bb8-9ab6cad94aaa', embedding=[-0.04367109015583992, -0.0038708068896085024, 0.02097388729453087, 0.02056472934782505, 0.03311917558312416, -0.02046845480799675, 0.0031744791194796562, 0.003741401480510831, -0.06648610532283783, -0.06540795415639877, -0.026052262634038925, -0.017108727246522903, -0.03600073233246803, 0.04833894968032837, -0.001385418581776321, -0.007396037224680185, -0.005103392526507378, 0.06403537839651108, 0.02325376495718956, 0.003998995758593082, 0.001780270249582827, -0.061939410865306854, 0.06021386757493019, -0.022301098331809044, -0.04734678193926811, 0.011378142982721329, 0.035375043749809265, 0.0022592060267925262, 0.022099554538726807, -0.13876938819885254, -0.05060829222202301, -0.0056232064962387085, 0.012748969718813896, 0.0003807187604252249, 0.0671507939696312, 0.011150100268423557, -0.0007634299690835178, 0.05396739020943642, -0.01493331603705883, 0.012736504897475243, -0.009071340784430504, 0.01814575307071209, -0.02

### Store and Index these documents

Since we are using an ingestion pipeline, we can directly attach a vector store to the pipeline to populate it. In this case, we will use Chroma to store our documents. Let's run the pipeline again with the vector store attached. The IngestionPipeline caches the operations so this should be fast!

In [4]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

db = chromadb.PersistentClient(path='./alfred_chroma_db')
chroma_collection = db.get_or_create_collection('alfred')
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    vector_store = vector_store
)

nodes = await pipeline.arun(documents = documents[:10])
len(nodes)

10

We can create a VectorStoreIndex from the vector store and use it to query the documents by passing the vector store and embedding model to the from_vector_store() method.

In [5]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name='BAAI/bge-small-en-v1.5'
)
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    embed_model=embed_model
)

We don't need to worry about persisting the index to disk, as it is automatically saved within the ChromaVectorStore object and the passed directory path.

### Querying the index

Under the hood, the query engine doesn’t only use the LLM to answer the question but also uses a ResponseSynthesizer as a strategy to process the response. Once again, this is fully customisable but there are three main strategies that work well out of the box:

- refine: create and refine an answer by sequentially going through each retrieved text chunk. This makes a separate LLM call per Node/retrieved chunk.
- compact (default): similar to refining but concatenating the chunks beforehand, resulting in fewer LLM calls.
- tree_summarize: create a detailed answer by going through each retrieved text chunk and creating a tree structure of the answer.

In [10]:
from llama_index.llms.openai import OpenAI
import nest_asyncio
from dotenv import load_dotenv
import os

nest_asyncio.apply()    # this is needed to run the query engine

load_dotenv()
openai_token = os.getenv('OPENAI_API_KEY')
llm = OpenAI()

query_engine = index.as_query_engine(
    llm=llm,
    response_mode='tree_summarize',
)

response = query_engine.query(
    'Respond using a persona that describes author and travel experiences?'
)
response

Response(response='A seasoned anthropologist and cultural expert with a profound interest in delving into the depths of various cultures around the world. Having dedicated significant time to researching and immersing myself in the rich tapestry of societies, I have cultivated a unique perspective on the intricacies of different customs, histories, and ways of life. My travels have not only broadened my understanding of diverse cultures but have also allowed me to appreciate the nuances and complexities that shape human societies across the globe.', source_nodes=[NodeWithScore(node=TextNode(id_='16108cc5-4030-4c1c-8f47-2848d8cf38ea', embedding=None, metadata={'file_path': '/media/chaitanya/8c668157-d88f-4711-8236-2a53e5225d1d/Work/HuggingFace-ai-agents-course/Unit2/2.2-llamaindex/data/persona_1.txt', 'file_name': 'persona_1.txt', 'file_type': 'text/plain', 'file_size': 266, 'creation_date': '2025-04-28', 'last_modified_date': '2025-04-28'}, excluded_embed_metadata_keys=['file_name', 'f

### Evaluation and observability

LlamaIndex provides built-in evaluation tools to assess response quality. These evaluators leverage LLMs to analyze responses across different dimensions. Let’s look at the three main evaluators available:

- FaithfulnessEvaluator: Evaluates the faithfulness of the answer by checking if the answer is supported by the context.
- AnswerRelevancyEvaluator: Evaluate the relevance of the answer by checking if the answer is relevant to the question.
- CorrectnessEvaluator: Evaluate the correctness of the answer by checking if the answer is correct.

In [9]:
from llama_index.core.evaluation import FaithfulnessEvaluator

# query index
evaluator = FaithfulnessEvaluator(llm=llm)
response = query_engine.query(
    "What battles took place in New York City in the AMerican Revolution?"
)

eval_result = evaluator.evaluate_response(response=response)
eval_result.passing

False

#### Also See
[LlamaTrace](https://phoenix.arize.com/llamatrace/)

Once this is enabled, all queries will be logged to LlamaTrace and can be viewed at https://llamatrace.com/projects/ 

In [8]:
import llama_index
import os
from dotenv import load_dotenv

load_dotenv()
phoenix_api_key = os.getenv('PHOENIX_API_KEY')
os.environ['OTEL_EXPORTER_OTLP_HEADERS'] = f'api_key={phoenix_api_key}'
llama_index.core.set_global_handler(
    "arize_phoenix",
    endpoint="https://llamatrace.com/v1/traces"
)

