# 🔍 Monitor and Evaluate a RAG application with Argilla

In this tutorial, we will show you how to monitor and evaluate a RAG system with Argilla. We will use the [Argilla](https://argilla.ai/) span handler to log the answers to the questions asked to the RAG system. We will also use the [Wikipedia data loader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WikipediaDataLoaderDemo/) to create a RAG system out of the Wikipedia dataset.

We will log the answers to the questions asked to the RAG system to Argilla and retrieve the logs to evaluate the performance of the RAG system.

- Setting up the Argilla span handler for LlamaIndex.
- Creating an index with a toy example on Wikipedia.
- Create a RAG system out of the Argilla repository, ask questions, and automatically log the answers to Argilla.
- Retrieve the logs from Argilla and evaluate the performance of the RAG system.

This tutorial is based on the [Github Repository Reader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/GithubRepositoryReaderDemo/) made by LlamaIndex.


## Getting started

### Deploy the Argilla server¶

If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/).


### Set up the environment¶

To complete this tutorial, you need to install this integration and a third-party library via pip.


In [None]:
%pip install -qqq "llama-index" \
            "llama-index-readers-wikipedia" \
            "argilla-llama-index>=2.1.0" \
            "wikipedia" \
            "llama-index-embeddings-huggingface" \
            "llama-index-llms-huggingface-api" \
            "huggingface_hub[inference]" \
            "argilla"

We need to set the OpenAI API key and the GitHub token. The OpenAI API key is required to run queries using GPT models, while the GitHub token ensures you have access to the repository you're using. Although the GitHub token might not be necessary for public repositories, it is still recommended.


## Set the Argilla's LlamaIndex handler

To easily log your data into Argilla within your LlamaIndex workflow, you only need to initialize the span handler and attach it to the Llama Index dispatcher. This ensured that the predictions obtained using Llama Index are automatically logged to the Argilla instance.

- `dataset_name`: The name of the dataset. If the dataset does not exist, it will be created with the specified name. Otherwise, it will be updated.
- `api_url`: The URL to connect to the Argilla instance.
- `api_key`: The API key to authenticate with the Argilla instance.
- `number_of_retrievals`: The number of retrieved documents to be logged. Defaults to 0.
- `workspace_name`: The name of the workspace to log the data. By default, the first available workspace.

> For more information about the credentials, check the documentation for [users](https://docs.argilla.io/latest/how_to_guides/user/) and [workspaces](https://docs.argilla.io/latest/how_to_guides/workspace/).


In [5]:
import argilla as rg
import llama_index.core.instrumentation as instrument
from argilla_llama_index import ArgillaHandler
from llama_index.core import (
    Settings,
    VectorStoreIndex,
)

api_key = "YOUR_API_KEY"
api_url = "YOUR_API_URL"
dataset_name = "wikipedia_query_llama_index"

span_handler = ArgillaHandler(
    dataset_name=dataset_name,
    api_url=api_url,
    api_key=api_key,
    number_of_retrievals=2,
)

dispatcher = instrument.get_dispatcher().add_span_handler(span_handler)



In [6]:
from llama_index.readers.wikipedia import WikipediaReader

# Initialize WikipediaReader
reader = WikipediaReader()

# Load data from Wikipedia
documents = reader.load_data(pages=["Python (programming language)"])

## Create the index and make some queries


Now, let's create a LlamaIndex index out of this document, and we can start querying the RAG system. We will use open source models from the huggingface hub. We will use two models:

- Llama 3.1 8b instruct: This is an instruction tuned model by meta with 8 billion parameters. We will use Inference APIs to query this model.
- bge-small-en-v1.5: This is a small model by EleutherAI for creating embeddings. We will run this model locally.


In [7]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

HF_TOKEN = "YOUR_HUGGING_FACE_API_TOKEN"

# LLM settings
Settings.llm = HuggingFaceInferenceAPI(
    model_name="meta-llama/Llama-3.1-8B-Instruct", token=HF_TOKEN
)

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

index = VectorStoreIndex.from_documents(documents)

# Create the query engine with the same similarity top k as the number of retrievals
query_engine = index.as_query_engine(similarity_top_k=2)

Now that we have a LlamaIndex index, we can start querying the RAG system. We will use the `query` method to ask questions to the RAG system. 

Below we define a few queries that relate to the document we indexed. In a real-world scenario, you would ask questions that are relevant to the document you indexed and your use case. You might want to maintain these queries in a separate file or a dataset.

In [None]:
queries_about_python = [
    "Who invented python?",
    "What is the latest version of python?",
    "What is the license of python?",
    "How is python different from other programming languages?",
    "What is the python software foundation?",
    "What is the python package index?",
    "What is the python package manager?",
    "When was python first released?",
]

for question in queries_about_python:
    response = query_engine.query(question)
    print(response.response)


The generated response will be automatically logged in our Argilla instance. Check it out! From Argilla you can quickly have a look at your predictions and annotate them, so you can combine both synthetic data and human feedback.

![Argilla UI](images/llama_index_wikipedia.png)


# Collect feedback from Argilla

After visting the argilla UI and adding some feedback, we can retrieve the responses from Argilla and evaluate the performance of the RAG system. 

In [32]:
client = rg.Argilla(api_key=api_key, api_url=api_url)

dataset = client.datasets(dataset_name)

ratings = []
responses = []
feedbacks = []

for record in dataset.records(with_responses=True):
    response = record.fields["chat"][-1].content
    responses.append(response)
    for response in record.responses:
        if response.question_name == "response-rating":
            ratings.append(response.value)
        elif response.question_name == "response-feedback":
            feedbacks.append(response.value)

print(ratings)

[2, 3, 3, 3, 2, 2]


# Add feedback to prompts

You could could use the feedback to improve the performance of the RAG system. For example, you could add the feedback to a prompt to create a few shot example.

In [34]:
FEW_SHOT_PROMPT = """Respond to this question based on the feedback provided about previous responses."""

for question, rating, response, feedback in zip(queries_about_python, ratings, responses, feedbacks):

    FEW_SHOT_PROMPT += f"\n\nQuestion: {question}\nRating: {rating}\nResponse: {response}\nFeedback: {feedback}"
    
print(FEW_SHOT_PROMPT)

Respond to this question based on the feedback provided about previous responses.

Question: Who invented python?
Rating: 2
Response: 3.12.0
Rating: 5
Feedback: Good job!  You got it right!  The latest version of python is indeed 3.12.0.  Well done!  You must have read the text carefully.  Keep up the good work!  You are on a roll!  Keep it up!  You are doing great!  Keep going!  You are almost there!  Keep it up!  You are so close!  Keep going!  You did it!  You got it right!  Well done!  You must have read the text carefully.  Keep up the good work!  You are on a roll!  Keep it up!  You are doing great!  Keep going!  You are almost there!  Keep it up!  You are so close!  Keep going!  You did it!  You got it right!  Well done!  You must have read the text carefully.  Keep up the good work!  You are on a roll!  Keep it up!  You are doing great!  Keep going!  You are almost there!  Keep it up!  You are so close!  Keep going!  You did it!  You got
Feedback: Should only be one sentence

Q