# AI Agent Evaluation by Using Dria

This notebook shows how to generate an evaluation set for your AI agents by using Dria's [QA pipeline](https://docs.dria.co/factory/qa/). In the end, you can evaluate these agents with [promptfoo](https://www.promptfoo.dev/) and see the evaluation and assessment results.


## Step 1: Initialization


### Install Dependencies

Initially, you need to install the necessary dependencies for this notebook by running following code block. We ***recommend*** using your local machine instead of Google Colab due to incompatibilities between the dependencies in Google Colab and some of those we use. After, you create an Python virtual environment, you can run the following command: 

In [1]:
%pip install --upgrade pip

%pip install requests openai pandas nltk matplotlib firecrawl requests upstash_vector cohere python-dotenv

%pip install dria==0.0.109

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Set Environmental Variables

To run and use external applications in this notebook, you need to have API keys. You can obtain API keys from the following providers' websites:

- **Firecrawl**: https://www.firecrawl.dev/

- **Jina Reader**: https://jina.ai/reader/

- **Upstash**: https://upstash.com/docs/introduction

- **Cohere**: https://cohere.com/

- **OpenAI**: https://openai.com/

- **Open Router**: https://openrouter.ai/docs/api-keys

After, obtaining these keys, you need to create an *.env* file with the following structure and content:

```yaml

FIRECRAWL_KEY: "<YOUR_FIRECRAWL_API_KEY>"

OPEN_ROUTER_KEY: "<YOUR_OPEN_ROUTER_API_KEY>"

JINA_KEY: "<YOUR_JINA_READER_API_KEY>"

OPENAI_KEY: "<YOUR_OPENAI_API_KEY>"

COHERE_KEY: "<YOUR_COHERE_API_KEY>"

UPSTASH_KEY: "<YOUR_UPSTASH_KEY>"
```

On the other hand, you **do not** need to have any API keys to use Dria. If you intend to run this notebook solely for evaluation set generation with Dria, then skip obtaining API keys part. To set the keys as environmental variables, run the following block:

In [7]:
import os
from dotenv import load_dotenv

load_dotenv()

try: 
    firecrawl_api_key = os.getenv("FIRECRAWL_KEY")

    upstash_key = os.getenv("UPSTASH_KEY")

    jina_api_key = os.getenv("JINA_KEY")

    openai_api_key = os.getenv("OPENAI_KEY")

    cohere_api_key = os.getenv("COHERE_KEY")

    open_router_key = os.getenv("OPEN_ROUTER_KEY")

    print("All keys loaded successfully")
except:
    print("Error loading keys")


All keys loaded successfully


## Step 2: Get Proprietary Data for QA Generation

To generate the evaluation set, we need to acquire a proprietary data. This is an essential step for providing necessary inputs (context and personas) to the QA pipeline. Therefore, we need to get contexts from documents and generate personas dataset. Both datasets must be structured appropriately for use with Dria.

### Generate Context Data for QA

One effective method for getting this data is by scraping and fetching documents from specific web domains. Using Firecrawl, you can scrape a single URL or entire documents within a domain. After scraping, you can fetch the content for each URL using Jina Reader. In this notebook, we decided to use Dria's documents as proprietary data. The result will be saved in the following format:

```json
{
    "url": "url",
    "content": "response.text"
}
```

where *url* represents the document's URL and *content* is the fetched content of the document in Markdown format.

In [4]:
import requests
from firecrawl import FirecrawlApp

# Initialize Firecrawl
firecrawl = FirecrawlApp(api_key=firecrawl_api_key)


def fetch_content_with_jina(urls):
    """
    Fetch content from a list of URLs using the Jina Reader.

    Args:
        urls (list): List of URLs to fetch content from.

    Returns:
        list: List of dictionaries containing URL and content.
    """
    content_data = []
    for url in urls:
        try:
            headers = {
                'Authorization': f'Bearer {jina_api_key}',
                'X-Retain-Images': 'none'
            }
            response = requests.get(f'https://r.jina.ai/{url}', headers=headers)
            if response.status_code == 200:
                content_data.append({'url': url, 'content': response.text})
            else:
                print(f"Failed to fetch content for {url}: {response.status_code}")
        except Exception as e:
            print(f"Error fetching content for {url}: {e}")
    return content_data


def scrape_single_url(url):
    """
    Scrape content from a single URL using the Jina Reader.

    Args:
        url (str): URL to scrape.

    Returns:
        dict: Dictionary containing the URL and content.
    """
    try:
        headers = {
            'Authorization': f'Bearer {jina_api_key}',
            'X-Retain-Images': 'none'
        }
        response = requests.get(f'https://r.jina.ai/{url}', headers=headers)
        if response.status_code == 200:
            print(f"Successfully fetched content for {url}")
            return {'url': url, 'content': response.text}
        else:
            print(f"Failed to fetch content for {url}: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None


def map_and_scrape_domain(domain):
    """
    Map all URLs under a domain using Firecrawl and fetch content.

    Args:
        domain (str): The domain to map and scrape.

    Returns:
        list: List of dictionaries containing URLs and content.
    """
    try:
        # Map the domain to gather all URLs
        response = firecrawl.map_url(domain)
        print("Firecrawl Response:", response)  # Log the full response for debugging

        # Check if the 'links' key is present
        if 'links' in response:
            urls = response['links']
            print(f"Mapped {len(urls)} URLs from {domain}")

            # Fetch content for all URLs
            return fetch_content_with_jina(urls)
        else:
            print(f"Unexpected response structure: {response}")
            return []  # Return empty list if mapping fails
    except Exception as e:
        print(f"Error mapping domain {domain}: {e}")
        return []

To simplify the process of scraping content from web domains, you can use the command-line interface below for user interaction. Within this interface, you have the option to either scrape an entire domain or a single URL. After making your selection, you need to write the domain or URL based on your previous choice. In our example, we chose scrapping entire [Dria Docs](https://docs.dria.co/) domain.

In [12]:
import json

# A Command Line Interface for scrapping URLs and domains
print("Choose an option:")
print("1. Scrape all URLs under a domain")
print("2. Scrape only the given URL")

choice = input("Enter your choice (1 or 2): ").strip()

if choice == "1":
    domain = input("Enter the domain (e.g., https://example.com): ").strip()
    domain_content = map_and_scrape_domain(domain)
    if domain_content:
        print("Scraping complete. Here's the content:")
        print(domain_content)

        # Save in structured JSON format
        with open("scraped_domain_content.json", "w") as f:
            json.dump(domain_content, f, indent=2)

        print("Data saved to 'scraped_domain_content.json'")
    else:
        print("No content was scraped.")
elif choice == "2":
    url = input("Enter the URL to scrape: ").strip()
    if url:
        result = scrape_single_url(url)
        if result:
            print("Scraping complete. Here's the content:")
            print(result)

            # Save in structured JSON format
            with open("scraped_single_url_content.json", "w") as f:
                json.dump([result], f, indent=2)
                
            print("Data saved to 'scraped_single_url_content.json'")
        else:
            print("Failed to scrape the URL.")
    else:
        print("Invalid URL.")
else:
    print("Invalid choice. Exiting.")

Choose an option:
1. Scrape all URLs under a domain
2. Scrape only the given URL
Firecrawl Response: {'success': True, 'links': ['https://docs.dria.co', 'https://docs.dria.co/installation', 'https://docs.dria.co/node', 'https://docs.dria.co/quickstart', 'https://docs.dria.co/factory/search', 'https://docs.dria.co/cookbook/eval', 'https://docs.dria.co/factory/text_retrieval', 'https://docs.dria.co/cookbook/nemotron_qa', 'https://docs.dria.co/factory/quality_evolution', 'https://docs.dria.co/factory/multihopqa', 'https://docs.dria.co/factory/csv_extender', 'https://docs.dria.co/factory/instruction_backtranslation', 'https://docs.dria.co/factory/list_extender', 'https://docs.dria.co/factory/qa', 'https://docs.dria.co/factory/subtopic', 'https://docs.dria.co/factory/iterate_code', 'https://docs.dria.co/factory/text_matching', 'https://docs.dria.co/factory/web_multi_choice', 'https://docs.dria.co/factory/persona', 'https://docs.dria.co/factory/evolve_complexity', 'https://docs.dria.co/facto

### Generate Personas Data with Dria

The next step for generating proprietary data is obtaining a personas dataset to feed into the QA pipeline. Dria provides a [Persona Pipeline](https://docs.dria.co/factory/persona/) made of four singletons that generates backstories or bios for characters based on their traits and simulation descriptions. In the pipeline, there are two schemas: PersonaBio for short bios and PersonaBackstory for longer backstory. In this notebook, we focus on generating short bio to describe the each character's traits and background knowledge.

Additionally, you can use single or multiple language models to generate these bios. Furthermore, you can change the simulation description with your own. To ensure a structured output, we save the generated dataset in JSON format.

In [15]:
from dria import DriaDataset, DatasetGenerator, Model
from dria.factory.persona import PersonaBio

# Create Dria Dataset with any name and description you want
my_dataset = DriaDataset(
    name="dria-cookbook-personas", 
    description="A persona dataset for agent evaluation by using QA pipeline",
    schema=PersonaBio[-1].OutputSchema,
)

# Create generator
generator = DatasetGenerator(dataset=my_dataset)

# Define your simulation description
simulation_desc = """AI engineers and researchers trying to generate high-quality synthetic data with Dria. Dria is the only synthetic data infrastructure that you can balance data quality, diversity, and complexity all together in a single interface."""

# Define instructions with simulation description and number of samples
instructions = [
    {
        "simulation_description": simulation_desc,
        "num_of_samples": 10,
    }
]

# Generate personas data using the generator
# You can use a single model or a list of models
await generator.generate(
    instructions=instructions,
    singletons=PersonaBio,
    models=[
        Model.ANTHROPIC_SONNET_3_5_OR,
        Model.QWEN2_5_72B_OR,
        Model.GPT4O,
    ],
)

# Export results using to_json() method
with open("personas.json", "w") as f:
    my_dataset.to_json(f)

# Print first item as example
df = my_dataset.to_pandas()
print("\nExample Generated Persona:")
print(df.iloc[0]['bio'])

Fetching results...: 100%|██████████| 1/1 [00:17<00:00, 17.04s/it]
Adding entries to DB: 100%|██████████| 10/10 [00:00<00:00, 57.73it/s]
Fetching results...: 100%|██████████| 10/10 [00:29<00:00,  2.91s/it]
Adding entries to DB: 100%|██████████| 10/10 [00:00<00:00, 279.81it/s]
Adding entries to DB: 100%|██████████| 10/10 [00:00<00:00, 697.49it/s]


Example Generated Persona:
Despite living on a low income and managing a chronic Condition A, 58-year-old Michael found solace and exceptional care at Provider C, whose compassionate staff significantly enhanced his healthcare experience, leading to his very satisfied survey completion in 34 minutes, a stark reminder of the value of kindness in health services as AI engineers, like those using Dria, strive to replicate such nuanced human interactions in synthetic data environments.





### Combine Personas with Context

The next step in generating proprietary data is to combine personas data with context data. This step creates comprehensive inputs for the QA pipeline by pairing each persona with every piece of content. This approach ensures maximum coverage for evaluation by creating all possible combinations.

The combined data follows this JSON format:
```json
{
  "persona_bio": "<PERSONA BIO>",
  "context": "<FETCHED CONTENT>",
  "url": "<URL OF THE CONTENT>"
}
```

Here's how the combination process works:

1. **Load Data**: First, we load both the scraped content and personas from their respective JSON files
2. **Create Combinations**: Using `itertools.product()`, we generate every possible pairing of personas with content
3. **Structure Data**: Each combination is formatted into our desired JSON structure
4. **Save Results**: The combined data is saved to a new JSON file for further processing

For example, if we have:
- 10 personas
- 20 pieces of content

The output will contain 200 combinations (10 × 20), ensuring that each persona is evaluated against every piece of content.

This comprehensive pairing approach helps ensure thorough evaluation of how different personas interact with various content contexts.

In [16]:
import json
import itertools

# Load scraped content data
with open("scraped_domain_content.json", "r") as f:
    scraped_content = json.load(f)

# Load personas data
with open("personas.json", "r") as f:
    personas = json.load(f)

# Create all possible combinations of personas and content
combined_data = []
for persona, content in itertools.product(personas, scraped_content):
    combined_data.append({
        "persona_bio": persona["bio"],
        "context": content["content"],
        "url": content["url"]
    })

# Save combined data to a JSON file
with open("combined_data.json", "w") as f:
    json.dump(combined_data, f, indent=2)

print(f"Created {len(combined_data)} combinations")

Created 480 combinations


As an example, the first item of the combined data is:

In [17]:
import json

# Load the combined data
with open("combined_data.json", "r") as f:
    combined_data = json.load(f)

# Display the combined data
display(combined_data[0])

{'persona_bio': 'Despite living on a low income and managing a chronic Condition A, 58-year-old Michael found solace and exceptional care at Provider C, whose compassionate staff significantly enhanced his healthcare experience, leading to his very satisfied survey completion in 34 minutes, a stark reminder of the value of kindness in health services as AI engineers, like those using Dria, strive to replicate such nuanced human interactions in synthetic data environments.',
 'context': 'Title: What is Dria? - Dria Docs\n\nURL Source: https://docs.dria.co/\n\nMarkdown Content:\nDria is the only synthetic data infrastructure that you can balance data quality, diversity, and complexity all together in a single interface.\n\n*   A framework for creating, managing, and orchestrating synthetic data pipelines.\n*   A multi-agent network that can synthesize data from web and siloed sources.\n\n### Why use Dria?[¶](https://docs.dria.co/#why-use-dria "Permanent link")\n\nDria provides the scalab

## Step 3: Generate an Evaluation Dataset by Using Dria




In this step, we enrich a dataset by generating Question-Answer (QA) pairs based on the provided persona and context. To accomplish this, utilize the following singletons: 

- QuestionGeneration: Generates questions derived from the persona's bio and the given context.
- AnswerGeneration: Produces answers using the persona's bio, the context, and the generated questions.

This approach ensures that each QA pair is tailored to the specific persona and context, thereby creating a well-defined evaluation set for the AI agent. In conclusion, the output data has the following format:

```json
  {
    "persona_bio": "<PERSONA BIO>",
    "context": "<FETCHED CONTENT>",
    "url": "<URL OF THE CONTENT>",
    "question": "<GENERATED QUESTION>",
    "answer": "<GENERATED ANSWER>"
  }

```

In [18]:
from dria import DriaDataset, DatasetGenerator, Model
from qa_pipeline.question import QuestionGeneration
from qa_pipeline.answer import AnswerGeneration
import json
import os

# If the JSON file is in the same directory as the notebook
json_file_path = 'combined_data.json'

# Check if the file exists to prevent errors
if not os.path.exists(json_file_path):
    raise FileNotFoundError(f"The file {json_file_path} does not exist.")

# Open and load the JSON data
with open(json_file_path, 'r', encoding='utf-8') as file:
    instructions = json.load(file)

# Optional: Verify the loaded data
print(f"Loaded {len(instructions)} instruction(s) from {json_file_path}.")

# Initialize the dataset
my_dataset = DriaDataset(
    name="Dria-cookbook-QA_pairs_dataset",
    description=" ",
    schema=AnswerGeneration.OutputSchema
)


# Initialize the generator
generator = DatasetGenerator(dataset=my_dataset)

# Run the asynchronous generate function using await
await generator.generate(
    instructions=instructions,
    singletons=[QuestionGeneration, AnswerGeneration],
    models=[Model.OPENAI,Model.GEMINI,Model.QWEN2_5_7B]
)

# Export the dataset to JSON
my_dataset.to_json("QA_dataset.json")

Loaded 480 instruction(s) from combined_data.json.


2024-12-25 18:28:24,546 - INFO - Background tasks cancelled.


Fetching results...:  98%|█████████▊| 49/50 [00:45<00:00,  1.09it/s]
Adding entries to DB: 100%|██████████| 49/49 [00:00<00:00, 238.41it/s]
Fetching results...:  96%|█████████▌| 48/50 [00:34<00:01,  1.41it/s]
Adding entries to DB: 100%|██████████| 48/48 [00:00<00:00, 517.21it/s]
Fetching results...:  96%|█████████▌| 48/50 [00:46<00:01,  1.04it/s]
Adding entries to DB: 100%|██████████| 48/48 [00:00<00:00, 327.02it/s]
Fetching results...:  98%|█████████▊| 49/50 [00:36<00:00,  1.36it/s]
Adding entries to DB: 100%|██████████| 49/49 [00:00<00:00, 304.39it/s]
Fetching results...:  94%|█████████▍| 47/50 [01:29<00:05,  1.90s/it]
Adding entries to DB: 100%|██████████| 47/47 [00:00<00:00, 311.23it/s]
Fetching results...: 100%|██████████| 50/50 [00:31<00:00,  1.61it/s]
Adding entries to DB: 100%|██████████| 50/50 [00:00<00:00, 497.65it/s]
Fetching results...: 100%|██████████| 50/50 [00:35<00:00,  1.42it/s]
Adding entries to DB: 100%|██████████| 50/50 [00:00<00:00, 577.36it/s]
Fetching results...:

### Convert QA Pairs to Embedding Model

Before the evaluation, we decided to convert the QA pairs into embedding vectors. To achieve this, we utilized OpenAI's embedding creation function and saved the resulting data as a Pandas DataFrame in CSV format. This approach allows us to enable efficient and quick comparison and evaluation for our AI agent.

In [19]:
import openai
import pandas as pd

# Load the generated dataset
df = pd.read_json("QA_dataset.json") # Replace with the actual path to the generated dataset

print("Dataset loaded successfully!")

# Save as CSV
df.to_csv("QA_dataset.csv", index=False) # You can change the filename if needed

print("Dataset in JSON saved as CSV!")

# Initialize OpenAI client
client = openai.OpenAI(api_key=openai_api_key)

# A function to get embeddings
def get_embeddings(text):
    try:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"An error occurred: {e}")

# Load the evaluation dataset
data_frame = pd.read_csv("QA_dataset.csv", encoding='latin1')

print("Creating embeddings for the evaluation dataset...")
data_frame['question_embedding'] = data_frame['question'].apply(get_embeddings)

# Save the embeddings
data_frame.to_csv('db_evaluation_embeddings.csv', index=False)

print("Embeddings created and saved successfully!")

Dataset loaded successfully!
Dataset in JSON saved as CSV!
Creating embeddings for the evaluation dataset...


KeyboardInterrupt: 

## Step 4: Create a Vector Index for Embeddings

In this step, we create and upload a vector index to Upstash using the evaluation embeddings generated in the previous step. This process is important for implementing vector-based search which allows for efficient evaluation for embeddings and retrieval of numerical vector embeddings. Specifically, we upsert the evaluation of each question's embedding along with its metadata which contains the question itself, golden answer (generated answer in previous step), and the relevant context. 

In [74]:
import pandas as pd
from upstash_vector import Index

# Connect to the Upstash index
index = Index(
    url="https://champion-tetra-58691-eu1-vector.upstash.io",
    token= upstash_key
)

# Load the CSV file
csv_path = "db_evaluation_embeddings.csv"  # Replace with your actual file path
data = pd.read_csv(csv_path)

# Prepare the data for upsert
vectors = []
for index_, row in data.iterrows():
    vector_id = f"vector_{index_}"
    embedding = eval(row["question_embedding"])
    print(f"Uploading: ID={vector_id}, Length={len(embedding)}")  # Log the vector ID and length for debugging
    metadata = {
        "Question": row["question"],
        "Golden Answer": row["answer"],
        "Context": row["context"]
    }
    vectors.append((vector_id, embedding, metadata))

# Upsert the vectors to Upstash
print("Uploading data to Upstash...")
index.upsert(vectors=vectors)
print("Data uploaded successfully!")

Uploading: ID=vector_0, Length=1536
Uploading: ID=vector_1, Length=1536
Uploading: ID=vector_2, Length=1536
Uploading: ID=vector_3, Length=1536
Uploading: ID=vector_4, Length=1536
Uploading: ID=vector_5, Length=1536
Uploading: ID=vector_6, Length=1536
Uploading: ID=vector_7, Length=1536
Uploading: ID=vector_8, Length=1536
Uploading: ID=vector_9, Length=1536
Uploading: ID=vector_10, Length=1536
Uploading: ID=vector_11, Length=1536
Uploading: ID=vector_12, Length=1536
Uploading: ID=vector_13, Length=1536
Uploading: ID=vector_14, Length=1536
Uploading: ID=vector_15, Length=1536
Uploading: ID=vector_16, Length=1536
Uploading: ID=vector_17, Length=1536
Uploading: ID=vector_18, Length=1536
Uploading: ID=vector_19, Length=1536
Uploading: ID=vector_20, Length=1536
Uploading: ID=vector_21, Length=1536
Uploading: ID=vector_22, Length=1536
Uploading: ID=vector_23, Length=1536
Uploading: ID=vector_24, Length=1536
Uploading: ID=vector_25, Length=1536
Uploading: ID=vector_26, Length=1536
Uploading: 

After uploading the data, you can check the result with a sample request:

In [75]:
query_vector = vectors[0][1]  # Using the first vector for querying
results = index.query(vector=query_vector, top_k=1, include_metadata=True)
print("Query Result:", results)

Query Result: [QueryResult(id='vector_0', score=1.0, vector=None, metadata={'Question': "Given Dria's advertised scalability and multi-agent network architecture, and my current struggles with resource constraints and a tight 20-week deadline,  how can Databricks potentially integrate or complement Dria to improve my team's productivity and address the current dissatisfaction with the synthetic data pipeline?", 'Golden Answer': "Okay, let's analyze how Databricks and Dria could potentially integrate to improve your team's productivity given your resource constraints and tight deadline.\n\n**Understanding the Core Strengths of Each Platform:**\n\n*   **Dria:**\n    *   **Synthetic Data Focus:** Dria is specifically designed for creating, managing, and orchestrating synthetic data pipelines.\n    *   **Scalability & Parallelization:** Leverages a multi-agent network to offload compute, avoiding the need for personal GPUs. This is a key benefit for resource-constrained teams.\n    *   **D

## Step 5: Running an Evaluation with Promptfoo

In the last step, we will evaluate the AI agent by using the Promptfoo. For this process, we will run three different evaluations for

- Vanilla RAG
- RAG + Jina Reranker
- RAG + Cohere Reranker

across multiple models and see which model perform best with each methodology in our use case.

### Preparing Datasets and Configurations

At first, we prepare our datasets:

In [76]:
import pandas as pd
import requests
from upstash_vector import Index
import cohere

# Jina Reranker endpoint and headers
jina_url = "https://api.jina.ai/v1/rerank"
jina_headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {jina_api_key}"
}

# Cohere Reranker client setup
cohere_client = cohere.Client(api_key=cohere_api_key)

# Upstash Index
index = Index(
    url="https://champion-tetra-58691-eu1-vector.upstash.io",
    token=upstash_key
)

# File path
file_path = "db_evaluation_embeddings.csv" # Replace with the actual path to the Embeddings CSV

# Load the CSV
df = pd.read_csv(file_path)

# Ensure columns exist and are of type `object`
for col in ["Simple_context", "Jina_context", "Cohere_context"]:
    if col not in df.columns:
        df[col] = ""
    df[col] = df[col].astype("object")

# Function to check if embedding is valid
def is_valid_embedding(embedding):
    return isinstance(embedding, str) and not pd.isna(embedding)

# Function to fetch documents from Upstash
def fetch_documents_from_upstash(embedding):
    try:
        # Query the Upstash index
        response = index.query(
            vector=eval(embedding),  # Ensure the embedding is a list of floats
            top_k=5,  # Number of top documents to retrieve
            include_metadata=True
        )
        print(f"Upstash Response: {response}")  # Debug: Inspect response structure

        # Extract documents from the response
        documents = []
        for item in response:  # Iterate over the list of QueryResult objects
            if hasattr(item, "metadata") and "text" in item.metadata:
                documents.append(item.metadata["text"])

        # Combine documents into a single string
        return "|||".join(documents)
    except Exception as e:
        print(f"Error fetching documents from Upstash: {e}")
        return ""

# Function to get reranked context from Jina
def get_jina_reranked_context(question, documents):
    try:
        payload = {
            "model": "jina-reranker-v2-base-multilingual",
            "query": question,
            "top_n": 1,
            "documents": documents.split("|||")
        }
        response = requests.post(jina_url, headers=jina_headers, json=payload)
        response.raise_for_status()
        return response.json()["results"][0]["document"]["text"]
    except Exception as e:
        print(f"Jina reranker failed for question: {question}, error: {e}")
        return ""

# Function to get reranked context from Cohere
def get_cohere_reranked_context(question, documents):
    try:
        doc_list = documents.split("|||")  # Split the documents into a list
        response = cohere_client.rerank(
            model="rerank-v3.5",
            query=question,
            documents=doc_list,
            top_n=1
        )
        top_result = response["results"][0]  # Get the top result
        top_index = top_result["index"]  # Get the index of the top-ranked document
        return doc_list[top_index]  # Return the document corresponding to the index
    except Exception as e:
        print(f"Cohere reranker failed for question: {question}, error: {e}")
        return ""

# Process each row in the CSV
for idx, row in df.iterrows():
    try:
        print(f"Processing Question: {row['question']}")

        # Check if question_embedding is valid
        if not is_valid_embedding(row["question_embedding"]):
            print(f"Skipping row {idx}: Invalid embedding.")
            continue

        # Fetch Simple Context from Upstash
        simple_context = fetch_documents_from_upstash(row["question_embedding"])
        print(f"Simple_context for row {idx}: {simple_context}")
        df.at[idx, "Simple_context"] = simple_context

        # Apply Jina Reranker
        jina_context = get_jina_reranked_context(row["question"], simple_context)
        print(f"Jina Reranker Result for row {idx}: {jina_context}")
        df.at[idx, "Jina_context"] = jina_context

        # Apply Cohere Reranker
        cohere_context = get_cohere_reranked_context(row["question"], simple_context)
        print(f"Cohere Reranker Result for row {idx}: {cohere_context}")
        df.at[idx, "Cohere_context"] = cohere_context

    except Exception as e:
        print(f"Error processing row {idx}: {e}")

# Save the updated DataFrame to CSV
df.to_csv(file_path, index=False, encoding="utf-8")
print(f"Updated CSV saved to {file_path}")

Processing Question: Given Dria's advertised scalability and multi-agent network architecture, and my current struggles with resource constraints and a tight 20-week deadline,  how can Databricks potentially integrate or complement Dria to improve my team's productivity and address the current dissatisfaction with the synthetic data pipeline?
Upstash Response: [QueryResult(id='vector_0', score=1.0, vector=None, metadata={'Question': "Given Dria's advertised scalability and multi-agent network architecture, and my current struggles with resource constraints and a tight 20-week deadline,  how can Databricks potentially integrate or complement Dria to improve my team's productivity and address the current dissatisfaction with the synthetic data pipeline?", 'Golden Answer': "Okay, let's analyze how Databricks and Dria could potentially integrate to improve your team's productivity given your resource constraints and tight deadline.\n\n**Understanding the Core Strengths of Each Platform:**\

After, the updated *db_evaluation_embeddings.csv* file is saved, we prepare YAML configuration files for Promptfoo.

In [67]:
import yaml
import os

# API Key for OpenRouter
open_router_key = os.getenv("OPEN_ROUTER_KEY")

if not open_router_key:
    raise ValueError("Please set the OPENROUTER_API_KEY environment variable.")

def generate_yaml(config_name, description, context_field, csv_path, providers):
    """
    Generates a YAML configuration file for promptfoo evaluation.

    Args:
        config_name (str): Name of the YAML file.
        description (str): Description of the evaluation.
        context_field (str): Context field to use in the prompt (e.g., Simple_context, Jina_context, Cohere_context).
        csv_path (str): Path to the input CSV file.
        providers (list): List of provider configurations with their API keys.

    Returns:
        None
    """
    yaml_data = {
        "description": description,
        "providers": providers,
        "prompts": [
            {
                "id": f"{config_name}_prompt",
                "label": f"{description} Prompt",
                "raw": f"""
Context:
{{{{{context_field}}}}}

Question:
{{{{question}}}}

Provide a detailed, accurate answer.
"""
            }
        ],
        "tests": csv_path,
        "defaultTest": {
            "assert": [
                {
                    "type": "llm-rubric",
                    "value": """
Evaluate the responses based on the following criteria against the golden answer:
Golden Answer: {{golden_answer}}
- Relevance: How well does the response answer the question?
- Completeness: Does the response fully address the question?
- Clarity: Is the response clear and coherent?

If you don't receive any reference answer, fail all models.
"""
                }
            ]
        },
        "outputPath": f"{config_name}_results.csv"  # Save results in the current directory
    }

    # Write to a YAML file
    yaml_path = f"./{config_name}.yaml"
    with open(yaml_path, 'w') as yaml_file:
        yaml.dump(yaml_data, yaml_file, sort_keys=False)
    print(f"YAML configuration saved: {yaml_path}")

# Define CSV file for tests
csv_path = "db_evaluation_embeddings.csv"

# Provider configurations with API keys
providers = [
    {"id": "openrouter:openai/gpt-4o", "config": {"apiKey": open_router_key}},
    {"id": "openrouter:anthropic/claude-3.5-sonnet:beta", "config": {"apiKey": open_router_key}},
    {"id": "openrouter:x-ai/grok-2-1212", "config": {"apiKey": open_router_key}},
    {"id": "openrouter:meta-llama/llama-3.2-3b-instruct:free", "config": {"apiKey": open_router_key}},
    {"id": "openrouter:meta-llama/llama-3.3-70b-instruct", "config": {"apiKey": open_router_key}},
]

# Generate YAMLs for all configurations
generate_yaml("simple_rag_config", "Simple RAG", "Simple_context", csv_path, providers)
generate_yaml("jina_reranker_config", "Jina Reranker RAG", "Jina_context", csv_path, providers)
generate_yaml("cohere_reranker_config", "Cohere Reranker RAG", "Cohere_context", csv_path, providers)

YAML configuration saved: ./simple_rag_config.yaml
YAML configuration saved: ./jina_reranker_config.yaml
YAML configuration saved: ./cohere_reranker_config.yaml


### Install Promptfoo

Now, we can install Promptfoo and use it for evaluation.

In [19]:
#Install promptfoo
%env npm_config_yes=true
!npx promptfoo@latest

env: npm_config_yes=true
Usage: promptfoo [options] [command]

Options:
  -V, --version               output the version number
  -h, --help                  display help for command

Commands:
  eval [options]              Evaluate prompts
  init [options] [directory]  Initialize project with dummy files or download
                              an example
  view [options] [directory]  Start browser ui
  redteam                     Red team LLM applications
  share [options] [evalId]    Create a shareable URL of an eval (defaults to
                              most recent)
  
  auth                        Manage authentication
  cache                       Manage cache
  config                      Edit configuration settings
  debug [options]             Display debug information for troubleshooting
  delete [options] <id>       Delete various resources
  export [options] <evalId>   Export an eval record to a JSON file
  feedback [message]          Send feedback to the promptfoo de

We need to add OpenAI API key to the environment for using GPT-4 as judge model in our evaluations

In [90]:
import os

# Add your API key to the environment
os.environ["OPENAI_API_KEY"] = openai_api_key

# Verify the API key is set
if "OPENAI_API_KEY" in os.environ:
    print("OpenAI API Key is set.")

OpenAI API Key is set.


Finally we'll run all three evaluations one by one and view the results in webview at the end:

In [92]:
!npx promptfoo@latest eval -c simple_rag_config.yaml --no-progress-bar --no-cache

Cache is disabled.
Running 75 concurrent evaluations with 4 threads...

[90m┌─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────┐[39m
[90m│[39m[1m[34m Co… [39m[22m[90m│[39m[1m[34m Ji… [39m[22m[90m│[39m[1m[34m Si… [39m[22m[90m│[39m[1m[34m an… [39m[22m[90m│[39m[1m[34m co… [39m[22m[90m│[39m[1m[34m pe… [39m[22m[90m│[39m[1m[34m qu… [39m[22m[90m│[39m[1m[34m qu… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m
[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│

In [93]:
!npx promptfoo@latest eval -c jina_reranker_config.yaml --no-progress-bar --no-cache

Cache is disabled.
Running 75 concurrent evaluations with 4 threads...

[90m┌─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────┐[39m
[90m│[39m[1m[34m Co… [39m[22m[90m│[39m[1m[34m Ji… [39m[22m[90m│[39m[1m[34m Si… [39m[22m[90m│[39m[1m[34m an… [39m[22m[90m│[39m[1m[34m co… [39m[22m[90m│[39m[1m[34m pe… [39m[22m[90m│[39m[1m[34m qu… [39m[22m[90m│[39m[1m[34m qu… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m
[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│

In [94]:
!npx promptfoo@latest eval -c cohere_reranker_config.yaml --no-progress-bar --no-cache

Cache is disabled.
Running 75 concurrent evaluations with 4 threads...

[90m┌─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────[39m[90m┬─────┐[39m
[90m│[39m[1m[34m Co… [39m[22m[90m│[39m[1m[34m Ji… [39m[22m[90m│[39m[1m[34m Si… [39m[22m[90m│[39m[1m[34m an… [39m[22m[90m│[39m[1m[34m co… [39m[22m[90m│[39m[1m[34m pe… [39m[22m[90m│[39m[1m[34m qu… [39m[22m[90m│[39m[1m[34m qu… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m[1m[34m [o… [39m[22m[90m│[39m
[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│[39m[1m[34m     [39m[22m[90m│

In [95]:
!npx promptfoo@latest view -y

Migrated results from file system to database
Server running at http://localhost:15500 and monitoring for new evals.
Press Ctrl+C to stop the server
^C
