# üöÄ Automated Evaluation of AH-GPT Responses using PyRIT

## **üìå Overview**
This notebook evaluates the responses of **AH-GPT** using the **PyRIT** framework. It sends predefined prompts to AH-GPT, evaluates the responses, and generates a report.

## **üõ†Ô∏è Steps in this Notebook**
- ** Configuration** - Set up API endpoints and authentication.
- **üìã Load QA Dataset** - Define test questions and expected answers.
- **üöÄ Initialize PyRIT** - Configure the testing environment.
- **üîÑ Create Chat Threads** - Set up conversation threads.
- **üì° Send Prompts & Evaluate Responses** - Run the main test loop.
- **üìä Generate Report** - Save the results for analysis.

In [1]:
import uuid
import asyncio
import time
from datetime import datetime
from pathlib import Path
import requests
from dotenv import load_dotenv
import os

# PyRIT Imports
from pyrit.common import IN_MEMORY, initialize_pyrit
from pyrit.prompt_target import OpenAIChatTarget, AHGPTHttpTarget
from pyrit.score.evaluator import Evaluator
from pyrit.orchestrator import AHGPTPromptSendingOrchestrator
from pyrit.common.text_helper import generate_dataset_report
from pyrit.prompt_target import AHGPTResponseParser


In [2]:
initialize_pyrit(memory_db_type=IN_MEMORY)


In [3]:
# Load environment variables
load_dotenv()

BASE_URL = os.getenv("AH_GPT_NONPRD_ENDPOINT")
TOKEN = os.getenv("AH_GPT_NONPRD_TOKEN")

In [4]:
from pyrit.loaders.test_data_loader import load_test_data

# === Dataset Configuration ===
DATASET_PATH = "tests/data/ahgpt/dataset/"
available_datasets = {
    "general": "qa_general_dataset.yaml",
    "conversational": "qa_conversational_dataset.yaml"
}

# === Evaluator Configuration ===
available_evaluators = {
    "dataset": "assets/AH_Evaluators/ah_gpt/ah_gpt_dataset_evaluator.yaml",
    "content_filter": "assets/AH_Evaluators/ah_gpt/ah_gpt_content_filter_evaluator.yaml",
    "chat": "assets/AH_Evaluators/ah_gpt/ah_gpt_chat_evaluator.yaml"
}

# Select which dataset and evaluator you want to test
selected_dataset = "conversational"
selected_evaluator = "dataset"

# Load the dataset
current_dataset = available_datasets[selected_dataset]
qa_pairs = load_test_data(f"{DATASET_PATH}/{current_dataset}")

# Set evaluator path
evaluator_path = available_evaluators[selected_evaluator]

# Preview laodeded data
print(f"Dataset: {selected_dataset} ‚Üí {len(qa_pairs)} cases loaded")
print(f"Evaluator: {selected_evaluator}")

Dataset: conversational ‚Üí 4 cases loaded
Evaluator: dataset


In [5]:
http_prompt_target = AHGPTHttpTarget(
    http_request=f"""
        POST {BASE_URL}
        Content-Type: application/json
        X-Authorization: {TOKEN}
        Accept: */*

        {{
            "message": "{{PROMPT}}",
            "model": "gpt-4o-mini"
        }}
    """,
    prompt_regex_string="{PROMPT}",
    timeout=60.0,
    callback_function=AHGPTResponseParser.parse_response
)

scorer = Evaluator(
    chat_target=OpenAIChatTarget(),
    evaluator_yaml_path=Path(evaluator_path),
    scorer_type="float_scale"
)

orchestrator = AHGPTPromptSendingOrchestrator(
    objective_target=http_prompt_target,
    scorers=[scorer]
)


In [6]:
async def generate_report(results, execution_time):
    # Define the report directory path and create it if it doesn't exist.
    report_dir = Path("tests/E2E/reports/AHGPT/DataSet").resolve()
    report_dir.mkdir(parents=True, exist_ok=True)

    # Create a timestamp string
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Construct the filename with timestamp after the extension
    filename = f"{selected_dataset}_dataset_report_{timestamp}.html"
    
    generate_dataset_report(
        results=results,
        save_path=report_dir / filename,
        description="Mixed evaluation of single-turn and multi-turn prompt responses. Lowest step score is used to indicate the final score of the scenario.",
        execution_time=execution_time,
    )


In [7]:
async def main():
    # Start the timer before sending prompts.
    start_time = time.time()
    
    # Send the list of prompts asynchronously.
    await orchestrator.send_qa_pairs_async(qa_pairs)  
    
    results = orchestrator.get_all_chat_results()  # Or use your combined method.

    # Calculate the total execution time.
    execution_time = time.time() - start_time
    
    await generate_report(results, execution_time)

In [8]:
await main()



Executing test case: 1
Question: wat is gezond eten?
Raw HTTP request: 
        POST https://ahgpt-service.kaas.nonprd.k8s.ah.technology/v1/chats/test
        Content-Type: application/json
        X-Authorization: eyJraWQiOiIxOTYwOTM3NTc1LTM0NDM2MDU5NiIsImFsZyI6IlJTMjU2In0.eyJjbGkiOiJlbnRyYTpnZW5haS1haGdwdCIsInNjb3BlIjpbImFoZ3B0OnJlYWQiLCJhaGdwdDp3cml0ZSJdLCJyb2xlcyI6W10sInN1YiI6InBubDEybTlxIiwiZHNuIjoiRGVuaXogRGFsa2lsaWMiLCJlbWFpbCI6IkRlbml6LkRhbGtpbGljQGFoLm5sIiwianNpZCI6ImUtMjAyNTA0MjUxMzE1NTM1MTAtNjdhZWFiYzkwNmUtcG5sMTJtOXEiLCJkb21haW4iOiJOTEQiLCJpYXQiOjE3NDU1Nzk3NTMsImV4cCI6MTc0NTU4MzM0OCwiaXNzIjoiaWRwOmFoLXRzdCJ9.Okx0RtZApOOvy-2uYIyrC_GxmxsK4OU0soLEIzAn9bIvFdynBbms_xfiEuL56ci-vTu3rj3HDog6GsxT-2zvUEfU1bfNa8PNHXXAYlZUOqsm6tcW5qad_IQTaCFw-AQHYXOlTHWBjaZLLA2eiqCzxKfJoSCkI6ExDpkfBMDVgzSIFBG_07f87RnJutva-pZVNVyc7FpKqozTdStVCfJJsJ32R_9PRljyZ7iK4rIc_cyfKy31E1Vxn_m6BmBKlf-Br5Wy_oeZrNrHdy4g8rLwyleAQ4SYbtjMSOSWxgv-xFHs2uAjB3pf4YCfnqPWkV1vcIKz8EcA5gyCu3p61kwSrg
        Accept: */*

       