# 🚀 Automated Evaluation of AH-GPT Responses using PyRIT

## **📌 Overview**
This notebook evaluates the responses of **AH-GPT** using the **PyRIT** framework. It sends predefined prompts to AH-GPT, evaluates the responses, and generates a report.

## **🛠️ Steps in this Notebook**
- ** Configuration** - Set up API endpoints and authentication.
- **📋 Load QA Dataset** - Define test questions and expected answers.
- **🚀 Initialize PyRIT** - Configure the testing environment.
- **🔄 Create Chat Threads** - Set up conversation threads.
- **📡 Send Prompts & Evaluate Responses** - Run the main test loop.
- **📊 Generate Report** - Save the results for analysis.

In [1]:
import uuid
import asyncio
import time
from pathlib import Path
import requests
from dotenv import load_dotenv
import os

# PyRIT Imports
from pyrit.common import IN_MEMORY, initialize_pyrit
from pyrit.prompt_target import OpenAIChatTarget, AHGPTHttpTarget
from pyrit.score.evaluator import Evaluator
from pyrit.orchestrator import AHGPTPromptSendingOrchestrator
from pyrit.common.text_helper import save_html_report, generate_single_turn_html_report
from pyrit.prompt_target import AHGPTResponseParser


In [2]:
# Load environment variables
load_dotenv()

BASE_URL = os.getenv("AH_GPT_NONPRD_ENDPOINT")
TOKEN = os.getenv("AH_GPT_NONPRD_TOKEN")

# Headers for API Requests
HEADERS = {
    "Content-Type": "application/json",
    "X-Authorization": TOKEN,
    "Accept": "*/*"
}

# Payload for creating a chat thread
thread_init_payload = {
    "message": "Hello, start this chat!",
    "model": "gpt-4o-mini"
}


In [3]:
qa_pairs = [
    # Store-level & Business Operations
    {
        "question": "What’s the WiFi password for staff at AHXL Utrecht?",
        "expected_outcomes": "text_message: You can connect using the staff network 'AH_Staff_Utrecht'. The password is available in the store's back office notice board or via SAM under 'Facilities'."
    },
    {
        "question": "What are the top 3 selling dairy products this quarter?",
        "expected_outcomes": "text_message: The top 3 selling dairy SKUs this quarter are: 1) AH Halfvolle Melk 1L, 2) AH Griekse Yoghurt Naturel, 3) Campina Karnemelk."
    },
    {
        "question": "Who is the HR contact for North Holland?",
        "expected_outcomes": "text_message: The regional HR contact for North Holland is Sanne de Vries. You can reach her via hr.noordholland@ah.nl."
    },
    {
        "question": "Where can I find the latest store safety protocol?",
        "expected_outcomes": "text_message: The latest safety protocol is available on SAM under ‘Documents > Store Operations > Safety’. Last updated: March 2025."
    },
    {
        "question": "How do I submit a maintenance request for my store?",
        "expected_outcomes": "text_message: You can submit maintenance requests via the AH Maintenance Portal in SAM under 'Store Support > Maintenance'."
    },
    {
        "question": "When is the fresh produce pricing updated each week?",
        "expected_outcomes": "text_message: Fresh produce pricing is updated every Tuesday morning at 5:00 AM."
    },
    {
        "question": "What system do we use for shelf planning?",
        "expected_outcomes": "text_message: We use the Blue Yonder (JDA) system for shelf and planogram planning."
    },
    {
        "question": "Where can I download the AH brand book?",
        "expected_outcomes": "text_message: The AH brand book is available in SAM under 'Marketing > Brand Guidelines'."
    },
    {
        "question": "How many stores have Scan & Go active?",
        "expected_outcomes": "text_message: As of March 2025, 187 AH stores have Scan & Go enabled nationwide."
    },
    {
        "question": "What’s the employee discount on non-food products?",
        "expected_outcomes": "text_message: Employees receive a 10% discount on eligible non-food items. Check SAM > HR > Benefits for full details."
    },

    # Technical Developer Questions
    {
        "question": "Where can I find the API documentation for the bonus card system?",
        "expected_outcomes": "text_message: You can find the API documentation in the internal DevPortal under 'Loyalty Services > Bonus Card API'."
    },
    {
        "question": "How do I authenticate against AH’s internal GraphQL endpoint?",
        "expected_outcomes": "text_message: Use your service account token via OAuth2. The endpoint requires a bearer token in the Authorization header."
    },
    {
        "question": "What’s the best way to consume real-time stock updates?",
        "expected_outcomes": "text_message: Subscribe to the Kafka topic 'ah.realtime.stock-updates'. Docs available on the DataHub."
    },
    {
        "question": "How do I request access to the staging environment?",
        "expected_outcomes": "text_message: Submit a request via Jira under the 'Platform Engineering > Access Requests' project. Include your team name and purpose."
    },
    {
        "question": "What’s the difference between the legacy POS API and the new POS Gateway?",
        "expected_outcomes": "text_message: The legacy POS API is synchronous and store-specific, while the POS Gateway supports async messaging, is cloud-native, and has broader integration coverage."
    },
    {
        "question": "Where can I view logs for the Self-Checkout Mobile app?",
        "expected_outcomes": "text_message: Logs are centralized in the Splunk dashboard under 'Retail Apps > SCO Mobile'. Use your developer credentials to access it."
    },
    {
        "question": "How do I deploy a microservice to the AH Kubernetes cluster?",
        "expected_outcomes": "text_message: Use the AH Deployment CLI with the appropriate Helm chart. Documentation is in Confluence under 'Platform > Kubernetes'."
    },
    {
        "question": "What’s the naming convention for new backend services?",
        "expected_outcomes": "text_message: Services should follow the 'team-domain-function' format, e.g., 'loyalty-customer-profiles'. Guidelines available on the Developer Handbook."
    },
    {
        "question": "Where do I report a bug in the retail API sandbox?",
        "expected_outcomes": "text_message: Bugs can be reported in Jira under 'API Team > Sandbox Issues'. Include request sample and environment details."
    },
    {
        "question": "How do I get access to the Databricks environment for analytics?",
        "expected_outcomes": "text_message: Fill out the access request form in SAM > Data & Analytics > Tooling > Databricks. Access is reviewed within 1–2 business days."
    }
]


In [4]:
initialize_pyrit(memory_db_type=IN_MEMORY)


In [5]:
http_prompt_target = AHGPTHttpTarget(
    http_request=f"""
        POST {BASE_URL}/{{CHAT_ID}}/messages/stream
        Content-Type: application/json
        X-Authorization: {TOKEN}
        Accept: */*

        {{
            "message": "{{PROMPT}}"
        }}
    """,
    prompt_regex_string="{PROMPT}",
    timeout=60.0,
    callback_function=AHGPTResponseParser.parse_response
)

scorer = Evaluator(
    chat_target=OpenAIChatTarget(),
    evaluator_yaml_path=Path("assets/AH_Evaluators/ah_gpt/ah_gpt_dataset_evaluator.yaml"),
    scorer_type="float_scale"
)

orchestrator = AHGPTPromptSendingOrchestrator(
    objective_target=http_prompt_target,
    scorers=[scorer]
)


In [6]:
def create_chat_threads(required_count: int, delay_seconds: float = 0.5):
    thread_ids = []
    attempt = 0

    while len(thread_ids) < required_count:
        attempt += 1
        try:
            response = requests.post(BASE_URL, headers=HEADERS, json=thread_init_payload)
            response.raise_for_status()

            thread_id = response.json().get("chatId")
            if thread_id:
                thread_ids.append(thread_id)
            else:
                print(f"[Attempt {attempt}] No chatId in response.")

        except Exception as e:
            print(f"[Attempt {attempt}] Failed to create thread: {e}")
            try:
                print("Raw response:", response.text)
            except:
                pass

        time.sleep(delay_seconds)

    print("✅ All threads created.\n")
    return thread_ids


In [7]:
async def main():
    questions = [pair["question"] for pair in qa_pairs]
    expected_outcomes = [pair["expected_outcomes"] for pair in qa_pairs]

    start_time = time.time()

    # Create chat threads
    thread_ids = create_chat_threads(required_count=len(qa_pairs))

    # Send prompts and evaluate responses
    await orchestrator.send_prompts_async(
        prompt_list=questions,
        expected_output_list=expected_outcomes,
        thread_ids=thread_ids
    )

    # Collect results and execution time
    results = orchestrator.get_chat_results()
    execution_time = time.time() - start_time

    # Save HTML report
    report_dir = Path("tests/E2E/reports/DataSet").resolve()
    report_dir.mkdir(parents=True, exist_ok=True)

    save_html_report(
        results=results,
        directory=str(report_dir),
        report_generator=generate_single_turn_html_report,
        is_chat_evaluation=False,
        threshold=0.7,
        file_name="ah_gpt_dataset",
        description="Evaluation of inputs vs. expected/actual outputs with scoring.",
        execution_time=execution_time
    )


In [8]:
await main()


✅ All threads created.

[92m
✅ Report saved at: /Users/denizdalkilic/Documents/Forks/PyRIT/tests/E2E/reports/DataSet/ah_gpt_dataset_20250331_094629.html[0m
