## Evaluation with Azure Evaluation SDK

Leverage the evaluation to test the generated output of you agent(s) to verfiy the accuracy, preformance, clarity, coherece, risk and safety and more. You can even build your own evaluators!

To see all the avaliable metrics visit https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python

In [2]:
import requests


api_url = "http://localhost:8000"   # FastAPI uvicorn URL with port 8000
# api_url = "http://localhost:80"     # Docker container URL since we exposed the port 80
# api_url = "https://chinook-backend-api.azurewebsites.net"  # Azure Web App URL
# api_url = "http://20.118.71.68:80"  # AKS URL

res = requests.get(f"{api_url}/health")
res.json()

{'status': '🤙'}

In [3]:
res.status_code

200

In [4]:
res.text

'{"status":"🤙"}'

In [5]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)



# AI assisted quality evaluator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
}

## Compare the semantic meaning of generated answer to true answer, GroundTruth and the Relevance of the generated answer.

In [17]:

from azure.ai.evaluation import RelevanceEvaluator, SimilarityEvaluator


"""The relevance measure assesses the ability of answers to capture the key points of the context.
High relevance scores signify the AI system's understanding of the input and its capability to produce coherent and contextually appropriate outputs. 
Conversely, low relevance scores indicate that generated responses might be off-topic, lacking in context, or insufficient in addressing the user's intended queries. 
Use the relevance metric when evaluating the AI system's performance in understanding the input and generating contextually appropriate responses.
Relevance scores range from 1 to 5, with 1 being the worst and 5 being the best."""

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)



"""The similarity measure evaluates the likeness between a ground truth sentence (or document) and the AI model's generated prediction. 
This calculation involves creating sentence-level embeddings for both the ground truth and the model's prediction, which are high-dimensional vector
 representations capturing the semantic meaning and context of the sentences.

Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses.
 Similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy.

Similarity scores range from 1 to 5, with 1 being the least similar and 5 being the most similar."""


# Initialzing Similarity Evaluator
similarity_eval = SimilarityEvaluator(model_config)



In [10]:
res = requests.post(f"{api_url}/sql-invoke",
    json={
        "message": "Find albums released by artists who have more than 5 albums",
        "thread_id": "847c6285-8fc9-4560-a83f-4e6285809364"
    }
)
res

<Response [200]>

## Define Invoke Endpoint

In [None]:
def invoke_sql_query(message, thread_id):
    try:        
        res = requests.post(f"{api_url}/sql-invoke",
            json={
                "message": message,
                "thread_id": thread_id
            }
        )
        print(res.json()["content"])
        return res.json()["content"]
    except Exception as e:
        print(e)
        

## Single Local Evaluation

In [None]:
import uuid
thread_id = str(uuid.uuid4())
results = invoke_sql_query("Find albums released by artists who have more than 5 albums", thread_id)
ground_truth ="""Here are some albums released by artists who have more than 5 albums:

### **Deep Purple**
1. Come Taste The Band
2. Deep Purple In Rock
3. Fireball
4. Knocking at Your Back Door: The Best Of Deep Purple in the 80's
5. MK III The Final Concerts [Disc 1]
6. Machine Head
7. Purpendicular
8. Slaves And Masters
9. Stormbringer
10. The Battle Rages On
11. The Final Concerts (Disc 2)

### **Iron Maiden**
1. A Matter of Life and Death
2. A Real Dead One
3. A Real Live One
4. Brave New World
5. Dance Of Death
6. Fear Of The Dark
7. Iron Maiden
8. Killers
9. Live After Death
10. Live At Donington 1992 (Disc 1)
11. Live At Donington 1992 (Disc 2)
12. No Prayer For The Dying
13. Piece Of Mind
14. Powerslave
15. Rock In Rio [CD1]
16. Rock In Rio [CD2]
17. Seventh Son of a Seventh Son
18. Somewhere in Time
19. The Number of The Beast
20. The X Factor
21. Virtual XI

### **Led Zeppelin**
1. BBC Sessions [Disc 1] [Live]
2. BBC Sessions [Disc 2] [Live]
3. Coda
4. Houses Of The Holy
5. IV
6. In Through The Out Door
7. Led Zeppelin I
8. Led Zeppelin II
9. Led Zeppelin III
10. Physical Graffiti [Disc 1]
11. Physical Graffiti [Disc 2]
12. Presence
13. The Song Remains The Same (Disc 1)
14. The Song Remains The Same (Disc 2)

### **Metallica**
1. ...And Justice For All
2. Black Album
3. Garage Inc. (Disc 1)
4. Garage Inc. (Disc 2)"""


relevance_score = relevance_eval(
    response= results,
    context= ground_truth,
    query="Find albums released by artists who have more than 5 albums",
)
print(relevance_score)


similarity_score = similarity_eval(
    query="Find albums released by artists who have more than 5 albums",
    response=results,
    ground_truth=ground_truth)

print(similarity_score)

## Batch Local Evaluation

In [None]:
import uuid
import json

# Define the input and output file paths
file_path_input = './data/question_batch.json'
file_path_output = './data/question_batch_output.json'

# Read input JSON file
with open(file_path_input, 'r', encoding='utf-8') as file:
    data = json.load(file)

# Prepare output structure
output_data = {"Results": []}

# Process each question
for q in data['Questions']:
    thread_id = str(uuid.uuid4())  # Generate unique thread ID

    # Extract question and ground truth
    question = q["Question"]
    ground_truth = q["GroundTruth"]

    # Call the function to get a response 
    results = invoke_sql_query(question, thread_id)

    # Compute relevance and similarity scores 
    relevance_score = relevance_eval(response=results, context=ground_truth, query=question)
    similarity_score = similarity_eval(query=question, response=results, ground_truth=ground_truth)

    # Store results
    output_data["Results"].append({
        "Question": question,
        "Answer": results,
        "GroundTruth": ground_truth,
        "RelevanceScore": relevance_score["relevance"],
        "SimilarityScore": similarity_score["similarity"]
    })

    # Print scores for debugging
    print(f"Question: {question}")
    print(f"Answer: {results}")
    print(f"Relevance Score: {relevance_score}")
    print(f"Similarity Score: {similarity_score}")
    print("-" * 50)

# Write results to the output JSON file
with open(file_path_output, 'w', encoding='utf-8') as file_output:
    json.dump(output_data, file_output, indent=4, ensure_ascii=False)

print(f"Results saved to {file_path_output}")