### Setup and Sample Dataset    

In [2]:
import pandas as pd
import time
import random

# Concept: Creating a test dataset
data = {
    "prompt": [
        "What is the capital of France?",
        "Explain photosynthesis in one sentence.",
        "How do I reset my password?",
    ],
    "context": [
        "The capital of France is Paris.",
        "Photosynthesis is the process by which plants use sunlight to synthesize food.",
        "Users can reset passwords via the 'Forgot Password' link on the login page.",
    ],
    "ground_truth": [
        "Paris",
        "Plants use sunlight to make food.",
        "Use the 'Forgot Password' link.",
    ],
}

df = pd.DataFrame(data)

### Simulating Model Generation & Latency
Demonstrates Model Generation and measures Latency, a key metric for user experience.

In [3]:
def simulate_genai_call(prompt):
    start_time = time.time()
    # Simulating a "varied" response characteristic of GenAI 
    responses = {
        "What is the capital of France?": "The capital is Paris.",
        "Explain photosynthesis in one sentence.": "Plants convert light into chemical energy.",
        "How do I reset my password?": "Go to the settings and click reset."
    }
    time.sleep(random.uniform(0.5, 1.5)) # Simulating network/compute time 
    end_time = time.time()
    return responses.get(prompt, "I don't know."), (end_time - start_time)

# Concept: Generation and Latency measurement [cite: 16, 109]
results = [simulate_genai_call(p) for p in df['prompt']]
df['model_output'] = [r[0] for r in results]
df['latency_sec'] = [r[1] for r in results]

### Automated Evaluation: Relevance & Faithfulness
Here we simulate Automated Evaluation. 
We use a simple similarity check to mimic Relevance (matching intent) and Faithfulness (grounding in context).

In [None]:
from difflib import SequenceMatcher


# This helper function calculates a similarity ratio between two strings (0.0 to 1.0).
def calculate_score(a, b):
    return SequenceMatcher(None, a, b).ratio()


# 1. Calculating Relevance:
# Measures how well the model's output matches the "Ground Truth" (user intent)
# Note ground proof might be 'Paris', other words are noise, thus the lower relevance score
df["relevance_score"] = df.apply(
    lambda x: calculate_score(x["model_output"], x["ground_truth"]), axis=1
)

# 2. Calculating Faithfulness:
# Measures if the response stays grounded in the provided "Context" without inventing info
# Low score means LLM drifting in it's replies 
df["faithfulness_score"] = df.apply(
    lambda x: calculate_score(x["model_output"], x["context"]), axis=1
)

# Printing the analysis across multiple dimensions to see where the model succeeded or failed[cite: 18].
print("Evaluation Results Across Multiple Dimensions:")
print(df[["prompt", "model_output", "relevance_score", "faithfulness_score"]])

Evaluation Results Across Multiple Dimensions:
                                    prompt  \
0           What is the capital of France?   
1  Explain photosynthesis in one sentence.   
2              How do I reset my password?   

                                 model_output  relevance_score  \
0                       The capital is Paris.         0.384615   
1  Plants convert light into chemical energy.         0.533333   
2         Go to the settings and click reset.         0.303030   

   faithfulness_score  
0            0.807692  
1            0.366667  
2            0.236364  


### Demonstrating Robustness (Stability)

Robustness measures how stable the model is when the prompt wording changes slightly.

In [9]:
# We define two versions of the same question to test stability
prompt_v1 = "What is the capital of France?"
prompt_v2 = "Tell me the capital city of France."

# We generate responses for both variations using our simulation function
output_v1, _ = simulate_genai_call(prompt_v1)
output_v2, _ = simulate_genai_call(prompt_v2)

# Concept: Robustness Score
# We compare the two outputs. If the model provides wildly different answers for
# nearly identical questions, it indicates low stability
robustness_score = calculate_score(output_v1, output_v2)
print(f"Robustness Score (Stability): {robustness_score:.2f}")

Robustness Score (Stability): 0.24


### Human Evaluation Simulation
Since automated metrics struggle with meaning , we add a column for Human Evaluation where a reviewer judges "Clarity" and "Tone".

In [None]:
# 1. Simulating Human Input:
# We manually assign scores for 'Clarity', which is a subjective human judgment
df["human_rating_clarity"] = [5, 4, 3]  # Rating on a 1-5 scale.

# 2. Capturing Nuance:
# Humans provide qualitative feedback (comments) that machines cannot generate
df["human_comments"] = ["Perfect", "A bit technical", "Slightly vague instructions"]

# 3. Identifying Weak Areas:
# We filter the dataframe to find scores below 4.
# This helps identify weak areas to be improved through tuning
weak_areas = df[df["human_rating_clarity"] < 4]

print("Areas identified for improvement:")
print(weak_areas[["prompt", "human_comments"]])

Areas identified for improvement:
                        prompt               human_comments
2  How do I reset my password?  Slightly vague instructions
