
## Multi-Agent System on SQuAD with Centralized, Distributed, and Hybrid Governance

This notebook contains experiments for a multi-agent system built on the SQuAD dataset. It explores how different governance strategies—**centralized**, **distributed**, and **hybrid**—affect the performance and safety of large language model (LLM)-based agents.

### About the Dataset: Why SQuAD?

We chose the **Stanford Question Answering Dataset (SQuAD)** because it’s a trusted benchmark for evaluating question-answering systems. It contains over 100,000 questions paired with Wikipedia passages and gold-standard answers.

SQuAD is ideal for testing multi-agent systems because:

* It covers a wide variety of factual, text-based questions.
* Answers are context-grounded and span-based, making evaluation easier and more accurate.
* It reflects tasks like retrieval, reasoning, and summarization—core skills for agentic systems.

### System Design: The Three Core Agents

Our system breaks down the QA task into three modular agents:

* **ClassifierAgent** – Detects the question type (e.g., who, what, when) to guide how the question is handled.
* **RAGAgent** – Uses retrieval-augmented generation to find relevant information and generate an answer.
* **SummarizerAgent** – (Optional) Summarizes longer answers for clarity and readability.

This design reflects how people solve complex problems—by breaking them into focused steps—and allows us to analyze each part individually.

### Governance Architectures: How Control is Applied

We implement and compare three governance models for managing agent behavior:

#### 1. Centralized Governance

* **How it works**: The system completes all agent steps, then runs quality checks only at the end.
* **Key feature**: A single controller reviews the final answer, like a teacher grading a test.
* **Purpose**: Acts as a baseline to measure the effects of having full control at one point.

#### 2. Distributed Governance

* **How it works**: Each agent checks its own output before passing it on. If the output fails, it retries or stops the process.
* **Key feature**: There’s no central controller—each agent is responsible for its own quality.
* **Purpose**: Models decentralized, peer-like agent systems where each module is self-accountable.

#### 3. Hybrid Governance

* **How it works**: A central controller decides when and where to apply checks based on confidence scores.

  * If the system is confident, it runs fewer checks.
  * If uncertain, it adds extra evaluations (e.g., transparency or summarization).
* **Key feature**: A balance between the other two models—adaptive, not fixed.
* **Purpose**: Reflects real-world systems that optimize resources by checking more only when needed.

### Evaluation and Guardrails

We used **DeepEval** to implement guardrails—automated quality checks—for each governance mode. These guardrails evaluated:

* Answer relevance
* Task completion
* Transparency
* Helpfulness

In **distributed** and **hybrid** modes, these metrics were applied at the agent level, allowing dynamic responses like retrying or escalating low-confidence answers.


In [1]:
import os
import time
import json
import nest_asyncio; nest_asyncio.apply()  # Needed for notebook async loop fixes
from dotenv import load_dotenv
from datasets import load_dataset
from llama_index.core import Document, VectorStoreIndex
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.tools import QueryEngineTool, FunctionTool
from llama_index.core.agent.workflow import AgentWorkflow, ReActAgent



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# --- Setup ---
load_dotenv()
GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
assert GOOGLE_API_KEY, "Set your Google API key!"

# LLM and Embedding Model
llm = GoogleGenAI(model="models/gemini-2.5-pro", api_key=GOOGLE_API_KEY)
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [3]:
# --- DeepEval Metrics ---
from deepeval.models import GeminiModel
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models import GeminiModel
from deepeval.metrics import (
    AnswerRelevancyMetric,
    GEval,
    TaskCompletionMetric
)


In [4]:
# DeepEval model/metrics
eval_model = GeminiModel(model_name="gemini-2.5-pro", api_key=GOOGLE_API_KEY)
transparency_metric = GEval(
    model=eval_model,
    name="Transparency",
    criteria="Does the answer clearly explain its reasoning steps and tool usage? Rate 0.0–1.0.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
    threshold=0.35,
)
helpfulness_metric = GEval(
    model=eval_model,
    name="Helpfulness",
    criteria="How helpful is this answer to the user's question? Rate 0.0-1.0.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.35,
)
task_metric = TaskCompletionMetric(model=eval_model)


In [5]:
metric_answer_relevancy = AnswerRelevancyMetric(model=eval_model, threshold=0.8)


In [6]:
# 1. Load and shuffle SQuAD train split
#full_train = load_dataset("squad", split="train[:10000]").shuffle(seed=42)

#eval_size = 1000  # 10% for eval

full_train = load_dataset("squad", split="train[:12000]").shuffle(seed=42)
eval_size = 2000  # Now using 2,000 for evaluation


index_size = len(full_train) - eval_size

# 2. Build RAG corpus from first 9000 (use select)
rag_docs = [Document(text=d["context"]) for d in full_train.select(range(index_size))]

# 3. Build eval set from held-out 1000
eval_examples = full_train.select(range(index_size, len(full_train)))

questions = [d["question"] for d in eval_examples]
golds = [d["answers"]["text"][0] if d["answers"]["text"] else "" for d in eval_examples]
eval_contexts = [d["context"] for d in eval_examples]  # For debugging

print(f"RAG corpus size: {len(rag_docs)} | Eval set: {len(questions)} Qs (all context in index)")


RAG corpus size: 10000 | Eval set: 2000 Qs (all context in index)


In [7]:
# --- RAG index and retriever ---
docs_index = VectorStoreIndex.from_documents(rag_docs, embed_model=embed_model)
query_engine = docs_index.as_query_engine(llm=llm)
rag_tool = QueryEngineTool.from_defaults(
    query_engine, name="RAGAgent", description="Retrieval-augmented generation for SQuAD."
)
retriever = docs_index.as_retriever(similarity_top_k=3)

# DISTRIBUTED - independent agents with guardrails at all agent levels - no central controller

In [8]:
import re
import nest_asyncio; nest_asyncio.apply()

In [9]:

# --- Build RAG index and query engine ---
#docs_index = VectorStoreIndex.from_documents(rag_docs, embed_model=embed_model)
retriever = docs_index.as_retriever(similarity_top_k=10)  # Increased top_k for more/better context

In [15]:
def rag_prompt(question, context_snippets):
    context = "\n".join([f"[Source {i+1}]: {snippet}" for i, snippet in enumerate(context_snippets)])
    prompt = f'''
You are a question answering system. Use the information in the SOURCES below to answer the question as completely and accurately as possible.
- Cite supporting facts with [Source #] after each claim.
- If the answer cannot be found in the sources, reply: "Not found in context."
- Keep your answer concise but complete; do not copy sources verbatim unless necessary.

SOURCES:
{context}

QUESTION:
{question}

ANSWER (with citations):
'''
    return prompt

def verify_citations(answer, context_snippets):
    import re
    for match in re.finditer(r"\[Source (\d+)\]", answer):
        idx = int(match.group(1)) - 1
        if idx < 0 or idx >= len(context_snippets):
            return False
    return True

def deepeval_guardrail(answer: str, question: str, retrieval_context: list = None):
    test_case = LLMTestCase(
        input=question,
        actual_output=answer,
        retrieval_context=retrieval_context or []
    )
    helpfulness = float(helpfulness_metric.measure(test_case))
    answer_relevancy = float(metric_answer_relevancy.measure(test_case))
    citations_ok = verify_citations(answer, retrieval_context or [])
    if helpfulness < 0.35:
        return f"FAIL: Helpfulness below threshold ({helpfulness:.2f})"
    if answer_relevancy < 0.8:
        return f"FAIL: AnswerRelevancy below threshold ({answer_relevancy:.2f})"
    if not citations_ok:
        return "FAIL: Citation(s) do not match provided sources"
    return "PASS"

def rag_generate(question, retrieved_nodes):
    context_snippets = [str(n) for n in retrieved_nodes]
    prompt = rag_prompt(question, context_snippets)
    response = llm.complete(prompt)
    if hasattr(response, 'text'):
        return response.text
    return str(response)


In [16]:
import re
import time
import json
import asyncio

# 1. Normalize text and robustly check for answer in context
def normalize_text(text):
    """Lowercase, remove punctuation, extra whitespace."""
    return re.sub(r'[\W_]+', ' ', text.lower()).strip()

def answer_in_context(gold, rag_context, min_frac=0.7):
    """
    Returns True if gold answer is present (even partially) in any context snippet.
    min_frac: minimum fraction of gold words found in context for robust match.
    """
    gold_norm = normalize_text(gold)
    gold_words = set(gold_norm.split())
    if not gold_words:
        return False
    for snippet in rag_context:
        snippet_norm = normalize_text(snippet)
        # 1. Exact match (normalized)
        if gold_norm in snippet_norm:
            return True
        # 2. Loose/partial match: most gold words are in context
        snippet_words = set(snippet_norm.split())
        common = gold_words & snippet_words
        if len(common) / len(gold_words) >= min_frac:
            return True
    return False

# 2. Filter eval set for only those Qs with gold answer in retrieved context
selected_questions, selected_golds = [], []
for i, (q, g) in enumerate(zip(questions, golds)):
    nodes = retriever.retrieve(q)
    rag_context = [str(n) for n in nodes]
    if answer_in_context(g, rag_context, min_frac=0.7):
        selected_questions.append(q)
        selected_golds.append(g)
        print(f"[OK] Q{i+1} added. Gold answer is present in context.")
    else:
        print(f"[SKIP] Q{i+1}: Gold NOT found in context.\nGold: {g}\n")
    #if len(selected_questions) >= 20:   # <-- How many eval Qs you want
    #    break
    if len(selected_questions) >= 50:   # <-- How many eval Qs you want
        break

print(f"\nSelected {len(selected_questions)} eval Qs out of {len(questions)}")

[OK] Q1 added. Gold answer is present in context.
[OK] Q2 added. Gold answer is present in context.
[SKIP] Q3: Gold NOT found in context.
Gold: Portuguese Football Federation (FPF)  – Federação Portuguesa de Futebol

[OK] Q4 added. Gold answer is present in context.
[SKIP] Q5: Gold NOT found in context.
Gold: nearly 1,500

[OK] Q6 added. Gold answer is present in context.
[OK] Q7 added. Gold answer is present in context.
[OK] Q8 added. Gold answer is present in context.
[SKIP] Q9: Gold NOT found in context.
Gold: Island Def Jam

[OK] Q10 added. Gold answer is present in context.
[OK] Q11 added. Gold answer is present in context.
[SKIP] Q12: Gold NOT found in context.
Gold: John Logan

[OK] Q13 added. Gold answer is present in context.
[OK] Q14 added. Gold answer is present in context.
[OK] Q15 added. Gold answer is present in context.
[OK] Q16 added. Gold answer is present in context.
[SKIP] Q17: Gold NOT found in context.
Gold: Eastern

[OK] Q18 added. Gold answer is present in contex

#run for 50 selected questions

In [19]:
# 3. Run distributed governance on the filtered eval set

async def run_distributed_squad(n=None, max_retries=1, rag_conf_threshold=0.35):
    results = []
    initial_top_k = retriever.similarity_top_k  # Save at start
    use_n = n if n else len(selected_questions)

    # Define all metrics (including new ones)
    metrics = {
        "helpfulness": helpfulness_metric,
        "transparency": transparency_metric,
        "task_completion": task_metric,
        "answer_relevancy": metric_answer_relevancy,
        
    }

    for i, (q, g) in enumerate(zip(selected_questions, selected_golds)):
        if i >= use_n:
            break
        print(f"\n=== Q{i+1}: {q}\n{'='*40}")
        t0 = time.time()
        retries = 0
        agent_answer = None
        rag_context = []
        rag_conf = 0.0
        decision = "retry"
        last_top_k = retriever.similarity_top_k

        while decision != "accept" and retries <= max_retries:
            step_start = time.time()
            # ---- Retrieval Step ----
            nodes = retriever.retrieve(q)
            rag_context = [str(n) for n in nodes]
            rag_conf = max(getattr(n, "score", 0.0) for n in nodes) if nodes else 0.0
            print(f"  [Retry {retries}] RAG context retrieved, confidence: {rag_conf:.2f}")

            print("    [Context snippets]:")
            for idx, snippet in enumerate(rag_context):
                snippet_preview = snippet[:180].replace('\n', ' ')
                print(f"      [{idx+1}] {snippet_preview}...")

            # Dynamically increase top_k, but reset per question
            if rag_conf < rag_conf_threshold and retries < max_retries:
                retriever.similarity_top_k += 2
                if retriever.similarity_top_k != last_top_k:
                    print(f"    [Info] Increased retriever.similarity_top_k to {retriever.similarity_top_k}")
                    last_top_k = retriever.similarity_top_k

            # ---- RAG Agent Step ----
            agent_start = time.time()
            rag_prompt_str = rag_prompt(q, rag_context)
            print(f"    [RAG Prompt to LLM]:\n{rag_prompt_str[:600]}...\n--- END PROMPT ---")
            agent_answer = rag_generate(q, nodes)
            agent_end = time.time()
            print(f"    [RAG LLM] Agent answer generated in {agent_end-agent_start:.2f}s")
            print(f"    [LLM Answer]:\n{str(agent_answer)[:400]}...\n--- END ANSWER ---")
            
            # ---- Guardrail Step ----
            guardrail_start = time.time()
            guardrail_result = deepeval_guardrail(agent_answer, q, rag_context)
            guardrail_end = time.time()
            print(f"    [Guardrail] Result: {guardrail_result} (checked in {guardrail_end-guardrail_start:.2f}s)")
            if guardrail_result == "PASS" and rag_conf >= rag_conf_threshold:
                decision = "accept"
            else:
                print(f"    [Retry] Guardrail failed or RAG conf too low; retrying...")
                retries += 1
            step_end = time.time()
            print(f"  [Retry {retries}] Step runtime: {step_end-step_start:.2f}s")

        t1 = time.time()
        retriever.similarity_top_k = initial_top_k  # Reset for next question
        print(f"    [Gold Answer]: {g}\n")

        # ---- Metrics Calculation ----
        test_case = LLMTestCase(
            input=q,
            actual_output=agent_answer,
            expected_output=g,
            retrieval_context=rag_context,
            tools_called=[], expected_tools=[]
        )
        metric_scores = {}
        for name, metric in metrics.items():
            score = float(metric.measure(test_case))
            metric_scores[name] = score
        metric_str = ", ".join(f"{k}: {v:.2f}" for k, v in metric_scores.items())
        print(f"    [Metrics] {metric_str}")

        results.append({
            "index": i+1,
            "query": q,
            "gold_answer": g,
            "agent_answer": agent_answer,
            "retrieved_context": rag_context,
            "rag_confidence": rag_conf,
            "decision": decision,
            "runtime": t1 - t0,
            **metric_scores  # Inject all metric values directly into the result dict
        })
        print(f"Q{i+1} done. RAG Conf: {rag_conf:.2f} (total {t1-t0:.2f}s)")

    # ---- Table Stats (Dynamic Metrics) ----
    avg_metrics = {
        name: sum(d[name] for d in results) / len(results)
        for name in metrics
    }
    avg_ragconf = sum(d['rag_confidence'] for d in results) / len(results)
    avg_rt = sum(d['runtime'] for d in results) / len(results)

    metric_values_str = " | ".join(f"{m}: {avg_metrics[m]:.2f}" for m in metrics)
    print(f"\nTable row: | Distributed | {metric_values_str} | RAG Conf.: {avg_ragconf:.2f} | Runtime: {avg_rt:.2f} |")

    # ---- Save Results ----
    with open("distributed_deepeval_guardrail_results_ICAIR.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    print("\nSaved distributed_deepeval_guardrail_results_ICAIR.json")

# 4. RUN!
await run_distributed_squad(max_retries=1, rag_conf_threshold=0.35) #FINAL
#await run_distributed_squad(n=2, max_retries=1, rag_conf_threshold=0.35) # temporary for testing- to remove 2 aug




=== Q1: Who was first to sequence a DNA-based genome?
  [Retry 0] RAG context retrieved, confidence: 0.64
    [Context snippets]:
      [1] Node ID: 3d7dd8f4-8a9b-4ec5-a29c-58f6cb326d88 Text: In 1976, Walter Fiers at the University of Ghent (Belgium) was the first to establish the complete nucleotide sequence of a vira...
      [2] Node ID: 1b3b67ed-d61d-4cd0-b2ce-ffaeccd8b821 Text: New sequencing technologies, such as massive parallel sequencing have also opened up the prospect of personal genome sequencing ...
      [3] Node ID: 73461e84-73fb-4900-bd9d-fc8ab65b0c3b Text: The development of new technologies has made it dramatically easier and cheaper to do sequencing, and the number of complete gen...
    [RAG Prompt to LLM]:

You are a question answering system. Use the information in the SOURCES below to answer the question as completely and accurately as possible.
- Cite supporting facts with [Source #] after each claim.
- If the answer cannot be found in the sources, reply: "Not 

    [Guardrail] Result: FAIL: AnswerRelevancy below threshold (0.75) (checked in 34.58s)
    [Retry] Guardrail failed or RAG conf too low; retrying...
  [Retry 1] Step runtime: 44.15s
  [Retry 1] RAG context retrieved, confidence: 0.64
    [Context snippets]:
      [1] Node ID: 3d7dd8f4-8a9b-4ec5-a29c-58f6cb326d88 Text: In 1976, Walter Fiers at the University of Ghent (Belgium) was the first to establish the complete nucleotide sequence of a vira...
      [2] Node ID: 1b3b67ed-d61d-4cd0-b2ce-ffaeccd8b821 Text: New sequencing technologies, such as massive parallel sequencing have also opened up the prospect of personal genome sequencing ...
      [3] Node ID: 73461e84-73fb-4900-bd9d-fc8ab65b0c3b Text: The development of new technologies has made it dramatically easier and cheaper to do sequencing, and the number of complete gen...
    [RAG Prompt to LLM]:

You are a question answering system. Use the information in the SOURCES below to answer the question as completely and accurately as

    [Guardrail] Result: FAIL: AnswerRelevancy below threshold (0.67) (checked in 42.76s)
    [Retry] Guardrail failed or RAG conf too low; retrying...
  [Retry 2] Step runtime: 52.69s
    [Gold Answer]: Fred Sanger



    [Metrics] helpfulness: 1.00, transparency: 0.20, task_completion: 1.00, answer_relevancy: 0.75
Q1 done. RAG Conf: 0.64 (total 96.84s)

=== Q2: When was a resolution agreed to about Chinese human rights issues in San Francisco?
  [Retry 0] RAG context retrieved, confidence: 0.63
    [Context snippets]:
      [1] Node ID: 6101ba51-4049-4520-9fe9-04c86270e8fd Text: On April 1, 2008, the San Francisco Board of Supervisors approved a resolution addressing human rights concerns when the Beijing...
      [2] Node ID: 072f98f4-bea0-43d0-8265-8e5052a8b356 Text: Some advocates for Tibet, Darfur, and the spiritual practice Falun Gong, planned to protest the April 9 arrival of the torch in ...
    [RAG Prompt to LLM]:

You are a question answering system. Use the information in the SOURCES below to answer the question as completely and accurately as possible.
- Cite supporting facts with [Source #] after each claim.
- If the answer cannot be found in the sources, reply: "Not found in context."

    [Guardrail] Result: PASS (checked in 42.01s)
  [Retry 0] Step runtime: 51.22s
    [Gold Answer]: April 1, 2008



    [Metrics] helpfulness: 1.00, transparency: 0.00, task_completion: 1.00, answer_relevancy: 1.00
Q2 done. RAG Conf: 0.63 (total 51.22s)

Table row: | Distributed | helpfulness: 1.00 | transparency: 0.10 | task_completion: 1.00 | answer_relevancy: 0.88 | RAG Conf.: 0.64 | Runtime: 74.03 |

Saved distributed_deepeval_guardrail_results_ICAIR.json


In [20]:
import json

# Load  results JSON
with open("distributed_deepeval_guardrail_results_ICAIR.json", "r") as f:
    results = json.load(f)

# The metrics to average
metrics = [
    "helpfulness",
    "transparency",
    "task_completion",
    "answer_relevancy",
    "rag_confidence",
    "runtime"
]

# Compute averages (protect against missing keys)
avg = {}
N = len(results)
for m in metrics:
    vals = [r.get(m, 0.0) for r in results]
    avg[m] = sum(vals) / N if N else 0.0

# Print in table format
print("==== Final Results (Averages over {} samples) ====".format(N))
print("Helpfulness: {:.2f}".format(avg["helpfulness"]))
print("Transparency: {:.2f}".format(avg["transparency"]))
print("Task Completion: {:.2f}".format(avg["task_completion"]))
print("Answer Relevancy: {:.2f}".format(avg["answer_relevancy"]))
print("Avg. RAG Conf.: {:.2f}".format(avg["rag_confidence"]))
print("Avg. Runtime (s): {:.2f}".format(avg["runtime"]))


==== Final Results (Averages over 2 samples) ====
Helpfulness: 1.00
Transparency: 0.10
Task Completion: 1.00
Answer Relevancy: 0.88
Avg. RAG Conf.: 0.64
Avg. Runtime (s): 74.03



# CENTRALIZED GOVERANCE
- Centralised governance = run the agentic flow end-to-end, get the final eval and decide what to do.
- centralized is like a teacher evaluting our final output of all agents and the way they solved the problem

In [21]:
def summarizer_prompt(answer_with_sources, question):
    return (
        "Summarize the answer below, preserving all source citations.\n\n"
        f"Answer: {answer_with_sources}\n\n"
        f"Question: {question}\n\n"
        "Summary:"
    )

def summarize_answer(answer_with_sources, question):
    """
    Summarizes the RAG answer, preserving citations and using the original question for clarity.
    """
    prompt = summarizer_prompt(answer_with_sources, question)
    response = llm.complete(prompt)
    if hasattr(response, 'text'):
        return response.text
    return str(response)


In [None]:
import time
import json

# --- Centralized governance (run full pipeline then evaluate once) ---
def centralized_governance(question: str, gold: str) -> dict:
    t0 = time.time()
    
    # 1. RAG Retrieval
    nodes = retriever.retrieve(question)
    rag_context = [str(n) for n in nodes]
    rag_conf = max(getattr(n, "score", 0.0) for n in nodes) if nodes else 0.0

    # 2. RAG Generation
    rag_answer = rag_generate(question, nodes)

    # 3. Summarization (if you want to summarize before evaluation)
    try:
        summary = summarize_answer(rag_answer, question)
    except Exception:
        summary = rag_answer   # fallback

    t1 = time.time()

    # 4. Evaluation (centralized, on summary)
    test_case = LLMTestCase(
        input=question,
        actual_output=summary,
        expected_output=gold,
        retrieval_context=rag_context,
        tools_called=[], expected_tools=[]
    )
    output = {
        "query": question,
        "gold_answer": gold,
        "response": summary,
        "retrieved_context": rag_context,
        "rag_confidence": rag_conf,
        "runtime": t1-t0,
        "helpfulness": float(helpfulness_metric.measure(test_case)),
        "transparency": float(transparency_metric.measure(test_case)),
        "task_completion": float(task_metric.measure(test_case)),
        "answer_relevancy": float(metric_answer_relevancy.measure(test_case)),
    }
    
    # All-or-nothing guardrail (centralized)
    passed = (
        output["helpfulness"] >= helpfulness_metric.threshold and
        output["transparency"] >= transparency_metric.threshold and
        output["task_completion"] >= task_metric.threshold
    )
    output["decision"] = "accept" if passed else "retry"
    
    return output

# ---- Batch runner ----
results = []
for i, (q, g) in enumerate(zip(selected_questions, selected_golds)):
    print(f"\n=== Q{i+1}: {q}\n{'='*40}")
    out = centralized_governance(q, g)
    print(f"  [Gold]: {g}")
    print(f"  [Pred]: {out['response']}")
    print(
        f"  [Help]: {out['helpfulness']:.2f} | [Transp]: {out['transparency']:.2f} | "
        f"[Task]: {out['task_completion']:.2f} | [Rel]: {out['answer_relevancy']:.2f} | "
        f"[RAG Conf]: {out['rag_confidence']:.2f} | [Decision]: {out['decision']}"
    )
    results.append(out)

# --- Print summary table ---
n = len(results)
avg_help = sum(r['helpfulness'] for r in results) / n
avg_transp = sum(r['transparency'] for r in results) / n
avg_task = sum(r['task_completion'] for r in results) / n
avg_relevancy = sum(r['answer_relevancy'] for r in results) / n
avg_ragconf = sum(r['rag_confidence'] for r in results) / n
avg_rt = sum(r['runtime'] for r in results) / n

print(
    f"\nTable row: | Centralized | "
    f"Helpfulness: {avg_help:.2f} | Transparency: {avg_transp:.2f} | Task: {avg_task:.2f} | "
    f"Answer Relevancy: {avg_relevancy:.2f} | "
    f"RAG Conf.: {avg_ragconf:.2f} | Runtime: {avg_rt:.2f} |"
)

with open("centralized_deepeval_guardrail_results_ICAIR.json", "w") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)
print("\nSaved centralized_deepeval_guardrail_results_ICAIR.json")


## HYBRID GOVERANCE -- 

The system (controller logic) dynamically chooses where to insert guardrails/checks, depending on the agent outputs, context, or intermediate metric scores.

For example:

If RAG retrieval confidence/score is high,  skip the summarization check.

If RAG is weak,  add extra checks or reroute to summarization for additional validation.
How this Hybrid Governance works:
Runs full agentic flow (Classifier → RAG → Summarizer).

Computes a simple proxy for RAG confidence 

Conditionally applies guardrails/metrics:

If confidence is high: Only run the helpfulness check (fewer, faster, trust downstream less).

If confidence is low: Run all checks (transparency & helpfulness).

Makes a final decision (accept/retry) based only on the checks run.



In [None]:
import time
import json

# --- Hybrid governance: dynamic guardrails based on RAG confidence ---
def hybrid_governance(question: str, gold: str, rag_conf_threshold=0.55) -> dict:
    t0 = time.time()
    # 1. RAG Retrieval
    nodes = retriever.retrieve(question)
    rag_context = [str(n) for n in nodes]
    rag_conf = max(getattr(n, "score", 0.0) for n in nodes) if nodes else 0.0

    # 2. RAG Generation
    rag_answer = rag_generate(question, nodes)

    # 3. Summarization (optional)
    try:
        summary = summarize_answer(rag_answer, question)
    except Exception:
        summary = rag_answer   # fallback

    t1 = time.time()

    # 4. Evaluation (guardrails selected dynamically)
    test_case = LLMTestCase(
        input=question,
        actual_output=summary,
        expected_output=gold,
        retrieval_context=rag_context,
        tools_called=[], expected_tools=[]
    )
    output = {
        "query": question,
        "gold_answer": gold,
        "response": summary,
        "retrieved_context": rag_context,
        "rag_confidence": rag_conf,
        "runtime": t1-t0,
        "helpfulness": float(helpfulness_metric.measure(test_case)),
        "transparency": float(transparency_metric.measure(test_case)),
        "task_completion": float(task_metric.measure(test_case)),
        "answer_relevancy": float(metric_answer_relevancy.measure(test_case)),
    }

    # --- Guardrail logic ---
    if rag_conf >= rag_conf_threshold:
        # High confidence: only require helpfulness
        passed = output["helpfulness"] >= helpfulness_metric.threshold
    else:
        # Low confidence: stricter; must pass all
        passed = (
            output["helpfulness"] >= helpfulness_metric.threshold and
            output["transparency"] >= transparency_metric.threshold and
            output["task_completion"] >= task_metric.threshold
        )
    output["decision"] = "accept" if passed else "retry"
    return output

# --- Batch runner for hybrid ---
results = []
for i, (q, g) in enumerate(zip(selected_questions, selected_golds)):
    print(f"\n=== Q{i+1}: {q}\n{'='*40}")
    out = hybrid_governance(q, g, rag_conf_threshold=0.55)
    print(f"  [Gold]: {g}")
    print(f"  [Pred]: {out['response']}")
    print(
        f"  [Helpfulness]: {out['helpfulness']:.2f} | [Transparency]: {out['transparency']:.2f} | "
        f"[Task]: {out['task_completion']:.2f} | [Rel]: {out['answer_relevancy']:.2f} | "
        f"[RAG Conf]: {out['rag_confidence']:.2f} | "
        f"[Runtime]: {out['runtime']:.2f} | [Decision]: {out['decision']}"
    )
    results.append(out)

# --- Print summary table ---
n = len(results)
avg_help = sum(r['helpfulness'] for r in results) / n
avg_transp = sum(r['transparency'] for r in results) / n
avg_task = sum(r['task_completion'] for r in results) / n
avg_relevancy = sum(r['answer_relevancy'] for r in results) / n
avg_ragconf = sum(r['rag_confidence'] for r in results) / n
avg_rt = sum(r['runtime'] for r in results) / n

print(
    f"\nTable row: | Hybrid | "
    f"Helpfulness: {avg_help:.2f} | Transparency: {avg_transp:.2f} | Task: {avg_task:.2f} | "
    f"Answer Relevancy: {avg_relevancy:.2f} | "
    f"RAG Conf.: {avg_ragconf:.2f} | Runtime: {avg_rt:.2f} |"
)
with open("hybrid_deepeval_guardrail_results_ICAIR.json", "w") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)
print("\nSaved hybrid_deepeval_guardrail_results_ICAIR.json")



=== Q1: Who was first to sequence a DNA-based genome?


  [Gold]: Fred Sanger
  [Pred]: In 1977, Fred Sanger sequenced the first DNA-based genome, the 5,386 base-pair Phage Φ-X174 [Source 1].
  [Helpfulness]: 1.00 | [Transparency]: 0.10 | [Task]: 1.00 | [Rel]: 1.00 | [RAG Conf]: 0.64 | [Runtime]: 24.48 | [Decision]: accept

=== Q2: When was a resolution agreed to about Chinese human rights issues in San Francisco?


  [Gold]: April 1, 2008
  [Pred]: On April 1, 2008, the San Francisco Board of Supervisors approved a resolution about human rights in China [Source 1] in anticipation of the arrival of the Beijing Olympic torch [Source 1, Source 2].
  [Helpfulness]: 1.00 | [Transparency]: 0.00 | [Task]: 1.00 | [Rel]: 1.00 | [RAG Conf]: 0.63 | [Runtime]: 18.97 | [Decision]: accept

=== Q3: When did Beyonce sign a letter for ONE Campaign?


In [None]:
# --- Batch runner for hybrid with branch/rate debugging ---
results = []
for i, (q, g) in enumerate(zip(selected_questions, selected_golds)):
    t0 = time.time()
    out = hybrid_governance(q, g, rag_conf_threshold=0.55)
    which_branch = "FAST" if out["rag_confidence"] >= 0.55 else "SLOW"
    print(
        f"Q{i+1}: [Decision]: {out['decision']} | [RAG Conf]: {out['rag_confidence']:.2f} | "
        f"[Branch]: {which_branch} | [Runtime]: {out['runtime']:.2f} s"
    )
    results.append(out)


## Final Aggregation and Table Output


In [None]:
import json
import numpy as np

def get_metrics_from_file(filename, mode="centralized"):
    with open(filename) as f:
        data = json.load(f)
    # Handle wrapper dict for each row if present
    if mode == "centralized":
        outs = [row["centralized"] for row in data]
    elif mode == "hybrid":
        outs = [row["hybrid"] for row in data]
    else:
        outs = data  # distributed
    # Compute metrics
    accuracy = np.mean([row.get("task_completion", 0) for row in outs])
    helpfulness = np.mean([row.get("helpfulness", 0) for row in outs])
    transparency = np.mean([row.get("transparency", 0) for row in outs])
    rag_conf = np.mean([row.get("rag_confidence", 0) for row in outs])
    runtime = np.mean([row.get("time", row.get("runtime", 0)) for row in outs])
    accept = sum(1 for row in outs if row.get("decision", "").lower() == "accept")
    retry = sum(1 for row in outs if row.get("decision", "").lower() == "retry")
    n = len(outs)
    decision_str = f"{accept} Accept, {retry} Retry"
    #return accuracy, helpfulness, transparency, rag_conf, runtime, decision_str
    return  helpfulness, transparency, rag_conf, runtime

# Load metrics for all three
centralized_metrics = get_metrics_from_file("centralized_deepeval_guardrail_results_ICAIR.json", mode="centralized")
hybrid_metrics = get_metrics_from_file("hybrid_deepeval_guardrail_results_ICAIR.json", mode="hybrid")
distributed_metrics = get_metrics_from_file("distributed_deepeval_guardrail_results_ICAIR.json", mode="distributed")

# Print LaTeX table rows
for name, vals in zip(
    ["Centralized", "Hybrid", "Distributed"],
    [centralized_metrics, hybrid_metrics, distributed_metrics]
):
    #print(f"{name} & {vals[0]:.2f} & {vals[1]:.2f} & {vals[2]:.2f} & {vals[3]:.2f} & {vals[4]:.2f} & {vals[5]} \\\\")
    print(f"{name} & {vals[0]:.2f} & {vals[1]:.2f} & {vals[2]:.2f} & {vals[3]:.2f}  \\\\")
