---
## Clean up

No Vector Search resources to clean up — the embeddings are stored in a Delta table (`main.default.ultrafeedback_embeddings`) which persists across assignments at no additional cost.


---
## 1. Why evaluate agents? *(Required)*

An agent that *seems* to work in a few demos might fail on edge cases, hallucinate tool results, or give confidently wrong answers. **Evaluation** is how you find out before your users do.

MLflow 3 provides three types of judges:

| Judge type | What it does | When to use |
|-----------|-------------|-------------|
| **Built-in** (e.g., `Correctness`, `Safety`) | Pre-configured scorers with standard rubrics | Quick baseline — does the agent give correct, safe answers? |
| **Guidelines** | You write natural-language rules; the judge checks compliance | Domain-specific quality bars (e.g., "must cite the data source") |
| **Custom** (`make_judge()`) | You write the full prompt template | Full control — your own rubric, scoring, and output format |

In this assignment you'll use all three on your UltraFeedback Expert agent and compare how they rate the same outputs.

---
## 2. Install dependencies *(Required)*

In [None]:
%pip install --upgrade "mlflow[databricks]>=3.1.0" databricks-langchain unitycatalog-ai[databricks] numpy databricks-agents
dbutils.library.restartPython()

---
## 3. Verify embeddings table *(Required)*

The embeddings table (`main.default.ultrafeedback_embeddings`) was created in Assignment 3. Verify it exists and has the expected data.


In [None]:
# Verify the embeddings table from Assignment 3 exists
emb_df = spark.table("main.default.ultrafeedback_embeddings")
print(f"Embeddings table has {emb_df.count()} rows.")
print(f"Columns: {emb_df.columns}")
display(emb_df.select("id", "instruction", "source").limit(3))


In [None]:
---
## 3. Verify embeddings table *(Required)*

The embeddings table (`main.default.ultrafeedback_embeddings`) was created in Assignment 3. Verify it exists and has the expected data.


In [None]:
# Verify the embeddings table from Assignment 3 exists
emb_df = spark.table("main.default.ultrafeedback_embeddings")
print(f"Embeddings table has {emb_df.count()} rows.")
print(f"Columns: {emb_df.columns}")
display(emb_df.select("id", "instruction", "source").limit(3))


In [None]:
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.prompts import ChatPromptTemplate
from databricks_langchain import ChatDatabricks, UCFunctionToolkit

# Enable tracing
mlflow.langchain.autolog()
mlflow.set_experiment("/Users/" + spark.sql("SELECT current_user()").first()[0] + "/aai594_assignment5_eval")

# Load prompt from UC Prompt Registry
PROMPT_NAME = "main.default.ultrafeedback_expert_prompt"
try:
    loaded_prompt = mlflow.genai.load_prompt(f"{PROMPT_NAME}@production")
    SYSTEM_PROMPT = loaded_prompt.template
    print(f"Loaded prompt from registry: {PROMPT_NAME}@production")
except:
    # Fallback if prompt wasn't registered in Assignment 4
    SYSTEM_PROMPT = """You are the UltraFeedback Expert, an AI assistant that helps users
explore and understand the UltraFeedback LLM preference dataset.
Use your tools to answer questions accurately.
Always cite which tool you used and explain the results.
If you don't have a relevant tool, say so rather than guessing."""
    print("Using fallback prompt (prompt registry not available).")

# Build the agent
LLM_ENDPOINT = "databricks-meta-llama-3-3-70b-instruct"
llm = ChatDatabricks(endpoint=LLM_ENDPOINT, temperature=0.1)

UC_FUNCTION_NAMES = [
    "main.default.lookup_source_info",
    "main.default.analyze_instruction",
    # "main.default.your_custom_function",  # <-- add your custom function
]
toolkit = UCFunctionToolkit(function_names=UC_FUNCTION_NAMES)
tools = toolkit.tools

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_PROMPT),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

# Quick test
resp = agent_executor.invoke({"input": "How many rows come from evol_instruct?"})
print("Agent test:", resp["output"][:200])

---
## 4. Create an evaluation dataset *(Required)*

An evaluation dataset is a set of questions (inputs) with optional expected answers (expectations). You need enough variety to test different agent capabilities.

We'll create the dataset in two ways:
1. **Manually** — you write questions that specifically target your tools
2. **Synthetically** — use an LLM to generate additional questions

**Docs:** [Build evaluation datasets](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset)

In [None]:
# ---- Manual evaluation questions ----
# These target specific tool capabilities. Add 5 of your own at the end.
manual_eval_data = [
    # Should trigger lookup_source_info
    {
        "inputs": {"input": "How many rows come from the sharegpt source?"},
        "expectations": {"expected_facts": ["should call lookup_source_info", "should return a row count"]}
    },
    {
        "inputs": {"input": "What kind of instructions does the flan_v2 source contain?"},
        "expectations": {"expected_facts": ["should call lookup_source_info", "should show a sample instruction"]}
    },
    # Should trigger analyze_instruction
    {
        "inputs": {"input": "Analyze the complexity of: What is 2+2?"},
        "expectations": {"expected_facts": ["should call analyze_instruction", "should report low complexity"]}
    },
    {
        "inputs": {"input": "Analyze this instruction: Write a comprehensive essay on the socioeconomic impacts of climate change across developing nations, including policy recommendations."},
        "expectations": {"expected_facts": ["should call analyze_instruction", "should report high complexity"]}
    },
    # Should NOT trigger tools (tests the agent's judgment)
    {
        "inputs": {"input": "What is the capital of France?"},
        "expectations": {"expected_facts": ["should say it doesn't have a relevant tool or answer from general knowledge"]}
    },
    # Multi-step or ambiguous
    {
        "inputs": {"input": "Compare the evol_instruct and sharegpt sources. Which has more data?"},
        "expectations": {"expected_facts": ["should call lookup_source_info for both sources", "should compare the row counts"]}
    },
    {
        "inputs": {"input": "What sources are available in the dataset?"},
        "expectations": {"expected_facts": ["should list multiple source names"]}
    },
    # Edge cases
    {
        "inputs": {"input": "Look up the source called 'nonexistent_source_xyz'"},
        "expectations": {"expected_facts": ["should call lookup_source_info", "should indicate no data found or zero rows"]}
    },

    # ---- YOUR 5 QUESTIONS BELOW ----
    # Add 5 more evaluation questions that test different aspects of the agent.
    # Include expected_facts for each.
    # {
    #     "inputs": {"input": "YOUR QUESTION HERE"},
    #     "expectations": {"expected_facts": ["what you expect"]}
    # },
]

print(f"Manual eval dataset: {len(manual_eval_data)} questions")

In [None]:
# Combine into the final evaluation dataset
# (If synthetic generation above didn't work, that's OK — the manual set is sufficient)
eval_data = manual_eval_data
print(f"Total evaluation dataset: {len(eval_data)} questions")

---
## 5. Built-in judge *(Required)*

Built-in judges are pre-configured scorers that assess standard quality dimensions. They require minimal setup.

We'll use `RelevanceToQuery` — it checks whether the agent's response actually addresses the user's question.

**Docs:** [Built-in scorers](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges)

In [None]:
from mlflow.genai.scorers import RelevanceToQuery

# Define the predict function — wraps the agent for evaluation
def predict_fn(inputs):
    """Run the agent and return the output string."""
    response = agent_executor.invoke(inputs)
    return response["output"]

# Run evaluation with the built-in RelevanceToQuery judge
print("Running evaluation with RelevanceToQuery...")
builtin_results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=predict_fn,
    scorers=[RelevanceToQuery()]
)

print("\nBuilt-in judge evaluation complete. Check the MLflow UI link above for detailed results.")

---
## 6. Guidelines judge *(Required)*

A **guidelines judge** checks whether the response follows specific rules you define in natural language. This is great for domain-specific quality bars.

You'll write rules specific to the UltraFeedback Expert agent.

In [None]:
from mlflow.genai.scorers import Guidelines

# Define domain-specific guidelines for the UltraFeedback Expert
tool_citation_judge = Guidelines(
    name="tool_citation",
    guidelines="The response must clearly state which tool or data source was used "
               "to generate the answer. If no tool was needed, the response should "
               "acknowledge that. Vague answers that don't explain their source fail."
)

data_grounding_judge = Guidelines(
    name="data_grounding",
    guidelines="When the question is about the UltraFeedback dataset, the response "
               "must include specific numbers or examples from the actual data "
               "(e.g., row counts, source names, sample instructions). Generic or "
               "made-up statistics fail this criterion."
)

# Run evaluation with both guidelines judges
print("Running evaluation with Guidelines judges...")
guidelines_results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=predict_fn,
    scorers=[tool_citation_judge, data_grounding_judge]
)

print("\nGuidelines judge evaluation complete.")

---
## 7. Custom judge *(Required)*

A **custom judge** gives you full control over the evaluation prompt. You define the rubric, scoring criteria, and output format. This is useful when built-in and guidelines judges don't capture exactly what you care about.

We'll create a custom judge that evaluates whether the agent used its tools **appropriately** — not just whether the answer was relevant, but whether the agent chose the *right* tool for the question.

**Docs:** [Custom judges](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge)

In [None]:
from mlflow.genai.judges import make_judge

# Create a custom judge that evaluates tool usage appropriateness
tool_usage_judge = make_judge(
    name="tool_usage_quality",
    judge_prompt="""You are evaluating an AI agent's ability to use tools appropriately.

The agent has access to these tools:
- lookup_source_info: looks up row counts and samples for a data source name
- analyze_instruction: analyzes complexity of instruction text
- Vector Search: finds semantically similar instructions

User question: {{request}}
Agent response: {{response}}

Evaluate the agent's tool usage:
1. Did the agent call the appropriate tool(s) for this question?
2. If the question didn't need a tool, did the agent correctly avoid using one?
3. Did the agent use the tool results correctly in its response?

Return YES if tool usage was appropriate, NO if it was not.
Explain your reasoning.""",
    output_type="boolean"
)

# Run evaluation with the custom judge
print("Running evaluation with custom judge...")
custom_results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=predict_fn,
    scorers=[tool_usage_judge]
)

print("\nCustom judge evaluation complete.")

---
## 8. Compare all judges and analyze *(Required)*

Now run all judges together in a single evaluation pass and compare how they rate the same outputs.

In [None]:
# Run all judges together
print("Running combined evaluation with all judges...")
all_results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=predict_fn,
    scorers=[
        RelevanceToQuery(),
        tool_citation_judge,
        data_grounding_judge,
        tool_usage_judge,
    ]
)

print("\nCombined evaluation complete. See the MLflow UI for the full results table.")

### Your analysis

Review the evaluation results in the MLflow UI (click the link in the cell output above, then go to the **Evaluations** tab). Answer these questions:

**8a. Which judge was strictest? Which was most lenient?**

*Look at the pass/fail rates across all judges. Which one failed the most responses?*

**[Your answer here]**

**8b. Where did the judges disagree?**

*Find specific examples where one judge passed and another failed the same response. Why did they disagree? What does this tell you about evaluation design?*

**[Your answer here]**

**8c. What did the evaluation reveal about your agent?**

*What are the agent's strengths and weaknesses based on the evaluation? If you were to improve the agent, what would you change first — the tools, the prompt, or the LLM?*

**[Your answer here]**

**8d. How does this connect to the readings?**

*Reference at least one of the Week 5 readings (Product Evals, Survey on Evaluation, Evals FAQ). How did the reading inform your approach to evaluation?*

**[Your answer here]**

---
## 9. Clean up *(Required)*

---
## Lab complete

### Required (Sections 1–9)
- [ ] **Section 3:** Agent recreated and verified (tools + embeddings table + prompt).
- [ ] **Section 4:** Evaluation dataset created with at least 8 provided + 5 of your own questions.
- [ ] **Section 5:** Built-in judge (`RelevanceToQuery`) evaluation ran successfully.
- [ ] **Section 6:** Guidelines judges (`tool_citation`, `data_grounding`) evaluation ran successfully.
- [ ] **Section 7:** Custom judge (`tool_usage_quality`) evaluation ran successfully.
- [ ] **Section 8:** Combined evaluation ran. All four analysis questions answered.
- [ ] **Section 9:** No VS cleanup needed (embeddings persist in Delta table).

**Submit:** Your executed notebook (`.ipynb` with all outputs) and the completed `SUBMISSION_5.md`.

*Next week you'll generate traces, provide human feedback, and use `align()` to improve your judges.*