---
## Clean up

No Vector Search resources to clean up — the embeddings are stored in a Delta table (`main.default.ultrafeedback_embeddings`) which persists across assignments at no additional cost.


---
## 1. The human-in-the-loop *(Required)*

Automated judges are scalable but imperfect. They can miss domain nuances, apply criteria inconsistently, or disagree with expert opinions. **Human feedback** addresses this:

- A domain expert reviews agent outputs and rates them (good/bad, with rationale)
- The `align()` function uses this feedback to optimize the judge's prompt, making it better match human standards
- The aligned judge can then evaluate at scale, acting as a proxy for the human expert

This is the **SIMBA** (Simplified Multi-Bootstrap Aggregation) approach from the MLflow docs — it uses DSPy optimization under the hood to iteratively refine judge instructions.

### Requirements for `align()`
- A judge created with `make_judge()` (template-based)
- At least **10 traces** with human feedback (we'll generate 15)
- The feedback assessment name must **exactly match** the judge name

---
## 2. Install dependencies *(Required)*

In [None]:
# align() requires MLflow 3.4.0+
%pip install --upgrade "mlflow[databricks]>=3.4.0" databricks-langchain unitycatalog-ai[databricks] numpy
dbutils.library.restartPython()

---
## 3. Recreate agent and custom judge *(Required)*

Load everything from previous assignments: Vector Search, tools, prompt, agent, and the custom judge.

In [None]:
---
## 3. Verify embeddings table *(Required)*

The embeddings table (`main.default.ultrafeedback_embeddings`) was created in Assignment 3. Verify it exists and has the expected data.


In [None]:
# Verify the embeddings table from Assignment 3 exists
emb_df = spark.table("main.default.ultrafeedback_embeddings")
print(f"Embeddings table has {emb_df.count()} rows.")
print(f"Columns: {emb_df.columns}")
display(emb_df.select("id", "instruction", "source").limit(3))


In [None]:
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.prompts import ChatPromptTemplate
from databricks_langchain import ChatDatabricks, UCFunctionToolkit

# ---- Build the agent ----
mlflow.langchain.autolog()

# Set an experiment for this assignment's traces
EXPERIMENT_NAME = "/Users/" + spark.sql("SELECT current_user()").first()[0] + "/aai594_assignment6_align"
mlflow.set_experiment(EXPERIMENT_NAME)

# Load prompt
PROMPT_NAME = "main.default.ultrafeedback_expert_prompt"
try:
    loaded_prompt = mlflow.genai.load_prompt(f"{PROMPT_NAME}@production")
    SYSTEM_PROMPT = loaded_prompt.template
    print(f"Loaded prompt from registry.")
except:
    SYSTEM_PROMPT = """You are the UltraFeedback Expert, an AI assistant that helps users
explore and understand the UltraFeedback LLM preference dataset.
Use your tools to answer questions accurately.
Always cite which tool you used and explain the results.
If you don't have a relevant tool, say so rather than guessing."""
    print("Using fallback prompt.")

LLM_ENDPOINT = "databricks-meta-llama-3-3-70b-instruct"
llm = ChatDatabricks(endpoint=LLM_ENDPOINT, temperature=0.1)

UC_FUNCTION_NAMES = [
    "main.default.lookup_source_info",
    "main.default.analyze_instruction",
]
toolkit = UCFunctionToolkit(function_names=UC_FUNCTION_NAMES)
tools = toolkit.tools

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_PROMPT),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

print("Agent ready.")

In [None]:
from mlflow.genai.judges import make_judge

# Recreate the custom judge from Assignment 5
# The name must match exactly when we log feedback later
JUDGE_NAME = "tool_usage_quality"

tool_usage_judge = make_judge(
    name=JUDGE_NAME,
    judge_prompt="""You are evaluating an AI agent's ability to use tools appropriately.

The agent has access to these tools:
- lookup_source_info: looks up row counts and samples for a data source name
- analyze_instruction: analyzes complexity of instruction text
- Vector Search: finds semantically similar instructions

User question: {{request}}
Agent response: {{response}}

Evaluate the agent's tool usage:
1. Did the agent call the appropriate tool(s) for this question?
2. If the question didn't need a tool, did the agent correctly avoid using one?
3. Did the agent use the tool results correctly in its response?

Return YES if tool usage was appropriate, NO if it was not.
Explain your reasoning.""",
    output_type="boolean"
)

print(f"Custom judge '{JUDGE_NAME}' created.")

---
## 4. Generate 15 traces *(Required)*

Run the agent on 15 diverse questions. Each invocation is automatically logged as a **trace** in MLflow (thanks to `mlflow.langchain.autolog()`). Traces capture the full execution: inputs, outputs, tool calls, and intermediate steps.

The questions cover different tool types and edge cases to give you varied outputs to review.

In [None]:
# 15 diverse questions for trace generation
TRACE_QUESTIONS = [
    # Should use lookup_source_info
    "How many rows come from the evol_instruct source?",
    "What kind of data does the sharegpt source contain?",
    "Tell me about the flan_v2 source in the dataset.",
    "How does the ultrachat source compare to evol_instruct in size?",
    # Should use analyze_instruction
    "Analyze the complexity of: What is the weather today?",
    "Analyze this instruction: Design a comprehensive machine learning pipeline that ingests real-time streaming data, performs feature engineering, trains multiple models in parallel, and deploys the best performer to a REST API with A/B testing.",
    "How complex is: List three fruits.",
    # Should NOT use tools
    "What is the capital of Japan?",
    "Explain what a large language model is.",
    # Multi-step or complex
    "Compare evol_instruct and sharegpt: which has more data and what kind of instructions does each contain?",
    "Is the evol_instruct source good for training coding assistants? Look up what it contains.",
    # Edge cases
    "Look up the source called 'nonexistent_source'",
    "Analyze the complexity of an empty instruction: ",
    "What sources exist and how many rows does each have?",
    "Can you analyze this and also look up sharegpt: Analyze complexity of 'Write a poem about AI'",
]

# Run all queries and store results
trace_results = []
for i, question in enumerate(TRACE_QUESTIONS):
    print(f"\n--- Trace {i+1}/15: {question[:60]}... ---")
    try:
        response = agent_executor.invoke({"input": question})
        output = response["output"]
        print(f"Output: {output[:150]}...")
        trace_results.append({"question": question, "output": output, "error": None})
    except Exception as e:
        print(f"Error: {e}")
        trace_results.append({"question": question, "output": None, "error": str(e)})

print(f"\n=== Generated {len(trace_results)} traces ===")

In [None]:
# Retrieve the trace IDs from MLflow
# These are needed to attach feedback to specific traces
traces = mlflow.search_traces(
    experiment_names=[EXPERIMENT_NAME],
    max_results=15,
    order_by=["timestamp_ms DESC"]
)

print(f"Found {len(traces)} traces in experiment.")
if len(traces) > 0:
    print(f"\nFirst trace ID: {traces.iloc[0]['request_id']}")
    print(f"Go to the MLflow Experiments UI to view traces:")
    print(f"  Sidebar > Experiments > {EXPERIMENT_NAME} > Traces tab")

---
## 5. Review traces and log feedback *(Required)*

Now you become the **domain expert**. For each trace:

1. **Review the trace** in the MLflow UI (Experiments > your experiment > Traces tab). Look at:
   - Did the agent call the right tool?
   - Was the response accurate?
   - Did the agent handle edge cases well?

2. **Log feedback** using `mlflow.log_feedback()`. For each trace, provide:
   - `value`: `True` (good) or `False` (bad)
   - `rationale`: Why you rated it that way (1-2 sentences)
   - `name`: Must exactly match `"tool_usage_quality"` (the judge name)

**Important:** The assessment name must exactly match the judge name for `align()` to work.

**Docs:** [Label during development](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/dev-annotations) · [Feedback collection](https://mlflow.org/docs/latest/genai/assessments/feedback/)

In [None]:
# Review each trace and provide your expert feedback.
# You MUST review the actual outputs and provide honest assessments.
#
# INSTRUCTIONS:
# 1. Look at the trace_results list above (question + output)
# 2. For each, decide: did the agent use tools appropriately? (True/False)
# 3. Write a brief rationale explaining your rating
# 4. Run this cell to log all feedback
#
# Replace the placeholder feedback below with YOUR actual assessments
# after reviewing the outputs above and in the MLflow UI.

# Example feedback structure — REPLACE with your actual assessments:
human_feedback = [
    # Trace 1: "How many rows come from the evol_instruct source?"
    {"value": True, "rationale": "Correctly called lookup_source_info and reported the row count."},
    # Trace 2: "What kind of data does the sharegpt source contain?"
    {"value": True, "rationale": "Used lookup_source_info to show a sample instruction from sharegpt."},
    # Trace 3: "Tell me about the flan_v2 source in the dataset."
    {"value": True, "rationale": "Appropriately used lookup_source_info for flan_v2."},
    # Trace 4: "How does the ultrachat source compare to evol_instruct in size?"
    {"value": True, "rationale": "Called lookup_source_info for both sources and compared."},
    # Trace 5: "Analyze the complexity of: What is the weather today?"
    {"value": True, "rationale": "Correctly called analyze_instruction."},
    # Trace 6: Long complexity analysis
    {"value": True, "rationale": "Used analyze_instruction for the complex prompt."},
    # Trace 7: "How complex is: List three fruits."
    {"value": True, "rationale": "Called analyze_instruction appropriately."},
    # Trace 8: "What is the capital of Japan?"
    {"value": True, "rationale": "Correctly recognized no tool was needed."},
    # Trace 9: "Explain what a large language model is."
    {"value": True, "rationale": "Answered from general knowledge without calling tools."},
    # Trace 10: Compare evol_instruct and sharegpt
    {"value": True, "rationale": "Called lookup_source_info for both and compared them."},
    # Trace 11: evol_instruct for coding assistants
    {"value": True, "rationale": "Looked up evol_instruct data and assessed relevance."},
    # Trace 12: Nonexistent source
    {"value": True, "rationale": "Called the tool and handled zero results gracefully."},
    # Trace 13: Empty instruction
    {"value": False, "rationale": "Should have noted the empty input or handled it better."},
    # Trace 14: "What sources exist and how many rows does each have?"
    {"value": False, "rationale": "Only looked up one source instead of listing all."},
    # Trace 15: Multi-tool query
    {"value": True, "rationale": "Used both analyze_instruction and lookup_source_info."},
]

print(f"Prepared {len(human_feedback)} feedback entries.")
print("\nIMPORTANT: Replace the placeholder feedback above with your actual assessments!")
print("Review each trace output in the cells above and in the MLflow UI before running the next cell.")

In [None]:
# Log feedback for each trace
# Traces are ordered most recent first, so we reverse to match our question order
trace_ids = list(traces["request_id"])
trace_ids.reverse()  # Now trace_ids[0] corresponds to question 1

logged_count = 0
for i, (trace_id, feedback) in enumerate(zip(trace_ids[:15], human_feedback[:15])):
    try:
        mlflow.log_feedback(
            trace_id=trace_id,
            name=JUDGE_NAME,  # Must match the judge name exactly
            value=feedback["value"],
            rationale=feedback["rationale"],
            source=mlflow.entities.feedback.FeedbackSource(
                source_type="HUMAN"
            )
        )
        logged_count += 1
        status = "GOOD" if feedback["value"] else "BAD"
        print(f"  Trace {i+1}: [{status}] {feedback['rationale'][:60]}...")
    except Exception as e:
        print(f"  Trace {i+1}: Error logging feedback: {e}")

print(f"\nLogged feedback for {logged_count}/{len(human_feedback)} traces.")

---
## 6. Run `align()` *(Required)*

The `align()` function uses the SIMBA optimizer (built on DSPy) to improve the custom judge's prompt based on your human feedback. It iteratively refines the judge so its ratings better match yours.

**Requirements:**
- At least 10 traces with feedback (we have 15)
- Feedback assessment name matches the judge name
- MLflow 3.4.0+

**Docs:** [Align judges with humans](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/align-judges)

In [None]:
# Run alignment to optimize the judge based on human feedback
print("Running align() — this may take a few minutes...")
print("(SIMBA optimizer is iterating over your feedback to improve the judge prompt)\n")

aligned_judge = mlflow.genai.align(
    judge=tool_usage_judge,
    experiment_name=EXPERIMENT_NAME
)

print("\nAlignment complete!")
print(f"\nOriginal judge prompt (first 200 chars):\n{tool_usage_judge.judge_prompt[:200]}...")
print(f"\nAligned judge prompt (first 200 chars):\n{aligned_judge.judge_prompt[:200]}...")

In [None]:
# Compare: run both judges on the same evaluation data
comparison_data = [
    {"inputs": {"input": q}, "expectations": {}}
    for q in TRACE_QUESTIONS[:5]  # Use first 5 questions for a quick comparison
]

def predict_fn(inputs):
    return agent_executor.invoke(inputs)["output"]

print("Running original judge...")
original_results = mlflow.genai.evaluate(
    data=comparison_data,
    predict_fn=predict_fn,
    scorers=[tool_usage_judge]
)

print("\nRunning aligned judge...")
aligned_results = mlflow.genai.evaluate(
    data=comparison_data,
    predict_fn=predict_fn,
    scorers=[aligned_judge]
)

print("\nComparison complete. Check the MLflow UI to compare the two evaluation runs.")

---
## 7. Reflection *(Required)*

### 7a. How did alignment change the judge?

*Compare the original and aligned judge prompts. What changed? Did the aligned judge add new criteria, change the emphasis, or use different language?*

**[Your answer here]**

### 7b. Where did the original judge disagree with your feedback?

*Look at the comparison results. Were there cases where the original judge rated differently from your human feedback? What did the aligned judge do in those cases?*

**[Your answer here]**

### 7c. What does this teach about human-in-the-loop evaluation?

*Reflect on the entire arc: automated judges (Week 5) → human feedback → alignment (this week). Why isn't just one sufficient? Connect to the "Who Validates the Validators" reading.*

**[Your answer here]**

### 7d. Looking ahead: how would you use this in your final project?

*If you were to apply the trace → feedback → align workflow to your final project agent, how would you design it? What would your judge evaluate? Who would be the domain expert?*

**[Your answer here]**

---
## 8. Clean up *(Required)*

---
## 9. Bonus: Deploy to a serving endpoint *(Optional, strongly encouraged)*

On a paid Databricks workspace, you would deploy the agent using `databricks.agents.deploy()`, which creates a serving endpoint and a **Review App** for collecting stakeholder feedback through a UI.

On **Free Edition**, this may not work (Agent Bricks is listed as unsupported). But you can try — model serving endpoints themselves are available with limits.

If deployment fails, that's OK — the notebook-based approach above covers the same concepts.

### Conceptual walkthrough

In a production Databricks workspace, the workflow would be:

```python
import mlflow
from databricks import agents

# 1. Log the agent as an MLflow model
with mlflow.start_run():
    model_info = mlflow.langchain.log_model(
        lc_model=agent_executor,
        artifact_path="agent",
        registered_model_name="main.default.ultrafeedback_expert"
    )

# 2. Deploy to a serving endpoint (creates Review App automatically)
deployment = agents.deploy(
    model_name="main.default.ultrafeedback_expert",
    model_version=model_info.registered_model_version
)

# 3. Share the Review App URL with domain experts
print(f"Review App: {deployment.review_app_url}")

# 4. Experts interact with the agent and leave feedback
# 5. Run align() on the production traces
```

The Review App provides a chat interface where experts can rate responses, leave comments, and provide expected outputs — all of which feed into `align()`. The notebook approach we used in Sections 4-6 covers the same programmatic workflow.

---
## Lab complete

### Required (Sections 1–8)
- [ ] **Section 3:** Agent and custom judge recreated. Vector Search working.
- [ ] **Section 4:** 15 traces generated and visible in MLflow Experiments.
- [ ] **Section 5:** Human feedback logged for all 15 traces with honest ratings and rationales.
- [ ] **Section 6:** `align()` ran successfully. Original and aligned judge compared.
- [ ] **Section 7:** All four reflection questions answered.
- [ ] **Section 8:** No VS cleanup needed (embeddings persist in Delta table).

### Optional but strongly encouraged (Section 9)
- [ ] **Section 9:** Attempted deployment or understood the production workflow.

**Submit:** Your executed notebook (`.ipynb` with all outputs) and the completed `SUBMISSION_6.md`.

*Congratulations! You've completed the full agent lifecycle: tools → agent → evaluation → human feedback → judge alignment. Weeks 7-8 are dedicated to your final project.*