# AAI 594 — Assignment 4a

## From Tools to a Working Agent

**In this lab you will:**
- **Required (Sections 1–6):** Prototype an agent in the **AI Playground**, export the code, and run it in this notebook.
- **Required (Section 7):** Register your agent's system prompt in the **Unity Catalog Prompt Registry** and version it.
- **Required (Section 8):** Swap the underlying LLM and compare output quality and token usage across models.
- **Required (Section 9):** Clean up Vector Search resources.
- **Optional, strongly encouraged (Section 10):** Explore **prompt optimization** concepts from the Week 4 readings.

### The big picture

| Week | What you do | Deliverable |
|------|------------|-------------|
| 3 | Build tools: UC functions, Vector Search, MCP | Tested tools + MCP config |
| **4 (this week)** | **Wire tools into an agent; register a prompt; compare LLMs** | **Working agent** |
| 5 | Evaluate the agent with judges and an eval dataset | Evaluation report |

**Readings this week:**
- [GEPA DSPy Optimization](https://arxiv.org/pdf/2507.19457)
- [Everything is Context](https://www.arxiv.org/pdf/2512.05470)
- Optional: [Evolving Excellence](https://arxiv.org/html/2512.09108v1)

**Key docs:**
- [Prototype tool-calling agents in AI Playground](https://docs.databricks.com/aws/en/generative-ai/agent-framework/ai-playground-agent)
- [Prompt Registry](https://docs.databricks.com/aws/en/mlflow3/genai/prompt-version-mgmt/prompt-registry/)
- [Author an agent](https://docs.databricks.com/aws/en/generative-ai/agent-framework/author-agent)

---
## 1. From tools to agent *(Required)*

In Assignment 3 you created three kinds of tools:

| Tool | What it does |
|------|--------------|
| `main.default.lookup_source_info` | SQL lookup — row count and sample for a source |
| `main.default.analyze_instruction` | Python — instruction complexity metrics |
| Your custom function | SQL or Python — your own design |
| Vector Search index | Semantic search over 1,000 UltraFeedback instructions |
| You.com MCP | Live web search (configured in Cursor) |

This week you'll wire those tools into a **working agent** — an LLM that can decide which tool to call based on the user's question. The workflow is:

1. **Prototype** in the AI Playground (no code)
2. **Export** the agent code to a notebook
3. **Register** the system prompt in Unity Catalog for versioning
4. **Swap LLMs** and compare how different models use the same tools

---
## 2. Install dependencies *(Required)*

In [None]:
%pip install --upgrade "mlflow[databricks]>=3.1.0" databricks-langchain unitycatalog-ai[databricks] databricks-vectorsearch
dbutils.library.restartPython()

---
## 3. Recreate Vector Search *(Required)*

You deleted the Vector Search endpoint and index at the end of Assignment 3 (good resource management). Recreate them now — the agent needs the index to answer similarity questions.

This code is the same as Assignment 3, Section 6.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

# Check if the VS source table already exists; create if not
try:
    spark.table("main.default.ultrafeedback_vs_source").limit(1)
    print("VS source table already exists.")
except:
    print("Creating VS source table...")
    vs_source = (
        spark.table("main.default.assignment_file")
        .select("source", "instruction")
        .dropDuplicates(["instruction"])
        .limit(1000)
        .withColumn("id", monotonically_increasing_id())
    )
    vs_source.write.format("delta") \
        .option("delta.enableChangeDataFeed", "true") \
        .mode("overwrite") \
        .saveAsTable("main.default.ultrafeedback_vs_source")
    print("VS source table created.")

In [None]:
from databricks.vector_search.client import VectorSearchClient

vs_client = VectorSearchClient()
VS_ENDPOINT_NAME = "aai594_vs_endpoint"
VS_INDEX_NAME = "main.default.ultrafeedback_vs_index"

# Recreate endpoint
try:
    vs_client.create_endpoint_and_wait(name=VS_ENDPOINT_NAME, endpoint_type="STANDARD")
    print(f"Endpoint '{VS_ENDPOINT_NAME}' is ready.")
except Exception as e:
    if "already exists" in str(e).lower():
        print(f"Endpoint '{VS_ENDPOINT_NAME}' already exists — reusing.")
    else:
        raise e

In [None]:
# Recreate delta sync index
try:
    index = vs_client.create_delta_sync_index_and_wait(
        endpoint_name=VS_ENDPOINT_NAME,
        source_table_name="main.default.ultrafeedback_vs_source",
        index_name=VS_INDEX_NAME,
        pipeline_type="TRIGGERED",
        primary_key="id",
        embedding_source_column="instruction",
        embedding_model_endpoint_name="databricks-gte-large-en"
    )
    print(f"Index '{VS_INDEX_NAME}' created and synced.")
except Exception as e:
    if "already exists" in str(e).lower():
        print(f"Index '{VS_INDEX_NAME}' already exists — reusing.")
        index = vs_client.get_index(endpoint_name=VS_ENDPOINT_NAME, index_name=VS_INDEX_NAME)
    else:
        raise e

# Quick test
results = index.similarity_search(query_text="machine learning", columns=["id", "instruction"], num_results=2)
print("VS index is working:", len(results.get("result", {}).get("data_array", [])), "results returned")

---
## 4. Prototype in the AI Playground *(Required)*

The **AI Playground** lets you prototype a tool-calling agent with no code. You select an LLM, attach tools, and chat — then export the working agent as a Python notebook.

### Step-by-step

1. **Open AI Playground** — In the Databricks sidebar, click **Playground** (under Machine Learning or the top-level menu).

2. **Select a Tools-enabled LLM** — In the model dropdown, choose a model that supports tool calling. Good options:
   - `databricks-meta-llama-3-3-70b-instruct`
   - `databricks-claude-sonnet-4` (if available)
   
   Make sure the model shows as **"Tools enabled"** in the Playground UI.

3. **Add your tools** — Click the **Tools** button and add:
   - **UC Functions:** `main.default.lookup_source_info`, `main.default.analyze_instruction`, and your custom function from Assignment 3
   - **Vector Search index:** `main.default.ultrafeedback_vs_index`
   - You can add up to 20 tools total

4. **Set a system prompt** — In the System Prompt field, enter something like:

   ```
   You are the UltraFeedback Expert, an AI assistant that helps users
   explore and understand the UltraFeedback LLM preference dataset.
   
   Use your tools to answer questions accurately:
   - Use lookup_source_info to get statistics about data sources
   - Use analyze_instruction to assess instruction complexity
   - Use Vector Search to find similar instructions by meaning
   
   Always cite which tool you used and explain the results.
   If you don't have a relevant tool, say so rather than guessing.
   ```

5. **Test the agent** — Try these queries to verify tool calling works:
   - *"How many rows come from the evol_instruct source?"* (should call `lookup_source_info`)
   - *"Find instructions similar to 'Explain quantum computing'"* (should call Vector Search)
   - *"Analyze the complexity of: Write a detailed essay about climate change including economic impacts"* (should call `analyze_instruction`)

6. **Export the code** — Once your agent is working:
   - Click **Get code** (or **Export**) in the top-right of the Playground
   - Select **Create agent notebook**
   - Review the generated code

> **Take a screenshot** of your agent in the AI Playground with tools attached and a successful tool-calling conversation. Save as `screenshots/ai_playground.png`.

**Docs:** [Prototype tool-calling agents in AI Playground](https://docs.databricks.com/aws/en/generative-ai/agent-framework/ai-playground-agent)

---
## 5. Adapt and run the exported agent code *(Required)*

Paste the exported code from the AI Playground into the cells below and run it. If the Playground's export didn't work or you prefer to build manually, use the template provided.

The key components are:
1. **LLM endpoint** — which Foundation Model to use
2. **Tools** — your UC functions wrapped in `UCFunctionToolkit`
3. **System prompt** — loaded from the Prompt Registry (you'll register it in the next section)
4. **Agent executor** — the LangChain agent that orchestrates LLM + tools

In [None]:
import mlflow
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.prompts import ChatPromptTemplate
from databricks_langchain import ChatDatabricks, UCFunctionToolkit

# Enable MLflow tracing so all agent calls are logged
mlflow.langchain.autolog()

# ---- 1. Choose the LLM ----
LLM_ENDPOINT = "databricks-meta-llama-3-3-70b-instruct"
llm = ChatDatabricks(endpoint=LLM_ENDPOINT, temperature=0.1)

# ---- 2. Load tools from Unity Catalog ----
# Add your UC function names here (include your custom function from Assignment 3)
UC_FUNCTION_NAMES = [
    "main.default.lookup_source_info",
    "main.default.analyze_instruction",
    # "main.default.your_custom_function",  # <-- uncomment and replace with your function
]

toolkit = UCFunctionToolkit(function_names=UC_FUNCTION_NAMES)
tools = toolkit.tools
print(f"Loaded {len(tools)} UC function tools: {[t.name for t in tools]}")

# ---- 3. Define the system prompt ----
SYSTEM_PROMPT = """You are the UltraFeedback Expert, an AI assistant that helps users
explore and understand the UltraFeedback LLM preference dataset.

Use your tools to answer questions accurately:
- Use lookup_source_info to get statistics about data sources
- Use analyze_instruction to assess instruction complexity

Always cite which tool you used and explain the results.
If you don't have a relevant tool, say so rather than guessing."""

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_PROMPT),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

# ---- 4. Create and run the agent ----
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print("Agent ready.")

In [None]:
# Test the agent with a query that should trigger a tool call
response = agent_executor.invoke({"input": "How many rows come from the evol_instruct source?"})
print(response["output"])

In [None]:
# Test with a complexity analysis query
response2 = agent_executor.invoke({
    "input": "Analyze the complexity of this instruction: Explain the process of photosynthesis in detail, including light-dependent and light-independent reactions."
})
print(response2["output"])

In [None]:
# Test with a general knowledge question (should say it doesn't have a tool)
response3 = agent_executor.invoke({"input": "What is the capital of France?"})
print(response3["output"])

---
## 6. Verify UC functions are registered *(Required)*

Quick check that all your tools are still in Unity Catalog.

In [None]:
%%sql
SHOW USER FUNCTIONS IN main.default;

---
## 7. Register a prompt in Unity Catalog *(Required)*

The **MLflow Prompt Registry** lets you version and manage prompt templates in Unity Catalog. This is important because:

- **Versioning:** Every change creates an immutable snapshot. You can roll back if a new prompt performs worse.
- **Aliases:** Point "production" at a specific version. Update the alias without changing code.
- **Collaboration:** Non-engineers can edit prompts through the UI.
- **Governance:** Unity Catalog tracks who changed what and when.

**Docs:** [Prompt Registry](https://docs.databricks.com/aws/en/mlflow3/genai/prompt-version-mgmt/prompt-registry/) · [Create and edit prompts](https://docs.databricks.com/aws/en/mlflow3/genai/prompt-version-mgmt/prompt-registry/create-and-edit-prompts)

In [None]:
import mlflow

# Register the system prompt as a versioned prompt in Unity Catalog
PROMPT_NAME = "main.default.ultrafeedback_expert_prompt"

prompt_info = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template=SYSTEM_PROMPT,
    commit_message="Initial system prompt for the UltraFeedback Expert agent"
)

print(f"Registered: {prompt_info.name}, version: {prompt_info.version}")

In [None]:
# Create an alias so we can reference "production" without knowing the version number
mlflow.genai.set_prompt_alias(
    name=PROMPT_NAME,
    alias="production",
    version=prompt_info.version
)
print(f"Alias 'production' set to version {prompt_info.version}")

In [None]:
# Demonstrate loading the prompt by alias — this is how your agent would
# load the prompt in production (decoupled from the version number)
loaded = mlflow.genai.load_prompt(f"{PROMPT_NAME}@production")
print("Loaded prompt template:")
print(loaded.template)

In [None]:
# Register a second version with a small improvement
IMPROVED_PROMPT = """You are the UltraFeedback Expert, an AI assistant that helps users
explore and understand the UltraFeedback LLM preference dataset.

Use your tools to answer questions accurately:
- Use lookup_source_info to get statistics about data sources
- Use analyze_instruction to assess instruction complexity

Always cite which tool you used and explain the results.
If you don't have a relevant tool, say so rather than guessing.
When comparing sources or models, use specific numbers from the data."""

v2 = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template=IMPROVED_PROMPT,
    commit_message="Added instruction to cite specific numbers when comparing"
)
print(f"Version {v2.version} registered. You now have two versions.")
print(f"Alias 'production' still points to version {prompt_info.version} (you can update it after testing).")

---
## 8. Swap LLMs and compare *(Required)*

A key advantage of the agent architecture is that you can **swap the underlying LLM** without changing the tools or prompt. Different models may:
- Call tools more or less reliably
- Produce more concise or verbose answers
- Use more or fewer tokens

You'll run the **same three test queries** against two different LLMs and compare the results.

### Test queries
1. *"How many rows come from the sharegpt source?"*
2. *"Analyze the complexity of: What is 2+2?"*
3. *"What sources are available in the dataset and how do they compare in size?"*

In [None]:
# Helper function to run test queries and collect results
TEST_QUERIES = [
    "How many rows come from the sharegpt source?",
    "Analyze the complexity of: What is 2+2?",
    "What sources are available in the dataset and how do they compare in size?",
]

def run_test_queries(executor, model_name):
    """Run test queries and return results with the model name."""
    results = []
    for q in TEST_QUERIES:
        print(f"\n--- {model_name} | Query: {q[:60]}... ---")
        resp = executor.invoke({"input": q})
        output = resp["output"]
        print(output[:300])
        results.append({"model": model_name, "query": q, "output": output})
    return results

In [None]:
# ---- Model A ----
MODEL_A = "databricks-meta-llama-3-3-70b-instruct"
llm_a = ChatDatabricks(endpoint=MODEL_A, temperature=0.1)
agent_a = create_tool_calling_agent(llm_a, tools, prompt)
executor_a = AgentExecutor(agent=agent_a, tools=tools, verbose=False)

print(f"=== Testing Model A: {MODEL_A} ===")
results_a = run_test_queries(executor_a, MODEL_A)

In [None]:
# ---- Model B ----
# Change this to another available Foundation Model endpoint.
# Check which models are available in your workspace under Serving > Foundation Models.
# Options may include: databricks-claude-sonnet-4, databricks-dbrx-instruct, etc.
MODEL_B = "databricks-claude-sonnet-4"  # <-- change if this model is not available

try:
    llm_b = ChatDatabricks(endpoint=MODEL_B, temperature=0.1)
    agent_b = create_tool_calling_agent(llm_b, tools, prompt)
    executor_b = AgentExecutor(agent=agent_b, tools=tools, verbose=False)

    print(f"=== Testing Model B: {MODEL_B} ===")
    results_b = run_test_queries(executor_b, MODEL_B)
except Exception as e:
    print(f"Model B ({MODEL_B}) not available: {e}")
    print("Try a different endpoint name. Check Serving > Foundation Models in the sidebar.")
    results_b = []

### Your comparison analysis

Compare the two models across the three test queries. Consider:
- **Tool usage:** Did both models call the right tools? Did one make unnecessary calls?
- **Output quality:** Which responses were more accurate, specific, or helpful?
- **Verbosity:** Which model was more concise? Is that better or worse for this use case?
- **Error handling:** Did either model hallucinate or fail to use a tool when it should have?

*Write your analysis below (replace this text):*

**[Your comparison analysis here]**

---
## 9. Clean up *(Required)*

Delete the Vector Search endpoint and index to conserve Free Edition resources. You'll recreate them in Assignment 5.

In [None]:
# Delete index first, then endpoint
try:
    vs_client.delete_index(endpoint_name=VS_ENDPOINT_NAME, index_name=VS_INDEX_NAME)
    print(f"Index '{VS_INDEX_NAME}' deleted.")
except Exception as e:
    print(f"Index deletion note: {e}")

try:
    vs_client.delete_endpoint(name=VS_ENDPOINT_NAME)
    print(f"Endpoint '{VS_ENDPOINT_NAME}' deleted.")
except Exception as e:
    print(f"Endpoint deletion note: {e}")

---
## 10. Prompt optimization *(Optional, strongly encouraged)*

The Week 4 readings cover **prompt optimization** — using algorithms to systematically improve prompts rather than relying on manual iteration.

### Key concepts from the readings

- **DSPy** treats prompts as programs with optimizable parameters. Instead of hand-crafting prompts, you define a task signature and let an optimizer (like MIPRO or BootstrapFewShot) find better instructions and examples.
- **"Everything is Context"** argues that prompts, tools, and memory are all forms of context — optimizing one often helps the others.
- **GEPA** extends DSPy optimization to multi-step agentic workflows.

### Reflection

Based on your experience comparing two LLMs in Section 8 and the readings, answer:

1. How could prompt optimization improve your UltraFeedback Expert agent? What would you optimize?
2. Why might automated prompt optimization be more effective than manual prompt engineering for complex agents?
3. If you had to set up a DSPy optimization for this agent, what would your metric (evaluation function) look like?

*Write your reflection below (replace this text):*

**[Your reflection here]**

---
## Lab complete

### Required (Sections 1–9)
- [ ] **Section 3:** Vector Search endpoint and index recreated and verified.
- [ ] **Section 4:** Prototyped the agent in AI Playground with tools attached (screenshot taken).
- [ ] **Section 5:** Exported or adapted agent code runs in this notebook. Agent answers test queries using tools.
- [ ] **Section 6:** UC functions verified.
- [ ] **Section 7:** System prompt registered in Unity Catalog Prompt Registry with two versions and a "production" alias.
- [ ] **Section 8:** Ran the same 3 queries against 2 different LLMs. Written comparison analysis provided.
- [ ] **Section 9:** Vector Search endpoint and index deleted.

### Optional but strongly encouraged (Section 10)
- [ ] **Section 10:** Written reflection on prompt optimization.

**Also submit:** `PROPOSAL_4b.md` — your final project proposal.

**Submit:** Your executed notebook (`.ipynb` with all outputs), `SUBMISSION_4a.md`, and `PROPOSAL_4b.md`.

*Next week you'll evaluate the agent using built-in judges, guidelines judges, and custom judges.*