# AAI 594 — Assignment 1

## Pull in data & configure Unity Catalog

**In this lab you will:**
- **Required (Sections 1–6):** Download the dataset, store it in Unity Catalog, run the Foundation Model demo, and connect your local machine via the **Databricks CLI**.
- **Optional but strongly encouraged (Sections 7–8):** Connect **Cursor** to Databricks (Databricks Connect) and/or create an **external model endpoint** (OpenAI or Claude) with your own API key.
- Get comfortable with the Databricks notebook and data platform.

*Next week you'll perform EDA on this dataset; the course will build in complexity from there.*


---
## 1. Install dependencies *(Required)*

You'll install [`kagglehub`](https://github.com/Kaggle/kagglehub) so you can load the UltraFeedback dataset from Kaggle. The `[pandas-datasets]` extra gives you a Pandas-friendly loader.

*Run the cell below. If the kernel prompts you, re-run the notebook from the top (or run the next cell manually).*


In [0]:
# Install kagglehub with pandas support; %restart_python ensures the new package is available in this session
%pip install kagglehub[pandas-datasets]
%restart_python

Collecting kagglehub[pandas-datasets]
  Downloading kagglehub-0.4.2-py3-none-any.whl.metadata (38 kB)
Collecting kagglesdk<1.0,>=0.1.14 (from kagglehub[pandas-datasets])
  Downloading kagglesdk-0.1.15-py3-none-any.whl.metadata (13 kB)
Collecting tqdm (from kagglehub[pandas-datasets])
  Downloading tqdm-4.67.3-py3-none-any.whl.metadata (57 kB)
Downloading kagglesdk-0.1.15-py3-none-any.whl (160 kB)
Downloading kagglehub-0.4.2-py3-none-any.whl (69 kB)
Downloading tqdm-4.67.3-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, kagglesdk, kagglehub
Successfully installed kagglehub-0.4.2 kagglesdk-0.1.15 tqdm-4.67.3
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


---
## 2. Download the dataset from Kaggle *(Required)*

In this lab you'll use the **LLM Human Preference Data (UltraFeedback)** dataset from Kaggle:

**Dataset link:** [LLM Human Preference Data (UltraFeedback) — Kaggle](https://www.kaggle.com/datasets/thedrcat/llm-human-preference-data-ultrafeedback?select=ultrafeedback.csv)

### About the dataset

- **What it is:** A large-scale human preference dataset for training and evaluating language models. It contains **preference pairs**: for each prompt, multiple model responses were collected and humans (or model-based annotators) indicated which response was *chosen* (preferred) vs *rejected*.
- **Why it matters for agentic AI:** This kind of data is used for **reward modeling** and **alignment**—teaching models to behave in ways humans prefer. Those techniques underpin safer and more controllable agents.
- **What you're loading:** The file `ultrafeedback.csv` in this dataset. Typical columns include `source` (e.g. evol_instruct), the instruction/prompt, **chosen** vs **rejected** model outputs, and which models produced them (e.g. `chosen-model`, `rejected-model`). You'll explore the exact schema during EDA next week.
- **Reference:** UltraFeedback-style data is widely used in the literature; the Kaggle version gives you a convenient, tabular form for analysis and experimentation in Databricks.

In the cell below you'll load it as a Pandas DataFrame. You can optionally use `kagglehub.dataset_load()` (new API) instead of `load_dataset()`.

**If the download fails:** Accept the dataset rules on the [Kaggle dataset page](https://www.kaggle.com/datasets/thedrcat/llm-human-preference-data-ultrafeedback) (click "Join competition" or "Download" and accept if prompted). If your environment doesn't have Kaggle credentials, set `KAGGLE_USERNAME` and `KAGGLE_KEY` (from your [Kaggle account → Settings → API](https://www.kaggle.com/docs/api#authentication)) in the notebook or compute environment.


In [0]:
# Import kagglehub and the adapter that returns Pandas DataFrames
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Which file from the dataset to load (this dataset has ultrafeedback.csv)
file_path = "ultrafeedback.csv"

# Load the latest version of the dataset as a Pandas DataFrame.
# Dataset: "thedrcat/llm-human-preference-data-ultrafeedback"
# For more options (e.g. sql_query, pandas_kwargs), see:
# https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "thedrcat/llm-human-preference-data-ultrafeedback",
    file_path,
)

print("First 5 records:", df.head())
print("\nShape:", df.shape)

  df = kagglehub.load_dataset(


Downloading to /home/spark-2f8f00ca-bd01-4beb-b161-e2/.cache/kagglehub/datasets/thedrcat/llm-human-preference-data-ultrafeedback/versions/2/ultrafeedback.csv...


  0%|          | 0.00/101M [00:00<?, ?B/s]  1%|          | 1.00M/101M [00:00<00:23, 4.49MB/s]  2%|▏         | 2.00M/101M [00:00<00:18, 5.72MB/s]  3%|▎         | 3.00M/101M [00:00<00:16, 6.29MB/s]  4%|▍         | 4.00M/101M [00:00<00:15, 6.67MB/s]  5%|▍         | 5.00M/101M [00:00<00:14, 7.01MB/s]  6%|▌         | 6.00M/101M [00:00<00:13, 7.35MB/s]  7%|▋         | 7.00M/101M [00:01<00:12, 7.60MB/s]  8%|▊         | 8.00M/101M [00:01<00:12, 7.91MB/s]  9%|▉         | 9.00M/101M [00:01<00:11, 8.07MB/s] 10%|▉         | 10.0M/101M [00:01<00:11, 8.30MB/s] 12%|█▏        | 12.0M/101M [00:01<00:09, 10.2MB/s] 15%|█▍        | 15.0M/101M [00:01<00:05, 15.4MB/s] 21%|██        | 21.0M/101M [00:01<00:04, 19.3MB/s] 23%|██▎       | 23.0M/101M [00:02<00:04, 19.4MB/s] 28%|██▊       | 28.0M/101M [00:02<00:02, 26.7MB/s] 31%|███       | 31.0M/101M [00:02<00:02, 27.3MB/s] 34%|███▎      | 34.0M/101M [00:02<00:02, 26.0MB/s] 40%|███▉      | 40.0M/101M [00:02<00:01, 34.4MB/s] 44%|████▎     | 44.

Extracting zip of ultrafeedback.csv...





First 5 records:           source  ...    rejected-model
0  evol_instruct  ...         alpaca-7b
1  evol_instruct  ...        vicuna-33b
2  evol_instruct  ...        pythia-12b
3  evol_instruct  ...  llama-2-13b-chat
4  evol_instruct  ...          starchat

[5 rows x 8 columns]


---
## 3. Configure Unity Catalog *(Required)*

**What is Unity Catalog?**
[Unity Catalog](https://docs.databricks.com/en/data-governance/unity-catalog/index.html) is Databricks' **governance layer** for data and AI assets. It organizes everything in a hierarchy: **catalogs** (top-level containers) → **schemas** (namespaces inside a catalog) → **tables** (and views). It provides fine-grained access control, audit logging, and data lineage so you know who can see what and where data came from. All tables you'll use in this course live under Unity Catalog (e.g. `main.default.assignment_file` = catalog `main`, schema `default`, table `assignment_file`).

In the next cell you'll ensure a catalog and schema exist (e.g. `main.default`) so you can write your Delta table into them.


In [0]:
# Create the default catalog and schema if they don't already exist
# (Safe to run repeatedly; IF NOT EXISTS avoids errors.)
spark.sql("CREATE CATALOG IF NOT EXISTS main")
spark.sql("CREATE SCHEMA IF NOT EXISTS main.default")

DataFrame[]

---
## 4. Write the dataset to a Delta table in Unity Catalog *(Required)*

**What is a Delta table?**
[Delta Lake](https://docs.databricks.com/en/delta/index.html) is an open-source storage format (built on Parquet) that adds **ACID transactions** (reads and writes are consistent), **time travel** (query or restore earlier versions of the table), and **efficient upserts and deletes**. In Databricks, when you write a table with `format("delta")`, it becomes a Delta table—the default and recommended way to store tabular data in the lakehouse. You get reliability and versioning without managing it yourself.

In the next cell you'll convert the Pandas DataFrame to a Spark DataFrame and write it as a Delta table at `main.default.assignment_file` (catalog.schema.table).


In [0]:
# Unity Catalog location: catalog.schema.table
catalog = "main"
schema = "default"
table_name = "assignment_file"

# Convert Pandas DataFrame to Spark and write as Delta
# mode("overwrite") replaces the table if it exists (good for re-running the lab)
assignment_file = spark.createDataFrame(df)
assignment_file.write.format("delta").mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table_name}")

# Verify: query the table (optional)
display(spark.table(f"{catalog}.{schema}.{table_name}").limit(5))

---
## 5. Demo: LLM call via Foundation Model APIs *(Required)*

**What are Foundation Model APIs?**
[Foundation Model APIs](https://docs.databricks.com/en/machine-learning/model-serving/score-foundation-models.html) are Databricks-hosted large language models (LLMs) that you can call **without deploying or managing your own model**. Databricks runs the model; you send a request (e.g. a chat message) and get a response. You can use **pay-per-token** (no upfront capacity) or **provisioned throughput** (reserved capacity for production).

**Calling via the MLflow Deployments SDK:** Below you'll use the [MLflow Deployments SDK](https://docs.databricks.com/en/mlflow/mlflow-deployments.html) (`mlflow.deployments.get_deploy_client("databricks")`) to call the Foundation Model. When you run this in a Databricks notebook, the SDK uses your **workspace authentication**—no API key or env vars needed. You pass the model name as the `endpoint` and your messages in the `inputs` dict. This is the same pattern you'll use later with MLflow for tracking and the model registry.


In [None]:
# Demo: call databricks-gpt-oss-120b via the MLflow Deployments SDK (Foundation Model API)
# In a Databricks notebook, the client uses your workspace auth—no API key needed.
import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

chat_response = client.predict(
    endpoint="databricks-gpt-oss-120b",
    inputs={
        "messages": [
            {"role": "system", "content": "You are a concise teaching assistant for an agentic AI course."},
            {"role": "user", "content": "In 1–2 sentences, what is agentic AI and why might we store training data in Unity Catalog?"},
        ],
        "max_tokens": 256,
    },
)

# Response has the same structure as the Foundation Model REST API (OpenAI-style)
print(chat_response["choices"][0]["message"]["content"])

---
## 6. Connect your local machine: Databricks CLI *(Required)*

The **Databricks CLI** lets you run Databricks commands from your laptop or dev machine—manage workspace resources, run jobs, sync files, and use the same auth as in the browser. You'll use it to keep local code and Databricks in sync and to script common tasks.

**Workspace URL:** Your workspace host is `https://<your-workspace>.cloud.databricks.com`. Here **&lt;your-workspace&gt;** is the prefix—everything *before* `.cloud.databricks.com`. Example: for `https://dbc-xxxx-xxxx.cloud.databricks.com`, the prefix is `dbc-xxxx-xxxx`.

**Do this on your local machine** (not in this notebook):

1. **Install the CLI**  
   - macOS: `brew tap databricks/tap` then `brew install databricks`  
   - See [Install or update the Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/install.html) for Windows and other options.

2. **Sign in with OAuth**  
   - Run: `databricks auth login --host https://<your-workspace>.cloud.databricks.com`  
   - Replace `<your-workspace>` with your workspace prefix (e.g. `dbc-xxxx-xxxx`). Your browser will open; sign in to Databricks. The CLI creates a profile in `~/.databrickscfg` and stores your OAuth credentials there.

3. **Verify**  
   - Run `databricks -v` (check version).  
   - Run `databricks auth profiles` (list profiles).  
   - Run `databricks workspace list /` to confirm the CLI can reach your workspace.

**Docs:** [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) · [Authentication](https://docs.databricks.com/en/dev-tools/cli/authentication.html)

---
## 7. Connect Cursor to Databricks *(Optional, strongly encouraged)*

Follow the steps below to set up **Cursor with [Databricks Connect](https://docs.databricks.com/en/dev-tools/databricks-connect/index.html)** on your local machine. Databricks Connect lets you develop in Cursor while running Spark workloads on Databricks compute. Once this is in place, you can optionally add MCP (e.g. Unity Catalog functions) for more IDE integration; see the links at the end.

**Instructions below match:** [Cursor with Databricks: AI Enhanced Development](https://dustinvannoy.com/2025/09/29/cursor-with-databricks-ai-enhanced-development/) (Dustin Vannoy).

**Installation and authentication:**

1. **Install Cursor and create an account**
   - Go to [cursor.com](https://cursor.com) and download Cursor for your OS. Install it (e.g. on macOS: open the .dmg and drag Cursor to Applications; on Windows: run the .exe). See [Cursor installation docs](https://docs.cursor.com/get-started/installation) if you need help.
   - On first launch, sign in or create a Cursor account. The 14-day trial includes what you need to get started; after that, Pro or a higher tier is useful for regular use.

2. **Install the Databricks extension within Cursor**
   - In Cursor, open the Extensions view and search for **Databricks**. Install the official Databricks extension.
   - This extension manages your Databricks connection, provides run options, and supports Asset Bundles integration.

3. **Configure your Databricks profile using the extension's setup wizard**
   - Use the extension to add a Databricks connection. You'll need your **workspace URL** (from your Databricks workspace address) and an **authentication method**. OAuth is the preferred method; an access token will also work well.

**The critical configuration detail:**

4. **Name your profile "DEFAULT" (in all caps)**
   - This eliminates countless issues with AI tools. Databricks Connect automatically looks for the default profile, and your AI assistant won't need explicit instructions about which profile to use. When you switch environments, change which profile is the default rather than reconfiguring everything.

5. **Set serverless compute in the profile**
   - We use the free edition, so compute is always serverless. Edit the profile configuration file at `~/.databrickscfg` and set `serverless_compute_id = auto`.

**Virtual environment setup:**

6. **Use a virtual environment and install Databricks Connect**
   - Use a virtual environment for Python development (create one manually or let Cursor create it). Install Databricks Connect in that environment. The Databricks extension can handle this automatically; you may need to adjust the version to match your Databricks runtime.

7. **Verify the connection**
   - Run simple Databricks Connect code (e.g. create a Spark session and run a trivial query). The built-in Cursor run button works well and mimics how the AI agent executes code.

**Optional — MCP (Unity Catalog, etc.):** For access to Unity Catalog functions, SQL, or other MCP tools from Cursor, see [Connect non-Databricks clients to Databricks MCP servers](https://docs.databricks.com/aws/en/generative-ai/mcp/connect-external-services). You can add the Databricks MCP server (PAT or OAuth via mcp-remote) once this Connect setup is working.


---
## 8. Create an external model endpoint (OpenAI or Claude) *(Optional, strongly encouraged)*

**External models** are third-party LLMs (OpenAI, Anthropic, etc.) that you call *through* Databricks. Databricks stores the API credentials in one place so you don’t put keys in code or notebooks. You create a **model serving endpoint** that forwards requests to the provider and returns responses in the same OpenAI-style format you used in Section 5.

**You need your own API keys for this step.** If you don't already have one, get an API key from [OpenAI](https://platform.openai.com/api-keys) or [Anthropic](https://console.anthropic.com/settings/keys) and configure it in Databricks (see below). There is no built-in external-model key in the free edition—you must supply and configure your own.

**Why use this:** Centralized credential management, consistent query interface (same `predict()` or chat API), and the option to add [Mosaic AI Gateway](https://docs.databricks.com/aws/en/generative-ai/external-models/) for rate limiting and guardrails.

Below is an example using the **MLflow Deployments SDK** to create an endpoint that calls **OpenAI** (you can swap in **Anthropic** by changing the provider and config). Store your API key in [Databricks Secrets](https://docs.databricks.com/en/security/secrets/index.html) and reference it as `{{secrets/<scope>/<key>}}`. For a quick test you can use plaintext (not for production).

In [None]:
# Create an external model endpoint that calls OpenAI (or Anthropic) using the MLflow Deployments SDK.
# You must configure your own OpenAI or Anthropic API key (see Section 8 text above) if not already done.
# Run this once; then query the endpoint by name with client.predict(endpoint="<name>", inputs={...}).
# Store your API key in Databricks Secrets: https://docs.databricks.com/en/security/secrets/index.html
# Reference as "{{secrets/<scope>/<key>}}". For a one-off test only, you can use openai_api_key_plaintext (not for production).

import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

# Example: OpenAI chat endpoint. Replace the secret scope and key with your own, or use openai_api_key_plaintext for testing.
client.create_endpoint(
    name="assignment1-openai-chat",  # use a unique name per workspace
    config={
        "served_entities": [
            {
                "name": "openai-chat",
                "external_model": {
                    "name": "gpt-4o-mini",  # or gpt-4o, gpt-3.5-turbo, etc.
                    "provider": "openai",
                    "task": "llm/v1/chat",
                    "openai_config": {
                        "openai_api_key": "{{secrets/<your-scope>/openai_api_key}}",
                        # "openai_api_key_plaintext": "<your-key>",  # testing only
                    },
                },
            }
        ]
    },
)

# To use Anthropic instead, use provider "anthropic", name e.g. "claude-3-5-sonnet-20241022",
# task "llm/v1/chat", and anthropic_config with anthropic_api_key (or secret reference).
# See: https://docs.databricks.com/aws/en/generative-ai/external-models/

After the endpoint is created and in **Ready** state (check **Serving** in the left sidebar), you can call it with the same MLflow Deployments SDK pattern: `client.predict(endpoint="assignment1-openai-chat", inputs={"messages": [...], "max_tokens": 256})`. The response format is the same as in Section 5.

---
## Getting comfortable with Databricks

As you work through this course, you'll use several Databricks features over and over. Here's a short guide so you know what to expect and where to look.

### What you've seen (or will see) in this lab

- **Unity Catalog + Delta tables** (Sections 3–4): Your table `main.default.assignment_file` is the kind of "source of truth" we'll use for data that agents and models read from or write to. Get used to the three-level name: *catalog.schema.table*.
- **Foundation Model APIs** (Section 5): You called a hosted LLM without deploying anything. We'll use these for prompts, tool-calling, and multi-turn flows. Try the same model in **Serving** → **AI Playground** in the left sidebar to compare with the API.
- **Databricks CLI** (Section 6, *required*): Use it from your local machine to manage workspaces, run jobs, and keep code in sync—no need to do everything in the browser.
- **Cursor and Databricks Connect** (Section 7, *optional but strongly encouraged*): Setting up Cursor with the Databricks extension and DEFAULT profile gives you local development with Databricks compute; you can optionally add MCP later for Unity Catalog and SQL in your IDE.
- **External model endpoints** (Section 8, *optional but strongly encouraged*): OpenAI/Claude (and other providers) through Databricks with credentials in one place; same `predict()` interface as Foundation Models.
- **Notebooks and Spark**: Running cells in order, using `spark.sql(...)` and DataFrames, and (later) `%run` to share code are the basics. **Workflows** (jobs) turn notebooks into scheduled or triggered steps—useful when you orchestrate agents.

### MLflow: experiments, models, and agent traces

**MLflow** on Databricks is the platform for the full ML lifecycle. You'll use it a lot in this course:

- **Tracking**: Log parameters, metrics, and artifacts for each run so you can compare experiments (e.g. different prompts or models).
- **Model Registry**: Register and version models, then promote them from staging to production. Fits nicely with Unity Catalog.
- **AI/agent evaluation and tracing**: For LLM apps and agents, MLflow lets you trace requests (inputs, outputs, tool calls, latency) and evaluate quality. When something goes wrong, you can inspect the trace instead of guessing.

Docs: [MLflow on Databricks](https://docs.databricks.com/en/mlflow/index.html). You'll use Tracking and the registry as you build and compare agents.

### Agent Bricks: framework for production agents

**Agent Bricks** is Databricks' framework for building and optimizing production AI agents *on your data*. It automates a lot of the heavy lifting:

- **Auto-optimization**: It can generate domain-specific synthetic data, create task-aware benchmarks, and tune model choice, prompts, and config to balance quality and cost—so you spend less time on manual trial-and-error.
- **Evaluation**: Built-in evaluation and LLM-as-judge workflows so you can compare agent versions and know when a change is an improvement.
- **Use cases**: Information extraction from documents, knowledge assistants over your data, document classification, multi-agent systems, and more.

You'll work with Agent Bricks when we move from single LLM calls to full agent workflows. Overview: [Agent Bricks](https://www.databricks.com/product/artificial-intelligence/agent-bricks).

### Things to try on your own

- In the **Data** UI, find `main.default.assignment_file`, open it, and check the **Lineage** tab to see where the table came from.
- In a **SQL** notebook (or a new cell), run `SELECT * FROM main.default.assignment_file LIMIT 10` and confirm you get the same data.
- **Re-run this notebook** from top to bottom and confirm the table is overwritten; later we'll use Delta time travel to look at older versions.

### Where to look things up

- [Foundation Model APIs](https://docs.databricks.com/en/machine-learning/model-serving/score-foundation-models.html)
- [MLflow Deployments SDK](https://docs.databricks.com/en/mlflow/mlflow-deployments.html)
- [Unity Catalog](https://docs.databricks.com/en/data-governance/unity-catalog/index.html)
- [Delta Lake](https://docs.databricks.com/en/delta/index.html)
- [MLflow on Databricks](https://docs.databricks.com/en/mlflow/index.html)
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html)
- [Databricks Connect](https://docs.databricks.com/en/dev-tools/databricks-connect/index.html)
- [Databricks Secrets](https://docs.databricks.com/en/security/secrets/index.html)
- [External Models / AI Gateway](https://docs.databricks.com/aws/en/generative-ai/external-models/)
- [Agent Bricks](https://www.databricks.com/product/artificial-intelligence/agent-bricks)


**Try it yourself:** Run the cell below to explore the table and the workspace. Change the SQL or DataFrame code and re-run to get comfortable with Spark and Unity Catalog.


In [None]:
# --- 1. Query the table with SQL (same as in the Data UI or a SQL notebook) ---
display(spark.sql("SELECT * FROM main.default.assignment_file LIMIT 10"))

# --- 2. Inspect the schema (column names and types) ---
spark.table("main.default.assignment_file").printSchema()

# --- 3. Same table as a Spark DataFrame: filter, select, count ---
tbl = spark.table("main.default.assignment_file")
print("Row count:", tbl.count())
# Example: show one column (change to any column name you see in the schema)
# display(tbl.select("source").limit(5))

# --- 4. See where your table lives: schemas in the main catalog ---
display(spark.sql("SHOW SCHEMAS IN main"))
# To list all catalogs: display(spark.sql("SHOW CATALOGS"))

# --- 5. Optional: list the workspace filesystem root (dbutils is handy in Databricks) ---
# display(dbutils.fs.ls("/"))

# --- 6. Teaser: Delta keeps history of table changes (time travel — we'll use this later) ---
# display(spark.sql("DESCRIBE HISTORY main.default.assignment_file"))

---
## Lab complete

**Required:**
- **Sections 1–4:** The table `main.default.assignment_file` exists in Unity Catalog and the exploration cell above shows rows and schema.
- **Section 5:** The Foundation Model demo cell ran and returned a short LLM response.
- You've run the exploration cell (SQL, schema, DataFrame) and optionally changed the SQL or DataFrame code to try your own queries.
- **Section 6:** Databricks CLI installed and signed in on your local machine (`databricks auth login --host https://<workspace>.cloud.databricks.com`).

**Optional but strongly encouraged:**
- **Section 7:** Cursor (or another IDE) set up with the Databricks extension and a DEFAULT profile using serverless compute.
- **Section 8:** An external model endpoint (OpenAI or Anthropic) created with your own API key and at least one successful `predict()` call.

*Next week you'll run EDA on this dataset.*
