# AAI 594 — Assignment 2

## Exploratory Data Analysis with the Data Science Agent

**In this lab you will:**
- **Required:** Use the Databricks **Data Science Agent** (Agent Mode in the Assistant) to perform exploratory data analysis on the UltraFeedback dataset you loaded in Assignment 1.
- **Required:** Review the agent's output, identify what the data contains, and assess how it could be integrated into an agent (e.g., as tools, reference data, or evaluation benchmarks).
- **Required:** Provide your own written analysis of the data — the agent helps you explore, but **your interpretation is the deliverable**.

*This assignment is intentionally bare-bones. The notebook sets up the data reference; you drive the exploration through the Agent and record your findings below.*


---
## 1. About this assignment

Last week you loaded the **UltraFeedback** dataset into Unity Catalog as `main.default.assignment_file`. This week you'll explore that data in depth.

Instead of writing all EDA code by hand, you'll use the **Databricks Data Science Agent** — an AI assistant built into your notebook that can plan, generate, and run code on your behalf. Your job is to **guide the agent with good prompts**, **review what it produces**, and **write your own analysis** of the results.

### Why this matters

In agentic AI, the human-in-the-loop is critical. An agent can automate mechanical tasks (generating plots, computing statistics), but **you** are responsible for interpreting results, catching errors, and making decisions. This assignment practices that skill: you'll delegate the exploration to an agent, then own the analysis.

### How to use the Data Science Agent

1. Open the **Assistant** side panel in this notebook (click the Assistant icon or press **Cmd+I** / **Ctrl+I**).
2. In the bottom-right corner of the panel, toggle to **Agent** mode.
3. Type a prompt. Reference the table with `@main.default.assignment_file` so the agent knows which data to use.
4. The agent will create a plan, ask clarifying questions, and generate notebook cells. **Review each step** before clicking **Allow** or **Continue**.
5. After the agent finishes, read the generated cells and outputs. Add your own markdown cells with your interpretation.

**Docs:** [Use the Data Science Agent](https://docs.databricks.com/aws/en/notebooks/ds-agent)

**Requirements:**
- Partner-powered AI features must be enabled for your workspace.
- Databricks Assistant Agent Mode preview must be enabled. See [Manage Databricks previews](https://docs.databricks.com/aws/en/admin/workspace-settings/manage-previews).


---
## 2. Verify the dataset *(Required)*

Run the cell below to confirm the table from Assignment 1 is still available and to see a quick preview. If you get an error, re-run Assignment 1 first.


In [None]:
# Quick check: confirm the table exists and preview the first few rows
df = spark.table("main.default.assignment_file")
print(f"Row count: {df.count()}")
print(f"Columns: {df.columns}")
df.printSchema()
display(df.limit(5))

---
## 3. Agent-assisted EDA *(Required)*

Use the Data Science Agent to explore the dataset. Below are **sample prompts** to get you started — you don't need to use all of them, and you're encouraged to ask your own follow-up questions. The agent will generate code cells and outputs directly in this notebook.

### Sample prompts to try

**Data overview and quality:**
- "Describe the `@main.default.assignment_file` dataset. Show column statistics (counts, nulls, unique values) and data types. Think like a data scientist."
- "Check for missing values in every column of `@main.default.assignment_file`. Show a summary table and a heatmap of nulls."
- "How many duplicate rows are in `@main.default.assignment_file`? Show me examples if any exist."

**Distribution and patterns:**
- "What are the unique values in the `source` column of `@main.default.assignment_file`? Show a bar chart of how many rows come from each source."
- "Compare the average length (in characters) of the `chosen` vs `rejected` responses in `@main.default.assignment_file`. Visualize the distributions."
- "Which models appear most frequently as `chosen-model` and `rejected-model`? Show the top 10 of each."

**Content and usefulness for agent integration:**
- "Show me 5 example rows from `@main.default.assignment_file` where the `source` is 'evol_instruct'. What kind of instructions are these?"
- "Is there a relationship between the `source` column and which model was chosen? Create a cross-tabulation."
- "Which columns contain structured data (e.g., model names, sources) that could be turned into lookup or filtering tools for an agent?"
- "Identify any columns that would need transformation before being used in an agent workflow (e.g., text that needs parsing, inconsistent formats, columns that aren't useful)."

**Go deeper (optional):**
- "Create a word cloud of the most common terms in the instruction/prompt column."
- "Are there any outliers in response length? Show a box plot."
- "Summarize the key characteristics of this dataset in a new markdown cell."

> **Tip:** After the agent generates cells, read the output carefully. If something looks wrong or surprising, ask the agent to investigate further. This back-and-forth is how real agent-assisted analysis works.


*The Data Science Agent will insert cells below as you interact with it. Leave this section open for agent-generated content.*


---
## 4. Your analysis *(Required)*

After the agent has helped you explore the data, write your own analysis in the sections below. **This is the core deliverable** — the agent does the mechanical work; you provide the thinking.

### 4a. What does this dataset contain?

Summarize what you found. Include:
- How many rows and columns are there?
- What do the key columns represent (e.g., source, instruction, chosen, rejected, models)?
- What is the overall structure of a single record — what does one row "mean"?
- Any data quality issues (nulls, duplicates, inconsistencies)?

*Write your summary below (replace this text):*

**[Your answer here]**


### 4b. How could this data be integrated into an agent?

In this course you won't be training models directly — instead, you'll be building **agents** that leverage data through **tools** (Unity Catalog functions, Vector Search, MCP). Based on your EDA, assess how this dataset could be useful in an agent workflow:
- Which columns could an agent **query as a tool** (e.g., a function that looks up model preferences by source, or retrieves example prompts by category)?
- Could any text columns (instructions, chosen/rejected responses) be indexed for **Vector Search** so an agent can find similar examples?
- What parts of the data are useful for **evaluating** an agent's outputs (e.g., using chosen vs rejected as a benchmark for quality)?
- Are there columns that are metadata-only and wouldn't be useful in an agent context?
- Is the data balanced (e.g., even distribution of sources, models) or skewed? How might that affect an agent tool built on this data?

*Write your assessment below (replace this text):*

**[Your answer here]**


### 4c. What transformations would you recommend?

If you were preparing this data to be used by an agent (e.g., as a queryable tool, a Vector Search index, or an evaluation dataset), what changes would you make?
- Do any text columns need parsing, cleaning, or reformatting before an agent could use them?
- Would you filter out certain rows (e.g., specific sources, very short/long responses) to improve tool quality?
- Would you create new derived columns (e.g., response length difference, model win rate) that an agent could use as features or filters?
- Would you split the data into separate tables for different agent tools (e.g., one for lookup, one for Vector Search)?
- Any other preprocessing steps?

*Write your recommendations below (replace this text):*

**[Your answer here]**


### 4d. How well did the Data Science Agent perform?

Reflect on using the agent for EDA:
- What did the agent do well? Where did it save you time?
- Did the agent make any mistakes or produce anything you had to correct?
- What prompts worked best? What would you do differently next time?
- How does this relate to the idea of "human-in-the-loop" that we discussed in Week 1?

*Write your reflection below (replace this text):*

**[Your answer here]**


---
## Lab complete

**Required:**
- **Section 2:** The dataset verification cell ran and confirmed the table exists with rows and schema.
- **Section 3:** You used the Data Science Agent to explore the data. Agent-generated cells with outputs are visible in the notebook.
- **Section 4a:** You wrote a summary of what the dataset contains.
- **Section 4b:** You assessed how this data could be integrated into an agent (tools, Vector Search, evaluation).
- **Section 4c:** You recommended transformations to prepare the data for agent use.
- **Section 4d:** You reflected on the agent's performance and the human-in-the-loop experience.

**Submit:** Your executed notebook (`.ipynb` with all outputs, including agent-generated cells) and the completed `SUBMISSION_2.md`.

*Next week you'll create tools (Unity Catalog functions, Vector Search, MCP) for the agent you'll build over Weeks 3–5.*
