# Quick Start Example (Optional)

This notebook demonstrates basic data access patterns using DuckDB and pandas. For a deeper dive into building full analytics agents on MotherDuck (prompt design, MCP integration, security), see [MotherDuck’s analytics agent guide](https://motherduck.com/docs/key-tasks/ai-and-motherduck/building-analytics-agents/).

## Contents
- Tool installation
- Loading Parquet files
- SQL queries
- Log file parsing
- Submission format


## Install Tools

Run this once:


In [None]:
%pip install -q duckdb pandas pyarrow lancedb

import duckdb
import pandas as pd
import os
from pathlib import Path

print("Ready.")


## Connect to DuckDB

DuckDB lets you run SQL queries on Parquet files. It's fast and works well for analytics.


In [None]:
con = duckdb.connect('retail.duckdb')
print("Connected to DuckDB")


## Option 1: Load Data into local DuckDB

Load Parquet files into DuckDB so you can query them:


In [None]:
# List all parquet files in our GCS bucket
files = con.execute("SELECT file FROM glob('gs://antm-dataset/**/*.parquet')").fetchall()

# Create table for each file
for file_path, in files:
    table_name = Path(file_path).stem
    con.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM read_parquet('{file_path}')")

# Show tables
print(f"\nCreated {len(files)} tables")
con.execute("SHOW TABLES").show()

## Option 2: Connect to MotherDuck

MotherDuck gives your local DuckDB cloud compute resources. It also lets you share data with others easily. 

1. Go to [app.motherduck.com](https://app.motherduck.com) and create an account.
2. [create an access token](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token)


In [None]:
os.environ["motherduck_token"] = "your_actual_token_here"

Attach the 'atmn_hack' share with pre-loaded data.

In [None]:
if os.environ.get("motherduck_token") == "your_actual_token_here":
    print("Using local DuckDB")
else:
    print("Attaching MotherDuck 'antm_hack' share")

    con.execute("ATTACH 'md:_share/antm_hack/88329567-1b97-4593-9696-73fd2be9c63d'")
    con.execute("USE antm_hack")

## Run a Query

Example: Find top customers by spending


In [None]:
if customer_path.exists():
    query = """
    SELECT 
        c_customer_sk,
        c_first_name,
        c_last_name
    FROM customer
    ORDER BY c_customer_sk
    LIMIT 5
    """
    
    result = con.execute(query).df()
    print(result)


## Read Log Files

Logs are in JSONL format (one JSON object per line). You can use pandas:


In [None]:
if Path('dataset/logs/clickstream.jsonl').exists():
    logs = pd.read_json('dataset/logs/clickstream.jsonl', lines=True)
    print(f"Loaded {len(logs)} events")
    print(logs.head())


## Alternative: Query Logs with DuckDB

You can also load logs into DuckDB and use SQL:


In [None]:
if Path('dataset/logs/clickstream.jsonl').exists():
    con.execute("CREATE OR REPLACE TABLE logs AS SELECT * FROM read_json_auto('dataset/logs/clickstream.jsonl')")
    
    result = con.execute("""
        SELECT event_type, COUNT(*) as count 
        FROM logs 
        GROUP BY event_type
    """).df()
    
    print(result)


## Using LanceDB for Semantic Search

LanceDB is useful when you need to search by meaning rather than exact matches. Example use case: searching through log descriptions or PDF content.


In [None]:
import lancedb

# Connect to LanceDB
db = lancedb.connect('lance_db')

# Example: Load log data into LanceDB
if Path('data/logs/clickstream.jsonl').exists():
    log_data = pd.read_json('data/logs/clickstream.jsonl', lines=True)
    
    # Create a table
    table = db.create_table("logs", data=log_data, mode="overwrite")
    
    print(f"Loaded {len(log_data)} rows into LanceDB")
    
    # Query example: Filter logs
    results = table.search().where("event_type = 'product_view'").limit(5).to_pandas()
    print("\nSample product view events:")
    print(results)
else:
    print("Log files not found.")


## Query LanceDB table or pandas DataFrame with DuckDB
arrow_table = table.to_lance()
con.query("SELECT * FROM arrow_table")
con.query("SELECT * FROM results")


## Submission Format

Save your answers in CSV format:

```csv
question_id,answer_type,answer_value,confidence,explanation
1,customer_id,12345,high,Top customer by revenue
1,total_spent,50000,high,Sum of net_paid
```

Create submissions programmatically:


In [None]:
submission = pd.DataFrame([
    {'question_id': 1, 'answer_type': 'customer_id', 'answer_value': 12345, 'confidence': 'high', 'explanation': 'Top customer by revenue'},
    {'question_id': 1, 'answer_type': 'total_spent', 'answer_value': 50000, 'confidence': 'high', 'explanation': 'Sum of net_paid'},
])

print(submission)

# To save:
# submission.to_csv('my_submission.csv', index=False)


## Competition Structure

**Training Round (12:30-2:00):** 25 questions with answers provided. Practice only, no submission required.

**Test Round (2:00-6:00):** 30 questions without answers. Submit by 6:00 PM. Worth 70% of final score.

**Holdout Round (6:00-7:30):** 20 secret questions. We run your system automatically. Worth 30% of final score.

That's the basics. Check the README for more details.


## Tool Use Cases

**DuckDB:** SQL queries on structured data (Parquet tables) and logs. Fast for aggregations, joins, filtering.

**LanceDB:** Semantic search when you need to find things by meaning, not exact matches. Good for searching PDFs or finding similar log entries.

**MotherDuck:** Cloud version of DuckDB. Useful for sharing data/queries with teammates or working with larger datasets.


## MCP Server Quickstart

1. **Pick a language/runtime.** MCP servers only need stdin/stdout plus JSON-RPC. Python (`mcp`), TypeScript (`@modelcontextprotocol/server`), or Go all work.
2. **Choose resources.** Decide what data you’ll expose (DuckDB tables, Parquet files, PDF search). Give each a stable URI so clients know how to reference them.
3. **Implement tools.** Each MCP tool wraps an action—run SQL, summarize a log window, fetch a PDF section. Keep inputs/outputs typed and minimal so LLMs can call them safely.
4. **Advertise capabilities.** In `initialize` return your tool/resource metadata so Cursor/Claude Desktop lists them automatically.
5. **Run & register.** Start the server (e.g., `python mcp_server.py`) and add that command under `Cursor → Settings → MCP Servers`.

Minimal Python scaffold:

```python
from mcp.server.fastmcp import FastMCPServer
import duckdb

con = duckdb.connect('retail.duckdb')
server = FastMCPServer()

@server.tool()
def describe_customer(limit: int = 5) -> list[dict[str, str]]:
    """Return sample customer rows from DuckDB."""
    rows = con.execute(
        "SELECT c_customer_sk, c_first_name, c_last_name FROM customer LIMIT ?",
        [limit],
    ).fetchall()
    return [dict(row) for row in rows]

if __name__ == "__main__":
    server.run()
```

Register `python mcp_server.py` as a custom MCP server and the `describe_customer` tool becomes available directly in your prompts.

> **Shortcut:** Don’t want to build your own? MotherDuck ships an OSS MCP server that connects to both DuckDB and MotherDuck backends, complete with SaaS/read-only modes and Claude/Cursor examples. Install it via `uvx mcp-server-motherduck …` using the instructions in their repo: https://github.com/motherduckdb/mcp-server-motherduck.
