# Quick Start Example (Optional)

This notebook demonstrates basic data access patterns using DuckDB and pandas.

## Contents
- Tool installation
- Loading Parquet files
- SQL queries
- Log file parsing
- Submission format


## Install Tools

Run this once:


In [1]:
%pip install -q duckdb pandas pyarrow lancedb

import duckdb
import pandas as pd
from pathlib import Path

print("Ready.")


You should consider upgrading via the '/Users/panthbhavsar/Documents/theory/modeler-hackathon-starter/.venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Ready.


## Connect to DuckDB

DuckDB lets you run SQL queries on Parquet files. It's fast and works well for analytics.


In [2]:
con = duckdb.connect('retail.duckdb')
print("Connected to DuckDB")


Connected to DuckDB


## Load Data

Load Parquet files into DuckDB so you can query them:


In [3]:
# Load a few example tables
# You'll need to load all 24 tables for the actual competition

if Path('data/parquet/customer.parquet').exists():
    con.execute("CREATE OR REPLACE TABLE customer AS SELECT * FROM 'data/parquet/customer.parquet'")
    con.execute("CREATE OR REPLACE TABLE store_sales AS SELECT * FROM 'data/parquet/store_sales.parquet'")
    con.execute("CREATE OR REPLACE TABLE date_dim AS SELECT * FROM 'data/parquet/date_dim.parquet'")
    
    print("Tables loaded")
    print(con.execute("SHOW TABLES").df())
else:
    print("Data files not found. Download the dataset first.")


Data files not found. Download the dataset first.


## Run a Query

Example: Find top customers by spending


In [4]:
if Path('data/parquet/customer.parquet').exists():
    query = """
    SELECT 
        c.c_customer_sk,
        c.c_first_name,
        c.c_last_name,
        SUM(ss.ss_net_paid) as total_spent
    FROM store_sales ss
    JOIN customer c ON ss.ss_customer_sk = c.c_customer_sk
    JOIN date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
    WHERE d.d_year = 2023
    GROUP BY c.c_customer_sk, c.c_first_name, c.c_last_name
    ORDER BY total_spent DESC
    LIMIT 5
    """
    
    result = con.execute(query).df()
    print(result)


## Read Log Files

Logs are in JSONL format (one JSON object per line). You can use pandas:


In [5]:
if Path('data/logs/clickstream.jsonl').exists():
    logs = pd.read_json('data/logs/clickstream.jsonl', lines=True)
    print(f"Loaded {len(logs)} events")
    print(logs.head())


## Alternative: Query Logs with DuckDB

You can also load logs into DuckDB and use SQL:


In [6]:
if Path('data/logs/clickstream.jsonl').exists():
    con.execute("CREATE OR REPLACE TABLE logs AS SELECT * FROM read_json_auto('data/logs/clickstream.jsonl')")
    
    result = con.execute("""
        SELECT event_type, COUNT(*) as count 
        FROM logs 
        GROUP BY event_type
    """).df()
    
    print(result)


## Using LanceDB for Semantic Search

LanceDB is useful when you need to search by meaning rather than exact matches. Example use case: searching through log descriptions or PDF content.


In [7]:
import lancedb

# Connect to LanceDB
db = lancedb.connect('lance_db')

# Example: Load log data into LanceDB
if Path('data/logs/clickstream.jsonl').exists():
    log_data = pd.read_json('data/logs/clickstream.jsonl', lines=True)
    
    # Create a table
    table = db.create_table("logs", data=log_data, mode="overwrite")
    
    print(f"Loaded {len(log_data)} rows into LanceDB")
    
    # Query example: Filter logs
    results = table.search().where("event_type = 'product_view'").limit(5).to_pandas()
    print("\nSample product view events:")
    print(results)
else:
    print("Log files not found.")




Log files not found.


## Submission Format

Save your answers in CSV format:

```csv
question_id,answer_type,answer_value,confidence,explanation
1,customer_id,12345,high,Top customer by revenue
1,total_spent,50000,high,Sum of net_paid
```

Create submissions programmatically:


In [8]:
submission = pd.DataFrame([
    {'question_id': 1, 'answer_type': 'customer_id', 'answer_value': 12345, 'confidence': 'high', 'explanation': 'Top customer by revenue'},
    {'question_id': 1, 'answer_type': 'total_spent', 'answer_value': 50000, 'confidence': 'high', 'explanation': 'Sum of net_paid'},
])

print(submission)

# To save:
# submission.to_csv('my_submission.csv', index=False)


   question_id  answer_type  answer_value confidence              explanation
0            1  customer_id         12345       high  Top customer by revenue
1            1  total_spent         50000       high          Sum of net_paid


## Competition Structure

**Training Round (12:30-2:00):** 25 questions with answers provided. Practice only, no submission required.

**Test Round (2:00-6:00):** 30 questions without answers. Submit by 6:00 PM. Worth 70% of final score.

**Holdout Round (6:00-7:30):** 20 secret questions. We run your system automatically. Worth 30% of final score.

That's the basics. Check the README for more details.


## Tool Use Cases

**DuckDB:** SQL queries on structured data (Parquet tables) and logs. Fast for aggregations, joins, filtering.

**LanceDB:** Semantic search when you need to find things by meaning, not exact matches. Good for searching PDFs or finding similar log entries.

**MotherDuck:** Cloud version of DuckDB. Useful for sharing data/queries with teammates or working with larger datasets.
