# Web Shop Orders — Databao Data Preparation Demo (Case 1)

Welcome! This notebook walks through a practical end‑to‑end data preparation workflow for a webshop dataset using Databao.

We follow the user workflow and progression:
- Flow: Understanding → Cleaning → Integration → Feature Engineering → Aggregation & Export → Insights
- Progression: “What data do we have?” → “Is it clean?” → “Can we aggregate and group it?” → “What KPIs can we compute?” → “What are the trends and drivers?” → “What actions do we take?”


In [7]:
# Quick installs (safe to re-run)
!pip install -q duckdb databao matplotlib pandas


In [1]:
# Imports and DB connection
import os
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

# Connect to the local DuckDB file. Use read_only=False to allow registering temp views/DFs.
DB_PATH = "data/web_shop_raw.duckdb"
if not os.path.exists(DB_PATH):
    raise FileNotFoundError(
        f"Expected DuckDB at {DB_PATH}. If you don't have it, check the examples README for setup instructions."
    )
conn = duckdb.connect(DB_PATH, read_only=False)
print(f"Connected to DuckDB database: {DB_PATH}")


Connected to DuckDB database: data/web_shop_raw.duckdb


In [2]:
# Databao imports
import databao
from databao import LLMConfig


a### 1) Choose your LLM (cloud or local)

Option A — Cloud (OpenAI)
- Requires environment variable OPENAI_API_KEY to be set in your shell/Jupyter.
- Example (Jupyter only): `%env OPENAI_API_KEY=YOUR_API_KEY` before running this cell.

Option B — Local (Ollama)
- Install Ollama and pull a suitable model, e.g. `ollama pull qwen3:8b` or a tuned config.

We’ll default to a local config below for easier setup. Uncomment the OpenAI config to use cloud instead.


In [4]:
# Option A — Cloud (OpenAI). Low temperature helps produce stable SQL/plots.
llm_config = LLMConfig(name="gpt-4.1-2025-04-14", temperature=0)

# Option B — Local (Ollama). Default for this notebook.
# llm_config = LLMConfig.from_yaml("../configs/qwen3-8b-ollama.yaml")


aaa### 2) Open a Databao session and register sources/contexts

We’ll register:
- a DuckDB connection
- a small per‑source schema overview (markdown file)
- a general, project‑wide context via `session.add_context()`


In [None]:
agent = databao.new_agent(name="preparation-demo", llm_config=llm_config)

# Register DuckDB with per‑source context (path to a markdown file is supported)
agent.add_db(conn)

# Add additional, project‑wide context (not tied to a specific source)
agent.add_context(
    """
    Project‑wide notes:
    - Monetary values are in EUR unless stated otherwise.
    - Orders with status 'canceled' should be excluded from KPIs unless explicitly requested.
    - Treat obviously fake/test rows as data quality issues (e.g., emails like test@, or names like 'Test User').
    - Date columns use UTC timestamps unless noted; compute delivery_days from purchase to delivered_customer.
    """
)

print("Registered DBs:", list(agent.dbs.keys()))


### 3) Start a thread (.ask() is eager by default)

- `.ask(prompt)` executes immediately (eager mode) and materializes results.
- You can switch to lazy mode via `session.thread(lazy=True)` if you want to chain asks before executing.


In [11]:
thread = agent.thread()

## Stage 1 — Data Understanding

Question: What data do we have, and how are the tables connected?
Action: Explore schema (orders, items, products, payments, reviews, customers, sellers).
Outcome: Identify key joins (order_id, customer_id, seller_id).


In [12]:
thread.ask(
    """
    Provide a concise schema overview of our webshop database. Include:
    - List of core tables (orders, order_items, products, payments, reviews, customers, sellers).
    - Primary keys and key foreign keys per table.
    - A short note on how to join them at the order level.
    Return a small markdown/text summary; keep it tight.
    """
)



Here’s a concise schema overview of your webshop database:

---

### Core Tables & Keys

- **orders**
  - Primary Key: `order_id`
  - Foreign Keys: `customer_id` → customers
- **order_items**
  - Primary Key: (`order_id`, `order_item_id`)
  - Foreign Keys: `order_id` → orders, `product_id` → products, `seller_id` → sellers
- **products**
  - Primary Key: `product_id`
  - Foreign Keys: `product_category_name` → product_category_name_translation (optional, for category translation)
- **order_payments**
  - Primary Key: (`order_id`, `payment_sequential`)
  - Foreign Key: `order_id` → orders
- **order_reviews**
  - Primary Key: `review_id`
  - Foreign Key: `order_id` → orders
- **customers**
  - Primary Key: `customer_id`
- **sellers**
  - Primary Key: `seller_id`

---

### Joining at the Order Level

- Use `order_id` as the central key:
  - Join `orders` to `order_items`, `order_payments`, and `order_reviews` via `order_id`.
  - Join `orders.customer_id` to `customers.customer_id`.
  - J

In [13]:
display(Markdown(thread.text()))

Here’s a concise schema overview of your webshop database:

---

### Core Tables & Keys

- **orders**
  - Primary Key: `order_id`
  - Foreign Keys: `customer_id` → customers
- **order_items**
  - Primary Key: (`order_id`, `order_item_id`)
  - Foreign Keys: `order_id` → orders, `product_id` → products, `seller_id` → sellers
- **products**
  - Primary Key: `product_id`
  - Foreign Keys: `product_category_name` → product_category_name_translation (optional, for category translation)
- **order_payments**
  - Primary Key: (`order_id`, `payment_sequential`)
  - Foreign Key: `order_id` → orders
- **order_reviews**
  - Primary Key: `review_id`
  - Foreign Key: `order_id` → orders
- **customers**
  - Primary Key: `customer_id`
- **sellers**
  - Primary Key: `seller_id`

---

### Joining at the Order Level

- Use `order_id` as the central key:
  - Join `orders` to `order_items`, `order_payments`, and `order_reviews` via `order_id`.
  - Join `orders.customer_id` to `customers.customer_id`.
  - Join `order_items.product_id` to `products.product_id`.
  - Join `order_items.seller_id` to `sellers.seller_id`.

---

**Note:** Exclude orders with status 'canceled' for most KPIs. All monetary values are in EUR.

In [14]:
# Ask for quick row counts to gauge table sizes
thread.ask(
    """
    Return a dataframe of row counts for the main tables
    """
)




[tool_call: 'run_sql_query']
```
{"sql":"SELECT 'orders' AS table_name, COUNT(*) AS row_count FROM orders\nUNION ALL\nSELECT 'order_items', COUNT(*) FROM order_items\nUNION ALL\nSELECT 'products', COUNT(*) FROM products\nUNION ALL\nSELECT 'order_payments', COUNT(*) FROM order_payments\nUNION ALL\nSELECT 'order_reviews', COUNT(*) FROM order_reviews\nUNION ALL\nSELECT 'customers', COUNT(*) FROM customers\nUNION ALL\nSELECT 'sellers', COUNT(*) FROM sellers"}
```


```sql
SELECT 'orders' AS table_name, COUNT(*) AS row_count FROM orders
UNION ALL
SELECT 'order_items', COUNT(*) FROM order_items
UNION ALL
SELECT 'products', COUNT(*) FROM products
UNION ALL
SELECT 'order_payments', COUNT(*) FROM order_payments
UNION ALL
SELECT 'order_reviews', COUNT(*) FROM order_reviews
UNION ALL
SELECT 'customers', COUNT(*) FROM customers
UNION ALL
SELECT 'sellers', COUNT(*) FROM sellers
```


[tool_call_output: 'run_sql_query']
```
Exception Name: CatalogException. Exception Desc: Catalog Error: Table wi

Unnamed: 0,table_name,row_count
0,orders,606
1,order_items,1352
2,products,106
3,order_payments,1409
4,order_reviews,172
5,customers,203
6,sellers,55


In [15]:
row_counts = thread.df()
row_counts

Unnamed: 0,table_name,row_count
0,orders,606
1,order_items,1352
2,products,106
3,order_payments,1409
4,order_reviews,172
5,customers,203
6,sellers,55


## Stage 2 — Cleaning & Validation

Question: Are there duplicates or missing values?
Action: Fix data types, clean nulls, remove test rows.
Outcome: Normalized, ready-to-merge datasets.

We will ask Databao to propose cleaning steps and produce cleaned, queryable outputs (views or temp tables) we can reuse.


In [16]:
thread.ask(
    """
    Check tables for common issues and propose pragmatic cleaning steps:
    - Duplicates by natural keys (e.g., one row per order_id in orders etc.).
    - Obvious type issues (timestamps, numerics).
    - Clearly fake/test rows (e.g., customers with emails like '%test%' or names 'test%').
    show me tables with issues and write down what's wrong with them
    -
    """
)


To address your request, I will:

1. **Check for duplicate rows** in each main table using their natural (primary) keys.
2. **Scan for obvious type issues** in key columns (e.g., timestamps that are null or invalid, numerics that are negative where not expected).
3. **Identify clearly fake/test rows** (e.g., customers with names or emails containing 'test').

I will show only the tables where issues are found and describe the problems.

---

**Step 1: Check for Duplicates by Natural Keys**

- orders: should be unique by order_id
- order_items: unique by (order_id, order_item_id)
- products: unique by product_id
- order_payments: unique by (order_id, payment_sequential)
- order_reviews: unique by review_id
- customers: unique by customer_id
- sellers: unique by seller_id

**Step 2: Obvious Type Issues**

- Timestamps: check for nulls in required timestamp columns (e.g., order_purchase_timestamp in orders)
- Numerics: check for negative prices, payment values, or freight values

**Step 

Unnamed: 0,table_name,order_id,cnt
0,orders,ord_0044,2
1,orders,ord_0301,2
2,orders,ord_0205,2


In [17]:
print(thread.text())

Issues found in the following tables:

1. orders: Duplicate order_id values found (ord_0044, ord_0301, ord_0205). This violates the expected uniqueness of order_id. Cleaning step: Remove or deduplicate these rows.

2. products: Duplicate product_id values found (prod_0001, prod_0093, prod_0055, prod_0094, prod_0002). Cleaning step: Remove or deduplicate these rows.

3. sellers: Duplicate seller_id values found (sell_0042, sell_0004, sell_0048, sell_0035, sell_0039). Cleaning step: Remove or deduplicate these rows.

4. orders: Null or missing order_purchase_timestamp/order_approved_at for order_id ord_0205. Cleaning step: Investigate and fill or remove incomplete records.

5. customers: Clearly fake/test customer_ids (test_cust_01, test_cust_ceo, test_cust_03) and related rows with suspicious city/state/zip. Cleaning step: Remove rows where customer_id or city/state/zip contains 'test'.

No negative prices, payment values, or freight values were found. No duplicate keys in order_items, 

In [18]:
thread.ask(
    """
    give me sql to fix issues, mentioned above
    """
)


Here are SQL statements to pragmatically fix the issues found:

---

### 1. Remove Duplicate Rows (Keep Only the First Occurrence)

**orders:**
```sql
DELETE FROM db1.main.orders
WHERE rowid NOT IN (
    SELECT MIN(rowid)
    FROM db1.main.orders
    GROUP BY order_id
);
```

**products:**
```sql
DELETE FROM db1.main.products
WHERE rowid NOT IN (
    SELECT MIN(rowid)
    FROM db1.main.products
    GROUP BY product_id
);
```

**sellers:**
```sql
DELETE FROM db1.main.sellers
WHERE rowid NOT IN (
    SELECT MIN(rowid)
    FROM db1.main.sellers
    GROUP BY seller_id
);
```

---

### 2. Remove Orders with Missing Timestamps

```sql
DELETE FROM db1.main.orders
WHERE order_purchase_timestamp IS NULL OR order_approved_at IS NULL;
```

---

### 3. Remove Clearly Fake/Test Customers

```sql
DELETE FROM db1.main.customers
WHERE LOWER(customer_id) LIKE '%test%'
   OR LOWER(customer_city) LIKE '%test%'
   OR LOWER(customer_state) LIKE '%test%'
   OR LOWER(customer_zip_code_prefix) LIKE '%test%';

## Stage 3 — Integration

Question: Can we combine customers, sellers, payments, and reviews?
Action: Join entities into a consolidated dataset.
Outcome: Unified order-level table with spend, freight, category, seller location, and reviews.


In [19]:
thread.ask(
    """
    Build a unified order-level dataset from the cleaned tables with one row per order_id. Include at least:
    - order_id, order_purchase_timestamp, order_approved_at
    - customer_id, customer_city, customer_state
    - seller_id (dominant seller for the order if multiple), seller_city, seller_state
    - items_count, total_items_price, total_freight_value, total_payment_value
    - main_product_category_name (mode across items), review_score
    Name this output orders_unified (as a DuckDB temp table) and also return it as a DataFrame sample (head).
    """
)



To build the unified order-level dataset (orders_unified), I will:

1. Join orders with customers to get customer info.
2. Aggregate order_items to get item counts, total price, total freight, and dominant seller (most items in the order).
3. Join sellers to get seller info for the dominant seller.
4. Aggregate product categories per order and select the mode (most frequent).
5. Aggregate order_payments for total payment value.
6. Join order_reviews for review_score.
7. Combine all into a single temp table orders_unified, one row per order_id.

Now, I will write and execute the SQL to create this temp table and return a sample.

[tool_call: 'run_sql_query']
```
{"sql":"CREATE OR REPLACE TEMP TABLE orders_unified AS\nWITH item_aggs AS (\n    SELECT \n        oi.order_id,\n        COUNT(*) AS items_count,\n        SUM(oi.price) AS total_items_price,\n        SUM(oi.freight_value) AS total_freight_value,\n        -- Dominant seller: seller with most items in the order\n        FIRST(sell

Unnamed: 0,order_id,order_purchase_timestamp,order_approved_at,customer_id,customer_city,customer_state,seller_id,seller_city,seller_state,items_count,total_items_price,total_freight_value,total_payment_value,main_product_category_name,review_score
0,ord_0026,2025-07-27 18:02:00,2025-07-27 21:00:00,cust_0026,leeuwarden,FR,sell_0006,zwolle,OV,1,50.59,27.58,78.17,huisdieren,
1,ord_0034,2025-07-25 14:39:00,2025-07-25 17:06:00,cust_0034,zaandam,NH,sell_0043,breda,NB,1,151.47,57.73,209.2,kantoorartikelen,
2,ord_0038,2025-07-06 00:58:00,2025-07-06 02:55:00,cust_0038,assen,DR,sell_0033,emmmen,DR,1,214.1,7.92,227.23,tuin_gereedschap,
3,ord_0098,2025-06-02 05:08:00,2025-06-02 05:37:00,cust_0098,rotterdam,ZH,sell_0007,vlissingen,ZL,3,878.88,99.33,915.79,bed_bad_tafel,
4,ord_0121,2025-09-25 06:40:00,2025-09-25 08:49:00,cust_0121,zwolle,OV,sell_0019,tilburg,NB,3,2335.63,100.04,2435.67,coole_spullen,
5,ord_0154,2025-09-17 00:17:00,2025-09-17 02:01:00,cust_0154,zaandam,NH,sell_0012,zoetermeer,ZH,3,2771.86,98.02,2822.08,coole_spullen,
6,ord_0194,2025-09-03 15:42:00,2025-09-03 17:24:00,cust_0194,nieuwegein,UT,sell_0040,lelystad,FL,1,1778.02,29.79,1807.81,meubels_decoratie,
7,ord_0203,2025-09-11 18:11:00,2025-09-11 18:21:00,cust_0003,zwolle,OV,sell_0044,lelystad,FL,1,125.46,41.15,144.71,huishoudartikelen,
8,ord_0211,2025-07-14 08:21:00,2025-07-14 10:40:00,cust_0011,goes,ZL,sell_0020,roermond,LI,8,2409.21,399.7,2808.91,,
9,ord_0260,2025-08-14 11:55:00,2025-08-14 14:29:00,cust_0060,hoofddorp,NH,sell_0048,groningen,GR,2,187.45,86.2,206.07,kantoorartikelen,


In [21]:
orders_unified = thread.df()
orders_unified

Unnamed: 0,order_id,order_purchase_timestamp,order_approved_at,customer_id,customer_city,customer_state,seller_id,seller_city,seller_state,items_count,total_items_price,total_freight_value,total_payment_value,main_product_category_name,review_score
0,ord_0026,2025-07-27 18:02:00,2025-07-27 21:00:00,cust_0026,leeuwarden,FR,sell_0006,zwolle,OV,1,50.59,27.58,78.17,huisdieren,
1,ord_0034,2025-07-25 14:39:00,2025-07-25 17:06:00,cust_0034,zaandam,NH,sell_0043,breda,NB,1,151.47,57.73,209.2,kantoorartikelen,
2,ord_0038,2025-07-06 00:58:00,2025-07-06 02:55:00,cust_0038,assen,DR,sell_0033,emmmen,DR,1,214.1,7.92,227.23,tuin_gereedschap,
3,ord_0098,2025-06-02 05:08:00,2025-06-02 05:37:00,cust_0098,rotterdam,ZH,sell_0007,vlissingen,ZL,3,878.88,99.33,915.79,bed_bad_tafel,
4,ord_0121,2025-09-25 06:40:00,2025-09-25 08:49:00,cust_0121,zwolle,OV,sell_0019,tilburg,NB,3,2335.63,100.04,2435.67,coole_spullen,
5,ord_0154,2025-09-17 00:17:00,2025-09-17 02:01:00,cust_0154,zaandam,NH,sell_0012,zoetermeer,ZH,3,2771.86,98.02,2822.08,coole_spullen,
6,ord_0194,2025-09-03 15:42:00,2025-09-03 17:24:00,cust_0194,nieuwegein,UT,sell_0040,lelystad,FL,1,1778.02,29.79,1807.81,meubels_decoratie,
7,ord_0203,2025-09-11 18:11:00,2025-09-11 18:21:00,cust_0003,zwolle,OV,sell_0044,lelystad,FL,1,125.46,41.15,144.71,huishoudartikelen,
8,ord_0211,2025-07-14 08:21:00,2025-07-14 10:40:00,cust_0011,goes,ZL,sell_0020,roermond,LI,8,2409.21,399.7,2808.91,,
9,ord_0260,2025-08-14 11:55:00,2025-08-14 14:29:00,cust_0060,hoofddorp,NH,sell_0048,groningen,GR,2,187.45,86.2,206.07,kantoorartikelen,


## Stage 4 — Feature Engineering

Question: Which KPIs help analyze performance?
Action: Compute total_price, freight, delivery_days, delay_days, review_score.
Outcome: Metrics such as average delivery time, review score, and order value.


In [22]:
thread.ask(
    """
    From orders_unified, compute a feature-rich table orders_features with per-order KPI: come up with most important metrics, expain why you chose it
    """
)



To create a feature-rich table orders_features from orders_unified, I will select and engineer the most important per-order KPIs. Here’s my reasoning for each metric:

---

### Key Metrics & Why They Matter

1. **order_id**  
   - Unique identifier for tracking and joining.

2. **order_purchase_timestamp, order_approved_at**  
   - Useful for time-based analysis and calculating delays.

3. **customer_id, seller_id**  
   - For customer/seller segmentation and repeat analysis.

4. **items_count**  
   - Indicates basket size; larger baskets may signal higher-value customers.

5. **total_items_price, total_freight_value, total_payment_value**  
   - Core financials: revenue, shipping cost, and total paid.

6. **avg_item_price**  
   - Shows if the order is made up of high- or low-value items.

7. **freight_share**  
   - Proportion of shipping in total payment; high values may indicate expensive delivery or low-value items.

8. **payment_vs_items_diff**  
   - Difference between what wa

Unnamed: 0,order_id,order_purchase_timestamp,order_approved_at,customer_id,seller_id,items_count,total_items_price,total_freight_value,total_payment_value,main_product_category_name,review_score,avg_item_price,freight_share,payment_vs_items_diff,delivery_delay_days
0,ord_0026,2025-07-27 18:02:00,2025-07-27 21:00:00,cust_0026,sell_0006,1,50.59,27.58,78.17,huisdieren,,50.59,0.352821,27.58,0
1,ord_0034,2025-07-25 14:39:00,2025-07-25 17:06:00,cust_0034,sell_0043,1,151.47,57.73,209.2,kantoorartikelen,,151.47,0.275956,57.73,0
2,ord_0038,2025-07-06 00:58:00,2025-07-06 02:55:00,cust_0038,sell_0033,1,214.1,7.92,227.23,tuin_gereedschap,,214.1,0.034855,13.13,0
3,ord_0098,2025-06-02 05:08:00,2025-06-02 05:37:00,cust_0098,sell_0007,3,878.88,99.33,915.79,bed_bad_tafel,,292.96,0.108464,36.91,0
4,ord_0121,2025-09-25 06:40:00,2025-09-25 08:49:00,cust_0121,sell_0019,3,2335.63,100.04,2435.67,coole_spullen,,778.543333,0.041073,100.04,0
5,ord_0154,2025-09-17 00:17:00,2025-09-17 02:01:00,cust_0154,sell_0012,3,2771.86,98.02,2822.08,coole_spullen,,923.953333,0.034733,50.22,0
6,ord_0194,2025-09-03 15:42:00,2025-09-03 17:24:00,cust_0194,sell_0040,1,1778.02,29.79,1807.81,meubels_decoratie,,1778.02,0.016479,29.79,0
7,ord_0203,2025-09-11 18:11:00,2025-09-11 18:21:00,cust_0003,sell_0044,1,125.46,41.15,144.71,huishoudartikelen,,125.46,0.284362,19.25,0
8,ord_0211,2025-07-14 08:21:00,2025-07-14 10:40:00,cust_0011,sell_0020,8,2409.21,399.7,2808.91,,,301.15125,0.142297,399.7,0
9,ord_0260,2025-08-14 11:55:00,2025-08-14 14:29:00,cust_0060,sell_0048,2,187.45,86.2,206.07,kantoorartikelen,,93.725,0.418304,18.62,0


In [24]:
display(Markdown(thread.text()))

Here is a feature-rich per-order KPI table (orders_features) created from orders_unified. The most important metrics included are:

- **order_id**: Unique order identifier.
- **order_purchase_timestamp, order_approved_at**: For time-based and delay analysis.
- **customer_id, seller_id**: For segmentation and repeat analysis.
- **items_count**: Basket size, indicating order value and customer type.
- **total_items_price, total_freight_value, total_payment_value**: Core financials for revenue and cost analysis.
- **avg_item_price**: Indicates if the order is made up of high- or low-value items.
- **freight_share**: Proportion of shipping in total payment, useful for cost structure analysis.
- **payment_vs_items_diff**: Difference between payment and item price, highlighting discounts, surcharges, or errors.
- **main_product_category_name**: For product mix and category analysis.
- **review_score**: Direct customer satisfaction indicator.
- **delivery_delay_days**: Measures operational efficiency and customer experience.

These metrics are chosen because they cover financial, operational, and customer experience aspects, which are essential for understanding and optimizing webshop performance.

Sample (head) of orders_features:

| order_id  | items_count | total_items_price | total_freight_value | total_payment_value | avg_item_price | freight_share | payment_vs_items_diff | main_product_category_name | review_score | delivery_delay_days |
|-----------|-------------|-------------------|---------------------|--------------------|---------------|--------------|----------------------|---------------------------|--------------|--------------------|
| ord_0026  | 1           | 50.59             | 27.58               | 78.17              | 50.59         | 0.35         | 27.58                | huisdieren                |              | 0                  |
| ord_0034  | 1           | 151.47            | 57.73               | 209.2              | 151.47        | 0.28         | 57.73                | kantoorartikelen           |              | 0                  |
| ...       | ...         | ...               | ...                 | ...                | ...           | ...          | ...                  | ...                       | ...          | ...                |

In [23]:
orders_features = thread.df()
orders_features

Unnamed: 0,order_id,order_purchase_timestamp,order_approved_at,customer_id,seller_id,items_count,total_items_price,total_freight_value,total_payment_value,main_product_category_name,review_score,avg_item_price,freight_share,payment_vs_items_diff,delivery_delay_days
0,ord_0026,2025-07-27 18:02:00,2025-07-27 21:00:00,cust_0026,sell_0006,1,50.59,27.58,78.17,huisdieren,,50.59,0.352821,27.58,0
1,ord_0034,2025-07-25 14:39:00,2025-07-25 17:06:00,cust_0034,sell_0043,1,151.47,57.73,209.2,kantoorartikelen,,151.47,0.275956,57.73,0
2,ord_0038,2025-07-06 00:58:00,2025-07-06 02:55:00,cust_0038,sell_0033,1,214.1,7.92,227.23,tuin_gereedschap,,214.1,0.034855,13.13,0
3,ord_0098,2025-06-02 05:08:00,2025-06-02 05:37:00,cust_0098,sell_0007,3,878.88,99.33,915.79,bed_bad_tafel,,292.96,0.108464,36.91,0
4,ord_0121,2025-09-25 06:40:00,2025-09-25 08:49:00,cust_0121,sell_0019,3,2335.63,100.04,2435.67,coole_spullen,,778.543333,0.041073,100.04,0
5,ord_0154,2025-09-17 00:17:00,2025-09-17 02:01:00,cust_0154,sell_0012,3,2771.86,98.02,2822.08,coole_spullen,,923.953333,0.034733,50.22,0
6,ord_0194,2025-09-03 15:42:00,2025-09-03 17:24:00,cust_0194,sell_0040,1,1778.02,29.79,1807.81,meubels_decoratie,,1778.02,0.016479,29.79,0
7,ord_0203,2025-09-11 18:11:00,2025-09-11 18:21:00,cust_0003,sell_0044,1,125.46,41.15,144.71,huishoudartikelen,,125.46,0.284362,19.25,0
8,ord_0211,2025-07-14 08:21:00,2025-07-14 10:40:00,cust_0011,sell_0020,8,2409.21,399.7,2808.91,,,301.15125,0.142297,399.7,0
9,ord_0260,2025-08-14 11:55:00,2025-08-14 14:29:00,cust_0060,sell_0048,2,187.45,86.2,206.07,kantoorartikelen,,93.725,0.418304,18.62,0


## Stage 5 — Aggregation, Grouping & Export

Question: Can we summarize results per category or seller?
Action: Aggregate by category, month, and seller.
Outcome: Analytical dataset ready for KPI dashboards or modeling.


In [None]:
# Aggregate by month and product category
thread.ask(
    """
    From orders_features join back the main_product_category_name (if not already present) and aggregate by:
    month = date_trunc('month', order_date), product_category_name.
    Compute: orders_count, revenue_total, aov
    Return as df_category_month.
    """
)


In [None]:
df_category_month = thread.df()
df_category_month.head()

In [None]:
# Aggregate by seller overall
thread.ask(
    """
    Aggregate performance by seller_id using orders_features
    """
)


In [None]:
df_seller_kpis = thread.df()
df_seller_kpis.head()


### Wrap up

- You saw a full data preparation flow with Databao: Understanding → Cleaning → Integration → Feature Engineering → Aggregation & Export.
- We used per‑source and project‑wide contexts to guide the LLM.
- .ask() was used in eager mode (default) to materialize results step by step. If you prefer, create a thread with `lazy=True` to chain multiple asks before computing.
- The resulting datasets are ready for downstream analytics, dashboards, or modeling.
