# Reporting with Databao - Web shop orders demo (Case 2)

Welcome! This notebook will walk you through the whole exploratory data analytics (EDA) workflow using [Databao](https://databao.app) ‚Äì a powerful data agent that helps you query, clean, and visualize your enterprise data.
You'll learn how to calculate and analyze metrics, generate charts and tables, and get insights.

The notebook contains a DuckDB file with a sample dataset, and it can be used with both cloud and local LLMs.
To use a cloud LLM, such as GPT-5.1, you will need an OpenAI API key.

You can learn more about connecting to data, using LLMs, and running Databao in the [Databao docs](https://jetbrains.github.io/databao-docs/).

üöÄ Let‚Äôs dive in!


## Project setup

### Install and import packages

In [None]:
# Install Databao and other packages (safe to rerun)
!pip install -q duckdb databao matplotlib pandas

In [1]:
# Import packages
import os
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

# Connect to the local DuckDB file (read-only)
DB_PATH = "data/web_shop.duckdb"
conn = duckdb.connect(DB_PATH, read_only=True)
print(f"Connected to DuckDB database: {DB_PATH}")

Connected to DuckDB database: data/web_shop.duckdb


In [2]:
# Import Databao
import databao
from databao import LLMConfig

### Configure your LLM

Databao supports both cloud and local LLMs.
For this demo, it‚Äôs easier and faster to use an OpenAI cloud model, but it requires an API key.

If you prefer to use a local model, all your data remains on your machine, but downloading a model may take some time. Depending on the model you use and your machine specs, generating answers may be slower compared to a cloud model.

For easier setup, this notebook uses a cloud LLM by default. If you prefer to use a local LLM, uncomment the corresponding section and comment out the line with the cloud LLM config.


In [None]:
# Add your OpenAI API key. Comment out the following line if you prefer to use a local model
%env OPENAI_API_KEY=<YOUR_API_KEY>

In [4]:
# Option A ‚Äî Cloud model (OpenAI). Low temperature helps produce deterministic SQL/plots
llm_config = LLMConfig(name="gpt-5.1", temperature=0)

# Option B ‚Äî Local model (Ollama)
# llm_config = LLMConfig.from_yaml("../configs/qwen3-8b-ollama.yaml")  # Use a custom config file

### Create a Databao agent session and register data sources

An *agent* in Databao acts as the main interface for database connections and context.
It can hangle multiple *threads* or conversations, each operating independently on the same data sources.



In [5]:
# Create a new agent and add a database connection
agent = databao.new_agent(llm_config=llm_config)
agent.add_db(conn)

# Start a new thread
thread = agent.thread()

## Run analysis and get insights

The following sections guide you through the different steps of data analysis.
In every step, Databao uses your questions to generates SQL queries, returns the results as dataframes, produce charts, or provide text explanations.


### 1. Descriptive metrics & KPI overview

##### How do our key business metrics perform overall?

Goal: Calculate and analyze topline KPIs, including total orders, revenue, AOV, freight, delivery time, and satisfaction



In [6]:
# Ask a question in the thread
thread.ask(
    """
    Compute a KPI overview
    Return:
      - total orders
      - total revenue from all orders
      - average order value (AOV)
      - total freight
      - average delivery days (only for delivered orders)
      - average review score (satisfaction proxy)
    """
)


**Calculating totals and orders**

I need to carefully aggregate totals: for revenue, I‚Äôll sum distinct payments by order ID in the fct_order_payments table. I'll do the same for freight costs from fct_sales. For total orders, I should count distinct orders in dim_orders, but maybe limit it to those with at least one item or payment. I‚Äôm thinking of left joining with aggregated data to accurately reflect all orders, while being cautious about the ambiguous KPI specs.**Defining total revenue and freight**

I‚Äôm calculating total revenue by summing payment values in fct_order_payments, filtering those orders through an inner join for distinct orders. However, if there are partial payments, that can complicate things. I could also compute revenue by summing item prices and freight values from fct_sales, but since there‚Äôs a separate total for freight, I‚Äôll define total revenue strictly as item sales. 

So:  
- Total revenue as sum of item prices in fct_sales.  
- Total freight fr

Unnamed: 0,total_orders,total_revenue,average_order_value,total_freight,average_delivery_days,average_review_score
0,565,673450.06,1191.95,43640.23,10.09,3.41


In [7]:
# Output the result as a dataframe
df_kpis = thread.df()
df_kpis

Unnamed: 0,total_orders,total_revenue,average_order_value,total_freight,average_delivery_days,average_review_score
0,565,673450.06,1191.95,43640.23,10.09,3.41


In [8]:
# Check out the SQL query used to calculate the result
print("SQL query for the KPI overview:\n", thread.code())

SQL query for the KPI overview:
 WITH sales AS (
  SELECT * FROM db1.main.fct_sales
),
orders_in_sales AS (
  SELECT DISTINCT order_id FROM sales
),
reviews_in_sales AS (
  SELECT DISTINCT review_id FROM sales WHERE review_id IS NOT NULL
)
SELECT
  -- Total number of orders that have at least one sales line
  (SELECT COUNT(DISTINCT order_id) FROM sales) AS total_orders,
  -- Total merchandise revenue (excluding freight)
  (SELECT ROUND(SUM(item_price), 2) FROM sales) AS total_revenue,
  -- Average order value = total revenue / total orders
  (SELECT ROUND(SUM(item_price) / COUNT(DISTINCT order_id), 2) FROM sales) AS average_order_value,
  -- Total freight collected/charged
  (SELECT ROUND(SUM(item_freight_value), 2) FROM sales) AS total_freight,
  -- Average delivery days for delivered orders (based on orders that appear in sales)
  (
    SELECT ROUND(
      AVG(
        DATEDIFF('day',
          d.order_purchased_at::DATE,
          d.order_delivered_to_customer_at::DATE
        )
   

### 2. Trend & seasonality analysis

Goal: Identify monthly trends in revenue, orders, and reviews.

In [14]:
# Threads have memory, so new questions can reference previous answers in the same thread
thread.ask(
    """
    Produce monthly time series for:
      - revenue
      - orders_count
      - average_review_score
    Use the same definitions as before.
    Include 2-month moving averages.
    """
)


Step 1: I‚Äôll build monthly aggregates from `fct_sales` using the established definitions:
- revenue = SUM(item_price) per month of order_purchased_at
- orders_count = COUNT(DISTINCT order_id) per month.

Step 2: I‚Äôll compute monthly average_review_score using distinct reviews linked to sales, mapped to the order‚Äôs purchase month.

Step 3: I‚Äôll join these into a single monthly table and add 2‚Äëmonth moving averages with window functions, then return the full time series as a dataset plus a line chart visualization.

[tool_call: 'run_sql_query']
```
{"sql": "WITH sales AS (\n  SELECT\n    order_id,\n    review_id,\n    item_price,\n    order_purchased_at\n  FROM db1.main.fct_sales\n),\n-- Base month per order (from order_purchased_at in fct_sales)\norders_month AS (\n  SELECT DISTINCT\n    order_id,\n    DATE_TRUNC('month', order_purchased_at)::DATE AS month_start\n  FROM sales\n),\n-- Monthly revenue and orders count\nrevenue_orders AS (\n  SELECT\n    DATE_TRUNC('month', orde

Unnamed: 0,month_start,revenue,orders_count,average_review_score,revenue_ma_2m,orders_count_ma_2m,average_review_score_ma_2m
0,2025-06-01,166536.77,129,3.088235,166536.77,129.0,3.09
1,2025-07-01,179466.59,144,3.325581,173001.68,136.5,3.21
2,2025-08-01,164676.53,140,3.477273,172071.56,142.0,3.4
3,2025-09-01,162770.17,152,3.627451,163723.35,146.0,3.55


In [15]:
df_trend = thread.df()
df_trend

Unnamed: 0,month_start,revenue,orders_count,average_review_score,revenue_ma_2m,orders_count_ma_2m,average_review_score_ma_2m
0,2025-06-01,166536.77,129,3.088235,166536.77,129.0,3.09
1,2025-07-01,179466.59,144,3.325581,173001.68,136.5,3.21
2,2025-08-01,164676.53,140,3.477273,172071.56,142.0,3.4
3,2025-09-01,162770.17,152,3.627451,163723.35,146.0,3.55


In [16]:
# Generate a chart
thread.plot('Draw a line chart for Revenue and on the other axis show average review score. Include legend')

In [13]:
print("SQL query for trends & seasonality:\n", thread.code())

SQL query for trends & seasonality:
 WITH delivered_orders AS (
    SELECT 
        order_id,
        order_purchased_at
    FROM db1.main.dim_orders
    WHERE order_status = 'delivered'
      AND order_purchased_at IS NOT NULL
),
monthly_orders AS (
    SELECT
        DATE_TRUNC('month', order_purchased_at)::DATE AS month,
        COUNT(DISTINCT order_id) AS orders_count
    FROM delivered_orders
    GROUP BY 1
),
monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', d.order_purchased_at)::DATE AS month,
        SUM(fp.payment_value) AS revenue
    FROM db1.main.fct_order_payments fp
    JOIN delivered_orders d USING(order_id)
    GROUP BY 1
),
monthly_reviews AS (
    SELECT
        DATE_TRUNC('month', review_sent_at)::DATE AS month,
        AVG(review_score) AS average_review_score
    FROM db1.main.dim_order_reviews
    WHERE review_sent_at IS NOT NULL
    GROUP BY 1
),
month_spine AS (
    SELECT month FROM monthly_orders
    UNION
    SELECT month FROM monthly_revenue
    U

### 3. Payment & fulfilment behavior

Goal: Correlate payment types and delivery performance with AOV and satisfaction.

Deliverables: Grouped bar charts for AOV and avg_review_score by payment_type and installments buckets; dataframe with review scores and AOV per payment type and installments buckets


In [20]:
# Start a new thread
thread = agent.thread()

thread.ask(
    """
    Analyze payment behavior and fulfilment performance:
    - Group by payment_type and installments buckets (1, 2-6, >6).
    - Compute AOV and avg_review_score for each group.
    """
)


**Analyzing payment records**

I need to group payments at the order level because multiple payment methods can exist for one order. I‚Äôm considering using the payment method with the highest value as the primary one, or treating each record individually. I plan to define a group based on payment type and installments, keeping average order value in mind. I'll also check if having multiple records per order is common, even though I want to minimize using multiple tools for this check. Ultimately, I'll review both options as instructed.**Preparing to compute metrics**

I need to calculate the average review score by joining with either stg__order_reviews or dim_order_reviews using order_id. The stg__order_reviews table has both order_id and review_score, making this easier. For fulfillment performance, while the user didn't specify metrics, I'll focus on average order value (AOV) and average review score. Before running any queries, I‚Äôll provide clear steps. I‚Äôll also categorize p

Unnamed: 0,payment_type,installments_bucket,order_count,reviewed_order_count,aov,avg_review_score
0,boleto,1,23,3,691.759130,3.000000
1,boleto,2-6,73,24,538.439041,3.583333
2,boleto,>6,20,6,249.313000,1.833333
3,credit_card,1,155,46,768.777871,3.630435
4,credit_card,2-6,245,77,1009.792286,3.337662
...,...,...,...,...,...,...
7,debit_card,2-6,67,19,550.784776,3.105263
8,debit_card,>6,18,4,292.396111,2.000000
9,voucher,1,25,10,853.856400,3.200000
10,voucher,2-6,75,21,333.897467,3.285714


In [21]:
df_payment = thread.df()
df_payment

Unnamed: 0,payment_type,installments_bucket,order_count,reviewed_order_count,aov,avg_review_score
0,boleto,1,23,3,691.75913,3.0
1,boleto,2-6,73,24,538.439041,3.583333
2,boleto,>6,20,6,249.313,1.833333
3,credit_card,1,155,46,768.777871,3.630435
4,credit_card,2-6,245,77,1009.792286,3.337662
5,credit_card,>6,74,23,2288.473108,3.130435
6,debit_card,1,30,10,500.757333,3.8
7,debit_card,2-6,67,19,550.784776,3.105263
8,debit_card,>6,18,4,292.396111,2.0
9,voucher,1,25,10,853.8564,3.2


In [22]:
thread.plot('draw a grouped bar chart - x axis should be payment_type, color - installments bucket')

In [23]:
print("SQL query for payment & fulfillment:\n", thread.code())

SQL query for payment & fulfillment:
 WITH payments AS (
    SELECT 
        fp.order_id,
        dp.payment_type,
        dp.payment_installments,
        fp.payment_value
    FROM db1.main.fct_order_payments fp
    JOIN db1.main.dim_order_payments dp
        USING (payment_id)
),
payments_enriched AS (
    SELECT
        order_id,
        payment_type,
        CASE
            WHEN payment_installments = 1 THEN '1'
            WHEN payment_installments BETWEEN 2 AND 6 THEN '2-6'
            WHEN payment_installments > 6 THEN '>6'
            ELSE 'Unknown'
        END AS installments_bucket,
        payment_value
    FROM payments
),
order_values AS (
    -- Order-level value per payment_type & installments bucket
    SELECT
        order_id,
        payment_type,
        installments_bucket,
        SUM(payment_value) AS total_payment_value
    FROM payments_enriched
    GROUP BY 1,2,3
),
order_reviews AS (
    -- Average review_score per order (in case of multiple reviews per order

### 4. Product mix & basket analysis

#### How does order weight differ between single-item and multi-item orders? Which type experiences higher cancellation rates?

Goal: Compare single vs multi-item orders in terms of freight and cancelation rates

Deliverables: Orders count, average freight per order, and cancellation rate by item group (single vs multi). Barplot.


In [24]:
thread = agent.thread()

thread.ask(
    """
    Compare single-item vs multi-item orders:
      - For each group, compute orders_count, avg_total_freight_per_order, and cancellation_rate.
      - Provide a bar chart illustrating differences.
    """
)


**Considering SQL Strategy**

I'm contemplating how to handle counting order items in the database, weighing whether to use COUNT(*) or COUNT(DISTINCT order_item_id) for safety. If I want a robust approach, I might lean towards COUNT(DISTINCT) since order_item_id is unique per line. Next, I plan to join with the orders table to get the order status and create a flag for order types based on the number of items. Additionally, I‚Äôll compute various metrics and set up a bar chart for visualization after running the main SQL query.**Planning the SQL Query Steps**

I need to describe each step briefly before executing the query and explain the reasons behind my choices. It‚Äôs crucial to avoid cross joins, so I‚Äôll only join stg__order_items with stg__orders based on order_id, which is standard practice. I plan to produce a single query. After running the SQL query, I'll submit the result with the correct query_id and a description, ensuring the visualization prompt is set for a simple b

Unnamed: 0,order_type,orders_count,avg_total_freight_per_order,cancellation_rate
0,multi_item,270,128.446222,7.777778
1,single_item,295,30.372034,0.677966


In [25]:
df_basket = thread.df()
df_basket

Unnamed: 0,order_type,orders_count,avg_total_freight_per_order,cancellation_rate
0,multi_item,270,128.446222,7.777778
1,single_item,295,30.372034,0.677966


In [26]:
thread.plot('draw a horizontal bar chart for orders count')

In [27]:
print("SQL query for basket analysis:\n", thread.code())

SQL query for basket analysis:
 WITH order_level AS (
    SELECT 
        soi.order_id,
        COUNT(DISTINCT soi.order_item_id) AS items_per_order,
        SUM(soi.item_freight_value)      AS total_freight
    FROM db1.main.stg__order_items AS soi
    GROUP BY soi.order_id
),
classified_orders AS (
    SELECT 
        ol.order_id,
        CASE 
            WHEN ol.items_per_order = 1 THEN 'single_item'
            ELSE 'multi_item'
        END AS order_type,
        ol.total_freight,
        so.order_status
    FROM order_level AS ol
    JOIN db1.main.stg__orders AS so
      ON ol.order_id = so.order_id
)
SELECT 
    order_type,
    COUNT(*) AS orders_count,
    AVG(total_freight) AS avg_total_freight_per_order,
    100.0 * SUM(CASE WHEN order_status = 'canceled' THEN 1 ELSE 0 END) / COUNT(*) AS cancellation_rate
FROM classified_orders
GROUP BY order_type
ORDER BY order_type;


### 5. Customer Retention & Cohort Trends

Goal: Analyze cohort-based customer LTV and monthly revenue over time segmented bycustomers‚Äô first-order month. Include each cohort's size and the number of months active. Plot cumulative LTV per cohort per month (area or line).


In [28]:
thread = agent.thread()

thread.ask(
    """
    Build customer cohorts by first_order_month.
    For each cohort across subsequent months, compute:
      - monthly_revenue_per_cohort
      - cumulative_LTV_per_customer (revenue divided by cohort size)
      - cohort_size
      - months_since_cohort_start
    """
)


**Defining cohort analysis**

I need to use the database to set up cohorts based on the month of the first purchase. Customers will be grouped according to their first order date, which can be found in the `dim_customers` table. I‚Äôll also consider linking orders and customers through `stg__order_customers`. I have to decide if monthly revenue should include freight costs. The monthly revenue per cohort will be the total revenue calculation. I‚Äôll need to calculate cumulative lifetime value too, following the steps provided before running the SQL queries.**Connecting customers to sales data**

I need to effectively join customers to sales data since `fct_sales` includes `customer_id`. I can connect this to `dim_customers` using that ID. For handling months, I plan to use `date_trunc('month', ...)` to group data. I might consider joining to `metricflow_time_spine` to avoid missing months, although the question doesn't explicitly require it. Including only months with revenue is proba

Unnamed: 0,cohort_month,revenue_month,months_since_cohort_start,cohort_size,monthly_revenue_per_cohort,cumulative_LTV_per_customer
0,2025-06-01,2025-06-01,0,105,177834.03,1693.657429
1,2025-06-01,2025-07-01,1,105,53274.06,2201.029429
2,2025-06-01,2025-08-01,2,105,75580.85,2920.847048
3,2025-06-01,2025-09-01,3,105,71139.77,3598.368667
4,2025-07-01,2025-07-01,0,68,137484.81,2021.835441
5,2025-07-01,2025-08-01,1,68,55422.62,2836.873971
6,2025-07-01,2025-09-01,2,68,54988.99,3645.535588
7,2025-08-01,2025-08-01,0,23,43793.06,1904.046087
8,2025-08-01,2025-09-01,1,23,36242.83,3479.821304
9,2025-09-01,2025-09-01,0,4,11329.27,2832.3175


In [30]:
df_cohort = thread.df()
df_cohort

Unnamed: 0,cohort_month,revenue_month,months_since_cohort_start,cohort_size,monthly_revenue_per_cohort,cumulative_LTV_per_customer
0,2025-06-01,2025-06-01,0,105,177834.03,1693.657429
1,2025-06-01,2025-07-01,1,105,53274.06,2201.029429
2,2025-06-01,2025-08-01,2,105,75580.85,2920.847048
3,2025-06-01,2025-09-01,3,105,71139.77,3598.368667
4,2025-07-01,2025-07-01,0,68,137484.81,2021.835441
5,2025-07-01,2025-08-01,1,68,55422.62,2836.873971
6,2025-07-01,2025-09-01,2,68,54988.99,3645.535588
7,2025-08-01,2025-08-01,0,23,43793.06,1904.046087
8,2025-08-01,2025-09-01,1,23,36242.83,3479.821304
9,2025-09-01,2025-09-01,0,4,11329.27,2832.3175


In [40]:
thread.plot('Line chart of cumulative LTV by cohort age')

In [32]:
print("SQL query for cohort analysis:\n", thread.code())

SQL query for cohort analysis:
 WITH cohorts AS (
    SELECT
        customer_id,
        CAST(DATE_TRUNC('month', first_order_made_at) AS DATE) AS cohort_month
    FROM db1.main.dim_customers
    WHERE first_order_made_at IS NOT NULL
),
cohort_sizes AS (
    SELECT
        cohort_month,
        COUNT(DISTINCT customer_id) AS cohort_size
    FROM cohorts
    GROUP BY cohort_month
),
cohort_revenue AS (
    SELECT
        c.cohort_month,
        CAST(DATE_TRUNC('month', fs.order_purchased_at) AS DATE) AS revenue_month,
        SUM(fs.item_price + fs.item_freight_value) AS monthly_revenue_per_cohort
    FROM cohorts c
    JOIN db1.main.fct_sales fs
      ON c.customer_id = fs.customer_id
    GROUP BY c.cohort_month, revenue_month
),
cohort_revenue_enriched AS (
    SELECT
        cr.cohort_month,
        cr.revenue_month,
        cs.cohort_size,
        cr.monthly_revenue_per_cohort,
        DATE_DIFF('month', cr.cohort_month, cr.revenue_month) AS months_since_cohort_start
    FROM cohor

### 6. Delivery performance & logistics efficiency

Goal: Analyze seller_state ‚Üí customer_state lanes with sufficient volume; compute average and median delivery days, orders per lane; visualize as heatmap and ranked bar chart.


In [41]:
thread = agent.thread()

thread.ask(
    """
    For seller_state ‚Üí customer_state lanes with at least 20 delivered orders:
      - Compute avg_delivery_days, median_delivery_days, orders_count.
    """
)


**Developing SQL Aggregation**

I need to reason stepwise about SQL aggregation before formulating the query. I'll apply the fact table fct_sales, which joins with dim_orders and includes customer and seller details for states. It‚Äôs crucial to focus on delivered orders, indicated by order_status = 'delivered' from dim_orders. Then there's the mapping between orders, sellers, and customers, supported by calculating delivery days using date_diff. To verify the function in DuckDB, date_diff seems to work as expected when tested.**Calculating Median with DuckDB**

I need to calculate the median, and DuckDB offers a median() aggregate function or quantile_cont(0.5). I can use median(delivery_days), but I can't refer to the alias in the same select. Maybe I'll compute it in a CTE. I also need to filter for lanes that have at least 20 delivered orders using a HAVING clause. It's essential to deduplicate the orders because fct_sales shows one row per order item, but I need the distinct coun

Unnamed: 0,seller_state,customer_state,orders_count,avg_delivery_days,median_delivery_days
0,UT,ZH,25,10.12,10.0
1,DR,NH,24,10.541667,10.0
2,LI,ZH,23,9.478261,10.0
3,UT,UT,20,6.95,7.5


In [42]:
df_lanes = thread.df()
df_lanes

Unnamed: 0,seller_state,customer_state,orders_count,avg_delivery_days,median_delivery_days
0,UT,ZH,25,10.12,10.0
1,DR,NH,24,10.541667,10.0
2,LI,ZH,23,9.478261,10.0
3,UT,UT,20,6.95,7.5


In [44]:
thread.plot('heatmap of avg_delivery_days by seller-customer state pair')

In [45]:
print("SQL query for lanes analysis:\n", thread.code())

SQL query for lanes analysis:
 WITH delivered_orders AS (
    SELECT 
        o.order_id,
        o.order_purchased_at::DATE AS order_purchased_date,
        o.order_delivered_to_customer_at::DATE AS delivered_date,
        date_diff('day', o.order_purchased_at::DATE, o.order_delivered_to_customer_at::DATE) AS delivery_days
    FROM db1.main.dim_orders o
    WHERE o.order_status = 'delivered'
      AND o.order_delivered_to_customer_at IS NOT NULL
      AND o.order_purchased_at IS NOT NULL
), order_customer_state AS (
    SELECT 
        s.order_id,
        c.customer_state
    FROM db1.main.fct_sales s
    JOIN db1.main.dim_customers c ON s.customer_id = c.customer_id
    GROUP BY s.order_id, c.customer_state
), order_seller_state AS (
    SELECT 
        s.order_id,
        se.seller_state
    FROM db1.main.fct_sales s
    JOIN db1.main.dim_sellers se ON s.seller_id = se.seller_id
    GROUP BY s.order_id, se.seller_state
), order_lanes AS (
    SELECT 
        d.order_id,
        os.s

### 7. Correlation & efficiency analysis
Goal: Explore relationships among cost, delivery time, satisfaction, and revenue. Deliverables: correlation matrix and scatter plots with trend lines.


In [46]:
thread = agent.thread()

thread.ask(
    """
    Construct an order-level analysis with the following numeric fields:
      - revenue_per_order
      - total_freight
      - delivery_days
      - review_score (satisfaction)
    Compute a correlation matrix for selected pairs:
      - delivery_days vs review_score
      - total_freight vs revenue_per_order
      - delivery_days vs revenue_per_order
    write a short summary of the results, explain it in simple words.
    """
)


**Analyzing database metrics**

I need to analyze order-level metrics and correlations using SQL on this database, specifically with DuckDB. First, I‚Äôll need to decide whether to use fact, staging, or raw tables. For metrics like revenue per order, I‚Äôll sum item prices from relevant tables. I'll also find total freight and delivery days by calculating differences between specific timestamps. The reviews table will help with review scores. I need to explore both options thoroughly to ensure I choose the most robust approach.**Planning order-level analysis**

First, I‚Äôll compare record counts between the orders tables‚Äîdim_orders, stg__orders, and raw.orders. I also need to check if fct_sales includes all orders or just those with items, as my analysis focuses on orders with shipping and review data. I‚Äôll probably use fct_sales, dim_orders, and dim_order_reviews. Dim tables like dim_order_items are often distinct from staging, while fct_sales includes review IDs which point to 

Unnamed: 0,corr_delivery_vs_review,corr_freight_vs_revenue,corr_delivery_vs_revenue
0,-0.304521,0.666748,0.000559


In [47]:
df_corr = thread.df()
df_corr

Unnamed: 0,corr_delivery_vs_review,corr_freight_vs_revenue,corr_delivery_vs_revenue
0,-0.304521,0.666748,0.000559


In [48]:
thread.plot()

In [50]:
print("SQL query for correlation analysis:\n", thread.code())

SQL query for correlation analysis:
 -- Compute Pearson correlations for the requested pairs
WITH order_metrics AS (
    SELECT 
        d.order_id,
        f.revenue_per_order,
        f.total_freight,
        d.delivery_days,
        r.review_score
    FROM (
        SELECT 
            o.order_id,
            DATE_DIFF('day', o.order_purchased_at::DATE, o.order_delivered_to_customer_at::DATE) AS delivery_days
        FROM db1.main.dim_orders o
        WHERE o.order_delivered_to_customer_at IS NOT NULL
    ) d
    JOIN (
        SELECT 
            oi.order_id,
            SUM(oi.item_price) AS revenue_per_order,
            SUM(oi.item_freight_value) AS total_freight
        FROM db1.main.stg__order_items oi
        GROUP BY oi.order_id
    ) f ON d.order_id = f.order_id
    JOIN db1.main.stg__order_reviews r ON d.order_id = r.order_id
)
SELECT 
    corr(delivery_days, review_score)      AS corr_delivery_vs_review,
    corr(total_freight, revenue_per_order) AS corr_freight_vs_revenu

### 8. Compare performance & generate insights
Goal: Rank top and bottom performers and generate narrative insights/recommendations suitable for reporting.


In [62]:
thread = agent.thread()

thread.ask(
    """
    Identify performance by seller:
      - Rank top/bottom performers on revenue, AOV, and average_review_score.
      - Provide a summary table with ranks and key metrics.
      - Generate narrative insights and brief recommendations (bulleted) suitable for a report.
    """
)


**Defining revenue metrics**

I‚Äôm considering "gross revenue" as the total of item price and item freight value, but I think I'll just go with item price for simplicity, since the question doesn't explicitly mention freight. In e-commerce, revenue can include freight, which complicates things a bit. I also need to find the top and bottom performers, possibly top and bottom 10 sellers based on revenue, average order value, and review scores. A summary table could be helpful here. I‚Äôll need to use certain tools for running queries.**Developing seller metrics**

I‚Äôm thinking I might need a second query to get the top and bottom N for the narrative. However, I can derive insights by pulling the top and bottom results directly from SQL, focusing on the first 12 rows returned. To build seller metrics, I‚Äôll aggregate data from the sales and order reviews. For revenue, I‚Äôll sum item prices, and I‚Äôll consider total orders per seller to calculate average order value. I‚Äôll need a s

Unnamed: 0,seller_id,seller_city,seller_state,total_revenue,order_count,aov,avg_review_score,num_reviews,revenue_rank,aov_rank,review_rank
0,9f9692b1aed52b21a5ea668594b7f897,rotterdam,ZH,21426.93,35,612.198000,4.000000,7,1,11,7
1,17a8fb87daefff54ea1a81918ef4526b,zoetermeer,ZH,20609.83,28,736.065357,3.600000,5,2,2,16
2,0fbff65ebd0a050da5eb315b4913742c,maastricht,LI,19631.87,32,613.495937,3.470588,17,3,10,20
3,c6df05b5066fa241db19df273f88d33e,utrecht,UT,19411.59,29,669.365172,4.000000,9,4,4,7
4,15f3ff60c4386de2f9deff09d297fc82,utrecht,UT,19358.36,43,450.194419,2.769231,13,5,38,43
...,...,...,...,...,...,...,...,...,...,...,...
45,5e099f6f745d7eb52736f256772996d2,drachten,FR,9181.94,29,316.618621,3.900000,10,46,49,11
46,4b6bc5e7ee1d25f03fd86cc01a8b8d20,kampen,OV,8388.80,26,322.646154,3.333333,9,47,48,27
47,616dc9e7cb4d7fa8282ce8bfd3420708,kampen,OV,8077.15,18,448.730556,5.000000,6,48,39,1
48,9890be93654742ad61723a6db4eff490,emmmen,DR,7153.41,27,264.941111,3.375000,8,49,50,26


In [63]:
df_perf = thread.df()
df_perf

Unnamed: 0,seller_id,seller_city,seller_state,total_revenue,order_count,aov,avg_review_score,num_reviews,revenue_rank,aov_rank,review_rank
0,9f9692b1aed52b21a5ea668594b7f897,rotterdam,ZH,21426.93,35,612.198,4.0,7,1,11,7
1,17a8fb87daefff54ea1a81918ef4526b,zoetermeer,ZH,20609.83,28,736.065357,3.6,5,2,2,16
2,0fbff65ebd0a050da5eb315b4913742c,maastricht,LI,19631.87,32,613.495937,3.470588,17,3,10,20
3,c6df05b5066fa241db19df273f88d33e,utrecht,UT,19411.59,29,669.365172,4.0,9,4,4,7
4,15f3ff60c4386de2f9deff09d297fc82,utrecht,UT,19358.36,43,450.194419,2.769231,13,5,38,43
5,b9be7424bfd1f48aecfe450cec7425f4,maastricht,LI,18004.42,29,620.842069,3.888889,9,6,7,12
6,4ecf1077ea04e42e3cf2cdd54b9958d9,assen,DR,17609.16,38,463.398947,4.272727,11,7,33,3
7,2ddb1be5564390ad73105d2da0e05e52,zwolle,OV,17532.88,27,649.365926,3.454545,11,8,5,21
8,f55c207b0821146f710fa0dee35d16f5,breda,NB,16957.91,27,628.070741,3.0,9,9,6,39
9,d3ebf96597c63546df2cf48e8d19666a,emmmen,DR,16891.97,22,767.816818,3.0,6,10,1,39


In [59]:
thread.plot()

In [64]:
print("\nNarrative insights and recommendations:\n")
print(thread.text())



Narrative insights and recommendations:

Summary: Seller performance ranked by revenue, average order value (AOV), and average review score.

Metric definitions
- total_revenue: Sum of item_price for all items sold by the seller across all orders (from fct_sales).
- order_count: Number of distinct orders per seller (we first aggregate revenue at seller_id + order_id, then count those order-level rows).
- aov: Average order value = total_revenue / order_count (implemented as AVG(order_revenue)).
- avg_review_score: Average of review_score for all orders with a linked review for that seller; sellers without reviews would have NULL here (none in this dataset).
- num_reviews: Count of non-null reviews linked to the seller‚Äôs orders.
- revenue_rank: Rank 1 = highest total_revenue.
- aov_rank: Rank 1 = highest AOV.
- review_rank: Rank 1 = highest avg_review_score (NULLS LAST).

High-level insights (for report)
- Revenue concentration:
  - There are 50 sellers in total; the top 5 by revenue

### Wrapping it up

- You just walked through the EDA workflow in Databao and generated figures and tables with Databao. It created SQL queries to extract data from DuckDB based on dbt context.
- To adjust results, you can edit the prompts and rerun individual cells.
- To start a fresh analysis with its own memory, create a new separate thread using `agent.thread()`.


In [65]:
# Close the database connection
conn.close()
print("Database connection closed successfully!")


Database connection closed successfully!
