# Reporting with Databao - Web shop orders demo (Case 2)

Welcome! This notebook will walk you through the whole exploratory data analytics (EDA) workflow using [Databao](https://databao.app) â€“ a powerful data agent that helps you query, clean, and visualize your enterprise data.
You'll learn how to calculate and analyze metrics, generate charts and tables, and get insights.

The notebook contains a DuckDB file with a sample dataset, and it can be used with both cloud and local LLMs.
To use a cloud LLM, such as GPT-4.1, you will need an OpenAI API key.

You can learn more about connecting to data, using LLMs, and running Databao in the [Databao docs](https://jetbrains.github.io/databao-docs/).

ðŸš€ Letâ€™s dive in!


## Project setup

### Install and import packages

In [None]:
# Install Databao and other packages (safe to rerun)
!pip install -q duckdb databao matplotlib pandas

In [None]:
# Import packages
import os
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

# Connect to the local DuckDB file (read-only)
DB_PATH = "data/web_shop.duckdb"
conn = duckdb.connect(DB_PATH, read_only=True)
print(f"Connected to DuckDB database: {DB_PATH}")

In [None]:
# Import Databao
import databao
from databao import LLMConfig

### Configure your LLM

Databao supports both cloud and local LLMs.
For this demo, itâ€™s easier and faster to use an OpenAI cloud model, but it requires an API key.

If you prefer to use a local model, all your data remains on your machine, but downloading a model may take some time. Depending on the model you use and your machine specs, generating answers may be slower compared to a cloud model.

For easier setup, this notebook uses a cloud LLM by default. If you prefer to use a local LLM, uncomment the corresponding section and comment out the line with the cloud LLM config.


In [None]:
# Add your OpenAI API key. Comment out the following line if you prefer to use a local model
%env OPENAI_API_KEY=<YOUR_API_KEY>

In [None]:
# Option A â€” Cloud model (OpenAI). Low temperature helps produce deterministic SQL/plots
llm_config = LLMConfig(name="gpt-4.1-2025-04-14", temperature=0)

# Option B â€” Local model (Ollama)
# llm_config = LLMConfig.from_yaml("../configs/qwen3-8b-ollama.yaml")  # Use a custom config file

### Create a Databao agent session and register data sources

An *agent* in Databao acts as the main interface for database connections and context.
It can hangle multiple *threads* or conversations, each operating independently on the same data sources.



In [None]:
# Create a new agent and add a database connection
agent = databao.new_agent(llm_config=llm_config)
agent.add_db(conn)

# Start a new thread
thread = agent.thread()

## Run analysis and get insights

The following sections guide you through the different steps of data analysis.
In every step, Databao uses your questions to generates SQL queries, returns the results as dataframes, produce charts, or provide text explanations.


### 1. Descriptive metrics & KPI overview

##### How do our key business metrics perform overall?

Goal: Calculate and analyze topline KPIs, including total orders, revenue, AOV, freight, delivery time, and satisfaction



In [None]:
# Ask a question in the thread
thread.ask(
    """
    Compute a KPI overview
    Return:
      - total orders
      - total revenue
      - average order value (AOV)
      - total freight
      - average delivery days
      - average review score (satisfaction proxy)
    """
)

In [None]:
# Output the result as a dataframe
df_kpis = thread.df()
df_kpis

In [None]:
# Check out the SQL query used to calculate the result
print("SQL query for the KPI overview:\n", thread.code())

### 2. Trend & seasonality analysis

Goal: Identify monthly trends in revenue, orders, and reviews.

In [None]:
# Threads have memory, so new questions can reference previous answers in the same thread
thread.ask(
    """
    Produce monthly time series for:
      - revenue
      - orders_count
      - average_review_score
    Include 2-month moving averages.
    """
)

In [None]:
df_trend = thread.df()
df_trend

In [None]:
# Generate a chart
thread.plot('Draw a line chart for Revenue')

In [None]:
print("SQL query for trends & seasonality:\n", thread.code())

### 3. Payment & fulfilment behavior

Goal: Correlate payment types and delivery performance with AOV and satisfaction.

Deliverables: Grouped bar charts for AOV and avg_review_score by payment_type and installments buckets; dataframe with review scores and AOV per payment type and installments buckets


In [None]:
thread.ask(
    """
    Analyze payment behavior and fulfilment performance:
    - Group by payment_type and installments buckets (1, 2-6, >6).
    - Compute AOV and avg_review_score for each group.
    Return summary DataFrames and produce grouped bar chart.
    """
)

In [None]:
df_payment = thread.df()
df_payment

In [None]:
thread.plot()

In [None]:
print("SQL query for payment & fulfillment:\n", thread.code())

### 4. Product mix & basket analysis

#### How does order weight differ between single-item and multi-item orders? Which type experiences higher cancellation rates?

Goal: Compare single vs multi-item orders in terms of freight and cancelation rates

Deliverables: Orders count, average freight per order, and cancellation rate by item group (single vs multi). Barplot.


In [None]:
thread.ask(
    """
    Compare single-item vs multi-item orders:
      - For each group, compute orders_count, avg_total_freight_per_order, and cancellation_rate.
      - Provide a bar chart illustrating differences.
    """
)

In [None]:
df_basket = thread.df()
df_basket

In [None]:
thread.plot()

In [None]:
print("SQL query for basket analysis:\n", thread.code())

### 5. Customer Retention & Cohort Trends

Goal: Analyze cohort-based customer LTV and monthly revenue over time segmented bycustomersâ€™ first-order month. Include each cohort's size and the number of months active. Plot cumulative LTV per cohort per month (area or line).


In [None]:
thread.ask(
    """
    Build customer cohorts by first_order_month.
    For each cohort across subsequent months, compute:
      - monthly_revenue_per_cohort
      - cumulative_LTV_per_customer (revenue divided by cohort size)
      - cohort_size
      - months_since_cohort_start
    """
)

In [None]:
df_cohort = thread.df()
df_cohort

In [None]:
thread.plot('Line chart of cumulative LTV by cohort age, with one line per cohort (separate cohorts in different colors)')

In [None]:
print("SQL query for cohort analysis:\n", thread.code())

### 6. Delivery performance & logistics efficiency

Goal: Analyze seller_state â†’ customer_state lanes with sufficient volume; compute average and median delivery days, orders per lane; visualize as heatmap and ranked bar chart.


In [None]:
thread.ask(
    """
    For seller_state â†’ customer_state lanes with at least 20 delivered orders:
      - Compute avg_delivery_days, median_delivery_days, orders_count.
    """
)

In [None]:
df_lanes = thread.df()
df_lanes

In [None]:
thread.plot('heatmap of avg_delivery_days by seller-customer state pair')

In [None]:
print("SQL query for lanes analysis:\n", thread.code())

### 7. Correlation & efficiency analysis
Goal: Explore relationships among cost, delivery time, satisfaction, and revenue. Deliverables: correlation matrix and scatter plots with trend lines.


In [None]:
thread.ask(
    """
    Construct an order-level analysis with the following numeric fields:
      - revenue_per_order
      - total_freight
      - delivery_days
      - review_score (satisfaction)
    Compute a correlation matrix for selected pairs:
      - delivery_days vs review_score
      - total_freight vs revenue_per_order
      - delivery_days vs revenue_per_order
    write a short summary of the results, explain it in simple words.
    """
)

In [None]:
df_corr = thread.df()
df_corr

In [None]:
thread.plot()

In [None]:
print(thread.text())

In [None]:
print("SQL query for correlation analysis:\n", thread.code())

### 8. Compare performance & generate insights
Goal: Rank top and bottom performers and generate narrative insights/recommendations suitable for reporting.


In [None]:
thread.ask(
    """
    Identify performance by category and by seller:
      - Rank top/bottom performers on revenue growth, AOV, and average_review_score.
      - Provide a summary table with ranks and key metrics.
      - Generate narrative insights and brief recommendations (bulleted) suitable for a report.
    """
)

In [None]:
df_perf = thread.df()
df_perf

In [None]:
thread.plot()

In [None]:
print("SQL query for performance comparison:\n", thread.code())


In [None]:
print("\nNarrative insights and recommendations:\n")
print(thread.text())


### Wrapping it up

- You just walked through the EDA workflow in Databao and generated figures and tables with Databao. It created SQL queries to extract data from DuckDB based on dbt context.
- To adjust results, you can edit the prompts and rerun individual cells.
- To start a fresh analysis with its own memory, create a new separate thread using `agent.thread()`.


In [None]:
# Close the database connection
conn.close()
print("Database connection closed successfully!")
