# Web Shop Orders — Databao Reporting Demo (Case 2)

This notebook demonstrates full EDA process using Databao


Notes:
- You need a DuckDB database at `data/web_shop.duckdb`
- You can use either a cloud LLM (OpenAI) or a local model (Ollama).


In [None]:
# Quick installs (safe to re-run)
!pip install -q duckdb databao matplotlib pandas

In [None]:
# Imports and DB connection
import os
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

# Set up the database connection (read-only)
DB_PATH = "data/web_shop.duckdb"
conn = duckdb.connect(DB_PATH, read_only=True)
print(f"Connected to DuckDB database: {DB_PATH}")

In [None]:
# Databao imports
import databao
from databao import LLMConfig

### LLM configuration

Choose one of the options below:

- Cloud (OpenAI):
  - Ensure the environment variable `OPENAI_API_KEY` is set before running this cell.
  - Example (Jupyter only): `%env OPENAI_API_KEY=YOUR_OPENAI_API_KEY` (replace with your key; do not commit it).
- Local (Ollama):
  - Install Ollama and pull a suitable model, e.g. `ollama pull gpt-oss:20b`.


In [None]:
# Default: Cloud LLM (OpenAI). Set temperature low for deterministic SQL/plots.
llm_config = LLMConfig(name="gpt-4.1-2025-04-14", temperature=0)

# Alternative: Local LLM (uncomment one of the options to use)

# llm_config = LLMConfig.from_yaml("../configs/qwen3-8b-ollama.yaml")  # Use a custom config file


In [None]:
# If you want to use cloud model, you need to put your Open AI API token in the env variable

%env OPENAI_API_KEY=

### Open Databao session and register data

We add the DuckDB connection and provide dbt’s manifest as context for better schema understanding.


In [None]:
session = databao.open_session(name="reporting-demo", llm_config=llm_config)
session.add_db(conn)

thread = session.thread()

## Case 2: Analytics & Insights (Reporting)

The following sections align with the requested analytical stages. Each step uses Databao to generate SQL, return DataFrames, produce plots, and provide optional narrative text.


### 1) Descriptive & KPI Overvie

##### How do our key business metrics perform overall?

Goal: Calculate and analyze topline KPIs, including total orders, revenue, AOV, freight, delivery time, and satisfaction



In [None]:
thread.ask(
    """
    Compute a KPI overview
    Return:
      - total orders
      - total revenue
      - average order value (AOV)
      - total freight
      - average delivery days
      - average review score (satisfaction proxy)
    """
)

In [None]:
df_kpis = thread.df()
df_kpis

In [None]:
print("SQL for KPI overview:\n", thread.code())

### 2) Trend & Seasonality Analysis

Goal: Monthly trends in revenue, orders, and reviews. 


In [None]:
thread.ask(
    """
    Produce monthly time series for:
      - revenue
      - orders_count
      - average_review_score
    Include 2-month moving averages.
    """
)

In [None]:
df_trend = thread.df()
df_trend

In [None]:
thread.plot('Draw a line chart for Revenue')

In [None]:
print("SQL for Trend & Seasonality:\n", thread.code())

### 3) Payment & Fulfillment Behavior


Goal: Correlate payment types and delivery performance with AOV and satisfaction.

Deliverables: Grouped bar charts for AOV and avg_review_score by payment_type and installments buckets; dataframe with review scores and AOV per payment type and installments buckets


In [None]:
thread.ask(
    """
    Analyze payment behavior and fulfillment performance:
    - Group by payment_type and installments buckets (1, 2-6, >6).
    - Compute AOV and avg_review_score for each group.
    Return summary DataFrames and produce grouped bar chart.
    """
)

In [None]:
df_payment = thread.df()
df_payment

In [None]:
thread.plot()

In [None]:
print("SQL for Payment & Fulfillment:\n", thread.code())

### 4) Product Mix & Basket Analysis
##### What's the difference in weight in orders when the order consist of one item vs multiple items in the order? Which orders are cancelled more? 


Goal: Compare single vs multi-item orders for freight and cancellations.

Deliverables: Orders count, average freight per order, and cancellation rate by item group (single vs multi). Barplot.


In [None]:
thread.ask(
    """
    Compare single-item vs multi-item orders:
      - For each group, compute orders_count, avg_total_freight_per_order, cancellation_rate.
      - Provide bar chart illustrating differences.
    """
)

In [None]:
df_basket = thread.df()
df_basket

In [None]:
thread.plot()

In [None]:
print("SQL for Basket Analysis:\n", thread.code())

### 5) Customer Retention & Cohort Trends
Goal: Cohort-based LTV and monthly revenue over time by first order month. Include cohort size and months since cohort start; plot cumulative LTV per month per cohort (area or line).


In [None]:
thread.ask(
    """
    Build customer cohorts by first_order_month.
    For each cohort across subsequent months, compute:
      - monthly_revenue_per_cohort
      - cumulative_LTV_per_customer (revenue divided by cohort size)
      - cohort_size
      - months_since_cohort_start
    """
)

In [None]:
df_cohort = thread.df()
df_cohort

In [None]:
thread.plot('line chart of cumulative LTV per cohort age per cohort (separate cohorts in different colors)')

In [None]:
print("SQL for Cohort Analysis:\n", thread.code())

### 6) Delivery Performance & Logistics Efficiency
Goal: Analyze seller_state → customer_state lanes with sufficient volume; compute average and median delivery days, orders per lane; visualize as heatmap and ranked bar chart.


In [None]:
thread.ask(
    """
    For seller_state → customer_state lanes with at least 20 delivered orders:
      - Compute avg_delivery_days, median_delivery_days, orders_count.
    """
)

In [None]:
df_lanes = thread.df()
df_lanes

In [None]:
thread.plot('heatmap of avg_delivery_days by seller-customer state pair')

In [None]:
print("SQL for Lanes Analysis:\n", thread.code())

### 7) Correlation & Efficiency Analysis
Goal: Explore relationships among cost, delivery time, satisfaction, and revenue. Deliverables: correlation matrix and scatter plots with trend lines.


In [None]:
thread.ask(
    """
    Construct an order-level analysis with the following numeric fields:
      - revenue_per_order
      - total_freight
      - delivery_days
      - review_score (satisfaction)
    Compute a correlation matrix for selected pairs:
      - delivery_days vs review_score
      - total_freight vs revenue_per_order
      - delivery_days vs revenue_per_order
    write a short summary of the results, explain it in simple words.
    """
)

In [None]:
df_corr = thread.df()
df_corr

In [None]:
thread.plot()

In [None]:
print(thread.text())

In [None]:
print("SQL for Correlation Analysis:\n", thread.code())


### 8) Performance Comparison & Insight Generation
Goal: Rank top and bottom performers and generate narrative insights/recommendations suitable for reporting.


In [None]:
thread.ask(
    """
    Identify performance by category and by seller:
      - Rank top/bottom performers on revenue growth, AOV, and average_review_score.
      - Provide a summary table with ranks and key metrics.
      - Generate narrative insights and brief recommendations (bulleted) suitable for a report.
    """
)

In [None]:
df_perf = thread.df()
df_perf

In [None]:
thread.plot()


In [None]:
print("SQL for Performance Comparison:\n", thread.code())


In [None]:
print("\nNarrative insights and recommendations:\n")
print(thread.text())


### Wrap up

- All figures and tables above are generated on demand by Databao using SQL against DuckDB, guided by dbt context.
- Re-run individual cells if you tweak prompts.
- You can start a fresh analysis with a new `session.thread()` for isolation.


In [None]:
# Close the database connection
conn.close()
print("Database connection closed successfully!")
