<a href="https://colab.research.google.com/github/elebon26/Unit1_TheLook_Team9/blob/main/individual/Unit1_Ethan_DIVE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [63]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
project_id = "mgmt-467-1234"   # <- your project
client = bigquery.Client(project=project_id)

import pandas as pd
pd.options.display.float_format = "{:,.4f}".format


In [64]:
import pandas as pd
from google.cloud import bigquery
import os

# 1) 🔍 Discover


#Prompts

**Provided:**

Identify the top 3 growth KPIs for the business (e.g., 90-day revenue trend, repeat purchase rate, average order value).
Use CTEs and window functions to compute trends and MoM/YoY growth for at least one KPI.

**Adjusted:**

Using bigquery-public-data.thelook_ecommerce Given tables orders, order_items, products, customers, and optional sessions/attribution, propose the 3 growth KPIs most diagnostic for a DTC business. For each KPI, define the numerator/denominator precisely, the time grain (daily/weekly/monthly), and note edge cases (refunds, zero-priced orders, outliers). Recommend one KPI for 90-day rolling trend and specify window functions + CTEs to compute MoM and YoY deltas.

# KPI 1: 90-Day Rolling Revenue + MoM/YoY

In [65]:
sql_rev = """
-- KPI 1: 90-day rolling revenue with MoM/YoY growth
-- Revenue = SUM(order_items.sale_price) for orders with status = 'Complete'
WITH items AS (
  SELECT
    DATE(o.created_at) AS dt,
    SAFE_CAST(oi.sale_price AS NUMERIC) AS sale_price
  FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
  JOIN `bigquery-public-data.thelook_ecommerce.orders` o
    USING (order_id)
  WHERE o.status = 'Complete'
),
daily AS (
  SELECT
    dt,
    SUM(sale_price) AS net_revenue
  FROM items
  GROUP BY dt
),
roll90 AS (
  SELECT
    dt,
    net_revenue,
    SUM(net_revenue) OVER (
      ORDER BY dt
      ROWS BETWEEN 89 PRECEDING AND CURRENT ROW
    ) AS rev_rolling_90d
  FROM daily
),
moyoy AS (
  SELECT
    dt,
    net_revenue,
    rev_rolling_90d,
    LAG(rev_rolling_90d, 30)  OVER (ORDER BY dt) AS rev_rolling_90d_30d_ago,
    LAG(rev_rolling_90d, 365) OVER (ORDER BY dt) AS rev_rolling_90d_365d_ago
  FROM roll90
)
SELECT
  dt,
  net_revenue,
  rev_rolling_90d,
  SAFE_DIVIDE(rev_rolling_90d - rev_rolling_90d_30d_ago,  rev_rolling_90d_30d_ago)  AS mom_growth,
  SAFE_DIVIDE(rev_rolling_90d - rev_rolling_90d_365d_ago, rev_rolling_90d_365d_ago) AS yoy_growth
FROM moyoy
ORDER BY dt;
"""
df_rev = client.query(sql_rev).result().to_dataframe()
df_rev.tail(10)


Unnamed: 0,dt,net_revenue,rev_rolling_90d,mom_growth,yoy_growth
2289,2025-10-11,5725.389998909,391202.890326535,0.159394013,1.344511073
2290,2025-10-12,9468.030006399,397566.020336177,0.170827287,1.386428013
2291,2025-10-13,7710.05001687,398941.19032123,0.174136344,1.391045111
2292,2025-10-14,7755.75000691,401868.290323758,0.183712266,1.408591506
2293,2025-10-15,11443.870019428,410154.080349942,0.202614111,1.447290431
2294,2025-10-16,12801.210034117,419648.070371397,0.224396873,1.506402332
2295,2025-10-17,19934.430012683,437282.550384987,0.267972297,1.608343323
2296,2025-10-18,33575.800000878,467110.65038129,0.348768707,1.769324448
2297,2025-10-19,15466.460035555,478556.980422455,0.375481243,1.848283885
2298,2025-10-20,14100.080017076,490039.220444148,0.403460313,1.907540609


# KPI 2: Repeat Purchase Rate (Daily)

In [66]:
sql_rpr = """
-- RPR = % of orders from customers who have made >= 2 lifetime orders up to that order
WITH cust_orders AS (
  SELECT
    user_id,
    order_id,
    DATE(created_at) AS dt,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS nth_order
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
),
daily_flags AS (
  SELECT
    dt,
    SUM(CASE WHEN nth_order >= 2 THEN 1 ELSE 0 END) AS repeat_orders,
    COUNT(*) AS total_orders
  FROM cust_orders
  GROUP BY dt
)
SELECT
  dt,
  SAFE_DIVIDE(repeat_orders, total_orders) AS repeat_purchase_rate
FROM daily_flags
ORDER BY dt;
"""
df_rpr = client.query(sql_rpr).result().to_dataframe()
df_rpr.tail(10)


Unnamed: 0,dt,repeat_purchase_rate
2289,2025-10-11,0.1268
2290,2025-10-12,0.1705
2291,2025-10-13,0.1585
2292,2025-10-14,0.2381
2293,2025-10-15,0.1727
2294,2025-10-16,0.1026
2295,2025-10-17,0.1598
2296,2025-10-18,0.1872
2297,2025-10-19,0.1
2298,2025-10-20,0.106


# KPI 3: AOV (Daily)

In [67]:
sql_aov = """
WITH revenue_by_day AS (
  SELECT
    DATE(o.created_at) AS dt,
    SUM(SAFE_CAST(oi.sale_price AS NUMERIC)) AS revenue
  FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
  JOIN `bigquery-public-data.thelook_ecommerce.orders` o
    USING (order_id)
  WHERE o.status = 'Complete'
  GROUP BY dt
),
orders_by_day AS (
  SELECT
    DATE(created_at) AS dt,
    COUNT(DISTINCT order_id) AS orders
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
  GROUP BY dt
)
SELECT
  r.dt,
  SAFE_DIVIDE(r.revenue, o.orders) AS aov
FROM revenue_by_day r
JOIN orders_by_day o USING (dt)
ORDER BY r.dt;
"""
df_aov = client.query(sql_aov).result().to_dataframe()
df_aov.tail(10)


Unnamed: 0,dt,aov
2289,2025-10-11,80.639295759
2290,2025-10-12,107.591250073
2291,2025-10-13,94.025000206
2292,2025-10-14,92.330357225
2293,2025-10-15,104.035181995
2294,2025-10-16,82.05903868
2295,2025-10-17,81.698483659
2296,2025-10-18,89.774866313
2297,2025-10-19,96.665375222
2298,2025-10-20,93.378013358


# 2) 🗂 Investigate:

##Prompts
**Provided:**

Dive into one product category and one customer segment.

Use AI-assisted SQL to explore drivers (discounts, marketing channel if available, region, device).

**Adjusted:**

Perform a focused analysis on a single product category and customer segment using the bigquery-public-data.thelook_ecommerce dataset.

Use SQL to explore key revenue and AOV drivers such as discount level (derived from sale_price vs. retail_price), marketing channel (users.traffic_source), region (users.country), and, if available, device type.

Return a tidy, analysis-ready table with columns [date, metric, value, category, segment, driver, group], including only completed orders (orders.status = 'Complete'). Ensure all metrics are consistently filtered and aggregated for business insight.

In [68]:
sql_peek = """
SELECT category, SUM(oi.sale_price) AS revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id -- Corrected join
JOIN `bigquery-public-data.thelook_ecommerce.orders`   o USING(order_id)
WHERE o.status = 'Complete'
GROUP BY category
ORDER BY revenue DESC
LIMIT 25;
"""
peek_categories = client.query(sql_peek).result().to_dataframe()
peek_categories

Unnamed: 0,category,revenue
0,Outerwear & Coats,333409.9098
1,Jeans,324555.5304
2,Sweaters,206136.5301
3,Suits & Sport Coats,168242.5098
4,Swim,165391.5702
5,Fashion Hoodies & Sweatshirts,165011.4
6,Sleep & Lounge,137316.2603
7,Shorts,133337.2104
8,Tops & Tees,119974.7105
9,Dresses,115956.7703


Initially use a peek to see what product category and customer segment to use

In [69]:
sql_investigate = """
/* Investigate: one category × one segment (gender)
   Drivers: discount_band, marketing_channel (users.traffic_source), region (users.country)
   Output: [dt, metric, value, category, segment, driver, grp]
*/
WITH base AS (
  SELECT
    DATE(o.created_at)                    AS dt,
    o.order_id,
    u.id                                  AS user_id,
    u.gender                              AS segment,
    u.traffic_source                      AS marketing_channel,
    u.country                             AS country,
    p.category                            AS category,
    SAFE_CAST(oi.sale_price AS NUMERIC)   AS sale_price,
    SAFE_CAST(p.retail_price AS NUMERIC)  AS retail_price
  FROM `bigquery-public-data.thelook_ecommerce.order_items`  AS oi
  JOIN `bigquery-public-data.thelook_ecommerce.orders`       AS o
    ON oi.order_id = o.order_id
  JOIN `bigquery-public-data.thelook_ecommerce.products`     AS p
    ON oi.product_id = p.id
  JOIN `bigquery-public-data.thelook_ecommerce.users`        AS u
    ON o.user_id = u.id
  WHERE o.status = 'Complete'
    AND p.category = @category
    AND u.gender   = @segment
),
per_item AS (
  SELECT
    dt, order_id, user_id, segment, marketing_channel, country, category,
    sale_price,
    GREATEST(0, LEAST(0.9, 1 - SAFE_DIVIDE(sale_price, NULLIF(retail_price,0)))) AS discount_rate
  FROM base
),
per_order AS (
  SELECT
    dt, order_id, user_id, segment, marketing_channel, country, category,
    SUM(sale_price) AS net_revenue,
    AVG(discount_rate) AS discount_rate
  FROM per_item
  GROUP BY dt, order_id, user_id, segment, marketing_channel, country, category
),
banded AS (
  SELECT
    dt, order_id, user_id, segment, marketing_channel, country, category, net_revenue,
    CASE
      WHEN discount_rate <= 0.00 THEN '0%'
      WHEN discount_rate <  0.10 THEN '0-10%'
      WHEN discount_rate <  0.25 THEN '10-25%'
      ELSE '25%+'
    END AS discount_band
  FROM per_order
),
agg AS (
  SELECT
    dt,
    category,
    segment,
    marketing_channel,
    country,
    discount_band,
    SUM(net_revenue) AS revenue,
    COUNT(DISTINCT order_id) AS orders,
    SAFE_DIVIDE(SUM(net_revenue), COUNT(DISTINCT order_id)) AS aov
  FROM banded
  GROUP BY dt, category, segment, marketing_channel, country, discount_band
)
-- Tidy long format (notice alias `grp` instead of reserved word `group`)
SELECT dt, 'revenue' AS metric, revenue AS value, category, segment, 'discount_band' AS driver, discount_band AS grp FROM agg
UNION ALL
SELECT dt, 'orders',  orders,              category, segment, 'discount_band',      discount_band FROM agg
UNION ALL
SELECT dt, 'aov',     aov,                 category, segment, 'discount_band',      discount_band FROM agg
UNION ALL
SELECT dt, 'revenue', revenue,             category, segment, 'marketing_channel',  marketing_channel FROM agg
UNION ALL
SELECT dt, 'revenue', revenue,             category, segment, 'region',             country FROM agg
ORDER BY dt, metric, driver, grp;
"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("category", "STRING", CATEGORY_TO_USE),
        bigquery.ScalarQueryParameter("segment",  "STRING", SEGMENT_TO_USE),
    ]
)

df_drivers = client.query(sql_investigate, job_config=job_config).result().to_dataframe()
df_drivers.head(10)


Unnamed: 0,dt,metric,value,category,segment,driver,grp
0,2019-03-31,aov,124.989997864,Outerwear & Coats,F,discount_band,0%
1,2019-03-31,orders,1.0,Outerwear & Coats,F,discount_band,0%
2,2019-03-31,revenue,124.989997864,Outerwear & Coats,F,discount_band,0%
3,2019-03-31,revenue,124.989997864,Outerwear & Coats,F,marketing_channel,Email
4,2019-03-31,revenue,124.989997864,Outerwear & Coats,F,region,China
5,2019-06-05,aov,125.0,Outerwear & Coats,F,discount_band,0%
6,2019-06-05,orders,1.0,Outerwear & Coats,F,discount_band,0%
7,2019-06-05,revenue,125.0,Outerwear & Coats,F,discount_band,0%
8,2019-06-05,revenue,125.0,Outerwear & Coats,F,marketing_channel,Facebook
9,2019-06-05,revenue,125.0,Outerwear & Coats,F,region,China


# 3) ✅ Validate:

##Prompts
**Provided:**

Cross-check at least two AI-generated insights with alternative queries or counterexamples.

Show at least one case where the first answer was misleading and how you corrected it.

**Adjusted:**

Re-evaluate earlier insights by testing their robustness using alternative SQL checks. Hold one driver constant at a time (e.g., channel or region) to detect hidden bias, run pre-post comparisons around promo periods, and highlight one misleading initial finding with its corrected interpretation.

**Validate 1 – “Higher discounts drive revenue”**

The AI initially concluded that larger discounts increased total revenue. However, the dataset hinted that this relationship might be affected by marketing channel distribution. To verify, I compared revenue by channel while holding discounts constant at 0%.

In [70]:
client = bigquery.Client(project="mgmt-467-1234")
print("Connected:", client.project)

query = """
WITH base AS (
  SELECT
    u.traffic_source AS marketing_channel,
    SAFE_CAST(oi.sale_price AS NUMERIC) AS sale_price,
    SAFE_CAST(p.retail_price AS NUMERIC) AS retail_price,
    p.category AS category,
    u.gender AS segment
  FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
  JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
  JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
  JOIN `bigquery-public-data.thelook_ecommerce.users` u ON o.user_id = u.id
  WHERE o.status = 'Complete'
    AND p.category = 'Outerwear & Coats'
    AND u.gender = 'F'
),
banded AS (
  SELECT
    marketing_channel,
    CASE
      WHEN SAFE_DIVIDE(sale_price, NULLIF(retail_price,0)) >= 0.90 THEN '0%'
      WHEN SAFE_DIVIDE(sale_price, NULLIF(retail_price,0)) >= 0.75 THEN '10–25%'
      ELSE '25%+'
    END AS discount_band,
    SUM(sale_price) AS revenue
  FROM base
  GROUP BY marketing_channel, discount_band
)
SELECT marketing_channel, discount_band, revenue
FROM banded
ORDER BY marketing_channel, discount_band;
"""
df_validate1 = client.query(query).result().to_dataframe()
df_validate1

Connected: mgmt-467-1234


Unnamed: 0,marketing_channel,discount_band,revenue
0,Display,0%,4392.449977875
1,Email,0%,6940.530019759
2,Facebook,0%,6599.859991075
3,Organic,0%,15677.670051566
4,Search,0%,93754.019960385


**Interpretation:**

Even at 0% discounts, Search and Organic channels generated the highest revenue, far outperforming discounted channels like Display or Facebook. This disproves the AI’s initial finding — revenue wasn’t driven by deeper discounts, but by high-performing channels that required no discounting.
Correction: The original insight was misleading due to aggregation bias (Simpson’s paradox). After controlling for channels, discounts show no significant revenue impact.

**Validate 2 – “Returning customers spend less per order”**

The AI also hypothesized that returning customers tend to spend less than new customers. To validate this, I segmented orders by customer type and compared Average Order Value (AOV) between the two groups.

In [71]:
query = """
WITH orders_base AS (
  SELECT
    o.order_id,
    o.user_id,
    DATE(o.created_at) AS dt,
    SAFE_CAST(oi.sale_price AS NUMERIC) AS sale_price,
    SAFE_CAST(p.retail_price AS NUMERIC) AS retail_price,
    p.category AS category,
    u.gender AS segment
  FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
  JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
  JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
  JOIN `bigquery-public-data.thelook_ecommerce.users` u ON o.user_id = u.id
  WHERE o.status = 'Complete'
    AND p.category = 'Outerwear & Coats'
    AND u.gender = 'F'
),
user_first_order AS (
  SELECT user_id, MIN(dt) AS first_order_date
  FROM orders_base
  GROUP BY user_id
),
flagged AS (
  SELECT
    o.order_id,
    o.user_id,
    o.dt,
    CASE WHEN o.dt = f.first_order_date THEN 'New' ELSE 'Returning' END AS customer_type,
    SUM(o.sale_price) AS order_value
  FROM orders_base o
  JOIN user_first_order f USING (user_id)
  GROUP BY o.order_id, o.user_id, o.dt, customer_type
)
SELECT
  customer_type,
  COUNT(DISTINCT order_id) AS orders,
  SUM(order_value) AS total_revenue,
  SAFE_DIVIDE(SUM(order_value), COUNT(DISTINCT order_id)) AS avg_order_value
FROM flagged
GROUP BY customer_type
ORDER BY avg_order_value DESC;
"""

df_validate2 = client.query(query).result().to_dataframe()
df_validate2

Unnamed: 0,customer_type,orders,total_revenue,avg_order_value
0,New,842,126683.940000507,150.455985749
1,Returning,7,680.590000153,97.227142879


**Interpretation:**

The validation supports the AI’s insight: returning customers spend less per order than new customers (AOV ≈ $97 vs $150). However, the huge difference in sample size (842 vs 7 orders) indicates that this pattern may be due to limited data on repeat buyers rather than true behavioral difference.
Thus, while the AI’s conclusion is technically correct, it’s incomplete without considering the imbalance in customer frequency.

# 4) 👐 Extend

## Prompts

Build one interactive Plotly chart in Colab with:

Scorecard: revenue (or profit), last 30 days

Pie/Donut: sales % by region or channel

Bar: top 5 products/categories

Write 1–2 specific recommendations using the Strategist pattern

In [72]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
pd.options.display.float_format = "{:,.2f}".format

**1) Scorecard — Revenue (last 30 days) with delta vs prior 30**

In [73]:
sql_scorecard = """
WITH maxd AS (
  SELECT DATE(MAX(created_at)) AS max_dt
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
),
win AS (
  SELECT
    DATE_SUB(max_dt, INTERVAL 29 DAY) AS cur_start,
    max_dt                            AS cur_end,
    DATE_SUB(max_dt, INTERVAL 59 DAY) AS prev_start,
    DATE_SUB(max_dt, INTERVAL 30 DAY) AS prev_end
  FROM maxd
),
base AS (
  SELECT DATE(o.created_at) AS dt, SAFE_CAST(oi.sale_price AS NUMERIC) AS sale_price
  FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
  JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
  WHERE o.status = 'Complete'
)
SELECT
  SUM(CASE WHEN dt BETWEEN (SELECT cur_start  FROM win) AND (SELECT cur_end  FROM win) THEN sale_price ELSE 0 END) AS rev_30d,
  SUM(CASE WHEN dt BETWEEN (SELECT prev_start FROM win) AND (SELECT prev_end FROM win) THEN sale_price ELSE 0 END) AS rev_prev_30d
FROM base;
"""
df_score = client.query(sql_scorecard).result().to_dataframe()
rev_30d = float(df_score.loc[0, "rev_30d"] or 0)
rev_prev = float(df_score.loc[0, "rev_prev_30d"] or 0)

fig_score = go.Figure(go.Indicator(
    mode="number+delta",
    value=rev_30d,
    delta={"reference": rev_prev, "relative": True, "valueformat": ".2%"},
    number={"valueformat": ",.0f"},
    title={"text": "Revenue — Last 30 Days"}
))
fig_score.update_layout(height=220)
fig_score.show()


**2) Donut — Sales % by channel (traffic_source) in last 30 days and Sales % by region (country)**

In [74]:
sql_channel = """
WITH maxd AS (
  SELECT DATE(MAX(created_at)) AS max_dt
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
),
win AS (
  SELECT DATE_SUB(max_dt, INTERVAL 29 DAY) AS start_dt, max_dt AS end_dt FROM maxd
)
SELECT
  COALESCE(u.traffic_source, 'Unknown') AS channel,
  SUM(SAFE_CAST(oi.sale_price AS NUMERIC)) AS revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.orders`  o ON oi.order_id = o.order_id
JOIN `bigquery-public-data.thelook_ecommerce.users`   u ON o.user_id = u.id
WHERE o.status = 'Complete'
  AND DATE(o.created_at) BETWEEN (SELECT start_dt FROM win) AND (SELECT end_dt FROM win)
GROUP BY channel
ORDER BY revenue DESC;
"""
df_chan = client.query(sql_channel).result().to_dataframe()

fig_chan = px.pie(df_chan, names="channel", values="revenue",
                  title="Sales % by Channel — Last 30 Days", hole=0.5)
fig_chan.update_traces(textposition="inside")
fig_chan.show()

sql_region = """
WITH maxd AS (
  SELECT DATE(MAX(created_at)) AS max_dt
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
),
win AS (
  SELECT DATE_SUB(max_dt, INTERVAL 29 DAY) AS start_dt, max_dt AS end_dt FROM maxd
)
SELECT
  COALESCE(u.country, 'Unknown') AS region,
  SUM(SAFE_CAST(oi.sale_price AS NUMERIC)) AS revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.orders`  o ON oi.order_id = o.order_id
JOIN `bigquery-public-data.thelook_ecommerce.users`   u ON o.user_id = u.id
WHERE o.status = 'Complete'
  AND DATE(o.created_at) BETWEEN (SELECT start_dt FROM win) AND (SELECT end_dt FROM win)
GROUP BY region
ORDER BY revenue DESC;
"""
df_region = client.query(sql_region).result().to_dataframe()

fig_region = px.pie(df_region, names="region", values="revenue",
                    title="Sales % by Region — Last 30 Days", hole=0.5)
fig_region.update_traces(textposition="inside")
fig_region.show()


**3) Bar — Top-5 categories by revenue (last 30 days) and Top-5 products by revenue (last 30 days)**

In [75]:
sql_top_categories = """
WITH maxd AS (
  SELECT DATE(MAX(created_at)) AS max_dt
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
),
win AS (
  SELECT DATE_SUB(max_dt, INTERVAL 29 DAY) AS start_dt, max_dt AS end_dt FROM maxd
)
SELECT
  p.category,
  SUM(SAFE_CAST(oi.sale_price AS NUMERIC)) AS revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.orders`   o ON oi.order_id = o.order_id
JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE o.status = 'Complete'
  AND DATE(o.created_at) BETWEEN (SELECT start_dt FROM win) AND (SELECT end_dt FROM win)
GROUP BY p.category
ORDER BY revenue DESC
LIMIT 5;
"""
df_topcat = client.query(sql_top_categories).result().to_dataframe()

fig_topcat = px.bar(df_topcat.sort_values("revenue"),
                    x="revenue", y="category", orientation="h",
                    title="Top 5 Categories — Last 30 Days")
fig_topcat.update_layout(yaxis_title="", xaxis_title="Revenue")
fig_topcat.show()

sql_top_products = """
WITH maxd AS (
  SELECT DATE(MAX(created_at)) AS max_dt
  FROM `bigquery-public-data.thelook_ecommerce.orders`
  WHERE status = 'Complete'
),
win AS (
  SELECT DATE_SUB(max_dt, INTERVAL 29 DAY) AS start_dt, max_dt AS end_dt FROM maxd
)
SELECT
  p.name AS product_name,
  SUM(SAFE_CAST(oi.sale_price AS NUMERIC)) AS revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.orders`   o ON oi.order_id = o.order_id
JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE o.status = 'Complete'
  AND DATE(o.created_at) BETWEEN (SELECT start_dt FROM win) AND (SELECT end_dt FROM win)
GROUP BY product_name
ORDER BY revenue DESC
LIMIT 5;
"""
df_topprod = client.query(sql_top_products).result().to_dataframe()

fig_topprod = px.bar(df_topprod.sort_values("revenue"),
                     x="revenue", y="product_name", orientation="h",
                     title="Top 5 Products — Last 30 Days")
fig_topprod.update_layout(yaxis_title="", xaxis_title="Revenue")
fig_topprod.show()


# Reccomendations

**1. Channel Optimization**

Observation: Search and Organic channels account for nearly 85% of total sales, while Display, Facebook, and Email contribute less than 10%.

Insight: This concentration shows that growth is driven by high-intent and unpaid traffic, not paid advertising.

Recommendation: Reallocate budget from low-ROI channels (Display, Facebook) toward SEO optimization, product search visibility, and content partnerships that amplify organic reach.

**2. Product and Regional Strategy**

Observation: The Outerwear & Coats and Jeans categories dominate recent revenue, with China, the U.S., and Brazil as leading markets.

Insight: Demand is strong in cold-weather regions and core wardrobe segments.

Recommendation: Expand regional campaigns in these top markets while testing seasonal product bundles (e.g., coats + sweaters) to increase AOV during high-traffic periods.

# DIVE Reflection

After completing the validation phase, my understanding of the business drivers became much sharper. Initially, I assumed that higher discounts were directly increasing revenue, but the validation step revealed that this effect was misleading—Search and Organic channels were actually responsible for most sales even without discounting. This clarified that revenue growth is primarily a result of strong acquisition channels rather than price incentives.

Similarly, while the AI insight that “returning customers spend less” was technically correct, validation showed that this was due to a very small returning customer sample, not an inherent behavioral trend. This distinction emphasized the importance of checking data balance before generalizing conclusions.

Overall, the validation process shifted my focus from surface-level correlations to more controlled, evidence-based insights. It underscored the value of testing AI-driven results against alternative cuts of the data to uncover bias, improve reliability, and make recommendations that better align with real business performance patterns.