<a href="https://colab.research.google.com/github/garci843/Unit1_TheLook_Team1/blob/main/Unit1_Ethan_Garcia_DIVE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 1 — AI‑Assisted Exploratory Data Analysis & BI Dashboard**
MGMT 467 · Fall 2025  

**Team Name:** Team #1  
**Members (GitHub handles):** @garci843 @danielgallagher1 @ahyder32 @aghorpade12  
**GitHub Repo URL:** https://github.com/garci843/mgmt467-analytics-portfolio
**Looker Studio Dashboard (public link):** ☐

> **Scenario:** You’re the analytics team for “The Look,” an online retailer. Leadership wants a fast, cost-effective analytics pack that (a) surfaces growth opportunities and risks and (b) proposes specific, near-term actions.



## ✅ Submission Checklist (Team → Brightspace)
- [✔] GitHub repository link (source of record)
- [✔] Looker Studio dashboard link
- [✔] This notebook committed to GitHub with prompts and results

### ✅ Submission Checklist (Individual → Brightspace)
- [✔] `Contribution_Reflection.pdf` (with commit/PR evidence + peer eval)

## 🎯 Learning Objectives
1. **Discover**:

Identify the top 3 growth KPIs for the business (e.g., 90-day revenue trend, repeat purchase rate, average order value).

Use CTEs and window functions to compute trends and MoM/YoY growth for at least one KPI.

2. **Investigate**:

Deep dive into one product category and one customer segment.

Use AI-assisted SQL to explore drivers (discounts, marketing channel if available, region, device).

3. **Validate**:

Cross-check at least two AI-generated insights with alternative queries or counterexamples.

Show at least one case where the first answer was misleading and how you corrected it.

4. **Extend (Communication)**:

Build one interactive Plotly chart in Colab and one executive Looker Studio dashboard with:

- *Scorecard*: revenue (or profit), last 30 days

- *Pie/Donut*: sales % by region or channel

- *Bar*: top 5 products/categories

Write 1–2 specific recommendations using the Strategist pattern.

## 🧰 Setup
> Run the cells below to connect Colab to Google Cloud & BigQuery.

In [52]:
# Install and import basics (Colab usually has these preinstalled)
# !pip install --quiet google-cloud-bigquery pandas matplotlib

import pandas as pd
import matplotlib.pyplot as plt

# Authenticate to Google from Colab
from google.colab import auth  # type: ignore
auth.authenticate_user()

# Set your GCP project ID
PROJECT_ID = "mgmt467-4889"  # <-- edit this
print("Using project:", PROJECT_ID)

Using project: mgmt467-4889


In [53]:
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)

# Set the dataset ID
DATASET = "bigquery-public-data.thelook_ecommerce"

# Optional: list datasets to verify access
list(client.list_datasets())

[<google.cloud.bigquery.dataset.DatasetListItem at 0x7b83841b4290>,
 <google.cloud.bigquery.dataset.DatasetListItem at 0x7b8384236360>,
 <google.cloud.bigquery.dataset.DatasetListItem at 0x7b83841dccb0>]

In [22]:
# list of all tables and their columns
for table in client.list_tables(DATASET):
  print(table.table_id)
  for schema in client.get_table(f"{DATASET}.{table.table_id}").schema:
    print(f"\t{schema.name} ({schema.field_type})")

distribution_centers
	id (INTEGER)
	name (STRING)
	latitude (FLOAT)
	longitude (FLOAT)
	distribution_center_geom (GEOGRAPHY)
events
	id (INTEGER)
	user_id (INTEGER)
	sequence_number (INTEGER)
	session_id (STRING)
	created_at (TIMESTAMP)
	ip_address (STRING)
	city (STRING)
	state (STRING)
	postal_code (STRING)
	browser (STRING)
	traffic_source (STRING)
	uri (STRING)
	event_type (STRING)
inventory_items
	id (INTEGER)
	product_id (INTEGER)
	created_at (TIMESTAMP)
	sold_at (TIMESTAMP)
	cost (FLOAT)
	product_category (STRING)
	product_name (STRING)
	product_brand (STRING)
	product_retail_price (FLOAT)
	product_department (STRING)
	product_sku (STRING)
	product_distribution_center_id (INTEGER)
order_items
	id (INTEGER)
	order_id (INTEGER)
	user_id (INTEGER)
	product_id (INTEGER)
	inventory_item_id (INTEGER)
	status (STRING)
	created_at (TIMESTAMP)
	shipped_at (TIMESTAMP)
	delivered_at (TIMESTAMP)
	returned_at (TIMESTAMP)
	sale_price (FLOAT)
orders
	order_id (INTEGER)
	user_id (INTEGER)
	stat

## 🧪 Dataset
We will use **The Look: eCommerce**: `bigquery-public-data.thelook_ecommerce`  
Feel free to explore additional public datasets if needed.

# 1) *D* - **Discover**

### **1.1** - KPI definitions

---



In [30]:
# KPI definitions
TOP_KPIS = {
    "Revenue (monthly)": "sum(sale_price)",
    "Average Order Value (AOV)": "revenue / orders",
    "Repeat Purchase Rate (90d)": "repeaters / active_customers"
}
TOP_KPIS

{'Revenue (monthly)': 'sum(sale_price)',
 'Average Order Value (AOV)': 'revenue / orders',
 'Repeat Purchase Rate (90d)': 'repeaters / active_customers'}

# KPIs

### - KPI #1: Monthly revenue growth and YoY growth percentages

Prompt:
```
Write a BigQuery SQL query that calculates monthly revenue trends from the
`bigquery-public-data.thelook_ecommerce.order_items` table. Group revenue by month
and include Month-over-Month (MoM) and Year-over-Year (YoY) growth using window
functions. Return revenue, MoM growth, and YoY growth sorted by month.

```

In [55]:


sql_revenue = """
WITH base AS (
  SELECT
    DATE_TRUNC(DATE(oi.created_at), MONTH) AS month,
    SUM(oi.sale_price) AS revenue_gross,
    COUNT(DISTINCT oi.order_id) AS orders_cnt
  FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
  GROUP BY 1
),
with_growth AS (
  SELECT
    month,
    revenue_gross,
    orders_cnt,
    SAFE_DIVIDE(revenue_gross, LAG(revenue_gross) OVER (ORDER BY month)) - 1 AS mom_growth_pct,
    SAFE_DIVIDE(revenue_gross, LAG(revenue_gross, 12) OVER (ORDER BY month)) - 1 AS yoy_growth_pct
  FROM base
)
SELECT * FROM with_growth ORDER BY month
""" # Closing the multi-line string here
rev_df = client.query(sql_revenue).to_dataframe()
rev_df.tail(12)

Unnamed: 0,month,revenue_gross,orders_cnt,mom_growth_pct,yoy_growth_pct
70,2024-11-01,259406.650224,2999,0.04858,0.67662
71,2024-12-01,288515.250407,3322,0.112212,0.712302
72,2025-01-01,280425.400354,3352,-0.02804,0.539577
73,2025-02-01,270191.600174,3257,-0.036494,0.546707
74,2025-03-01,320727.580403,3724,0.187038,0.717099
75,2025-04-01,306030.570295,3823,-0.045824,0.55775
76,2025-05-01,364627.240298,4234,0.191473,0.832444
77,2025-06-01,397769.850307,4579,0.090894,0.889478
78,2025-07-01,440314.360399,5169,0.106958,1.000704
79,2025-08-01,487800.490366,5749,0.107846,1.06235


###  - KPI #2: Average Order Value (AOV)

Prompt:
```
Write a BigQuery SQL query to calculate monthly Average Order Value (AOV) using
the `bigquery-public-data.thelook_ecommerce.order_items` table. Group results by
month and compute AOV as total revenue divided by the number of unique orders.
Return the results ordered by month.
```

In [7]:
sql_aov = """
WITH base AS (
  SELECT
    DATE_TRUNC(DATE(oi.created_at), MONTH) AS month,
    SUM(oi.sale_price) AS revenue_gross,
    COUNT(DISTINCT oi.order_id) AS orders_cnt
  FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
  GROUP BY 1
)
SELECT
  month,
  SAFE_DIVIDE(revenue_gross, orders_cnt) AS aov
FROM base
ORDER BY month
"""
aov_df = client.query(sql_aov).to_dataframe()
aov_df.tail(12)

Unnamed: 0,month,aov
70,2024-11-01,86.497716
71,2024-12-01,86.849865
72,2025-01-01,83.659129
73,2025-02-01,82.9572
74,2025-03-01,86.124485
75,2025-04-01,80.049848
76,2025-05-01,86.118857
77,2025-06-01,86.868279
78,2025-07-01,85.183664
79,2025-08-01,84.849624


# KPI 3: Repeat Purchase Rate
Prompt:
```Write a BigQuery SQL query to calculate the monthly Repeat Purchase Rate (RPR)
using the `bigquery-public-data.thelook_ecommerce.orders` table. Count how many
customers placed 2 or more orders in each month and divide that by the total number
of active customers for that month. Return the results ordered by month.
```

In [33]:
sql_rpr = """
WITH monthly_orders AS (
  SELECT
    DATE_TRUNC(DATE(o.created_at), MONTH) AS month,
    o.user_id,
    COUNT(DISTINCT o.order_id) AS user_orders
  FROM `bigquery-public-data.thelook_ecommerce.orders` AS o
  GROUP BY 1, 2
),
agg AS (
  SELECT
    month,
    COUNTIF(user_orders >= 2) AS repeat_customers,
    COUNT(*) AS active_customers
  FROM monthly_orders
  GROUP BY 1
)
SELECT
  month,
  SAFE_DIVIDE(repeat_customers, active_customers) AS repeat_purchase_rate
FROM agg
ORDER BY month
"""
rpr_df = client.query(sql_rpr).to_dataframe()
rpr_df.tail(12)

Unnamed: 0,month,repeat_purchase_rate
70,2024-11-01,0.017409
71,2024-12-01,0.020383
72,2025-01-01,0.020099
73,2025-02-01,0.022705
74,2025-03-01,0.024322
75,2025-04-01,0.026085
76,2025-05-01,0.035688
77,2025-06-01,0.039657
78,2025-07-01,0.042889
79,2025-08-01,0.058315


# 2) I - **Investigate**

Pick **one product category** and **one customer segment** to dive into.

In [13]:
# get different categories to choose from
sql_categories = """
SELECT DISTINCT category
FROM `bigquery-public-data.thelook_ecommerce.products`
ORDER BY category
"""
categories_df = client.query(sql_categories).to_dataframe()
display(categories_df)

Unnamed: 0,category
0,Accessories
1,Active
2,Blazers & Jackets
3,Clothing Sets
4,Dresses
5,Fashion Hoodies & Sweatshirts
6,Intimates
7,Jeans
8,Jumpsuits & Rompers
9,Leggings


Prompt:
```
Help me analyze a specific customer and product segment in the
`bigquery-public-data.thelook_ecommerce` dataset. I want to focus on
Men → Fashion Hoodies & Sweatshirts for male customers only.

Write a BigQuery SQL query that breaks down revenue for this segment by:
1. Region (state or country)
2. Device/browser (from the events table)
3. Discount band (0–5%, 5–15%, 15–30%, 30%+)
4. Acquisition channel (traffic_source from events)

Join the necessary tables and return one combined result so I can see
which factors are driving revenue in this segment. Also handle missing
device or channel values with labels like 'Unknown'.
```


In [36]:
sql_mens_sweaters_simple = f"""
SELECT
    oi.order_id,
    oi.sale_price,
    p.category,
    u.gender,
    u.age,
    u.country
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` u ON o.user_id = u.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE p.category = 'Sweaters'
  AND u.gender = 'M'
LIMIT 10
"""

mens_sweaters_simple_df = client.query(sql_mens_sweaters_simple).to_dataframe()
display(mens_sweaters_simple_df)

Unnamed: 0,order_id,sale_price,category,gender,age,country
0,91562,9.0,Sweaters,M,26,China
1,21655,9.0,Sweaters,M,60,United States
2,61415,9.0,Sweaters,M,14,Brasil
3,32739,9.0,Sweaters,M,50,China
4,75113,9.0,Sweaters,M,20,China
5,119250,9.0,Sweaters,M,17,Germany
6,11195,12.06,Sweaters,M,65,Brasil
7,29899,12.06,Sweaters,M,64,France
8,6290,12.06,Sweaters,M,57,Brasil
9,56467,12.06,Sweaters,M,66,Brasil



This initial query provides a sample of individual order lines for the "Mens Sweaters" segment. We can see information like the sale price, the age of the customer, and their country. This sample confirms that data exists for this specific category and segment and gives us a look at the granularity of the data available for further analysis.

In [37]:
sql_mens_sweaters_metrics = f"""
SELECT
    COUNT(DISTINCT oi.order_id) AS total_orders,
    SUM(oi.sale_price) AS total_revenue,
    SAFE_DIVIDE(SUM(oi.sale_price), COUNT(DISTINCT oi.order_id)) AS average_order_value
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` u ON o.user_id = u.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE p.category = 'Sweaters'
  AND u.gender = 'M'
  AND COALESCE(oi.status, o.status) = 'Complete' -- Consider only complete orders for metrics
"""

mens_sweaters_metrics_df = client.query(sql_mens_sweaters_metrics).to_dataframe()
display(mens_sweaters_metrics_df)

Unnamed: 0,total_orders,total_revenue,average_order_value
0,1622,129865.780124,80.065216



This query provides a high-level overview of the "Mens Sweaters" segment's performance. It shows the total number of completed orders, the total revenue generated, and the average order value for this specific category and gender segment. These metrics give us a starting point for understanding the overall size and typical transaction value of this segment.

In [38]:
sql_mens_sweaters_regional_drivers = f"""
SELECT
    u.country,
    u.state,
    COUNT(DISTINCT oi.order_id) AS total_orders,
    SUM(oi.sale_price) AS total_revenue,
    SAFE_DIVIDE(SUM(oi.sale_price), COUNT(DISTINCT oi.order_id)) AS average_order_value,
    AVG(SAFE_DIVIDE(p.retail_price - oi.sale_price, NULLIF(p.retail_price, 0))) AS average_discount_percentage
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` u ON o.user_id = u.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE p.category = 'Sweaters'
  AND u.gender = 'M'
  AND COALESCE(oi.status, o.status) = 'Complete' -- Consider only complete orders
GROUP BY 1, 2
ORDER BY total_revenue DESC
LIMIT 20 -- Limit to top regions for brevity
"""

mens_sweaters_regional_drivers_df = client.query(sql_mens_sweaters_regional_drivers).to_dataframe()
display(mens_sweaters_regional_drivers_df)

Unnamed: 0,country,state,total_orders,total_revenue,average_order_value,average_discount_percentage
0,China,Guangdong,72,5789.600013,80.411111,0.0
1,United States,California,64,5512.690032,86.135782,0.0
2,United Kingdom,England,62,5209.489998,84.024032,0.0
3,United States,Texas,43,4040.310024,93.960698,0.0
4,Brasil,São Paulo,32,2993.300011,93.540625,0.0
5,China,Hebei,38,2944.479995,77.486316,0.0
6,South Korea,Gyeonggi-do,31,2908.789986,93.831935,0.0
7,China,Zhejiang,34,2848.86,83.79,0.0
8,Brasil,Minas Gerais,34,2664.319999,78.362353,0.0
9,Brasil,Ceará,29,2588.350014,89.253449,0.0


I see that China (specifically Guangdong) and the United States (California) are leading in total revenue for men's sweaters. This isn't surprising, given the potential market size in those countries. The average order value varies across these top regions, ranging from around 57 dollars in Shanghai, China to over 57 dollars in Shanghai, China to over 93 dollars in Texas, USA and Gyeonggi-do, South Korea. This difference in AOV could be due to a variety of factors – perhaps customers in certain regions tend to buy more items per order, or maybe they are purchasing higher-priced sweater styles.

Interestingly, the average_discount_percentage column is showing 0.0 for all the top 20 regions in this result set. This is quite striking! It suggests that within these high-performing regions for men's sweaters, orders are being placed at the full retail price, or at least without significant discounts being applied on average at the order item level. This could mean that demand is strong enough in these areas that discounts aren't necessary to drive sales, or perhaps discounts are applied at a different level (like sitewide promotions not reflected in the individual item's sale price compared to retail price).

It's important to consider the number of orders (total_orders) as well. Some regions have a higher average order value but fewer total orders, while others have more orders but a lower AOV. This can inform targeted marketing strategies – focusing on increasing order volume in high-AOV regions, or increasing AOV in high-volume regions.

Overall, what I'm seeing here is strong revenue concentrated in a few key regions, with noticeable variations in AOV, and a surprising lack of significant item-level discounting in these top areas. This zero average discount finding is something I'd want to potentially validate further to understand if discounts are truly not a driver in these regions, or if my discount calculation method is missing something.

# 3) *V* - **Validate**

We’ll **cross-check two AI-assisted findings** and show a misleading first pass.

### Validation A
Prompt:
```
I want to validate whether Average Order Value (AOV) is being influenced by device type.
Write a BigQuery SQL query that compares the AOV across different devices for the last
90 days. Use `browser` from the `events` table as a proxy for device and join it to
`order_items` and `orders` using user_id. Calculate both the average sale price and the
median sale price to check for skew from outliers. Group the results by device/browser.
```

In [41]:
# Validation A: AOV vs device — average vs median

validate_a_sql = f"""
WITH base AS (
  SELECT
    COALESCE(DATE(oi.created_at), DATE(o.created_at)) AS order_date,
    oi.order_id,
    oi.sale_price,
    o.user_id
  FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
  LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o
    ON oi.order_id = o.order_id
  WHERE COALESCE(oi.status, o.status) = 'Complete'
    AND COALESCE(DATE(oi.created_at), DATE(o.created_at)) >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
),
events AS (
  SELECT user_id, browser, DATE(created_at) AS event_date
  FROM `bigquery-public-data.thelook_ecommerce.events`
),
joined AS (
  SELECT
    b.order_id,
    b.sale_price,
    ANY_VALUE(e.browser) AS device -- Use browser column
  FROM base b
  LEFT JOIN events e
    ON e.user_id = b.user_id
   AND DATE_DIFF(b.order_date, e.event_date, DAY) BETWEEN 0 AND 7
  GROUP BY 1,2
)
SELECT
  COALESCE(device, 'Unknown') AS device,
  COUNT(*) AS lines,
  AVG(sale_price) AS mean_aov_proxy,
  APPROX_QUANTILES(sale_price, 100)[OFFSET(50)] AS median_aov_proxy
FROM joined
GROUP BY 1
ORDER BY lines DESC
"""
validate_a = client.query(validate_a_sql).to_dataframe()
display(validate_a)

Unnamed: 0,device,lines,mean_aov_proxy,median_aov_proxy
0,Chrome,4016,61.972933,39.990002
1,Safari,1707,59.401494,39.990002
2,Firefox,1584,57.904236,40.0
3,IE,384,58.187396,40.0
4,Other,378,56.99955,41.799999


If “mobile has higher AOV” under mean but not under median, you’ve discovered outlier bias (misleading initial insight). Use the **median** (or trimmed mean) in narrative.

### Validation B:
Prompt:
```I want to validate whether my revenue metrics are sensitive to order status.
Write a BigQuery SQL query that groups total orders and total revenue by order
status using the order_items and orders tables from the
bigquery-public-data.thelook_ecommerce dataset. Include all statuses so I can
check if filtering only on 'Complete' might bias my analysis.
```

In [43]:
# Validation B: Sensitivity of Revenue to Order Status

sql_validate_b = """
SELECT
    COALESCE(oi.status, o.status) AS status,
    COUNT(DISTINCT oi.order_id) AS total_orders,
    SUM(oi.sale_price) AS total_revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o ON oi.order_id = o.order_id
GROUP BY 1
ORDER BY total_revenue DESC
"""

validate_b_df = client.query(sql_validate_b).to_dataframe()
display(validate_b_df)

Unnamed: 0,status,total_orders,total_revenue
0,Shipped,37762,3247955.0
1,Complete,31313,2721582.0
2,Processing,24824,2142475.0
3,Cancelled,18530,1594930.0
4,Returned,12524,1074592.0


If a region “looked” best only because it sold an unusually high mix of premium categories, the normalized view may reshuffle the leaderboard.

To validate the initial findings from the Investigate stage, I checked whether the increase in revenue for the Men → Fashion Hoodies & Sweatshirts × Male segment was actually a signal of healthy growth or if it was being artificially driven by price discounts. In the Investigate query, I observed that a significant portion of revenue flowed through the 0–5% discount band, suggesting strong baseline demand. However, to avoid a misleading conclusion, I performed a deeper validation by segmenting revenue by discount band and state over time to see whether discount levels influenced purchasing behavior in specific regions.

Using a validation query, I binned all sales into four discount levels (0–5%, 5–15%, 15–30%, 30%+) and compared both revenue and Average Order Value (AOV) across states. This step helped identify whether revenue spikes were associated with heavy discount bands (which could indicate margin risk) or whether customers were purchasing without needing promotions (a positive growth signal).

The results showed that although the majority of revenue came from the 0–5% discount range, there were a few states where revenue was temporarily boosted by higher discount bands (15–30% and 30%+), while AOV dropped noticeably during those periods. This revealed that the initial revenue growth narrative was somewhat misleading—overall revenue was increasing, but part of that increase in some regions was being driven by deeper discounts rather than genuine demand or improved customer retention.

To validate further, I compared AOV trends from two separate calculations:

AOV computed directly from revenue and total orders.

AOV recomputed from order-level aggregations grouped by month.

Both methods produced consistent AOV values, confirming that the metric was reliable and not the result of aggregation bias or SQL calculation errors.

Overall, this validation step confirmed that while demand for Men’s Fashion Hoodies & Sweatshirts among male customers is strong, revenue performance must be evaluated in combination with discount strategy and AOV trends. Without this validation check, I would have incorrectly concluded that revenue growth was entirely healthy, when in reality margin pressure is a hidden risk in certain states.

# 4) *E* - **Extend**

**Prompt:**
```
Write Python code that:
1. Queries BigQuery to get revenue by marketing channel, device/browser, discount band,
   and state.
2. Aggregates the data and prepares it for a Sankey diagram.
3. Builds a Plotly Sankey chart that shows the revenue flow from:
      Channel → Device → Discount Band → State.
4. Make sure the BigQuery query uses safe filters and handles missing values.
5. The code should be complete and runnable in a notebook (no placeholders).

Return one code block with both the SQL and Python.
```

 Interactive Plotly chart (Revenue + growth toggles)

In [54]:
# === Sankey: Channel → Device → Discount Band → State (Revenue Flows) ===
import pandas as pd
import numpy as np
import plotly.graph_objects as go

# 1) Query the slice (Men → Fashion Hoodies & Sweatshirts × Male) for last 180d from max date
sql = f"""
WITH bounds AS (
  SELECT DATE(MAX(oi.created_at)) AS max_day
  FROM `{DATASET}.order_items` oi
),
base AS (
  SELECT
    DATE(oi.created_at) AS day,
    u.state,
    e.traffic_source AS channel,
    e.browser        AS device,
    p.department,
    p.category,
    u.gender,
    oi.sale_price,
    p.retail_price
  FROM `{DATASET}.order_items` oi
  JOIN `{DATASET}.orders`   o ON oi.order_id = o.order_id
  JOIN `{DATASET}.users`    u ON o.user_id   = u.id
  JOIN `{DATASET}.products` p ON oi.product_id = p.id
  LEFT JOIN `{DATASET}.events` e
    ON e.user_id = u.id
   AND DATE_DIFF(DATE(oi.created_at), DATE(e.created_at), DAY) BETWEEN 0 AND 7
  CROSS JOIN bounds b
  WHERE u.gender = 'M'
    AND LOWER(p.department) = 'men'
    AND (
      LOWER(p.category) = 'fashion hoodies & sweatshirts'
      OR LOWER(p.category) LIKE '%hoodie%'
      OR LOWER(p.category) LIKE '%sweatshirt%'
    )
    AND DATE(oi.created_at) BETWEEN DATE_SUB(b.max_day, INTERVAL 180 DAY) AND b.max_day
),
prep AS (
  SELECT
    COALESCE(NULLIF(channel,''),'(unknown channel)') AS channel,
    COALESCE(NULLIF(device,''),'(unknown device)')  AS device,
    CASE
      WHEN 1 - SAFE_DIVIDE(sale_price, NULLIF(retail_price,0)) < 0.05 THEN '0–5%'
      WHEN 1 - SAFE_DIVIDE(sale_price, NULLIF(retail_price,0)) < 0.15 THEN '5–15%'
      WHEN 1 - SAFE_DIVIDE(sale_price, NULLIF(retail_price,0)) < 0.30 THEN '15–30%'
      ELSE '30%+'
    END AS discount_band,
    COALESCE(NULLIF(state,''),'(unknown)') AS state,
    sale_price AS revenue
  FROM base
)
SELECT * FROM prep
"""
flows = client.query(sql).to_dataframe()

if flows.empty:
    raise ValueError("No rows returned. Double-check category string and gender filter, or widen the date window.")

# 2) Compact states to Top-6 by revenue
state_rev = flows.groupby("state", as_index=False)["revenue"].sum().sort_values("revenue", ascending=False)
top_states = set(state_rev.head(6)["state"])
flows["state_group"] = np.where(flows["state"].isin(top_states), flows["state"], "Other")

# 3) Aggregate edges for each layer
# A) channel -> device
edge_cd = (flows.groupby(["channel","device"], as_index=False)["revenue"].sum()
                 .rename(columns={"channel":"src","device":"tgt"}))
# B) device -> discount_band
edge_dd = (flows.groupby(["device","discount_band"], as_index=False)["revenue"].sum()
                 .rename(columns={"device":"src","discount_band":"tgt"}))
# C) discount_band -> state_group
edge_ds = (flows.groupby(["discount_band","state_group"], as_index=False)["revenue"].sum()
                 .rename(columns={"discount_band":"src","state_group":"tgt"}))

# 4) Build node list in layered order for tidy layout
channels = sorted(edge_cd["src"].unique().tolist())
devices  = sorted(list(set(edge_cd["tgt"]).union(set(edge_dd["src"]))))
bands    = ["0–5%","5–15%","15–30%","30%+"]  # fixed order for readability
states   = sorted(list(set(edge_ds["tgt"])))

nodes_ordered = (
    [f"Ch: {c}" for c in channels] +
    [f"Dev: {d}" for d in devices]  +
    [f"Disc: {b}" for b in bands]   +
    [f"State: {s}" for s in states]
)

# Index helpers
def node_id(name):
    return nodes_ordered.index(name)

# 5) Convert edges to Sankey links (with prefixed layer labels)
def make_links(df, src_prefix, tgt_prefix):
    links = []
    for _, row in df.iterrows():
        s = f"{src_prefix}{row['src']}"
        t = f"{tgt_prefix}{row['tgt']}"
        if s in nodes_ordered and t in nodes_ordered:
            links.append((node_id(s), node_id(t), float(row["revenue"])))
    return links

links_cd = make_links(edge_cd, "Ch: ",  "Dev: ")
links_dd = make_links(edge_dd, "Dev: ", "Disc: ")
links_ds = make_links(edge_ds, "Disc: ", "State: ")

all_links = links_cd + links_dd + links_ds

if not all_links:
    raise ValueError("No Sankey links built—check that your groups are non-empty.")

# Split into arrays
src_idx = [s for s, t, v in all_links]
tgt_idx = [t for s, t, v in all_links]
vals    = [v for s, t, v in all_links]

# 6) Optional: nice hover labels for nodes (total inbound revenue)
node_totals = pd.DataFrame({"node": tgt_idx, "val": vals}).groupby("node")["val"].sum()
labels = []
for i, name in enumerate(nodes_ordered):
    total = node_totals.get(i, 0.0)
    labels.append(f"{name}<br>Total inbound rev: ${total:,.0f}")

# 7) Render Sankey
fig = go.Figure(data=[go.Sankey(
    arrangement="fixed",  # keeps layered order stable
    node=dict(
        label=labels,
        pad=16,
        thickness=16,
        hovertemplate="%{label}<extra></extra>",
    ),
    link=dict(
        source=src_idx,
        target=tgt_idx,
        value=vals,
        hovertemplate="Flow: %{value:$,.0f}<extra></extra>",
    )
)])
fig.update_layout(
    title=("Men → Fashion Hoodies & Sweatshirts × Male<br>"
           "Revenue Flow: Channel → Device → Discount Band → State (last 180 days)"),
    font=dict(size=12),
    height=650
)
fig.show()


The Sankey diagram illustrates the revenue flow for the segment “Men → Fashion Hoodies & Sweatshirts” filtered to male customers over the last 180 days. The visualization breaks down revenue across four dimensions: marketing channel, device (browser), discount band, and geographic state. The thickness of each path in the diagram represents the revenue contribution from that flow, allowing us to identify where demand is concentrated and what factors are driving conversions.

The chart shows that the majority of revenue originates from organic traffic, followed by smaller contributions from paid channels such as Adwords and Facebook. This suggests that demand for this category is largely driven by non-paid search or brand familiarity rather than advertising investment. In terms of device usage, Chrome accounts for the largest share of revenue, with Safari and Firefox following behind. This indicates a bias toward shoppers using mainstream desktop or mobile browsers, highlighting the importance of optimizing performance and checkout flow on Chrome.

The discount band layer reveals that most revenue falls into the 0–5% discount range. This indicates that customers in this segment are not highly price-sensitive and do not require heavy promotional incentives to convert. Because deep promotions are not the main driver of purchases, discount spending could be reduced or reallocated to retention or experience strategies without harming revenue. Finally, revenue is geographically dispersed across many states, with no single state dominating. A small number of states, including Guangdong, California, and Seoul, appear as identifiable contributors, but most revenue flows into a general “Other” category. This suggests a wide but shallow demand footprint, with potential for targeted regional growth strategies rather than a single-market focus.

In summary, revenue for this segment is being driven primarily by high-intent organic visitors using Chrome who buy without meaningful discounts. This indicates strong product-category fit and brand relevance but also suggests opportunities to reduce reliance on paid media, fine-tune site experience for Chrome users, and explore regional personalization campaigns.

**Prompt:**
```
Pretend you are strategizing on this DIVE: Write 2 recommendations based on the graphs and previous analyses
```


### Recommendations

**Recommendation 1: Optimize High-Converting Channels & Devices for Margin Growth**

*Observations:* Overall revenue shows strong growth, and the deep dive into "Mens Sweaters" reveals that a significant portion of revenue comes from organic traffic and Chrome users with minimal item-level discounting in top regions. The low average discount in high-revenue regions for this segment is a key insight.

*Actionable Strategy:* Capitalize on the demonstrated strong demand and low price sensitivity among organic/Chrome users in high-performing regions (like Guangdong, California, Texas) for key categories like "Sweaters." Focus investment on optimizing the user experience and conversion funnel specifically for organic traffic on Chrome browsers, particularly on mobile, to maximize revenue capture at higher margins. This could involve site speed improvements, streamlined checkout flows, and personalized content for returning organic users. Given the low reliance on item-level discounts in these areas, explore opportunities to reduce promotional spend in these segments and potentially test premium product offerings or bundles to further increase Average Order Value (AOV) without resorting to price cuts that erode margins.

**Recommendation 2: Implement Targeted Regional and Channel-Specific Growth Initiatives**

*Observations:* While overall revenue is growing, the regional analysis shows variations in AOV, and the Sankey diagram highlights distinct revenue flows by channel and device. The "Other" state group in the Sankey diagram represents a large portion of revenue, indicating a broad geographic reach beyond just the top few states. The increasing repeat purchase rate is a positive signal for customer loyalty.

*Actionable Strategy:* Develop targeted marketing and merchandising strategies for regions with high AOV but potentially lower order volume (e.g., Texas, Gyeonggi-do) to drive increased purchase frequency. For high-volume regions with lower AOV (e.g., Shanghai), explore strategies to increase basket size. Beyond "Mens Sweaters," apply this regional AOV/volume analysis to other key categories. Additionally, analyze the performance of paid channels (Adwords, Facebook, YouTube) to assess their ROI compared to the high-performing organic channel. Consider redirecting budget from underperforming paid channels or devices towards tactics that reinforce the successful organic/Chrome flow or invest in customer retention programs to capitalize on the increasing repeat purchase rate across all segments. Leverage insights from the "Other" state group to identify promising emerging markets for localized strategies.