# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [1]:
# Install the Google Cloud BigQuery client library
#!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4   #installed

##Authenticate

In [2]:
# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

# "WARNING: google.colab.auth.authenticate_user() is not supported in Colab Enterprise." comes because we are in GCP already

Authenticated


## Copy Schema to a dataframe

In [3]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'sunlit-plasma-471119-s7' # This is derived from your provided table name
dataset_id = 'lab1_foundation'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
# To see the output, run the code.
schema_df.head()


Schema DataFrame created:


Unnamed: 0,name,field_type,mode,description
0,Row ID,INTEGER,NULLABLE,
1,Order ID,STRING,NULLABLE,
2,Order Date,DATE,NULLABLE,
3,Ship Date,DATE,NULLABLE,
4,Ship Mode,STRING,NULLABLE,


## CLean Column Names

In [4]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore`;



## Generate View with standard column naming convention

In [5]:
# The following code creates an error due to the use of client.list_rows()
# Client.list_rows() only works for physical tables, not VIEWs
'''
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_table_ref = client.dataset(dataset_id).table(new_view_id)

    # Fetch the first 10 rows
    rows = client.list_rows(view_table_ref, max_results=10)

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")
'''


'\n# Execute the CREATE VIEW SQL query\ntry:\n    query_job = client.query(create_view_sql)  # API request\n    query_job.result()  # Waits for the query to finish\n    print(f"View \'{new_view_id}\' created/replaced successfully in dataset \'{dataset_id}\'.")\nexcept Exception as e:\n    print(f"An error occurred while creating the view: {e}")\n\n# Now, let\'s print 10 rows from the newly created view to verify\nprint(f"\n--- First 10 rows from the new view \'{new_view_id}\' ---")\ntry:\n    # Construct a reference to the new view\n    view_table_ref = client.dataset(dataset_id).table(new_view_id)\n\n    # Fetch the first 10 rows\n    rows = client.list_rows(view_table_ref, max_results=10)\n\n    # Print header\n    print(" | ".join([field.name for field in rows.schema]))\n    print("-" * 80) # Separator\n\n    # Print rows\n    for row in rows:\n        print(" | ".join([str(item) for item in row.values()]))\n\nexcept Exception as e:\n    print(f"An error occurred while fetching rows

In [6]:
# WORKING CODE:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a SQL query to select from the new view
    query_to_select_from_view = f"""
    SELECT *
    FROM `{project_id}.{dataset_id}.{new_view_id}`
    LIMIT 10;
    """

    # Execute the query
    query_job_select = client.query(query_to_select_from_view)

    # Fetch the results into a DataFrame
    rows_df = query_job_select.to_dataframe()

    # Print header and rows from the DataFrame
    if not rows_df.empty:
        print(" | ".join(rows_df.columns))
        print("-" * 80) # Separator
        for index, row in rows_df.iterrows():
            print(" | ".join([str(item) for item in row.values]))
    else:
        print("No rows returned from the view.")

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")


View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundation'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
5769 | CA-2015-154900 | 2015-02-25 | 2015-03-01 | Standard Class | SS-20875 | Sung Shariari | Consumer | United States | Leominster | Massachusetts | 1453 | East | OFF-LA-10001641 | Office Supplies | Labels | Avery 518 | 3.15 | 1 | 0.0 | 1.512
5770 | CA-2015-154900 | 2015-02-25 | 2015-03-01 | Standard Class | SS-20875 | Sung Shariari | Consumer | United States | Leominster | Massachusetts | 1453 | East | OFF-PA-10002377 | Office Supplies | Paper | Adams Telephone Message Book W/Dividers/Space For Phone Numbers, 5 1/4"X8 1/2", 200/Mes

In [7]:
%%bigquery
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
LIMIT 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


In [8]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

# Queries must be written with """ quotes (triple-double quotes)
query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

The result did match my expectations. It correctly used the column names in superstore_clean and produced the correct code.

In [9]:
query_string_1 = """
SELECT
    DISTINCT `Sub-Category` AS Sub_Category
FROM
    `mgmt-467-47888.lab1_foundation.superstore_clean`
WHERE
    Region = 'West'
ORDER BY
    Sub_Category ASC
LIMIT 100
"""
# query given by prompt
prompt_query_string_1 = """
SELECT
    DISTINCT sub_category
FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
WHERE
    region = 'West'
ORDER BY
    sub_category ASC
LIMIT 100
"""

query_job_1 = client.query(prompt_query_string_1)
results_df_1 = query_job_1.to_dataframe()
display(results_df_1)

Unnamed: 0,sub_category
0,Accessories
1,Appliances
2,Art
3,Binders
4,Bookcases
5,Chairs
6,Copiers
7,Envelopes
8,Fasteners
9,Furnishings


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [10]:
query_string_2 = """
SELECT
    customer_id,
    SUM(profit) AS total_profit
FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
GROUP BY
    customer_id
ORDER BY
    total_profit DESC
LIMIT 10
"""

query_job_2 = client.query(query_string_2)
results_df_2 = query_job.to_dataframe()
display(results_df_2)

Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [11]:
query_string_3 = """
SELECT
    ship_mode,
    COUNT(*) AS order_count
FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
WHERE
    category = 'Technology'
GROUP BY
    ship_mode
ORDER BY
    order_count DESC
"""

sanity_check_query = """
SELECT
    COUNT(*)
FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
WHERE
    category = 'Technology'
"""

query_job_3 = client.query(query_string_3)
results_df_3 = query_job.to_dataframe()
display(results_df_3)

print("Sanity Check:")
query_job_3 = client.query(sanity_check_query)
sanity_df = query_job_3.to_dataframe()
display(sanity_df)

Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


Sanity Check:


Unnamed: 0,f0_
0,1847


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

In [12]:
query_string_b1 = """
SELECT
  FORMAT_DATE('%Y-%m', DATE_TRUNC(order_date, MONTH)) AS year_month,
  SUM(sales) AS monthly_revenue
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
WHERE
  order_date >= DATE_TRUNC(DATE_SUB((SELECT MAX(order_date) FROM `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`), INTERVAL 11 MONTH), MONTH)
GROUP BY
  1
ORDER BY
  year_month ASC
LIMIT 100;
"""

query_job_b1 = client.query(query_string_b1)
results_df_b1 = query_job_b1.to_dataframe()
display(results_df_b1)

Unnamed: 0,year_month,monthly_revenue
0,2017-01,43971.374
1,2017-02,20301.1334
2,2017-03,58872.3528
3,2017-04,36521.5361
4,2017-05,44261.1102
5,2017-06,52981.7257
6,2017-07,45264.416
7,2017-08,63120.888
8,2017-09,87866.652
9,2017-10,77776.9232


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

In [13]:
query_string_b2 = """
SELECT
  sub_category,
  SUM(profit) AS total_profit
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
GROUP BY
  sub_category
HAVING
  total_profit < 0
ORDER BY
  total_profit ASC;
"""

print("Explanation: `HAVING` is used to filter groups after aggregation, whereas `WHERE` filters individual rows before they are grouped.")

query_job_b2 = client.query(query_string_b2)
results_df_b2 = query_job_b2.to_dataframe()
display(results_df_b2)

Explanation: `HAVING` is used to filter groups after aggregation, whereas `WHERE` filters individual rows before they are grouped.


Unnamed: 0,sub_category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995


## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [14]:
query_string_c1 = """
WITH
  products_dimension AS (
    -- This CTE simulates a product dimension table by getting unique product IDs and names
    SELECT DISTINCT
      product_id,
      product_name
    FROM
      `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
  ),
  sales_facts AS (
    -- This CTE aggregates sales data to get the total sales for each product
    SELECT
      product_id,
      SUM(sales) AS total_sales
    FROM
      `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
    GROUP BY
      product_id
  )
-- Final join to enrich the aggregated sales with product names
SELECT
  pd.product_id,
  pd.product_name,
  sf.total_sales
FROM
  products_dimension AS pd
JOIN
  sales_facts AS sf
  ON pd.product_id = sf.product_id
ORDER BY
  total_sales DESC
LIMIT 20; -- Using a LIMIT to control costs during exploration
"""

query_job_c1 = client.query(query_string_c1)
results_df_c1 = query_job_c1.to_dataframe()
display(results_df_c1)

Unnamed: 0,product_id,product_name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.48
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
5,OFF-BI-10000545,GBC Ibimaster 500 Manual ProClick Binding System,19024.5
6,TEC-CO-10001449,Hewlett Packard LaserJet 3310 Copier,18839.686
7,TEC-MA-10001127,HP Designjet T520 Inkjet Large Format Printer ...,18374.895
8,OFF-BI-10004995,GBC DocuBind P400 Electric Binding System,17965.068
9,OFF-SU-10000151,High Speed Automatic Electric Letter Opener,17030.312


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

In [15]:
query_string_d1 = """
WITH
  state_sales AS (
    -- First, calculate the total sales for each state within each region.
    SELECT
      region,
      state,
      SUM(sales) AS total_sales
    FROM
      `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
    GROUP BY
      region,
      state
  ),
  ranked_state_sales AS (
    -- Then, rank the states within each region based on their total sales.
    SELECT
      region,
      state,
      total_sales,
      RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS sales_rank
    FROM
      state_sales
  )
-- Finally, select only the top 3 ranked states from each region.
SELECT
  region,
  state,
  total_sales,
  sales_rank
FROM
  ranked_state_sales
WHERE
  sales_rank <= 3
ORDER BY
  region,
  sales_rank;
"""

query_job_d1 = client.query(query_string_d1)
results_df_d1 = query_job_d1.to_dataframe()
display(results_df_d1)

Unnamed: 0,region,state,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

In [16]:
# We used 2016 and 2017 instead of 2023 and 2024, because there was no data for 2023 and 2024
query_string_d2 = """
WITH
  yr_sales AS (
    -- Aggregate sales by sub-category for only the years of interest.
    -- We are using 2016 and 2017 because the dataset contains data for these years.
    SELECT
      sub_category,
      EXTRACT(YEAR FROM order_date) AS sales_year,
      SUM(sales) AS total_sales
    FROM
      `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
    WHERE
      EXTRACT(YEAR FROM order_date) IN (2016, 2017)
    GROUP BY
      sub_category,
      sales_year
  ),
  pivoted_sales AS (
    -- Pivot the data and alias the year columns correctly.
    -- Without aliasing, BigQuery would name them `_2016` and `_2017`.
    SELECT * FROM yr_sales
    PIVOT(SUM(total_sales) FOR sales_year IN (2016 AS sales_2016, 2017 AS sales_2017))
  )
-- Final selection and calculation of year-over-year growth.
SELECT
  sub_category,
  -- Use COALESCE to handle cases where a sub-category might exist in one year but not the other.
  COALESCE(sales_2016, 0) AS sales_2016,
  COALESCE(sales_2017, 0) AS sales_2017,
  -- Calculate the year-over-year change.
  (COALESCE(sales_2017, 0) - COALESCE(sales_2016, 0)) AS yoy_delta
FROM
  pivoted_sales
ORDER BY
  yoy_delta DESC
LIMIT 5;
"""

query_job_d2 = client.query(query_string_d2)
results_df_d2 = query_job_d2.to_dataframe()
display(results_df_d2)

Unnamed: 0,sub_category,sales_2016,sales_2017,yoy_delta
0,Phones,78962.03,105340.516,26378.486
1,Binders,49683.325,72788.045,23104.72
2,Accessories,41895.854,59946.232,18050.378
3,Appliances,26050.315,42926.932,16876.617
4,Copiers,49599.41,62899.388,13299.978


### Failure Modes and Solutions for YoY Analysis

1.  **Failure Mode: Sub-category Exists in One Year but Not Both**
    *   **Problem:** If a sub-category has sales in one year (e.g., 2017) but not the other (2016), the `PIVOT` operation will create a `NULL` value for the missing year's sales. Any calculation involving `NULL` (e.g., `sales_2017 - NULL`) results in `NULL`, which would incorrectly exclude that sub-category from our growth analysis.
    *   **Solution:** The query uses the `COALESCE()` function (e.g., `COALESCE(sales_2016, 0)`). This function checks if a value is `NULL` and, if it is, replaces it with a specified default value—in this case, `0`. This ensures that new or discontinued sub-categories are correctly included in the calculation.

2.  **Failure Mode: No Data for the Specified Years**
    *   **Problem:** If the query is run with years for which there is no data in the table (as we initially saw with 2023 and 2024), the first CTE (`yr_sales`) will be empty. This leads to the final output being an empty table, which is correct but might be confusing.
    *   **Solution:** While the current query handles this by correctly returning an empty result, a more robust approach in an automated pipeline would be to run a **pre-flight check**. Before executing the main query, you could run a simple `COUNT` to verify that data exists for the selected period. If the count is zero, you can skip the main query and output a clear message like, "No data available for the selected years," saving processing cost and providing better context.

## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

In [17]:
query_string_e1 = """
WITH
  subcat_sales AS (
    -- First, calculate the total sales for each sub-category within each region.
    SELECT
      region,
      sub_category,
      SUM(sales) AS total_sales
    FROM
      `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
    GROUP BY
      region,
      sub_category
  ),
  ranked_sales AS (
    -- Then, assign a unique row number to each sub-category within its region, ordered by sales.
    SELECT
      region,
      sub_category,
      total_sales,
      ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) as rn
    FROM
      subcat_sales
  )
-- Finally, select only the sub-category with row number 1 for each region.
SELECT
  region,
  sub_category,
  total_sales
FROM
  ranked_sales
WHERE
  rn = 1
ORDER BY
  region;
"""

query_job_e1 = client.query(query_string_e1)
results_df_e1 = query_job_e1.to_dataframe()
display(results_df_e1)

Unnamed: 0,region,sub_category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


### `ROW_NUMBER()` vs. `RANK()`

We use `ROW_NUMBER()` here because it guarantees exactly one row will be assigned the number `1` within each region, even if two sub-categories have identical sales. In contrast, `RANK()` would assign the same rank to tied sub-categories, meaning our `WHERE rn = 1` filter could incorrectly return multiple rows for a single region.

### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

In [18]:
query_string_e2 = """
WITH yearly_phone_sales AS (
  -- First, filter for the 'Phones' sub-category and aggregate sales by year.
  SELECT
    EXTRACT(YEAR FROM order_date) AS sales_year,
    SUM(sales) AS yearly_revenue
  FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
  WHERE
    sub_category = 'Phones'
  GROUP BY
    sales_year
)
-- Then, use the LAG window function to access the previous year's revenue
-- and calculate the year-over-year percentage change.
SELECT
  sales_year AS year,
  yearly_revenue,
  -- LAG fetches data from a previous row in the window. Here, it gets last year's revenue.
  LAG(yearly_revenue, 1) OVER (ORDER BY sales_year) AS prev_revenue,
  -- Calculate the percentage change. NULLIF prevents division-by-zero errors for the first year.
  100.0 * (yearly_revenue - LAG(yearly_revenue, 1) OVER (ORDER BY sales_year)) / NULLIF(LAG(yearly_revenue, 1) OVER (ORDER BY sales_year), 0) AS yoy_pct
FROM
  yearly_phone_sales
ORDER BY
  year ASC;
"""

query_job_e2 = client.query(query_string_e2)
results_df_e2 = query_job_e2.to_dataframe()
display(results_df_e2)

Unnamed: 0,year,yearly_revenue,prev_revenue,yoy_pct
0,2014,77390.806,,
1,2015,68313.702,77390.806,-11.728918
2,2016,78962.03,68313.702,15.587397
3,2017,105340.516,78962.03,33.406545


### Guarding Against Divide-by-Zero Errors

The query uses `NULLIF` to prevent division-by-zero errors when calculating the year-over-year percentage (`yoy_pct`). Here's the relevant part of the code:

```sql
... / NULLIF(LAG(yearly_revenue, 1) OVER (ORDER BY sales_year), 0)
```

**How it works:**

1.  **`LAG(...)`**: This function retrieves the `yearly_revenue` from the previous year. For the very first year in the dataset (2014), there is no previous year, so `LAG` returns `NULL`.

2.  **`NULLIF(value, value_to_check)`**: This function is the core of the safeguard. It returns `NULL` if its two arguments are equal; otherwise, it returns the first argument.
    *   **For the First Year**: The expression becomes `NULLIF(NULL, 0)`, which results in `NULL`. Any mathematical operation with `NULL` (like division) results in `NULL`, which is why you see `NaN` (Not a Number) in the output for 2014.
    *   **If a Previous Year's Sales were Zero**: If `LAG` returned `0`, the expression would be `NULLIF(0, 0)`, which also returns `NULL`. This prevents a "division by zero" error and correctly results in a `NULL` growth percentage.

This single function gracefully handles both the initial case (no previous year) and the edge case of zero-revenue years.

### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

In [19]:
query_string_e3 = """
WITH monthly_corporate_sales AS (
  -- First, filter for the 'Corporate' segment and aggregate sales by month.
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
  WHERE
    segment = 'Corporate'
  GROUP BY
    month
)
-- Then, calculate the 3-month moving average using a window function.
SELECT
  month,
  monthly_revenue,
  -- AVG(...) OVER (...) calculates the average of 'monthly_revenue' across a specified window of rows.
  -- 'ROWS BETWEEN 2 PRECEDING AND CURRENT ROW' defines the window as the current month and the two before it.
  AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
  monthly_corporate_sales
ORDER BY
  month ASC;
"""

query_job_e3 = client.query(query_string_e3)
results_df_e3 = query_job_e3.to_dataframe()
display(results_df_e3)

Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01-01,1701.528,1701.528
1,2014-02-01,1183.668,1442.598
2,2014-03-01,11106.799,4663.998333
3,2014-04-01,14131.729,8807.398667
4,2014-05-01,9142.0,11460.176
5,2014-06-01,3970.914,9081.547667
6,2014-07-01,10032.988,7715.300667
7,2014-08-01,7451.774,7151.892
8,2014-09-01,15507.745,10997.502333
9,2014-10-01,12637.678,11865.732333


**Cost Control Note:** To reduce query costs during iteration, you can temporarily add a `WHERE` clause inside the `monthly_corporate_sales` CTE to process a smaller date range (e.g., `WHERE segment = 'Corporate' AND EXTRACT(YEAR FROM order_date) = 2017`).

## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
[PASTE ERROR MESSAGE and the exact SQL here]
Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

The root cause of the error is the missing `BETWEEN` keyword in the window function's `ROWS` clause, which is required to define the frame boundaries. The corrected query adds this keyword, creating the valid syntax `ROWS BETWEEN 2 PRECEDING AND CURRENT ROW` for the moving average calculation.

In [20]:
query_string_e3_fixed = """
WITH monthly_corporate_sales AS (
  -- First, filter for the 'Corporate' segment and aggregate sales by month.
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
  WHERE
    segment = 'Corporate'
  GROUP BY
    month
)
-- Then, calculate the 3-month moving average using a window function.
SELECT
  month,
  monthly_revenue,
  -- AVG(...) OVER (...) calculates the average of 'monthly_revenue' across a specified window of rows.
  -- 'ROWS BETWEEN 2 PRECEDING AND CURRENT ROW' defines the window as the current month and the two before it.
  AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
  monthly_corporate_sales
ORDER BY
  month ASC;
"""

query_job_e3 = client.query(query_string_e3_fixed)
results_df_e3 = query_job_e3.to_dataframe()
display(results_df_e3)

Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01-01,1701.528,1701.528
1,2014-02-01,1183.668,1442.598
2,2014-03-01,11106.799,4663.998333
3,2014-04-01,14131.729,8807.398667
4,2014-05-01,9142.0,11460.176
5,2014-06-01,3970.914,9081.547667
6,2014-07-01,10032.988,7715.300667
7,2014-08-01,7451.774,7151.892
8,2014-09-01,15507.745,10997.502333
9,2014-10-01,12637.678,11865.732333


### Sanity Check Query

To verify the fix, this query manually calculates the moving average for just the third month in the series. You can compare the `manual_ma_3` result from this query with the `ma_3` value in the third row of the output from the corrected query above; they should match.

In [21]:
sanity_check_e3 = """
WITH monthly_corporate_sales AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
  WHERE
    segment = 'Corporate'
  GROUP BY
    month
),
ranked_sales AS (
  SELECT
    monthly_revenue,
    ROW_NUMBER() OVER (ORDER BY month) as rn
  FROM
    monthly_corporate_sales
)
SELECT
  AVG(monthly_revenue) as manual_ma_3
FROM
  ranked_sales
WHERE
  rn BETWEEN 1 AND 3;
"""

sanity_job = client.query(sanity_check_e3)
sanity_df = sanity_job.to_dataframe()
display(sanity_df)

Unnamed: 0,manual_ma_3
0,4663.998333


### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.
[PASTE YOUR SQL HERE]
Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

Given the query to find the top sub-category by sales in each region, here are three actionable ways to reduce its cost and improve performance, prioritized for impact:

1.  **Partition and Cluster the Base Table:** The most significant optimization is to alter the physical layout of the source data. If the `superstore` table (which the `superstore_clean` view is built on) were **partitioned by `order_date`** and, more importantly for this query, **clustered by `region` and `sub_category`**, the performance would increase dramatically. Clustering co-locates data with the same `region` and `sub_category` values in storage, allowing BigQuery to scan far less data to perform the initial `GROUP BY` operation in the `subcat_sales` CTE. This directly minimizes the most expensive part of the query.

2.  **Create a Pre-Aggregated Summary Table:** This query calculates a fundamental business metric. Instead of re-calculating it from raw data every time, you could create a much smaller, aggregated table. A scheduled query could run daily or hourly to populate a table like `analytics.daily_region_subcategory_sales` with the total sales for each combination. Your analysis query would then run against this tiny, pre-aggregated table, making it exceptionally fast and cheap, as the heavy lifting has already been done.

3.  **Refactor with the `QUALIFY` Clause:** While your use of CTEs is clear, BigQuery offers a more concise and often more efficient way to express this logic using the `QUALIFY` clause. This clause allows you to filter the results of a window function in the same step, avoiding the need for a second CTE or subquery. While this may not reduce the initial bytes scanned (that's the job of options 1 and 2), it simplifies the query and allows the BigQuery optimizer more freedom to create an efficient execution plan.

    **Example using `QUALIFY`:**
    ```sql
    SELECT
      region,
      sub_category,
      SUM(sales) AS total_sales
    FROM
      `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
    GROUP BY
      region, sub_category
    QUALIFY ROW_NUMBER() OVER (PARTITION BY region ORDER BY SUM(sales) DESC) = 1
    ORDER BY
      region;
    ```

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

## Generate Counter-Queries and Analysis Note

### Subtask:
Generate two alternative BigQuery SQL queries to test the finding that the 'Tables' sub-category is unprofitable, and provide a note on how to interpret the results.


**Reasoning**:
I will now write the first BigQuery SQL query to analyze the sales, profit, and average discount for the 'Tables' sub-category, broken down by region and year. I will then execute this query and display the results.



In [22]:
query_string_g1_a = """
-- Query 1: Slicing by Region and Time
-- This query investigates if the unprofitability of 'Tables' is uniform or concentrated in specific regions or years.
SELECT
  region,
  EXTRACT(YEAR FROM order_date) AS order_year,
  SUM(sales) AS total_sales,
  SUM(profit) AS total_profit,
  AVG(discount) AS average_discount
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
WHERE
  sub_category = 'Tables'
GROUP BY
  region,
  order_year
ORDER BY
  region,
  order_year;
"""

query_job_g1_a = client.query(query_string_g1_a)
results_df_g1_a = query_job_g1_a.to_dataframe()
display(results_df_g1_a)

Unnamed: 0,region,order_year,total_sales,total_profit,average_discount
0,Central,2014,7785.478,-1424.331,0.326667
1,Central,2015,6857.26,-265.0939,0.207143
2,Central,2016,13922.926,292.6211,0.205882
3,Central,2017,10589.307,-2162.8466,0.292308
4,East,2014,10603.704,-3537.8375,0.38
5,East,2015,8884.806,-2275.8641,0.373333
6,East,2016,7825.328,-2306.7783,0.368182
7,East,2017,11825.969,-2904.9002,0.373913
8,South,2014,9940.9445,1107.9902,0.113636
9,South,2015,7370.6745,-2171.3765,0.21875


**Reasoning**:
The first query has been executed. Now I will write and execute the second BigQuery SQL query to analyze the same metrics (sales, profit, average discount) for the 'Tables' sub-category, but this time broken down by `ship_mode`, as requested in the second step of the subtask.



In [23]:
query_string_g1_b = """
-- Query 2: Slicing by Ship Mode
-- This query checks if certain shipping methods are associated with higher costs or discounts, impacting profit.
SELECT
  ship_mode,
  SUM(sales) AS total_sales,
  SUM(profit) AS total_profit,
  AVG(discount) AS average_discount
FROM
  `sunlit-plasma-471119-s7.lab1_foundation.superstore_clean`
WHERE
  sub_category = 'Tables'
GROUP BY
  ship_mode
ORDER BY
  total_profit;
"""

query_job_g1_b = client.query(query_string_g1_b)
results_df_g1_b = query_job_g1_b.to_dataframe()
display(results_df_g1_b)

Unnamed: 0,ship_mode,total_sales,total_profit,average_discount
0,Standard Class,124826.6615,-11910.0122,0.270526
1,Second Class,43693.7475,-3320.6799,0.248361
2,First Class,28800.776,-1365.3665,0.240426
3,Same Day,9644.347,-1129.4225,0.261905


### How to Compare Outcomes

To nuance your initial finding, compare the results from these two queries to see if the unprofitability of 'Tables' is a universal problem or a concentrated one. The first query, slicing by region and year, reveals that the losses are not uniform; for instance, the 'South' and 'West' regions had years with positive profit, whereas the 'East' consistently lost money. This suggests the problem is heavily influenced by regional strategy, likely tied to the `average_discount` which is visibly higher in unprofitable region-years. The second query, slicing by `ship_mode`, shows that while all modes are unprofitable, the `average_discount` is relatively similar across them, indicating that shipping method is likely not the primary driver of the losses. By combining these views, you can move from a simple conclusion ('Tables are unprofitable') to a more actionable insight ('The unprofitability of Tables is primarily driven by high discount strategies in specific regions, particularly the East and Central, rather than being an issue with a particular shipping method').

## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

The 'Tables' sub-category has created a significant financial drain, generating a net loss of over $17,700 despite consistently high sales. This loss is not universal but is driven primarily by aggressive and unsustainable discount strategies in the East and Central regions. We recommend that the East and Central regional sales managers immediately review and cap 'Tables' discount rates to align with profitable regions, starting next quarter. The key metric to monitor will be the monthly profit ratio for 'Tables' within these specific regions to confirm the effectiveness of this policy change.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

---
## Submission checklist
- [x] Kept prompts precise and reproducible  
- [x] Captured at least **one** CTE query and **one** window function query  
- [x] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [x] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---