<a href="https://colab.research.google.com/github/bjrodarmel/mgmt467-analytics-portfolio/blob/main/Lab_VertexAI_BigQuery_PromptsOnly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [None]:
# Install the Google Cloud BigQuery client library
!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

Collecting google-cloud-bigquery==3.17.0
  Downloading google_cloud_bigquery-3.17.0-py2.py3-none-any.whl.metadata (8.8 kB)
Collecting pandas==2.1.4
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy<2,>=1.23.2 (from pandas==2.1.4)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Downloading google_cloud_bigquery-3.17.0-py2.py3-none-any.whl (230 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.2/230.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.ma

**Authenticate**

In [None]:
# Authenticate your Colab environment
#from google.colab import auth
#auth.authenticate_user()
#print('Authenticated')

Collecting google-cloud-bigquery==3.17.0
  Downloading google_cloud_bigquery-3.17.0-py2.py3-none-any.whl.metadata (8.8 kB)
Collecting pandas==2.1.4
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy<2,>=1.23.2 (from pandas==2.1.4)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Downloading google_cloud_bigquery-3.17.0-py2.py3-none-any.whl (230 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.2/230.2 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.ma

Authenticated


## Copy Schema to a dataframe

In [None]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'mgmt-467' # This is derived from your provided table name
dataset_id = 'lab1_foundation'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
# To see the output, run the code.


Schema DataFrame created:


## CLean Column Names

In [None]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `mgmt-467.lab1_foundation.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `mgmt-467.lab1_foundation.superstore`;



## Generate View with standard column naming convention

In [None]:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_table_ref = client.dataset(dataset_id).table(new_view_id)

    # Fetch the first 10 rows
    rows = client.list_rows(view_table_ref, max_results=10)

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")



View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundation'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
An error occurred while fetching rows from the view: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt-467/datasets/lab1_foundation/tables/superstore_clean/data?maxResults=10&formatOptions.useInt64Timestamp=True&prettyPrint=false: Cannot list a table of type VIEW.


In [None]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `mgmt-467.lab1_foundation.superstore`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

The results did not match my expectations because it is returning a bunch of other fields and doesn't return what is being asked for. I missed renaming some of the variables and did not tell Gemini to make those naming chnages. I have now gone back and done that in order to try running the code again. This fixed the issue and I recieved the output that I was expecting to get now. I have learned that I need to make sure I really read the prompt over several times in order to make sure I have all the naming correct.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string2 = """
SELECT
  DISTINCT sub_category
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  region = 'West'
ORDER BY
  sub_category ASC
LIMIT 100
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job2 = client.query(query_string2)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df2 = query_job2.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df2)} rows.")

    if results_df2.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df2)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 17 rows.

--- Displaying Results ---


Unnamed: 0,sub_category
0,Accessories
1,Appliances
2,Art
3,Binders
4,Bookcases
5,Chairs
6,Copiers
7,Envelopes
8,Fasteners
9,Furnishings


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string3 = """
SELECT
  customer_id,
  SUM(profit) AS total_profit
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
GROUP BY
  customer_id
ORDER BY
  total_profit DESC
LIMIT 10
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job3 = client.query(query_string3)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df3 = query_job3.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df3)} rows.")

    if results_df3.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df3)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,customer_id,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [None]:
print("✅ Step 1: Defining the query string...")

query_string4 = """
SELECT
  ship_mode,
  COUNT(*) AS order_count
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  category = 'Technology'
GROUP BY
  ship_mode
ORDER BY
  order_count DESC
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job4 = client.query(query_string4)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df4 = query_job4.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df4)} rows.")

    if results_df4.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df4)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


print("✅ Step 1: Defining the query string...")

query_string5 = """
SELECT
  COUNT(*) AS standard_class_tech_orders
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  category = 'Technology' AND ship_mode = 'Standard Class';
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job5 = client.query(query_string3)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df5 = query_job5.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df5)} rows.")

    if results_df5.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df5)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


print("✅ Step 1: Defining the query string...")

query_string6 = """
SELECT
  COUNT(*) AS total_technology_orders
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  category = 'Technology';
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job6 = client.query(query_string6)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df6 = query_job6.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df6)} rows.")

    if results_df6.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df6)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,ship_mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,customer_id,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 1 rows.

--- Displaying Results ---


Unnamed: 0,total_technology_orders
0,1847


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

This query worked after I was able to adjust the query when I found that the data didn't include any orders from the last full 12 months. I had Gemini instead calculate the orders for the last full 12 months based on the last full months with sales.

In [None]:
print("✅ Step 1: Defining the query string...")

# This revised query finds the last 12 full months of data *relative to the latest order date in the table*,
# which handles cases where the data is not recent.
query_string_b1 = """
WITH latest_date AS (
  SELECT MAX(order_date) as max_order_date
  FROM `mgmt-467.lab1_foundation.superstore_clean`
)
SELECT
  FORMAT_DATE('%Y-%m', order_date) AS year_month,
  ROUND(SUM(sales), 2) AS monthly_revenue
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
CROSS JOIN
  latest_date
WHERE
  -- Filter for the 12 full months preceding the latest month in the data
  order_date >= DATE_SUB(DATE_TRUNC(max_order_date, MONTH), INTERVAL 12 MONTH)
  AND order_date < DATE_TRUNC(max_order_date, MONTH)
GROUP BY
  year_month
ORDER BY
  year_month ASC
LIMIT 12
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job_b1 = client.query(query_string_b1)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df_b1 = query_job_b1.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_b1)} rows.")

    if results_df_b1.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. This might mean there's no data in the last 12 full months in your table.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df_b1)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 12 rows.

--- Displaying Results ---


Unnamed: 0,year_month,monthly_revenue
0,2016-12,96999.04
1,2017-01,43971.37
2,2017-02,20301.13
3,2017-03,58872.35
4,2017-04,36521.54
5,2017-05,44261.11
6,2017-06,52981.73
7,2017-07,45264.42
8,2017-08,63120.89
9,2017-09,87866.65


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

Why use HAVING instead of WHERE?

HAVING is used instead of WHERE because it filters the results after the data has been grouped and aggregated (e.g., with SUM()), while WHERE filters individual rows before the aggregation takes place.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string_b2 = """
SELECT
  sub_category,
  ROUND(SUM(profit), 2) AS total_profit
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
GROUP BY
  sub_category
HAVING
  SUM(profit) < 0
ORDER BY
  total_profit ASC
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_b2 = client.query(query_string_b2)

    print("✅ Step 3: Fetching results...")
    results_df_b2 = query_job_b2.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_b2)} rows.")

    if results_df_b2.empty:
        print("\n⚠️ The query returned an empty result.")
    else:
        print("\n--- Displaying Sub-Categories with Negative Profit ---")
        display(results_df_b2)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 3 rows.

--- Displaying Sub-Categories with Negative Profit ---


Unnamed: 0,sub_category,total_profit
0,Tables,-17725.48
1,Bookcases,-3472.56
2,Supplies,-1189.1


## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [None]:
print("✅ Step 1: Defining the query string with a simulated dimension table...")

query_string_c1_join = """
WITH synthetic_product_dimension AS (
  -- This CTE creates our temporary 'dimension' table.
  -- It contains a unique list of each product ID and its name.
  SELECT DISTINCT
    product_id,
    product_name
  FROM
    `mgmt-467.lab1_foundation.superstore_clean`
)
SELECT
  p.product_id,
  p.product_name,
  ROUND(SUM(s.sales), 2) AS total_sales
FROM
  `mgmt-467.lab1_foundation.superstore_clean` AS s
JOIN
  synthetic_product_dimension AS p
  ON s.product_id = p.product_id
GROUP BY
  p.product_id,
  p.product_name
ORDER BY
  total_sales DESC
LIMIT 20 -- Adding a LIMIT to keep the initial output manageable
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_c1_join = client.query(query_string_c1_join)

    print("✅ Step 3: Fetching results...")
    results_df_c1_join = query_job_c1_join.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_c1_join)} rows.")

    if results_df_c1_join.empty:
        print("\n⚠️ The query returned an empty result.")
    else:
        print("\n--- Displaying Top 20 Products by Total Sales (Using a JOIN) ---")
        display(results_df_c1_join)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string with a simulated dimension table...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 20 rows.

--- Displaying Top 20 Products by Total Sales (Using a JOIN) ---


Unnamed: 0,product_id,product_name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.82
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.38
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.48
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.58
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.48
5,OFF-BI-10000545,GBC Ibimaster 500 Manual ProClick Binding System,19024.5
6,TEC-CO-10001449,Hewlett Packard LaserJet 3310 Copier,18839.69
7,TEC-MA-10001127,HP Designjet T520 Inkjet Large Format Printer ...,18374.9
8,OFF-BI-10004995,GBC DocuBind P400 Electric Binding System,17965.07
9,OFF-SU-10000151,High Speed Automatic Electric Letter Opener,17030.31


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

Explanation of the Query Steps

First, the state_sales CTE aggregates the data to compute the total sales for each state within each region, creating a summarized temporary table. Next, the ranked_state_sales CTE takes this summary and applies the RANK() window function, which assigns a unique rank to each state within its region based on total sales, ordered from highest to lowest. Finally, the main SELECT statement queries the results of the second CTE, filtering to keep only the rows where the sales_rank is less than or equal to 3, effectively giving us the top three states for each region.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string_d1 = """
WITH state_sales AS (
  -- CTE 1: Calculate total sales for each state within each region
  SELECT
    region,
    state,
    SUM(sales) AS total_sales
  FROM
    `mgmt-467.lab1_foundation.superstore_clean`
  GROUP BY
    region, state
),
ranked_state_sales AS (
  -- CTE 2: Rank the states within each region based on their total sales
  SELECT
    region,
    state,
    total_sales,
    RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS sales_rank
  FROM
    state_sales
)
-- Final SELECT: Filter for the top 3 ranked states in each region
SELECT
  region,
  state,
  ROUND(total_sales, 2) AS total_sales,
  sales_rank
FROM
  ranked_state_sales
WHERE
  sales_rank <= 3
ORDER BY
  region,
  sales_rank;
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_d1 = client.query(query_string_d1)

    print("✅ Step 3: Fetching results...")
    results_df_d1 = query_job_d1.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_d1)} rows.")

    if results_df_d1.empty:
        print("\n⚠️ The query returned an empty result.")
    else:
        print("\n--- Displaying Top 3 States by Sales Per Region ---")
        display(results_df_d1)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 12 rows.

--- Displaying Top 3 States by Sales Per Region ---


Unnamed: 0,region,state,total_sales,sales_rank
0,Central,Texas,170188.05,1
1,Central,Illinois,80166.1,2
2,Central,Michigan,76269.61,3
3,East,New York,310876.27,1
4,East,Pennsylvania,116511.91,2
5,East,Ohio,78258.14,3
6,South,Florida,89473.71,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.16,3
9,West,California,457687.63,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

### Failure Modes and Solutions for YoY Analysis

1.  **Missing Year Data:** A sub-category might have sales in one year but not the other (e.g., a new product line in 2017). Using a `JOIN` could accidentally exclude these important cases. **Solution:** The query below uses a conditional aggregation (`SUM(IF(...))`) which automatically treats a missing year's sales as `0`, ensuring that even brand-new or discontinued sub-categories are included in the calculation.

2.  **Data Skew from a Single Large Order:** A large, one-time order could make a sub-category appear to have huge growth, when in reality its typical performance hasn't changed. **Solution:** While not implemented in this query, a good follow-up would be to look at `order_count` or `AVG(sales)` in addition to `SUM(sales)`. This helps verify if the growth is consistent across many orders or driven by a single outlier.

In [None]:
print("✅ Step 1: Defining the query string...")

# Note: This query is adapted to use 2016 and 2017, as the dataset does not contain 2023-2024 data.
query_string_d2 = """
WITH yearly_sales_pivoted AS (
  -- This CTE pivots the data, creating columns for 2016 and 2017 sales for each sub-category.
  SELECT
    sub_category,
    SUM(IF(EXTRACT(YEAR FROM order_date) = 2016, sales, 0)) AS sales_2016,
    SUM(IF(EXTRACT(YEAR FROM order_date) = 2017, sales, 0)) AS sales_2017
  FROM
    `mgmt-467.lab1_foundation.superstore_clean`
  WHERE
    EXTRACT(YEAR FROM order_date) IN (2016, 2017)
  GROUP BY
    sub_category
)
-- Final SELECT: Calculate the year-over-year change and find the top 5
SELECT
  sub_category,
  ROUND(sales_2016, 2) AS sales_2016,
  ROUND(sales_2017, 2) AS sales_2017,
  ROUND(sales_2017 - sales_2016, 2) AS yoy_delta
FROM
  yearly_sales_pivoted
ORDER BY
  yoy_delta DESC
LIMIT 5;
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_d2 = client.query(query_string_d2)

    print("✅ Step 3: Fetching results...")
    results_df_d2 = query_job_d2.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_d2)} rows.")

    if results_df_d2.empty:
        print("\n⚠️ The query returned an empty result. Check if the years 2016 and 2017 exist in the data.")
    else:
        print("\n--- Displaying Top 5 Sub-Categories by YoY Revenue Growth (2016 vs 2017) ---")
        display(results_df_d2)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 5 rows.

--- Displaying Top 5 Sub-Categories by YoY Revenue Growth (2016 vs 2017) ---


Unnamed: 0,sub_category,sales_2016,sales_2017,yoy_delta
0,Phones,78962.03,105340.52,26378.49
1,Binders,49683.32,72788.04,23104.72
2,Accessories,41895.85,59946.23,18050.38
3,Appliances,26050.32,42926.93,16876.62
4,Copiers,49599.41,62899.39,13299.98


## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

### `ROW_NUMBER()` vs. `RANK()`

`ROW_NUMBER()` is used here to ensure that exactly one sub-category is returned per region, even if there's a tie in total sales, because it assigns a unique, sequential integer to each row. In contrast, `RANK()` would assign the same rank to tied sub-categories, which could cause the query to return multiple "top" sub-categories for a region if their sales were identical.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string_e2 = """
WITH phones_yearly_revenue AS (
  SELECT
    EXTRACT(YEAR FROM order_date) AS year,
    SUM(sales) AS yearly_revenue
  FROM
    `mgmt-467.lab1_foundation.superstore_clean`
  WHERE
    sub_category = 'Phones'
  GROUP BY
    year
)
SELECT
  year,
  ROUND(yearly_revenue, 2) AS yearly_revenue,
  ROUND(LAG(yearly_revenue, 1) OVER (ORDER BY year), 2) AS prev_year_revenue,
  -- Guard against divide-by-zero for the first year
  ROUND(100.0 * (yearly_revenue - LAG(yearly_revenue, 1) OVER (ORDER BY year)) / NULLIF(LAG(yearly_revenue, 1) OVER (ORDER BY year), 0), 2) AS yoy_growth_pct
FROM
  phones_yearly_revenue
ORDER BY
  year ASC;
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_e2 = client.query(query_string_e2)

    print("✅ Step 3: Fetching results...")
    results_df_e2 = query_job_e2.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_e2)} rows.")

    if results_df_e2.empty:
        print("\n⚠️ The query returned an empty result.")
    else:
        print("\n--- Displaying YoY Growth for 'Phones' Sub-Category ---")
        display(results_df_e2)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying YoY Growth for 'Phones' Sub-Category ---


Unnamed: 0,year,yearly_revenue,prev_year_revenue,yoy_growth_pct
0,2014,77390.81,,
1,2015,68313.7,77390.81,-11.73
2,2016,78962.03,68313.7,15.59
3,2017,105340.52,78962.03,33.41


### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string_e1 = """
WITH subcat_sales AS (
  -- First, aggregate sales for each sub-category within each region
  SELECT
    region,
    sub_category,
    SUM(sales) AS total_sales
  FROM
    `mgmt-467.lab1_foundation.superstore_clean`
  GROUP BY
    region,
    sub_category
),
ranked_sales AS (
  -- Now, assign a row number to each sub-category within its region based on sales
  SELECT
    region,
    sub_category,
    total_sales,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) as rn
  FROM
    subcat_sales
)
-- Finally, select only the top-ranked sub-category for each region
SELECT
  region,
  sub_category,
  ROUND(total_sales, 2) AS total_sales
FROM
  ranked_sales
WHERE
  rn = 1
ORDER BY
  region;
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_e1 = client.query(query_string_e1)

    print("✅ Step 3: Fetching results...")
    results_df_e1 = query_job_e1.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_e1)} rows.")

    if results_df_e1.empty:
        print("\n⚠️ The query returned an empty result.")
    else:
        print("\n--- Displaying the Top Sub-Category per Region ---")
        display(results_df_e1)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying the Top Sub-Category per Region ---


Unnamed: 0,region,sub_category,total_sales
0,Central,Chairs,85230.65
1,East,Phones,100614.98
2,South,Phones,58304.44
3,West,Chairs,101781.33


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

In [None]:
print("✅ Step 1: Defining the query string...")

query_string_e3 = """
WITH corporate_monthly_revenue AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `mgmt-467.lab1_foundation.superstore_clean`
  WHERE
    segment = 'Corporate'
    -- Cost-control tip: For exploration, you could add a date filter like:
    -- AND order_date >= '2017-01-01'
  GROUP BY
    month
)
SELECT
  month,
  ROUND(monthly_revenue, 2) AS monthly_revenue,
  ROUND(
    AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
    2
  ) AS ma_3_month_revenue
FROM
  corporate_monthly_revenue
ORDER BY
  month ASC;
"""

print("✅ Step 2: Sending the query to BigQuery...")

try:
    query_job_e3 = client.query(query_string_e3)

    print("✅ Step 3: Fetching results...")
    results_df_e3 = query_job_e3.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df_e3)} rows.")

    if results_df_e3.empty:
        print("\n⚠️ The query returned an empty result.")
    else:
        print("\n--- Displaying 3-Month Moving Average for Corporate Segment ---")
        display(results_df_e3)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery...
✅ Step 3: Fetching results...
✅ Step 4: Query finished. Found 48 rows.

--- Displaying 3-Month Moving Average for Corporate Segment ---


Unnamed: 0,month,monthly_revenue,ma_3_month_revenue
0,2014-01-01,1701.53,1701.53
1,2014-02-01,1183.67,1442.6
2,2014-03-01,11106.8,4664.0
3,2014-04-01,14131.73,8807.4
4,2014-05-01,9142.0,11460.18
5,2014-06-01,3970.91,9081.55
6,2014-07-01,10032.99,7715.3
7,2014-08-01,7451.77,7151.89
8,2014-09-01,15507.75,10997.5
9,2014-10-01,12637.68,11865.73


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
[PASTE ERROR MESSAGE and the exact SQL here]
Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

### Corrected SQL Query

```sql
WITH corporate_monthly_revenue AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month, -- Fixed alias to be consistent
    SUM(sales) AS monthly_revenue
  FROM
    `mgmt-467.lab1_foundation.superstore_clean` -- Fixed table name
  WHERE
    segment = 'Corporate'
  GROUP BY
    month -- Match the alias
)
SELECT
  month,
  ROUND(monthly_revenue, 2) AS monthly_revenue,
  ROUND(
    AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
    2 -- Fixed the number of digits for rounding
  ) AS ma_3_month_revenue
FROM
  corporate_monthly_revenue
ORDER BY
  month ASC;
```

### Quick Sanity Check

To verify the fix, you can run just the CTE part of the query. This ensures the monthly revenue aggregation is working correctly before the window function is applied. If this query runs successfully, the main issue was likely in the window function or final select statement.

```sql
-- Sanity Check Query
SELECT
  DATE_TRUNC(order_date, MONTH) AS month,
  SUM(sales) AS monthly_revenue
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  segment = 'Corporate'
GROUP BY
  month
ORDER BY
  month ASC
LIMIT 10;
```

### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.
[PASTE YOUR SQL HERE]
Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

### Three Ways to Optimize the Query

1.  **Enable Partition Pruning with a Direct Date Filter:** The most significant improvement would come from effective partition pruning. The current `WHERE EXTRACT(YEAR FROM order_date) IN (2016, 2017)` clause has to scan the entire `order_date` column to apply the `EXTRACT` function. If the underlying table were partitioned by `order_date` (e.g., by day or month), you could rewrite the filter to `WHERE order_date >= '2016-01-01' AND order_date < '2018-01-01'`. This change allows BigQuery to read only the data from the relevant partitions, drastically reducing the bytes scanned.

2.  **Pre-Aggregate Results with a Materialized View:** Since this query calculates aggregates (SUM of sales), it's a perfect candidate for pre-aggregation, especially if this analysis is run frequently. You could create a materialized view that stores the total sales for each sub-category by month. The main query would then run against this much smaller, pre-computed summary table, making it significantly faster and cheaper than re-calculating from the raw data every time.

3.  **Explicitly Prune Columns in CTEs/Views:** While BigQuery is a columnar store and automatically prunes columns, in complex queries or nested views, it is a best practice to be explicit. Your `superstore_clean` view selects all columns (`*`). An optimized version would be a materialized view that selects *only* the columns needed for your analyses (e.g., `order_date`, `sub_category`, `sales`, `region`, etc.). This guarantees that any query using the view scans the absolute minimum amount of data from the start.

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [None]:
print("--- Query 1: Slicing by Region and Time ---")
query_g1_regional = """
SELECT
  region,
  EXTRACT(YEAR FROM order_date) AS order_year,
  ROUND(SUM(sales), 2) AS total_sales,
  ROUND(SUM(profit), 2) AS total_profit,
  ROUND(AVG(discount) * 100, 2) AS avg_discount_pct
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  sub_category = 'Tables'
GROUP BY
  region,
  order_year
ORDER BY
  region,
  order_year;
"""

try:
    query_job_g1r = client.query(query_g1_regional)
    results_df_g1r = query_job_g1r.to_dataframe()
    print("Regional and Yearly Breakdown for 'Tables':")
    display(results_df_g1r)
except Exception as e:
    print(f"An error occurred: {e}")

--- Query 1: Slicing by Region and Time ---
Regional and Yearly Breakdown for 'Tables':


Unnamed: 0,region,order_year,total_sales,total_profit,avg_discount_pct
0,Central,2014,7785.48,-1424.33,32.67
1,Central,2015,6857.26,-265.09,20.71
2,Central,2016,13922.93,292.62,20.59
3,Central,2017,10589.31,-2162.85,29.23
4,East,2014,10603.7,-3537.84,38.0
5,East,2015,8884.81,-2275.86,37.33
6,East,2016,7825.33,-2306.78,36.82
7,East,2017,11825.97,-2904.9,37.39
8,South,2014,9940.94,1107.99,11.36
9,South,2015,7370.67,-2171.38,21.87


In [None]:
print("--- Query 2: Controlling for Ship Mode ---")
query_g1_shipmode = """
SELECT
  ship_mode,
  COUNT(*) AS num_orders,
  ROUND(SUM(sales), 2) AS total_sales,
  ROUND(SUM(profit), 2) AS total_profit,
  ROUND(AVG(profit), 2) AS avg_profit_per_order,
  ROUND(AVG(discount) * 100, 2) AS avg_discount_pct
FROM
  `mgmt-467.lab1_foundation.superstore_clean`
WHERE
  sub_category = 'Tables'
GROUP BY
  ship_mode
ORDER BY
  total_profit;
"""

try:
    query_job_g1s = client.query(query_g1_shipmode)
    results_df_g1s = query_job_g1s.to_dataframe()
    print("Ship Mode Breakdown for 'Tables':")
    display(results_df_g1s)
except Exception as e:
    print(f"An error occurred: {e}")

--- Query 2: Controlling for Ship Mode ---
Ship Mode Breakdown for 'Tables':


Unnamed: 0,ship_mode,num_orders,total_sales,total_profit,avg_profit_per_order,avg_discount_pct
0,Standard Class,190,124826.66,-11910.01,-62.68,27.05
1,Second Class,61,43693.75,-3320.68,-54.44,24.84
2,First Class,47,28800.78,-1365.37,-29.05,24.04
3,Same Day,21,9644.35,-1129.42,-53.78,26.19


### How to Compare These Outcomes

To compare these outcomes, first run both queries. In the regional breakdown, look for any regions or years that are disproportionately unprofitable; if the negative profit is isolated to one area or time period, the problem may not be universal. In the ship mode breakdown, see if certain shipping methods correlate with negative profit, which might suggest that shipping costs for bulky items, not just discounts, are a key factor. If 'Tables' are profitable under certain conditions (e.g., in a specific region or with 'Standard Class' shipping), it refines the initial conclusion that high discounts are the sole cause of unprofitability.

## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

The 'Tables' sub-category, despite being a significant source of revenue, incurred a total loss of over $17,700, making it the most unprofitable category in the dataset. This loss is primarily driven by an aggressive discount strategy, especially in the Central and East regions where average discounts frequently exceed 30%, while the category remained profitable in the West with more moderate discount rates. We recommend that the regional sales leadership in the Central and East markets immediately implement a cap on discounts for 'Tables' to a maximum of 20%, effective by the start of the next quarter. To measure the impact of this change, we will closely monitor the `total_profit` and `avg_discount` for 'Tables' in these specific regions on a weekly basis.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

In [None]:
import logging
from datetime import datetime, timedelta
from google.cloud import bigquery

# --- Configuration ---
# IMPORTANT: Replace with your actual Google Cloud project ID.
PROJECT_ID = 'mgmt-467'
DESTINATION_DATASET = 'analytics' # Assumes this dataset exists
DESTINATION_TABLE = 'outputs_kpi'

# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def run_kpi_job(start_date_str, end_date_str):
    """
    Executes a BigQuery query to analyze profitability and saves the results
    to a destination table.

    Args:
        start_date_str (str): The start date for the analysis in 'YYYY-MM-DD' format.
        end_date_str (str): The end date for the analysis in 'YYYY-MM-DD' format.
    """
    try:
        logging.info("Initializing BigQuery client...")
        client = bigquery.Client(project=PROJECT_ID)

        # --- Define Destination and Job Configuration ---
        destination_table_id = f"{PROJECT_ID}.{DESTINATION_DATASET}.{DESTINATION_TABLE}"

        # This defines how the job will run, including the destination for the results.
        job_config = bigquery.QueryJobConfig(
            destination=destination_table_id,
            write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE, # Overwrite table on each run
            query_parameters=[
                bigquery.ScalarQueryParameter("start_date", "DATE", datetime.strptime(start_date_str, "%Y-%m-%d").date()),
                bigquery.ScalarQueryParameter("end_date", "DATE", datetime.strptime(end_date_str, "%Y-%m-%d").date()),
            ]
        )

        # --- Define the SQL Query ---
        # This query calculates sales, profit, and discount metrics for the 'Tables'
        # sub-category, sliced by region and year, within a parameterized date range.
        sql = """
        SELECT
          region,
          EXTRACT(YEAR FROM order_date) AS order_year,
          ROUND(SUM(sales), 2) AS total_sales,
          ROUND(SUM(profit), 2) AS total_profit,
          ROUND(AVG(discount) * 100, 2) AS avg_discount_pct
        FROM
          `mgmt-467.lab1_foundation.superstore_clean`
        WHERE
          sub_category = 'Tables'
          AND order_date BETWEEN @start_date AND @end_date
        GROUP BY
          region,
          order_year
        ORDER BY
          region,
          order_year;
        """

        logging.info(f"Starting BigQuery job. Results will be saved to {destination_table_id}")
        query_job = client.query(sql, job_config=job_config)  # Make an API request.

        # Wait for the job to complete.
        query_job.result()
        logging.info("BigQuery job finished successfully.")

        # --- Verification (Optional) ---
        destination_table = client.get_table(destination_table_id)
        logging.info(f"Loaded {destination_table.num_rows} rows into {destination_table_id}")

    except Exception as e:
        logging.error(f"An error occurred: {e}")

# --- Main execution block ---
if __name__ == "__main__":
    # Example: Run the job for the full year of 2017.
    # In a real scheduled job, you might calculate these dates dynamically.
    start_date = "2017-01-01"
    end_date = "2017-12-31"

    logging.info(f"Running job for date range: {start_date} to {end_date}")
    # To run this, you would uncomment the next line in your script environment
    # run_kpi_job(start_date, end_date)

---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---