<a href="https://colab.research.google.com/github/bulut19/mgmt467-analytics-portfolio/blob/main/Lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [None]:
# Install the Google Cloud BigQuery client library
!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Copy Schema to a dataframe

In [None]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'boxwood-veld-471119-r6' # This is derived from your provided table name
dataset_id = 'lab1_foundation'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
# To see the output, run the code.
print(schema_df.head())

Schema DataFrame created:
         name field_type      mode description
0      Row ID    INTEGER  NULLABLE        None
1    Order ID     STRING  NULLABLE        None
2  Order Date       DATE  NULLABLE        None
3   Ship Date       DATE  NULLABLE        None
4   Ship Mode     STRING  NULLABLE        None


## Clean Column Names

In [None]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `boxwood-veld-471119-r6.lab1_foundation.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `boxwood-veld-471119-r6.lab1_foundation.superstore`;



## Generate View with standard column naming convention

In [None]:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_table_ref = client.dataset(dataset_id).table(new_view_id)

    # Fetch the first 10 rows
    rows = client.list_rows(view_table_ref, max_results=10)

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")


View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundation'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
An error occurred while fetching rows from the view: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/boxwood-veld-471119-r6/datasets/lab1_foundation/tables/superstore_clean/data?maxResults=10&formatOptions.useInt64Timestamp=True&prettyPrint=false: Cannot list a table of type VIEW.


In [None]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `mgmt-467-47888.lab1_foundation.superstore`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

Output:
```
SELECT DISTINCT Sub_Category
FROM `mgmt-467-47888.lab1_foundation.superstore`
WHERE Region = 'West'
ORDER BY Sub_Category ASC
LIMIT 100;
```



In [None]:
query_string = """
SELECT
    DISTINCT `sub_category` AS Sub_Category
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    Region = 'West'
ORDER BY
    Sub_Category ASC
LIMIT 100
"""

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Sub_Category
0,Accessories
1,Appliances
2,Art
3,Binders
4,Bookcases
5,Chairs
6,Copiers
7,Envelopes
8,Fasteners
9,Furnishings


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [None]:
query_string = '''
SELECT
    customer_ID,
    SUM(profit) AS total_profit
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
GROUP BY
    customer_ID
ORDER BY
    total_profit DESC
LIMIT 10
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,customer_ID,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [None]:
query_string = '''
SELECT
    Ship_Mode,
    COUNT(*) AS order_count
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    Category = 'Technology'
GROUP BY
    Ship_Mode
ORDER BY
    order_count DESC
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Ship_Mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


In [None]:
# sanity check - total count verification
query_string = '''
SELECT
    COUNT(*) AS total_technology_orders
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    Category = 'Technology'
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,total_technology_orders
0,1847


In [None]:
# sanity check - category filter verification
query_string = '''
SELECT
    DISTINCT Category
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    Ship_Mode IN ('Standard Class', 'Second Class', 'First Class', 'Same Day')
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Category
0,Office Supplies
1,Furniture
2,Technology


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

In [None]:
query_string = '''
SELECT
    FORMAT_DATE('%Y-%m', Order_Date) AS year_month,
    SUM(Sales) AS monthly_revenue
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    Order_Date >= DATE_SUB((SELECT MAX(DATE_TRUNC(Order_Date, MONTH)) FROM `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`), INTERVAL 12 MONTH)
    AND Order_Date < (SELECT MAX(DATE_TRUNC(Order_Date, MONTH)) FROM `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`)
GROUP BY
    year_month
ORDER BY
    year_month ASC
LIMIT 100
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,year_month,monthly_revenue
0,2016-12,96999.043
1,2017-01,43971.374
2,2017-02,20301.1334
3,2017-03,58872.3528
4,2017-04,36521.5361
5,2017-05,44261.1102
6,2017-06,52981.7257
7,2017-07,45264.416
8,2017-08,63120.888
9,2017-09,87866.652


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

- HAVING is used instead of WHERE because we need to filter on the aggregated result SUM(profit) after grouping, whereas WHERE filters individual rows before grouping occurs.

In [None]:
query_string = '''
SELECT
    sub_category AS Sub_Category,
    SUM(profit) AS total_profit
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
GROUP BY
    sub_category
HAVING
    SUM(profit) < 0
ORDER BY
    total_profit ASC
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Sub_Category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995


## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [None]:
query_string = '''
SELECT
    product_id AS Product_ID,
    product_name AS Product_Name,
    SUM(sales) AS total_sales
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
GROUP BY
    product_id, product_name
ORDER BY
    total_sales DESC
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Product_ID,Product_Name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.480
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
...,...,...,...
1889,OFF-AR-10003986,Avery Hi-Liter Pen Style Six-Color Fluorescent...,7.700
1890,OFF-EN-10001535,Grip Seal Envelopes,7.072
1891,OFF-PA-10000048,Xerox 20,6.480
1892,OFF-LA-10003388,Avery 5,5.760


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

The first CTE state_sales aggregates sales data by summing total sales for each unique combination of region and state, creating a foundation dataset that shows how much revenue each state generated within its respective region. The second CTE ranked_state_sales takes this aggregated data and applies a window function using RANK() to assign rankings within each region partition, ordering states by their total sales in descending order so the highest-selling state in each region gets rank 1. The final SELECT statement filters the ranked results to show only the top 3 performing states per region (where sales_rank <= 3) and sorts the output first by region alphabetically, then by sales rank numerically, giving us a clear view of the top performers in each geographical area.

In [None]:
query_string = '''
WITH state_sales AS (
    SELECT
        region AS Region,
        state AS State,
        SUM(sales) AS total_sales
    FROM
        `boxwood-veld-471119-r6.lab1_foundation.superstore`
    GROUP BY
        region, state
),
ranked_state_sales AS (
    SELECT
        Region,
        State,
        total_sales,
        RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC) AS sales_rank
    FROM
        state_sales
)
SELECT
    Region,
    State,
    total_sales,
    sales_rank
FROM
    ranked_state_sales
WHERE
    sales_rank <= 3
ORDER BY
    Region, sales_rank
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Region,State,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

**Missing years failure mode:** If a sub-category only exists in one year (e.g., new product launched in 2017 or discontinued after 2016), the FULL OUTER JOIN will create NULL values that could skew results.

**How to handle:** Add filters to only include sub-categories that exist in both years:

```
WHERE s16.Sub_Category IS NOT NULL AND s17.Sub_Category IS NOT NULL
```

**Zero/NULL sales failure mode:** Sub-categories with zero sales in one year will show as NULL from the SUM aggregation, causing incorrect delta calculations.

**How to handle:** Use COALESCE to convert NULLs to 0 (already implemented in the query) and optionally add a minimum threshold:


```
HAVING COALESCE(s16.total_sales, 0) > 100 AND COALESCE(s17.total_sales, 0) > 100
```

In [None]:
query_string = '''
WITH yr_sales AS (
    SELECT
        sub_category AS Sub_Category,
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(sales) AS total_sales
    FROM
        `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
    GROUP BY
        sub_category, year
)
SELECT
    COALESCE(s16.Sub_Category, s17.Sub_Category) AS Sub_Category,
    COALESCE(s16.total_sales, 0) AS sales_2016,
    COALESCE(s17.total_sales, 0) AS sales_2017,
    COALESCE(s17.total_sales, 0) - COALESCE(s16.total_sales, 0) AS yoy_delta
FROM
    (SELECT * FROM yr_sales WHERE year = 2016) s16
FULL OUTER JOIN
    (SELECT * FROM yr_sales WHERE year = 2017) s17
ON
    s16.Sub_Category = s17.Sub_Category
ORDER BY
    yoy_delta DESC
LIMIT 5
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Sub_Category,sales_2016,sales_2017,yoy_delta
0,Phones,78962.03,105340.516,26378.486
1,Binders,49683.325,72788.045,23104.72
2,Accessories,41895.854,59946.232,18050.378
3,Appliances,26050.315,42926.932,16876.617
4,Copiers,49599.41,62899.388,13299.978


## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

ROW_NUMBER() guarantees exactly one result per region even if multiple sub-categories have identical sales totals, while RANK() would return multiple rows in case of ties. Since the task specifically asks for "only the single highest-revenue Sub_Category" per region, ROW_NUMBER() ensures we get exactly one winner per region regardless of ties.

In [None]:
query_string = '''
WITH subcat_sales AS (
    SELECT
        region AS Region,
        sub_category AS Sub_Category,
        SUM(sales) AS total_sales,
        ROW_NUMBER() OVER (PARTITION BY region ORDER BY SUM(sales) DESC) AS rn
    FROM
        `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
    GROUP BY
        region, sub_category
)
SELECT
    Region,
    Sub_Category,
    total_sales
FROM
    subcat_sales
WHERE
    rn = 1
ORDER BY
    Region
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,Region,Sub_Category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

The CASE statement handles:

NULL previous year (first year in dataset) → returns NULL for yoy_pct

Zero previous revenue (divide by zero) → returns NULL for yoy_pct

Valid previous year → calculates normal percentage growth

In [None]:
query_string = '''
WITH phones_yearly AS (
    SELECT
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(sales) AS yearly_revenue
    FROM
        `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
    WHERE
        sub_category = 'Phones'
    GROUP BY
        year
)
SELECT
    year,
    yearly_revenue,
    LAG(yearly_revenue) OVER (ORDER BY year) AS prev_revenue,
    CASE
        WHEN LAG(yearly_revenue) OVER (ORDER BY year) IS NULL
             OR LAG(yearly_revenue) OVER (ORDER BY year) = 0
        THEN NULL
        ELSE 100.0 * (yearly_revenue - LAG(yearly_revenue) OVER (ORDER BY year)) / LAG(yearly_revenue) OVER (ORDER BY year)
    END AS yoy_pct
FROM
    phones_yearly
ORDER BY
    year ASC
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,year,yearly_revenue,prev_revenue,yoy_pct
0,2014,77390.806,,
1,2015,68313.702,77390.806,-11.728918
2,2016,78962.03,68313.702,15.587397
3,2017,105340.516,78962.03,33.406545


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

**Cost control note:** Uncomment the date filter AND order_date >= '2017-01-01' to limit data processing to recent months while developing and testing the query.

In [None]:
query_string = '''
WITH corporate_monthly AS (
    SELECT
        DATE_TRUNC(order_date, MONTH) AS month,
        SUM(sales) AS monthly_revenue
    FROM
        `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
    WHERE
        segment = 'Corporate'
        -- AND order_date >= '2017-01-01'  -- Cost control: restrict date range while iterating
    GROUP BY
        month
)
SELECT
    month,
    monthly_revenue,
    AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
    corporate_monthly
ORDER BY
    month ASC
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01-01,1701.528,1701.528
1,2014-02-01,1183.668,1442.598
2,2014-03-01,11106.799,4663.998333
3,2014-04-01,14131.729,8807.398667
4,2014-05-01,9142.0,11460.176
5,2014-06-01,3970.914,9081.547667
6,2014-07-01,10032.988,7715.300667
7,2014-08-01,7451.774,7151.892
8,2014-09-01,15507.745,10997.502333
9,2014-10-01,12637.678,11865.732333


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:

SELECT
    priority AS order_priority,
    ship_mode,
    sub_category,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS avg_discount,
    SUM(sales) / COUNT(*) AS avg_order_value,
    COUNT(*) AS order_count
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    priority, ship_mode, sub_category
ORDER BY
    priority, ship_mode

---------------------------------------------------------------------------
BadRequest                                Traceback (most recent call last)
/tmp/ipython-input-4182244379.py in <cell line: 0>()
     20 '''
     21
---> 22 results_df = client.query(query_string).to_dataframe()
     23 display(results_df)

9 frames
/usr/local/lib/python3.11/dist-packages/google/api_core/future/polling.py in result(self, timeout, retry, polling)
    259             # pylint: disable=raising-bad-type
    260             # Pylint doesn't recognize that this is valid in this case.
--> 261             raise self._exception
    262
    263         return self._result

BadRequest: 400 Unrecognized name: priority; Did you mean profit? at [3:5]; reason: invalidQuery, location: query, message: Unrecognized name: priority; Did you mean profit? at [3:5]

Location: US
Job ID: 0ccd4cc3-4590-47b0-bd3c-223ac109a2a4

Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

Output:

**Root cause:** The column priority doesn't exist in your table - BigQuery is suggesting you meant profit instead.

**Corrected SQL:**

```
SELECT
    ship_mode,
    sub_category,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS avg_discount,
    SUM(sales) / COUNT(*) AS avg_order_value,
    COUNT(*) AS order_count
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    ship_mode, sub_category
ORDER BY
    ship_mode
```

**Sanity check query:**


```
SELECT column_name
FROM `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
LIMIT 1
```

**Rationale:** The priority column doesn't exist in your dataset, so I removed it from the SELECT and GROUP BY clauses. The sanity check will show all available columns to confirm what fields are actually present in your table.



### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.

WITH corporate_monthly AS (
    SELECT
        DATE_TRUNC(order_date, MONTH) AS month,
        SUM(sales) AS monthly_revenue
    FROM
        `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
    WHERE
        segment = 'Corporate'
        -- AND order_date >= '2017-01-01'  -- Cost control: restrict date range while iterating
    GROUP BY
        month
)
SELECT
    month,
    monthly_revenue,
    AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
    corporate_monthly
ORDER BY
    month ASC
    
Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

**Output:**

3 ways to reduce scanned bytes and improve performance:

**1. Activate the date partition filter**

```
AND order_date >= '2017-01-01'
```
This dramatically reduces scanned data by leveraging BigQuery's date partitioning. Even for exploration, start with recent data and expand the range only when needed.

**2. Column pruning in the base table scan**

```
WITH corporate_monthly AS (
    SELECT
        DATE_TRUNC(order_date, MONTH) AS month,
        SUM(sales) AS monthly_revenue
    FROM (
        SELECT order_date, sales  -- Only scan needed columns
        FROM `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
        WHERE segment = 'Corporate'
        AND order_date >= '2017-01-01'
    )
    GROUP BY month
)
```
Scanning only order_date, sales, and segment columns instead of the full table reduces I/O costs significantly.


**3. Pre-filter before expensive operations**

```
WITH filtered_data AS (
    SELECT order_date, sales
    FROM `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
    WHERE segment = 'Corporate'
    AND order_date >= '2017-01-01'  -- Apply filters early
),
corporate_monthly AS (
    SELECT
        DATE_TRUNC(order_date, MONTH) AS month,
        SUM(sales) AS monthly_revenue
    FROM filtered_data
    GROUP BY month
)
```
This pushes the segment filter and date range before any aggregation, minimizing data processed in subsequent steps.

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [None]:
# Slice by region and time
query_string = '''
SELECT
    region,
    EXTRACT(YEAR FROM order_date) AS year,
    sub_category,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS avg_discount,
    COUNT(*) AS order_count
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    region, year, sub_category
ORDER BY
    region, year
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,region,year,sub_category,total_sales,total_profit,avg_discount,order_count
0,Central,2014,Tables,7785.478,-1424.331,0.326667,15
1,Central,2015,Tables,6857.26,-265.0939,0.207143,14
2,Central,2016,Tables,13922.926,292.6211,0.205882,17
3,Central,2017,Tables,10589.307,-2162.8466,0.292308,26
4,East,2014,Tables,10603.704,-3537.8375,0.38,20
5,East,2015,Tables,8884.806,-2275.8641,0.373333,15
6,East,2016,Tables,7825.328,-2306.7783,0.368182,22
7,East,2017,Tables,11825.969,-2904.9002,0.373913,23
8,South,2014,Tables,9940.9445,1107.9902,0.113636,11
9,South,2015,Tables,7370.6745,-2171.3765,0.21875,8


In [None]:
# Control for order priority and ship mode
query_string = '''
SELECT
    ship_mode,
    sub_category,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS avg_discount,
    SUM(sales) / COUNT(*) AS avg_order_value,
    COUNT(*) AS order_count
FROM
    `boxwood-veld-471119-r6.lab1_foundation.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    ship_mode, sub_category
ORDER BY
    ship_mode
'''

results_df = client.query(query_string).to_dataframe()
display(results_df)

Unnamed: 0,ship_mode,sub_category,total_sales,total_profit,avg_discount,avg_order_value,order_count
0,First Class,Tables,28800.776,-1365.3665,0.240426,612.782468,47
1,Same Day,Tables,9644.347,-1129.4225,0.261905,459.254619,21
2,Second Class,Tables,43693.7475,-3320.6799,0.248361,716.290943,61
3,Standard Class,Tables,124826.6615,-11910.0122,0.270526,656.982429,190


**Comparison approach:** These queries could reveal that Tables' negative profitability might be concentrated in specific regions (perhaps due to local competition) or time periods (seasonal promotions), or that certain shipping methods actually generate positive profits despite discounts, suggesting the loss-making pattern isn't universal but depends on operational context. Compare profit margins across these dimensions to see if the "high discounts = negative profit" narrative holds consistently or if there are profitable segments within Tables that could inform better pricing strategies.

## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

**Executive Summary:**

Tables sub-category lost 17,725 in total profit despite high sales volume, with Standard Class shipping accounting for $11,910 in losses across 190 orders. The losses stem from excessive discount rates averaging 24-27% across all shipping methods, which appear unsustainable given the product's cost structure and operational expenses. The merchandising team should reduce Standard Class discounts to under 20% by March 2025 while testing regional pricing strategies in profitable markets like South and West regions. Monitor weekly profit margins by shipping mode to track whether discount optimization moves Tables from loss-making to break-even within 90 days.


### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

### H2. Automated Python Job for KPI Calculation

This script converts the moving average analysis into a reusable and schedulable Python job. It connects to BigQuery, runs a parameterized query to calculate the 3-month moving average of revenue for the 'Corporate' segment, and saves the results into a new destination table for reporting and further analysis. The script includes configuration for your project, error handling, logging, and ensures the destination dataset exists before running.

In [None]:
import os
import logging
from datetime import datetime, timedelta
from google.cloud import bigquery
from google.api_core.exceptions import NotFound

# --- Configuration ---
# The script will use the project ID configured in your Colab environment.
# Replace with your project ID if running outside of a configured environment.
PROJECT_ID = "boxwood-veld-471119-r6"
DESTINATION_DATASET = "analytics"
DESTINATION_TABLE = "outputs_kpi_moving_avg"
SOURCE_TABLE = f"`{PROJECT_ID}.lab1_foundation.superstore_clean`"

# Configure logging to see the script's progress and any errors
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def create_dataset_if_not_exists(client: bigquery.Client, dataset_id: str):
    """Creates a BigQuery dataset if it does not already exist."""
    full_dataset_id = f"{client.project}.{dataset_id}"
    try:
        client.get_dataset(full_dataset_id)
        logging.info(f"Dataset {full_dataset_id} already exists.")
    except NotFound:
        logging.info(f"Dataset {full_dataset_id} not found, creating it...")
        dataset = bigquery.Dataset(full_dataset_id)
        # Specify the location, e.g., "US". BigQuery requires this.
        dataset.location = "US" # Change to your dataset's location if needed
        client.create_dataset(dataset, timeout=30)
        logging.info(f"Successfully created dataset {full_dataset_id}.")

def run_moving_average_job(start_date: str, end_date: str):
    """
    Runs a BigQuery job to calculate the 3-month moving average for the 'Corporate'
    segment and saves the results to a destination table.

    Args:
        start_date (str): The start date for the analysis in 'YYYY-MM-DD' format.
        end_date (str): The end date for the analysis in 'YYYY-MM-DD' format.
    """
    logging.info("Starting BigQuery KPI job...")

    try:
        # Initialize the BigQuery client. It will use the project defined
        # in the environment or the one specified in the client constructor.
        client = bigquery.Client(project=PROJECT_ID)

        # Ensure the destination dataset exists before trying to write to it.
        create_dataset_if_not_exists(client, DESTINATION_DATASET)

        # This query calculates the 3-month moving average of revenue.
        # It uses @start_date and @end_date as parameters for security and flexibility.
        sql = f"""
            WITH corporate_monthly AS (
                SELECT
                    DATE_TRUNC(order_date, MONTH) AS month,
                    SUM(sales) AS monthly_revenue
                FROM
                    {SOURCE_TABLE}
                WHERE
                    segment = 'Corporate'
                    AND order_date BETWEEN @start_date AND @end_date
                GROUP BY
                    month
            )
            SELECT
                month,
                monthly_revenue,
                AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS revenue_3_month_moving_avg,
                CURRENT_TIMESTAMP() as load_timestamp
            FROM
                corporate_monthly
            ORDER BY
                month ASC;
        """

        # Configure the job to write to a destination table.
        # WRITE_TRUNCATE will overwrite the table each time the job runs.
        destination_table_ref = f"{PROJECT_ID}.{DESTINATION_DATASET}.{DESTINATION_TABLE}"
        job_config = bigquery.QueryJobConfig(
            destination=destination_table_ref,
            write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
            query_parameters=[
                bigquery.ScalarQueryParameter("start_date", "DATE", start_date),
                bigquery.ScalarQueryParameter("end_date", "DATE", end_date),
            ]
        )

        logging.info(f"Executing query and writing results to {destination_table_ref}...")

        # Start the query job and wait for it to complete.
        query_job = client.query(sql, job_config=job_config)
        query_job.result()  # Waits for the job to finish.

        logging.info(f"Job completed successfully. Results are in {destination_table_ref}.")

    except Exception as e:
        logging.error(f"An error occurred during the BigQuery job: {e}", exc_info=True)


# --- Main execution block ---
if __name__ == '__main__':
    # This part runs when the script is executed directly.
    # For a scheduled job, you might get dates from an event payload.

    # We'll calculate the metric for the last 3 years as an example.
    end_date_obj = datetime.now()
    start_date_obj = end_date_obj - timedelta(days=3*365)

    start_date_str = start_date_obj.strftime('%Y-%m-%d')
    end_date_str = end_date_obj.strftime('%Y-%m-%d')

    logging.info(f"Running job for date range: {start_date_str} to {end_date_str}")
    run_moving_average_job(start_date=start_date_str, end_date=end_date_str)

---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---