# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [1]:
# Install the Google Cloud BigQuery client library
!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Copy Schema to a dataframe

In [2]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'noble-broker-471012-q6' # This is derived from your provided table name
dataset_id = 'lab1_foundations_EthanLouie'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
print(schema_df)
# To see the output, run the code.


Schema DataFrame created:
             name field_type      mode description
0          Row ID    INTEGER  NULLABLE        None
1        Order ID     STRING  NULLABLE        None
2      Order Date       DATE  NULLABLE        None
3       Ship Date       DATE  NULLABLE        None
4       Ship Mode     STRING  NULLABLE        None
5     Customer ID     STRING  NULLABLE        None
6   Customer Name     STRING  NULLABLE        None
7         Segment     STRING  NULLABLE        None
8         Country     STRING  NULLABLE        None
9            City     STRING  NULLABLE        None
10          State     STRING  NULLABLE        None
11    Postal Code    INTEGER  NULLABLE        None
12         Region     STRING  NULLABLE        None
13     Product ID     STRING  NULLABLE        None
14       Category     STRING  NULLABLE        None
15   Sub-Category     STRING  NULLABLE        None
16   Product Name     STRING  NULLABLE        None
17          Sales      FLOAT  NULLABLE        None
18   

## CLean Column Names

In [3]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore`;



## Generate View with standard column naming convention

In [4]:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_table_ref = client.dataset(dataset_id).table(new_view_id)

    # Fetch the first 10 rows
    rows = client.list_rows(view_table_ref, max_results=10)

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")



View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundations_EthanLouie'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
An error occurred while fetching rows from the view: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/noble-broker-471012-q6/datasets/lab1_foundations_EthanLouie/tables/superstore_clean/data?maxResults=10&formatOptions.useInt64Timestamp=True&prettyPrint=false: Cannot list a table of type VIEW.


In [5]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?
**The query was staright forward, but I needed to evauate how the prvious cell applied the query string into a variable which then prompted the df to be made.**

In [6]:
query_string = """
SELECT
    DISTINCT `sub_category` AS Sub_Category
FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
WHERE
    Region = 'West'
ORDER BY
    Sub_Category ASC
LIMIT 100
"""
query_job = client.query(query_string)
results_df = query_job.to_dataframe()
display(results_df)

Unnamed: 0,Sub_Category
0,Accessories
1,Appliances
2,Art
3,Binders
4,Bookcases
5,Chairs
6,Copiers
7,Envelopes
8,Fasteners
9,Furnishings


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [7]:
# Step 1: Define the SQL query string
query_string = """
SELECT
  customer_id,
  SUM(profit) AS total_profit
FROM
  `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
GROUP BY
  customer_id
ORDER BY
  total_profit DESC
LIMIT 10;
"""

query_job = client.query(query_string)
results_df = query_job.to_dataframe()
display(results_df)

Unnamed: 0,customer_id,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [8]:
query_string = """
SELECT
  ship_mode,
  COUNT(*) AS order_count
FROM
  `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
WHERE
  category = 'Technology'
GROUP BY
  ship_mode
ORDER BY
  order_count DESC;"""

query_job = client.query(query_string)
results_df = query_job.to_dataframe()
display(results_df)

Unnamed: 0,ship_mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


In [19]:
#SANITY CHECK 1

query_string_total = """
SELECT COUNT(*) AS total_technology_orders
FROM `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
WHERE category = 'Technology'
"""
query_job_total = client.query(query_string_total)
results_df_total = query_job_total.to_dataframe()
display(results_df_total)

Unnamed: 0,total_technology_orders
0,1847


In [20]:
#SANITY CHECK 2

query_string_standard = """
SELECT COUNT(*) AS standard_class_technology_orders
FROM `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
WHERE category = 'Technology' AND ship_mode = 'Standard Class'
"""
query_job_standard = client.query(query_string_standard)
results_df_standard = query_job_standard.to_dataframe()
display(results_df_standard)

Unnamed: 0,standard_class_technology_orders
0,1082


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

In [9]:
# Define the SQL query string for monthly revenue
# We'll use the 'superstore_clean' view created earlier in your notebook.
# Note: The `order_date` column is now lowercase due to your cleaning process.
monthly_revenue_query_string = """
SELECT
  FORMAT_DATE('%Y-%m', DATE_TRUNC(`Order Date`, MONTH)) AS year_month,
  SUM(Sales) AS monthly_revenue
FROM
  `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore`
WHERE
  `Order Date` >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
  AND `Order Date` < DATE_TRUNC(CURRENT_DATE(), MONTH)
GROUP BY
  year_month
ORDER BY
  year_month ASC
LIMIT 100;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
monthly_revenue_query_job = client.query(monthly_revenue_query_string)

# Fetch results into a Pandas DataFrame
monthly_revenue_df = monthly_revenue_query_job.to_dataframe()

# Display the results
display(monthly_revenue_df)


Unnamed: 0,year_month,monthly_revenue


The data is empty here because the data only goes to 2017, which is why the data is not showing data from 2024-present.

### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.


HAVING is used to follow the GROUP BY function in terms of the SQL order of execution, to correctly group the data post-filter.

In [10]:
# Define the SQL query string to find sub-categories with negative total profit
# We'll use the 'superstore_clean' view created earlier in your notebook.
negative_profit_subcategories_query_string = """
SELECT
    Sub_Category,
    SUM(Profit) AS total_profit
FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
GROUP BY
    Sub_Category
HAVING
    SUM(Profit) < 0
ORDER BY
    total_profit ASC
LIMIT 100;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
negative_profit_subcategories_query_job = client.query(negative_profit_subcategories_query_string)

# Fetch results into a Pandas DataFrame
negative_profit_subcategories_df = negative_profit_subcategories_query_job.to_dataframe()

# Display the results
display(negative_profit_subcategories_df)

Unnamed: 0,Sub_Category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995


## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [11]:
# Define the SQL query string for joining sales and products to report total sales per product,
# simulating a product dimension using a CTE.
# This assumes Product_Name is available in the superstore_clean table itself for grouping.
# If Product_Name were truly missing and needed a separate dimension, we'd have to
# create a CTE with known Product_ID to Product_Name mappings.
product_sales_with_cte_query_string = """
WITH ProductDimension AS (
  SELECT DISTINCT
    Product_ID,
    Product_Name
  FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
)
SELECT
    s.Product_ID,
    pd.Product_Name,
    SUM(s.Sales) AS total_sales
FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean` AS s
JOIN
    ProductDimension AS pd
ON
    s.Product_ID = pd.Product_ID
GROUP BY
    s.Product_ID,
    pd.Product_Name
ORDER BY
    total_sales DESC
LIMIT 100;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
product_sales_with_cte_query_job = client.query(product_sales_with_cte_query_string)

# Fetch results into a Pandas DataFrame
product_sales_with_cte_df = product_sales_with_cte_query_job.to_dataframe()

# Display the results
display(product_sales_with_cte_df)

Unnamed: 0,Product_ID,Product_Name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.480
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
...,...,...,...
95,OFF-ST-10001526,Iceberg Mobile Mega Data/Printer Cart,5751.774
96,FUR-CH-10000595,Safco Contoured Stacking Chairs,5697.760
97,TEC-CO-10001766,Canon PC940 Copier,5669.874
98,FUR-TA-10004256,Bretford “Just In Time” Height-Adjustable Mult...,5634.900


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.
state_sales CTE: This Common Table Expression (CTE) is designed to aggregate sales at the state level within each region. It groups all records from the superstore table by both Region and State, and then calculates the sum of Sales for each unique combination, aliasing this sum as total_sales. This prepares the data for the subsequent ranking step by providing the necessary aggregated sales figures per state within each region.
ranked_state_sales CTE: Building upon the state_sales CTE, this second CTE introduces a ranking mechanism. It uses the RANK() window function to assign a rank to each state within its respective region based on their total_sales in descending order. The PARTITION BY Region clause ensures that the ranking restarts for each new region, and ORDER BY total_sales DESC assigns rank 1 to the state with the highest sales in that region. This rank is aliased as sales_rank.
Final SELECT Statement: The final part of the query filters the results from the ranked_state_sales CTE. It selects all columns (Region, State, total_sales, sales_rank) but only includes those rows where the sales_rank is less than or equal to 3. This effectively isolates the top 3 states by total sales for each region. The results are then ordered by Region and sales_rank to present them in a clear, organized manner.

In [12]:
# Define the SQL query string for ranking states by total sales within each region (top 3)
# This query uses CTEs to first aggregate sales by state/region, then rank them,
# and finally filters for the top 3 states per region.
top_states_per_region_query_string = """
SELECT
  Region,
  State,
  total_sales,
  sales_rank
FROM (
  SELECT
    Region,
    State,
    SUM(Sales) AS total_sales,
    RANK() OVER (PARTITION BY Region ORDER BY SUM(Sales) DESC) AS sales_rank
  FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore`
  GROUP BY
    Region,
    State
) AS ranked_state_sales
WHERE
  sales_rank <= 3
ORDER BY
  Region,
  sales_rank;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
top_states_per_region_query_job = client.query(top_states_per_region_query_string)

# Fetch results into a Pandas DataFrame
top_states_per_region_df = top_states_per_region_query_job.to_dataframe()

# Display the results
display(top_states_per_region_df)

Unnamed: 0,Region,State,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

Potential Failure Modes and Handling:

Missing Year Data (e.g., 2023 or 2024 data is absent for a sub-category): If a Sub_Category has sales in 2023 but not in 2024, or vice versa, a standard INNER JOIN (which is implicitly used when joining s24 and s23) would exclude that Sub_Category entirely. To address this, we should use FULL OUTER JOIN to ensure all sub-categories are considered, even if they only exist in one of the two years. Then, we can use IFNULL or COALESCE to replace NULL sales values with 0 for the year that is missing, allowing the YoY delta calculation to proceed correctly (e.g., a sub-category with sales only in 2024 would have a yoy_delta equal to its 2024 sales).

Sub-categories with Sales in 2023 but No Sales in 2024: Even with the FULL OUTER JOIN, if a sub-category had sales in 2023 but absolutely zero sales in 2024, the yoy_delta would be negative (e.g., 0 - 1000 = -1000). If the goal is specifically increase, then these would be excluded by the ORDER BY yoy_delta DESC and LIMIT 5. However, if the intent was to show any change and filter for the top positive increases, the current structure is fine. If the requirement was strictly positive increases, we'd add a WHERE yoy_delta > 0 clause before the ORDER BY and LIMIT.

In [13]:
# Define the SQL query string for identifying the top 5 sub-categories
# with the largest Year-over-Year (YoY) revenue increase from 2023 to 2024.
# This query handles potential missing years by using FULL OUTER JOIN and COALESCE.
top_yoy_increase_query_string = """
WITH yr_sales AS (
  SELECT
    Sub_Category,
    EXTRACT(YEAR FROM order_date) AS year,
    SUM(sales) AS total_sales
  FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
  WHERE
    EXTRACT(YEAR FROM order_date) IN (2023, 2024)
  GROUP BY
    sub_category,
    year
),
sales_2023 AS (
  SELECT
    sub_category,
    COALESCE(total_sales, 0) AS sales_2023
  FROM
    yr_sales
  WHERE
    year = 2023
),
sales_2024 AS (
  SELECT
    sub_category,
    COALESCE(total_sales, 0) AS sales_2024
  FROM
    yr_sales
  WHERE
    year = 2024
),
yoy_comparison AS (
  SELECT
    COALESCE(s23.sub_category, s24.sub_Category) AS sub_category,
    s23.sales_2023,
    s24.sales_2024,
    (COALESCE(s24.sales_2024, 0) - COALESCE(s23.sales_2023, 0)) AS yoy_delta
  FROM
    sales_2023 AS s23
  FULL OUTER JOIN
    sales_2024 AS s24
  ON
    s23.sub_category = s24.sub_category
)
SELECT
  sub_category,
  sales_2023,
  sales_2024,
  yoy_delta
FROM
  yoy_comparison
WHERE
  yoy_delta > 0 -- Ensure we only look for increases
ORDER BY
  yoy_delta DESC
LIMIT 5;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
top_yoy_increase_query_job = client.query(top_yoy_increase_query_string)

# Fetch results into a Pandas DataFrame
top_yoy_increase_df = top_yoy_increase_query_job.to_dataframe()

# Display the results
display(top_yoy_increase_df)

Unnamed: 0,sub_category,sales_2023,sales_2024,yoy_delta


Data is empty, because most recent sales only go to 2017.

## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

Okay, here's the explanation contrasting ROW_NUMBER and RANK, and the adjusted SQL query.

Contrast between ROW_NUMBER and RANK:

ROW_NUMBER assigns a unique, sequential integer to each row within a partition, even if there are ties in the ordering column. RANK, on the other hand, assigns the same rank to rows with identical values in the ordering column and then skips the next rank(s) to maintain sequentiality, which can lead to gaps in the ranking.

In [14]:
# Define the SQL query string to find the highest-revenue Sub_Category for each Region.
# This query uses a CTE with ROW_NUMBER() to rank Sub_Categories within each Region
# by total sales, and then filters for the top-ranked one (rn = 1).
# References the 'superstore_clean' table as per the new instruction.
highest_revenue_subcategory_per_region_query_string = """
SELECT
  region,
  sub_category,
  total_sales
FROM (
  SELECT
    region,
    sub_category,
    SUM(sales) AS total_sales,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY SUM(sales) DESC) AS rn
  FROM
    `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
  GROUP BY
    region,
    sub_category
) AS subcat_sales
WHERE
  rn = 1
ORDER BY
  region;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
highest_revenue_subcategory_per_region_query_job = client.query(highest_revenue_subcategory_per_region_query_string)

# Fetch results into a Pandas DataFrame
highest_revenue_subcategory_per_region_df = highest_revenue_subcategory_per_region_query_job.to_dataframe()

# Display the results
display(highest_revenue_subcategory_per_region_df)

Unnamed: 0,region,sub_category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

Filtered Yearly Revenue CTE: This initial CTE filters the superstore_clean table to include only records where the sub_category is 'Phones'. It then aggregates the sales for each year extracted from the order_date, aliasing the aggregated sales as yearly_revenue.
Lagged Revenue CTE: This CTE takes the results from the previous CTE and applies the LAG() window function. LAG(yearly_revenue) OVER (ORDER BY year) calculates the yearly_revenue from the preceding year for each row, effectively capturing the previous year's revenue. This value is aliased as prev_revenue.
Final SELECT with YoY Growth: The final SELECT statement computes the year-over-year percentage growth (yoy_pct). To prevent division-by-zero errors or issues with NULL prev_revenue, it uses SAFE_DIVIDE. This function returns NULL if the divisor is zero or NULL, rather than causing an error. The calculation 100.0 * (yearly_revenue - prev_revenue) / prev_revenue determines the percentage change. The output includes the year, yearly_revenue, prev_revenue, and the calculated yoy_pct, sorted by year in ascending order.

In [15]:
# Define the SQL query string to compute year-over-year revenue growth for the 'Phones' sub-category.
# It filters for 'Phones', aggregates yearly revenue, uses LAG for previous year's revenue,
# and calculates YoY percentage growth, safely handling potential division by zero.
# References the 'superstore_clean' table.
phones_yoy_growth_query_string = """
WITH
  yearly_sales AS (
    SELECT
      EXTRACT(YEAR FROM order_date) AS year,
      SUM(sales) AS yearly_revenue
    FROM
      `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
    WHERE
      sub_category = 'Phones'
    GROUP BY
      year
  ),
  lagged_revenue AS (
    SELECT
      year,
      yearly_revenue,
      LAG(yearly_revenue, 1, 0) OVER (ORDER BY year) AS prev_revenue
    FROM
      yearly_sales
  )
SELECT
  year,
  yearly_revenue,
  prev_revenue,
  -- Use SAFE_DIVIDE to prevent division by zero or null errors
  SAFE_DIVIDE(100.0 * (yearly_revenue - prev_revenue), prev_revenue) AS yoy_pct
FROM
  lagged_revenue
WHERE
  prev_revenue IS NOT NULL AND prev_revenue != 0 -- Exclude the first year if prev_revenue is 0 or NULL
ORDER BY
  year ASC;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
phones_yoy_growth_query_job = client.query(phones_yoy_growth_query_string)

# Fetch results into a Pandas DataFrame
phones_yoy_growth_df = phones_yoy_growth_query_job.to_dataframe()

# Display the results
display(phones_yoy_growth_df)

Unnamed: 0,year,yearly_revenue,prev_revenue,yoy_pct
0,2015,68313.702,77390.806,-11.728918
1,2016,78962.03,68313.702,15.587397
2,2017,105340.516,78962.03,33.406545


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

Cost Control Note: When iterating on window functions or complex aggregations, it's advisable to temporarily restrict the date range (e.g., using WHERE Order_Date BETWEEN '2023-01-01' AND '2023-12-31') to reduce query costs and execution time during development.

In [16]:
# Define the SQL query string to compute a 3-month moving average of monthly revenue
# for the 'Corporate' segment. It aggregates monthly sales, then calculates the
# moving average using a window function, referencing the 'superstore_clean' table.
corporate_moving_average_query_string = """
WITH
  monthly_revenue_data AS ( -- Renamed to avoid conflict with AVG function later
    SELECT
      DATE_TRUNC(order_date, MONTH) AS month,
      SUM(sales) AS monthly_revenue
    FROM
      `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
    WHERE
      segment = 'Corporate'
      -- Cost control note: Uncomment and adjust the date range for faster iteration during development
      -- AND order_date >= DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 18 MONTH), MONTH)
      -- AND order_date < DATE_TRUNC(CURRENT_DATE(), MONTH)
    GROUP BY
      month
  )
SELECT
  month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
  monthly_revenue_data -- Use the CTE alias here
ORDER BY
  month ASC;
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
corporate_moving_average_query_job = client.query(corporate_moving_average_query_string)

# Fetch results into a Pandas DataFrame
corporate_moving_average_df = corporate_moving_average_query_job.to_dataframe()

# Display the results
display(corporate_moving_average_df)

Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01-01,1701.528,1701.528
1,2014-02-01,1183.668,1442.598
2,2014-03-01,11106.799,4663.998333
3,2014-04-01,14131.729,8807.398667
4,2014-05-01,9142.0,11460.176
5,2014-06-01,3970.914,9081.547667
6,2014-07-01,10032.988,7715.300667
7,2014-08-01,7451.774,7151.892
8,2014-09-01,15507.745,10997.502333
9,2014-10-01,12637.678,11865.732333


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
[PASTE ERROR MESSAGE and the exact SQL here]
Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

I used E3, specifically creating syntax errors in the code by removing an "_" from a variable. Here is the result below.



Root Cause:

The error message "Table "monthly_revenue_data" must be qualified with a dataset (e.g. dataset.table)" indicates that BigQuery is attempting to interpret monthly_revenue_data as a physical table that needs a dataset qualification, rather than a Common Table Expression (CTE) within the current query. This typically happens when a CTE is referenced without proper scope awareness or if there's a misunderstanding of how CTEs are handled.

Smallest Possible Fix:

The issue is not in the SQL syntax itself, but likely in how the query is being executed or referenced in your Python environment. BigQuery correctly identifies monthly_revenue_data as a CTE within the provided SQL. The error stems from the execution context where it's being treated as a regular table.

There isn't a SQL syntax fix for this error as the SQL itself is valid for BigQuery. The problem lies in the environment calling the SQL. However, if we must provide a SQL fix that might work around an environment issue (though this is not ideal and might mask a deeper problem), we could try fully qualifying the CTE reference, but this is generally not how CTEs are meant to be used.

A more direct approach, assuming the Python environment is misinterpreting the CTE reference, is to ensure the SQL string is passed correctly. However, since you asked for a SQL fix:

-- No SQL fix is strictly necessary for the BigQuery SQL itself.
-- The provided SQL is syntactically correct for BigQuery.
-- The error originates from the execution environment's interpretation of the CTE.
-- If you are forced to provide a SQL change, it would be to ensure no unintended
-- re-qualification or aliasing is happening. The original SQL is correct.

-- Re-pasted for clarity of the correct SQL:
WITH
  monthly_revenue_data AS (
    SELECT
      DATE_TRUNC(order_date, MONTH) AS month,
      SUM(sales) AS monthly_revenue
    FROM
      `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
    WHERE
      segment = 'Corporate'
    GROUP BY
      month
  )
SELECT
  month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
  monthly_revenue_data
ORDER BY
  month ASC;
Sanity Check Query:

To verify that BigQuery correctly interprets monthly_revenue_data as a CTE, you can run a simpler query that just selects from it:

WITH
  monthly_revenue_data AS (
    SELECT
      DATE_TRUNC(order_date, MONTH) AS month,
      SUM(sales) AS monthly_revenue
    FROM
      `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
    WHERE
      segment = 'Corporate'
    GROUP BY
      month
  )
SELECT
  month,
  monthly_revenue
FROM
  monthly_revenue_data
LIMIT 1;
If this simpler query runs without the "must be qualified" error, it confirms that the CTE is correctly defined within the SQL. The original error then points to an issue in how the Python code is passing or executing the SQL.

### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.
[PASTE YOUR SQL HERE]
Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

I used E3's SQL query to sample this question.


Here are 3 ways to reduce scanned bytes and improve performance for your BigQuery query, without changing its business logic:

Partition Filter on order_date: If your superstore_clean table is partitioned by order_date (which is highly likely for time-series data), explicitly filter on this partition column in the WHERE clause. This is the most impactful optimization. Even though you are using DATE_TRUNC to group by month, applying a partition filter directly on order_date will prune data before the aggregation.

-- Example with partition filter:
WHERE
  segment = 'Corporate'
  AND order_date >= DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 18 MONTH), MONTH) -- Example: Filter to last 18 months for partition pruning
  AND order_date < DATE_TRUNC(CURRENT_DATE(), MONTH)
Column Pruning within the CTE: While the CTE already selects specific columns, ensure that only the absolutely necessary columns (order_date, sales, segment) are accessed from the base table within the CTE. BigQuery is generally good at this, but explicitly listing columns in the SELECT statement of the CTE is the most robust way to enforce column pruning. Your current query already does this well.

Clustering on segment (if applicable): If your superstore_clean table is clustered by segment in addition to being partitioned by order_date, BigQuery can more efficiently filter rows for segment = 'Corporate'. While not a direct query modification, it's a crucial table design consideration that impacts query performance. If it's not already clustered by segment, consider adding it if segment is frequently used in WHERE clauses.

Explanation of Prioritization:

Partition Filters: This is the highest priority because it directly reduces the amount of data BigQuery needs to scan from disk based on the order_date.
Column Pruning: This is also very important as it reduces the amount of data read from disk and processed in memory for each row. Your current query is good here.
Pre-aggregations / Temporary Results via CTEs: Your query already uses a CTE (monthly_revenue_data) to pre-aggregate data to a monthly level. This is a good practice for performance, especially when dealing with window functions on aggregated data. The AVG() window function is then applied to this already-reduced monthly dataset, which is efficient. Clustering would fall under a similar "optimization for data access" category but is a table-level setting rather than a query-level one.

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [17]:
# --- Query 1: Slicing by Region and Time ---
query_1_string = """
SELECT
    region,
    EXTRACT(YEAR FROM order_date) AS order_year,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
     `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    region,
    order_year
ORDER BY
    region,
    order_year
"""

# Execute the query using the BigQuery client
# This assumes your 'client' object from previous cells is still active and authenticated.
query_job_1 = client.query(query_1_string)

# Fetch results into a Pandas DataFrame
df_region_time = query_job_1.to_dataframe()

# Display the results
display(df_region_time)

Unnamed: 0,region,order_year,total_sales,total_profit,average_discount
0,Central,2014,7785.478,-1424.331,0.326667
1,Central,2015,6857.26,-265.0939,0.207143
2,Central,2016,13922.926,292.6211,0.205882
3,Central,2017,10589.307,-2162.8466,0.292308
4,East,2014,10603.704,-3537.8375,0.38
5,East,2015,8884.806,-2275.8641,0.373333
6,East,2016,7825.328,-2306.7783,0.368182
7,East,2017,11825.969,-2904.9002,0.373913
8,South,2014,9940.9445,1107.9902,0.113636
9,South,2015,7370.6745,-2171.3765,0.21875


In [18]:
query_2_ship_mode_string = """
SELECT
    ship_mode,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
     `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    ship_mode
ORDER BY
    ship_mode
"""

# Execute the query using the BigQuery client
query_job_2_ship_mode = client.query(query_2_ship_mode_string)

# Fetch results into a Pandas DataFrame
df_ship_mode = query_job_2_ship_mode.to_dataframe()

# Display the results
display(df_ship_mode)


Unnamed: 0,ship_mode,total_sales,total_profit,average_discount
0,First Class,28800.776,-1365.3665,0.240426
1,Same Day,9644.347,-1129.4225,0.261905
2,Second Class,43693.7475,-3320.6799,0.248361
3,Standard Class,124826.6615,-11910.0122,0.270526


```sql
-- Query 1: Slicing by Region and Time
SELECT
    region,
    EXTRACT(YEAR FROM order_date) AS order_year,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
    `your_project_id.your_dataset_id.your_table_id`
WHERE
    sub_category = 'Tables'
GROUP BY
    region,
    order_year
ORDER BY
    region,
    order_year;
```

```sql
-- Query 2: Controlling for Order Priority
SELECT
    order_priority,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
    `your_project_id.your_dataset_id.your_table_id`
WHERE
    sub_category = 'Tables'
GROUP BY
    order_priority
ORDER BY
    order_priority;

-- Alternative Query 2: Controlling for Ship Mode
SELECT
    ship_mode,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
    `your_project_id.your_dataset_id.your_table_id`
WHERE
    sub_category = 'Tables'
GROUP BY
    ship_mode
ORDER BY
    ship_mode;
```

To compare the outcomes, you would examine the results of both queries. For the first query, look for specific regions or years where 'Tables' might have been profitable despite the overall negative trend, or conversely, where the negative profit and high discounts are particularly pronounced. This would suggest that geographic location and the time of the sale are significant factors. For the second query, you would analyze if 'Tables' have a different profit margin or discount level based on their `order_priority` or `ship_mode`. If, for instance, 'Tables' with a 'High' `order_priority` or a 'Same Day' `ship_mode` are profitable, it would nuance your initial conclusion by showing that the high discounts might be strategically applied to certain types of orders or shipments, rather than being a blanket issue across all 'Tables' sales.

## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```



The three-month moving average dataset shows us that: There has been a substantial increase in monthly revenue, growing from 1700 USD to 20500 USD in December 2017, representing a compound monthly growth rate of approximately 2.7%. This upward trend is likely driven by seasonal demand patterns and effective sales strategies, particularly evident in the strong performance during Q4 in multiple years. To capitalize on this, sales teams should focus aggressively on Q4 targets, aligning marketing efforts with anticipated seasonal peaks. Monitoring customer acquisition cost (CAC) alongside revenue will be crucial to ensure sustainable growth.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---

# Task
Convert the 3-month moving average BigQuery SQL query into a complete Python script that can be run as a scheduled job. The script should use the `google-cloud-bigquery` client to execute a parameterized version of the query, write the results to the destination table `"noble-broker-471012-q6.analytics.outputs_kpi"`, and include basic error handling and logging.

## Generate Python Script for Scheduled Job

### Subtask:
Generate a complete Python script that uses the `google-cloud-bigquery` client. The script will execute a parameterized BigQuery query (based on the 3-month moving average query from a previous step), write the results to the destination table `noble-broker-471012-q6.analytics.outputs_kpi`, and include basic error handling and logging as per your requirements.


**Reasoning**:
I will generate a single Python code block that encapsulates all the user's instructions to create a complete, runnable script for scheduling a BigQuery job.



In [24]:
import logging
from datetime import datetime
from google.cloud import bigquery

# 1. Set up basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# 2. Define configuration variables
PROJECT_ID = 'noble-broker-471012-q6'
DESTINATION_DATASET = 'analytics'
DESTINATION_TABLE = 'outputs_kpi'

# 3. Create a function to run the parameterized job
def run_moving_average_job(start_date: str, end_date: str):
    """Runs a parameterized query for a 3-month moving average and saves the results."""
    client = bigquery.Client(project=PROJECT_ID)

    # Create the dataset if it does not exist
    logging.info(f"Ensuring dataset '{DESTINATION_DATASET}' exists...")
    client.create_dataset(DESTINATION_DATASET, exists_ok=True)
    logging.info(f"Dataset '{DESTINATION_DATASET}' is ready.")

    destination_table_ref = f"{PROJECT_ID}.{DESTINATION_DATASET}.{DESTINATION_TABLE}"

    logging.info(f"Starting job to calculate moving average from {start_date} to {end_date}.")
    logging.info(f"Results will be written to: {destination_table_ref}")

    # 4. Construct the parameterized SQL query
    sql = """
    WITH monthly_revenue_data AS (
      SELECT
        DATE_TRUNC(order_date, MONTH) AS month,
        SUM(sales) AS monthly_revenue
      FROM
        `noble-broker-471012-q6.lab1_foundations_EthanLouie.superstore_clean`
      WHERE
        segment = 'Corporate'
        AND order_date >= @start_date
        AND order_date <= @end_date
      GROUP BY
        month
    )
    SELECT
      month,
      monthly_revenue,
      AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
    FROM
      monthly_revenue_data
    ORDER BY
      month ASC;
    """

    # 5. Configure the query job
    job_config = bigquery.QueryJobConfig(
        destination=destination_table_ref,
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
        query_parameters=[
            bigquery.ScalarQueryParameter("start_date", "DATE", datetime.strptime(start_date, "%Y-%m-%d").date()),
            bigquery.ScalarQueryParameter("end_date", "DATE", datetime.strptime(end_date, "%Y-%m-%d").date()),
        ]
    )

    # 6. Wrap the execution in a try...except block
    try:
        logging.info("Executing query...")
        query_job = client.query_and_wait(sql, job_config=job_config)
        logging.info(f"Successfully created/updated table {destination_table_ref}. Job ID: {query_job.job_id}")
        # You can also get the number of rows affected
        destination_table = client.get_table(destination_table_ref)
        logging.info(f"Destination table now has {destination_table.num_rows} rows.")

    except Exception as e:
        logging.error(f"An error occurred while executing the BigQuery job: {e}")

# 7. Main execution block
if __name__ == "__main__":
    # Define the date range for the job
    start_date_param = '2017-01-01'
    end_date_param = '2017-12-31'

    run_moving_average_job(start_date=start_date_param, end_date=end_date_param)

## Final Task

### Subtask:
Provide a summary of the generated Python script and its functionality.


## Summary:

### Data Analysis Key Findings
*   A Python script was successfully created to automate a BigQuery task. This script calculates the 3-month moving average of monthly sales revenue for the 'Corporate' segment.
*   The script is designed for automation, featuring parameterization for date ranges, error handling with logging, and the ability to write results to a specified destination table.
*   An initial execution attempt failed with a `404 Not Found` error because the destination dataset `analytics` did not exist in BigQuery.
*   The script was corrected by adding a step to programmatically create the destination dataset if it is not already present, which resolved the error.
*   The final, successful execution processed data for the specified date range ('2017-01-01' to '2017-12-31') and loaded 12 rows of data into the destination table `noble-broker-471012-q6.analytics.outputs_kpi`.

### Insights or Next Steps
*   The completed Python script is now robust and ready for deployment as a scheduled job using a cloud service like Google Cloud Functions or Cloud Scheduler to automate regular KPI reporting.
*   Future enhancements could include adding more sophisticated error handling, such as implementing retry logic for transient API errors or integrating an alerting mechanism to notify stakeholders of job failures.


### Verify the Destination Table

The following code queries the `outputs_kpi` table that was created by the scheduled job script. This allows you to see the final data and confirm the job was successful.

In [25]:
import pandas as pd

# This assumes your 'client' object from previous cells is still active.

verification_query = """
SELECT
  month,
  monthly_revenue,
  ma_3
FROM
  `noble-broker-471012-q6.analytics.outputs_kpi`
ORDER BY
  month;
"""

print("Querying the destination table to verify the results...")

try:
    verification_df = client.query(verification_query).to_dataframe()
    print("Verification successful! Displaying the contents of the table:")
    display(verification_df)
except Exception as e:
    print(f"An error occurred during verification: {e}")

Querying the destination table to verify the results...
Verification successful! Displaying the contents of the table:


Unnamed: 0,month,monthly_revenue,ma_3
0,2017-01-01,12221.232,12221.232
1,2017-02-01,7959.258,10090.245
2,2017-03-01,16574.6228,12251.704267
3,2017-04-01,14803.0146,13112.298467
4,2017-05-01,14239.8348,15205.824067
5,2017-06-01,19695.04,16245.963133
6,2017-07-01,16525.599,16820.157933
7,2017-08-01,25678.444,20633.027667
8,2017-09-01,24105.103,22103.048667
9,2017-10-01,24877.168,24886.905
