<a href="https://colab.research.google.com/github/brandonmoss124/mgmt467-analytics-portfolio/blob/main/Lab2_Advanced_EDA_(1)_READY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2 – Advanced EDA with SQL + Plotly
**Week 4 (Thu)** – CTEs, Window Functions, Interactive Line Chart

In [None]:
#@title Setup (auth + libs)
!pip -q install google-cloud-bigquery pandas pyarrow plotly --upgrade
from google.cloud import bigquery
from google.colab import auth
import pandas as pd, plotly.express as px
auth.authenticate_user()
PROJECT_ID = "YOUR_PROJECT_ID_HERE"  # <-- EDIT
bq = bigquery.Client(project=PROJECT_ID)

## D – Discover (CTE): Top 3 Sub-Categories in 2023

In [None]:
prompt = """
# TASK: Generate a BigQuery SQL query using a Common Table Expression (CTE).
# GOAL: Find the top 3 product sub-categories by total sales in the year 2023.
# STEP 1 (in the CTE): Create a temporary table named 'yearly_sales' that calculates the sum of 'Sales' for each 'Sub_Category', filtering for orders where the year of 'Order_Date' is 2023.
# STEP 2 (in the final SELECT): Select the 'Sub_Category' and total sales from 'yearly_sales', order by sales descending, and limit to 3 results.
"""
print(prompt)

In [None]:
sql_cte = f"""
WITH yearly_sales AS (
  SELECT Sub_Category, SUM(Sales) AS total_sales
  FROM `{PROJECT_ID}.superstore_data.sales`
  WHERE EXTRACT(YEAR FROM Order_Date) = 2023
  GROUP BY Sub_Category
)
SELECT Sub_Category, total_sales
FROM yearly_sales
ORDER BY total_sales DESC
LIMIT 3;
"""
df_cte = bq.query(sql_cte).to_dataframe()
display(df_cte)

Write: “The top 3 sub‑categories in 2023 were …”

## I – Investigate (Window): YoY growth for Phones

In [None]:
prompt = """
# TASK: Generate a BigQuery SQL query using a window function.
# GOAL: Calculate the year-over-year sales growth for the 'Phones' sub-category.
# STEPS:
# 1. Filter for 'Phones'.
# 2. Sum sales by year.
# 3. Use LAG() to get previous_year_sales.
# 4. Compute YoY% = ((current - previous)/previous)*100.
"""
print(prompt)

In [None]:
sql_yoy = f"""
WITH yearly AS (
  SELECT EXTRACT(YEAR FROM Order_Date) AS year,
         SUM(Sales) AS current_sales
  FROM `{PROJECT_ID}.superstore_data.sales`
  WHERE Sub_Category = 'Phones'
  GROUP BY year
)
SELECT year,
       current_sales,
       LAG(current_sales) OVER (ORDER BY year) AS previous_year_sales,
       SAFE_DIVIDE(current_sales - LAG(current_sales) OVER (ORDER BY year), LAG(current_sales) OVER (ORDER BY year)) * 100 AS yoy_pct
FROM yearly
ORDER BY year;
"""
df_yoy = bq.query(sql_yoy).to_dataframe()
display(df_yoy)

In [None]:
import plotly.express as px
fig = px.line(df_yoy, x='year', y='current_sales', markers=True, title='Yearly Sales – Phones')
fig.update_layout(xaxis_title='Year', yaxis_title='Sales')
fig.show()

Note one insight only visible due to interactivity (hover values, inflection year, etc.).

## Challenges – Author Prompts

In [None]:
# Challenge 1 – Highest single-day sales in 2023 (CTE)
sql_c1 = f"""
WITH daily AS (
  SELECT DATE(Order_Date) AS order_date,
         Sub_Category,
         SUM(Sales) AS day_sales
  FROM `{PROJECT_ID}.superstore_data.sales`
  WHERE EXTRACT(YEAR FROM Order_Date) = 2023
  GROUP BY order_date, Sub_Category
)
SELECT order_date, Sub_Category, day_sales
FROM daily
ORDER BY day_sales DESC
LIMIT 1;
"""
df_c1 = bq.query(sql_c1).to_dataframe()
display(df_c1)

In [None]:
# Challenge 2 – 3‑month moving average for Corporate segment
sql_c2 = f"""
WITH monthly AS (
  SELECT DATE_TRUNC(DATE(Order_Date), MONTH) AS month,
         SUM(Sales) AS sales
  FROM `{PROJECT_ID}.superstore_data.sales`
  WHERE Segment = 'Corporate'
  GROUP BY month
)
SELECT month,
       sales,
       AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3m
FROM monthly
ORDER BY month;
"""
df_c2 = bq.query(sql_c2).to_dataframe()
display(df_c2)


## Reflection (DIVE)

- **Discover**: Briefly note the first relevant answer Gemini produced and what it showed.
- **Investigate**: Describe another angle or alternate query you explored.
- **Validate**: Explain how you checked Gemini's outputs (e.g., sanity counts, comparing against expectations).
- **Extend**: Propose a new business question that emerged from your results.


### Insight (from interactive Plotly chart)
- Example: *'Phones had consistent growth from 2020–2023, with the largest jump in 2022 visible on hover values.'*


## Submit

- Push **Lab2_...ipynb** to your GitHub repo under the `Labs/` folder.  
- Submit the GitHub link on Brightspace before the deadline.
