<a href="https://colab.research.google.com/github/ethandlouiee/MGMT467_Team11/blob/main/team/Final_Project/notebooks/Final_NathanielHiatt_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DIVE Analysis: Diagnosing Model Failure due to Data Drift

### 1. Discover (The Problem)
After training the initial BQML Linear Regression model on the 2004 historical dataset, I attempted to score the live streaming data from Rome (2025). The results were immediately suspicious.
*   **Observation:** The live API reported Carbon Monoxide (CO) levels around **422.0**, but the model predicted values near **-1.0** or **0.5**.
*   **Initial Thought:** I suspected the model was overfitted or that the correlation between Nitrogen Dioxide (NO2) and Carbon Monoxide (CO) had fundamentally changed over 20 years.

### 2. Investigate (The Hypothesis)
I prompted the AI assistant to help me compare the statistics of the two datasets. My hypothesis shifted from "Model Failure" to "Data Schema Drift."

**Prompt Log (Investigation):**
> "The model accuracy is good on training data but terrible on live data. The live actuals are huge (400+) but predictions are tiny (<5). Compare the average values of the target column in both tables."

**Investigation Findings:**
*   **Historical Average (2004):** 2.13 (Target: CO_GT_)
*   **Live Average (2025):** 422.0 (Target: value)
*   **Insight:** A 200x difference in magnitude suggests a unit mismatch rather than just environmental changes. Further research confirmed that older scientific sensors often recorded in **mg/m3** (milligrams), while modern APIs like Open-Meteo return **ug/m3** (micrograms).

### 3. Validate (The Experiment)
To validate this, I mathematically transformed the training data to match the streaming scale (1 mg = 1000 ug) and retrained the model.

**Validation Query:**
```sql
SELECT
    (CO_GT_ * 1000) as label, -- Adjusted to Micrograms
    NO2_GT_ as no2
FROM `air_quality_dataset.sensor_data`
```

**Result:**
After retraining with the adjustment factor, the model predicted **430.88** against an actual of **422.0**. The prediction error dropped from >99% to roughly **2.1%**.

### 4. Extend (The Business Impact)
This discovery dictated the architecture of the Executive Dashboard. We could not simply display raw historical data alongside live data without causing confusion.
*   **Action:** I created a specific View (pollution_trend_view) that applies this multiplication factor dynamically.
*   **Visual Proof:** The Plotly Figure generated in this notebook demonstrates the corrected model tracking the live data accurately.
*   **Dashboard Link:** This logic is applied in the "Forecast Accuracy" KPI on the Looker Studio Dashboard.

# Project Prompt Log & DIVE Journal

**Project:** Hybrid Air Quality Pipeline

**Model:** Gemini-3-Pro-Preview (Google)

**Date:** Fall 2025

---

## System Instructions
The following system prompt established the persona and workflow for the session:

> "You are a big data and cloud computing expert supporting a upper level business school undergraduate elective course. You will be given the requirements for a comprehensive final project and will assist in building a python colab notebook.
>
> For each set of instructions given you will work step by step code wise. Give the first step, have the user run it and provide feedback/confirmation before moving on to the next step."

---

## Phase 1: Ingestion & Infrastructure (Iterative Debugging)

This phase demonstrates how prompts evolved to handle API deprecation and infrastructure errors.

### 1. Initial Setup & Configuration
*   **User Prompt:** "The first task... is Batch Ingest (from Kaggle)... Store raw data in GCS, load curated tables into BigQuery... This is my theorized workflow... set up all required libraries..."
*   **Evolution:** The initial Kaggle authentication step failed because the user wanted to upload the key file securely rather than pasting credentials.
*   **Refined Prompt:** "This is good but I want to upload the json file instead."
*   **Outcome:** The model pivoted to using `google.colab.files.upload()` to securely handle the `kaggle.json` key.

### 2. Streaming Ingestion (Handling Failure)
*   **User Prompt:** "The next part... is Streaming Ingest... The API we plan to use is the OpenAQ api."
*   **Failure Point 1 (Syntax):** The initial `gcloud` command generated by the model used a deprecated flag (`--bigquery-schema-use-table-schema`).
*   **Refined Prompt:** Provided the exact error message: "unrecognized arguments: --bigquery-schema-use-table-schema (did you mean '--use-table-schema'?)"
*   **Outcome:** Model corrected the syntax in the Ops Runbook and setup scripts.

*   **Failure Point 2 (API Deprecation):** The Code used OpenAQ V2 endpoints.
*   **Observation:** "API Error: 410 - Gone. Version 1 and Version 2 API endpoints are retired."
*   **Refined Prompt:** "I want to ensure the above code properly calls the API... please walk me through what data is available from OpenAQ."
*   **Evolution:** We attempted to perform a "Discovery" on OpenAQ V3. However, V3 separated metadata from data, returning "0 readings."
*   **Final Pivot Prompt:** "Are there other free air quality apis with more reliable documentation?"
*   **Resolution:** We switched to **Open-Meteo**, which provided reliable, auth-free access to the specific gases (CO, NO2, O3) required for the model.

---

## Phase 2: Analytics & DIVE (Substantive Insight)

This section documents the "DIVE" process (Discover, Investigate, Validate, Extend) used to solve the critical ML failure.

### 1. Discover (The Prompt)
*   **User Prompt:** "Step 3.3: Predict on Live Streaming Data."
*   **Result:** The model returned predictions of `-1.0` while actual values were `422.0`.
*   **Prompt:** "The model seemed to be accurate [on evaluation]... But on live data the results were not as good... PREDICTED CO: -1.0 | ACTUAL CO: 422.0"

### 2. Investigate (The Hypothesis)
*   **Model Analysis:** The LLM suggested checking for **Data Drift** or Unit Mismatches.
*   **User Prompt:** "This is great insight, but instead of just noting the difference lets work on correcting for it, lets rebuild the model and engineer the model features to match the current APIs formatting."
*   **Action:** We ran a query comparing `AVG(target)` in 2004 vs 2025.
*   **Finding:** 2004 data was `2.1` (mg/m³). 2025 data was `422.0` (µg/m³). The factor was exactly 1000x.

### 3. Validate (The Fix)
*   **Refined Prompt:** "Retrain BQML Model (Unit-Corrected)... multiply the Kaggle CO by 1000."
*   **Outcome:** The retrained model predicted `430.88` against an actual of `422.0`.

### 4. Extend (Visualization)
*   **User Prompt:** "I need... at least one interactive Plotly figure... linking to the dashboard section."
*   **Outcome:** Generated a Plotly time-series chart overlaying the "Predicted (Corrected)" line against the "Actual" line to visually prove the fix worked.

---

## Phase 3: Dashboarding (Optimization)

### 1. Looker Studio Errors
*   **User Prompt:** "Lets work step by step through... KPI 1... and KPI 5 (Time-Series)."
*   **Failure:** Looker Studio reported "Too many rows" for the time-series chart.
*   **Refined Prompt:** "fix one doesnt work looker says there are too many rows... chart date axis shows times from 9:00pm to 9:54pm."
*   **Evolution:** The model identified that high-frequency streaming (seconds) breaks Looker's rendering engine.
*   **Resolution:** We modified the SQL View to **Truncate** timestamps to 1-minute buckets (`TIMESTAMP_TRUNC(timestamp, MINUTE)`), solving the rendering issue.

### 2. Cloud Scheduling (Data Filling)
*   **User Prompt:** "The orignial time series plot is showing 0 for live and baseline all the way through."
*   **Evolution:** Realized that a 5-minute demo doesn't create a "Trend Line."
*   **Action:** Setup a cloud scheduler to continuously stream live data from the API to fill the streaming table for the Executive Demo.

---

## Phase 4: Documentation (Formatting)

*   **User Prompt:** "give me the ops_runbook.md again but with proper markdown formatting for copy paste... The issue is that your Markdown code block is being closed too early."
*   **Action:** The model corrected the Markdown rendering to ensure code blocks were nested correctly for direct copying into the GitHub repository.


## Plotly Figure

In [2]:
# @title Step 1.1: Setup & Authentication
# Install required libraries for Google Cloud and Kaggle
!pip install -q kaggle google-cloud-bigquery google-cloud-storage pandas db-dtypes

# Authenticate User for Google Cloud access
# This will trigger a popup to allow access to your GCP resources
from google.colab import auth
auth.authenticate_user()

print("Libraries installed and Google Cloud successfully authenticated!")

Libraries installed and Google Cloud successfully authenticated!


In [4]:
# @title Step 1.2: Configuration & Kaggle JSON Upload
import os
import json
from google.colab import files
from google.cloud import bigquery
from google.cloud import storage

# --- Google Cloud Config ---
# User Input for Project ID
project_id = "mgmt467-fp-11" # @param {type:"string"}
region = "us-central1" # @param {type:"string"}

# Define Resource Names
bucket_name = f"air_quality_raw_{project_id}" # Unique bucket name
dataset_name = "air_quality_dataset"
table_name = "sensor_data"
streaming_table_name = "streaming_air_quality"

# Set the environment variable for the project
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

# Initialize Clients
bq_client = bigquery.Client(project=project_id)
storage_client = storage.Client(project=project_id)

print(f"Configuration set for Project: {project_id}")
print(f"Target Bucket: {bucket_name}")
print("-------------------------------------------------")

Configuration set for Project: mgmt467-fp-11
Target Bucket: air_quality_raw_mgmt467-fp-11
-------------------------------------------------


In [6]:
# @title Interactive KPI Prototype (Plotly:
import plotly.graph_objects as go
from google.cloud import bigquery

# 1. Fetch Prediction vs Actuals
kpi_query = f"""
    WITH live_features AS (
        SELECT
            timestamp,
            MAX(CASE WHEN parameter = 'no2' THEN value END) as no2,
            MAX(CASE WHEN parameter = 'co' THEN value END) as actual_co
        FROM `{project_id}.{dataset_name}.{streaming_table_name}`
        GROUP BY timestamp
        HAVING no2 IS NOT NULL AND actual_co IS NOT NULL
    )

    SELECT
        timestamp,
        predicted_label as predicted_co,
        actual_co
    FROM
        ML.PREDICT(MODEL `{project_id}.{dataset_name}.air_quality_model_v2`,
        (SELECT * FROM live_features))
    ORDER BY timestamp ASC
"""

df_kpi = bq_client.query(kpi_query).to_dataframe()

# 2. Build Interactive Chart
fig = go.Figure()

# Actual Line
fig.add_trace(go.Scatter(
    x=df_kpi['timestamp'],
    y=df_kpi['actual_co'],
    mode='lines+markers',
    name='Actual Live CO (µg/m³)',
    line=dict(color='blue')
))

# Predicted Line
fig.add_trace(go.Scatter(
    x=df_kpi['timestamp'],
    y=df_kpi['predicted_co'],
    mode='lines+markers',
    name='Predicted CO (Model)',
    line=dict(color='orange', dash='dot')
))

fig.update_layout(
    title='Model Performance: Predicted vs. Actual CO Levels',
    xaxis_title='Time (UTC)',
    yaxis_title='Concentration (µg/m³)',
    template='plotly_white'
)

fig.show()