<a href="https://colab.research.google.com/github/brandonmoss124/mgmt467-analytics-portfolio/blob/main/Lab9_Final_Project_Blueprint_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project Blueprint


## Section 1 — Business Problem

I want to build a near real-time weather-risk monitoring tool for a small delivery company operating in Indiana.
Dispatchers need to know when severe weather (heavy rain, snow, or high winds) is likely to affect on-time deliveries
in the next few hours so they can reroute drivers or adjust schedules. Historical weather and delay data will be used
to understand patterns, while live weather data will give an up-to-date view of current risk. Success will be measured
by reduced late deliveries during bad weather and by how quickly dispatchers can see and react to changing conditions.

## Section 2 — Data Sources

### 2.1 Batch Source
- **Source name / URL:** Kaggle “US Weather Events (2016–2021)” CSV export
- **File format & storage location:** Multiple CSV files uploaded to `gs://proven-agility-477721-q9-bucket/weather_batch/`
- **Rough size:** ~6 million rows, 6 years of data (2016–2021) across the U.S.
- **Important fields and grain:**
  - Grain: one row per weather event (e.g., `Rain`, `Snow`, `Fog`) in a given location and time interval
  - Fields: `event_type`, `severity`, `start_time`, `end_time`, `city`, `state`, `latitude`, `longitude`

### 2.2 Streaming Source
- **Public API name and endpoint:** OpenWeatherMap Current Weather API (`https://api.openweathermap.org/data/2.5/weather`)
- **How often new data arrives:** Cloud Scheduler will trigger the pipeline every 5 minutes.
- **Important fields I will use:** `dt` (timestamp), `main.temp`, `main.humidity`, `wind.speed`,
  `weather[0].main`, `weather[0].description`
- **Landing path:** `OpenWeather API → HTTP Cloud Function (ingest_weather_producer) → Pub/Sub topic (live-data-stream)
  → Dataflow template (Pub/Sub to BigQuery) → BigQuery table: `proven-agility-477721-q9.superstore_data.realtime_weather_streaming`

## Section 3 — Cloud Architecture

### 3.1 Text Architecture

- **Batch path:**

  `Kaggle CSV files → GCS bucket (weather_batch) → BigQuery dataset superstore_data → table historical_weather`

- **Streaming path:**

  `OpenWeather API → Cloud Function (ingest_weather_producer) → Pub/Sub topic live-data-stream →`
  `Dataflow streaming job (Pub/Sub to BigQuery template) → BigQuery table realtime_weather_streaming`

- **ML + Analytics:**

  `historical_weather + realtime_weather_streaming (joined on location & time bucket) →`
  `BigQuery ML model (weather_delay_risk_model) → prediction table weather_risk_scores`

- **Dashboard:**

  `BigQuery tables (realtime_weather_streaming, weather_risk_scores) → Looker Studio dashboard (live risk map + trends)`

### 3.2 Optional Diagram Link

_Example placeholder_: Link to architecture diagram stored in Google Drive or as an image in the report.

### Gemini Architecture Prompt

In [1]:
prompt = """
# TASK: Act as a Google Cloud Solution Architect.
# CONTEXT: I am building a project to monitor near real-time weather risk for a small delivery company in Indiana.
# DATA SOURCES: A batch CSV of historical weather events (uploaded to BigQuery) and a streaming API (OpenWeather)
# that sends current weather data every few minutes.
# GOAL: A Looker Studio dashboard showing current high-risk areas plus historical patterns of bad weather and delays.
# REQUEST: Design a simple GCP architecture using services like Cloud Functions, Pub/Sub, Dataflow, BigQuery,
#          and BigQuery ML. Explain the specific role each service plays in this pipeline and how they connect together.
"""
prompt


'\n# TASK: Act as a Google Cloud Solution Architect.\n# CONTEXT: I am building a project to monitor near real-time weather risk for a small delivery company in Indiana.\n# DATA SOURCES: A batch CSV of historical weather events (uploaded to BigQuery) and a streaming API (OpenWeather)\n# that sends current weather data every few minutes.\n# GOAL: A Looker Studio dashboard showing current high-risk areas plus historical patterns of bad weather and delays.\n# REQUEST: Design a simple GCP architecture using services like Cloud Functions, Pub/Sub, Dataflow, BigQuery,\n#          and BigQuery ML. Explain the specific role each service plays in this pipeline and how they connect together.\n'

Gemini said my design is a good fit for what I want to do. It explained that the Cloud Function is a simple way to collect data from the OpenWeather API. Pub/Sub helps make sure messages don’t get lost and lets Dataflow handle the streaming data separately. Dataflow keeps sending the live weather data into BigQuery, where I can also store my batch weather data. Gemini also said BigQuery ML is the right place to train and run my prediction model so everything stays in one system. It recommended using monitoring and alerts in case the API has problems or the stream stops working.


## Section 4 — Machine Learning Plan (BigQuery ML)

- **Problem type:** Classification — predict whether weather conditions are **High-Risk** vs **Normal**
  for delivery delays in a given area and time slot.
- **Label (target variable):** `is_high_risk` (boolean), derived from historical data where severe weather
  corresponded to high numbers of late deliveries.
- **Features from batch table (`historical_weather`):**
  - `event_type` (encoded as dummy variables)
  - `severity` (ordinal)
  - `season` (derived from date)
  - `avg_temp` and `avg_wind_speed` for the event window
- **Features from streaming table (`realtime_weather_streaming`):**
  - `current_temp_c`, `current_wind_speed`, `current_humidity`
  - `current_condition_main` (rain, snow, clear, etc.)
- **BigQuery ML model type:** `LOGISTIC_REG` (logistic regression) in BigQuery ML.
- **Evaluation:** Use `ML.EVALUATE` to inspect accuracy, precision, recall, and ROC AUC, with a focus on recall
  for the high-risk class.
- **Usage:** Regularly run `ML.PREDICT` on the latest joined streaming features to generate a table of risk scores
  by region and time, which feeds the Looker Studio dashboard.

## Section 5 — Dashboard KPIs (Looker Studio)

| # | KPI name                               | Why it matters                                                                 | Data source/table                                                          |
|---|----------------------------------------|--------------------------------------------------------------------------------|----------------------------------------------------------------------------|
| 1 | % of regions flagged as High-Risk now  | Shows how widespread severe weather risk is at the current moment              | `weather_risk_scores` (latest predictions)                                |
| 2 | Avg predicted risk score by hour       | Reveals how risk has changed over the last 24 hours                            | `weather_risk_scores` aggregated by hour                                  |
| 3 | Number of severe weather events today  | Indicates how active the weather has been in the operating area                | `historical_weather` filtered to today’s date                             |
| 4 | Deliveries delayed during bad weather* | (If delivery data is added) measures real impact of weather on operations      | Join of delivery table with `historical_weather` (optional extension)     |
| 5 | Model confidence distribution          | Helps monitor times when the model is uncertain and may need retraining/tuning | `weather_risk_scores` (prediction probabilities)                          |

\*If there is no real delivery data, this KPI can be removed or simulated.

## Challenge — Devil’s Advocate Prompt

### 6.1 My Devil’s Advocate Prompt

> Play devil’s advocate for my weather-risk monitoring architecture. Identify the single biggest
> technical risk or failure point in this design (for example, dependence on the external OpenWeather
> API, cost of running a continuous Dataflow job, or maintaining consistent schemas between batch and
> streaming tables). Explain why it is risky and propose at least two concrete mitigation strategies
> I can implement before and during development.

### 6.2 Gemini’s Devil’s Advocate Response (summary)
Gemini said the biggest risk in my design is relying too much on the OpenWeather API. If the API stops working, changes its data, or rate-limits me, my whole streaming pipeline could fail. To lower the risk, Gemini suggested adding retry and error-logging so I know when problems happen, and storing backup API results in GCS so data isn’t lost. It also mentioned that Dataflow could cost more if it runs all the time, so I should monitor costs and scale the job only when needed.
