<a href="https://colab.research.google.com/github/brandonmoss124/mgmt467-analytics-portfolio/blob/main/lab4_regression_bqml_completed_confident_nexus_475419_k9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lab 4 — Predicting Flight Arrival Delays (Regression with **BigQuery ML**)

**Objective:** Train, evaluate, and explain a **linear regression** model in **BigQuery ML** to predict `arr_delay` (arrival delay in minutes).  
**Business goal:** Support airline resource planning by estimating arrival delays.

> **Submission tips:** This notebook already includes completed prompts, queries, and explanations. You only need to set your **PROJECT_ID** (and optionally **DATASET_ID**) and then run all cells.


## 1) Setup

In [None]:

# If running in Colab, authenticate:
# from google.colab import auth
# auth.authenticate_user()

!pip -q install -U google-cloud-bigquery pandas-gbq google-cloud-bigquery-storage

from google.cloud import bigquery
import pandas as pd

print("BigQuery libraries installed.")



## 2) Configuration

- Set your **GCP project ID** and an existing **BigQuery dataset** where the model will be stored.  
- If the dataset doesn't exist, create it in the BigQuery UI before running the model cell.


In [None]:

PROJECT_ID = "confident-nexus-475419-k9"
DATASET_ID = "superstore_data"
LOCATION   = "US"
MODEL_ID   = "flight_delay_predictor"

MODEL_FQN = f"`{PROJECT_ID}.{DATASET_ID}.{MODEL_ID}`"

client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

print("Project:", PROJECT_ID)
print("Dataset:", DATASET_ID)
print("Model:", MODEL_FQN)


## 3) Feature Brainstorming (Gemini prompt + answer)

In [None]:

prompt = """
# TASK: Brainstorm features for a machine learning model.
# CONTEXT: I'm using the BigQuery public flights dataset. I want to predict the 'arr_delay' (arrival delay in minutes), which is a numerical value.
# GOAL: List 5 columns from the dataset that you think would be the best predictors for 'arr_delay' and briefly explain why for each one.
"""
print(prompt)



**Filled answer (documented rationale):**

| Feature     | Why it may help |
|-------------|------------------|
| `dep_delay` | Departure delays usually propagate to arrival delays. |
| `distance`  | Longer flights can recover or compound delays depending on routing and buffers. |
| `carrier`   | Airlines differ in operations and on‑time performance. |
| `origin`    | Airport congestion and local weather influence delays. |
| `month`     | Seasonality (storms, holidays) affects delay risk. |


## 4) Train a Regression Model (BQML)


We train a **linear regression** model with BQML.  
- **Label:** `arr_delay`  
- **Features:** `dep_delay, distance, carrier, origin, month`  
- **Public source table:** `bigquery-public-data.airline_ontime_data.flights`  
- We filter out `NULL` labels and keep a **LIMIT** to control cost.


In [None]:

# OPTIONS clause (completed)
options_prompt = """
# TASK: Generate the OPTIONS clause for a BigQuery ML linear regression model.
# GOAL: I need to specify the model_type, the input_label_cols, and enable the model to be explainable.
"""
print(options_prompt)

create_model_sql = f"""
CREATE OR REPLACE MODEL {MODEL_FQN}
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['arr_delay'],
  enable_global_explain=TRUE
) AS
SELECT
  arr_delay,
  dep_delay,
  distance,
  carrier,
  origin,
  month
FROM
  `bigquery-public-data.airline_ontime_data.flights`
WHERE
  arr_delay IS NOT NULL
  AND dep_delay IS NOT NULL
LIMIT 200000
"""
print(create_model_sql)


In [None]:

# Run the CREATE MODEL job
job = client.query(create_model_sql)
job.result()
print("Model created successfully. Job ID:", job.job_id)


## 5) Evaluate Model Performance (ML.EVALUATE)

In [None]:

evaluate_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL {MODEL_FQN})
"""
print(evaluate_sql)

eval_job = client.query(evaluate_sql)
eval_df = eval_job.result().to_dataframe()
eval_df


## 6) Explain a Key Metric (Filled prompt + answer)

In [None]:

prompt = """
# TASK: Explain a technical metric to a non-technical audience.
# CONTEXT: My machine learning model has a 'mean_absolute_error' of 14.5.
# GOAL: Explain what this means in the business context of predicting flight delays, which are measured in minutes. Is an error of 14.5 good or bad in this context?
"""
print(prompt)



**Answer (example explanation):**  
A **mean absolute error (MAE) of 14.5** means that, on average, the model’s predictions are off by about **14–15 minutes**. For high-level planning (e.g., staffing or gate ranges), this can be **useful**. For minute‑by‑minute operations (tight turnarounds, crew changes), it may still be **too high**, so we’d either seek more features (weather, taxi‑out time) or consider a different model to reduce error.


## 7) Challenge — Explain a Prediction (Prompt + ML.EXPLAIN_PREDICT)

In [None]:

challenge_prompt = """
# TASK: Help me write an ML.EXPLAIN_PREDICT query in BigQuery ML.
# CONTEXT: I trained a linear regression model called `flight_delay_predictor` in dataset `${DATASET_ID}`.
# GOAL: Explain a prediction for a hypothetical 2000-mile flight on carrier 'AA' with a 30-minute departure delay.
# OUTPUT: Provide the full SQL query that uses ML.EXPLAIN_PREDICT to show feature contributions for this prediction.
"""
print(challenge_prompt)


In [None]:

# Example EXPLAIN_PREDICT query (modify origin/month if you trained with additional features)
explain_predict_sql = f"""
SELECT *
FROM ML.EXPLAIN_PREDICT(
  MODEL {MODEL_FQN},
  (
    SELECT
      30   AS dep_delay,
      2000 AS distance,
      'AA' AS carrier,
      'JFK' AS origin,
      7    AS month
  )
)
"""
print(explain_predict_sql)

explain_job = client.query(explain_predict_sql)
explain_df = explain_job.result().to_dataframe()
explain_df



### Interpreting the explanation
Look at the contribution/weight columns per feature. Positive values **increase** the predicted arrival delay; negative values **decrease** it. For the example above, you’ll typically see:
- **dep_delay** contributing the most (late departures usually land late),
- **carrier** adding/removing minutes depending on historical performance,
- **origin** and **month** nudging the prediction based on congestion/seasonality,
- **distance** with a smaller effect (some recovery time on longer routes).



## 8) Checklist & Submission

- ✅ Feature brainstorming prompt + filled rationale  
- ✅ `CREATE MODEL` SQL with selected features (and cost control `LIMIT`)  
- ✅ `ML.EVALUATE` results table  
- ✅ MAE explanation for a non‑technical audience  
- ✅ Challenge prompt + `ML.EXPLAIN_PREDICT` query and interpretation  

**Before you submit:**
1. Set `PROJECT_ID` (and `DATASET_ID` if needed).
2. Run all cells.
3. Push this notebook to your GitHub repository.
4. Submit the GitHub link in Brightspace.
