
# **Lab 4: Predicting Delays (Regression with BQML)**
**Unit 2 • Week 7 (Thu) — ML Foundations & Regression**

**Objective:** Train, evaluate, and interpret a **linear regression** model with **BigQuery ML** to predict a flight’s **arrival delay (minutes)**. You’ll use Gemini for feature brainstorming and explanation, and run all SQL from **Colab**.

**Business Goal:** Improve airline resource planning by predicting arrival delays.

> **Dataset (recommended):** `bigquery-public-data.airline_ontime_data.flights`  
> If your project uses a different on-time dataset, update the `FULL_TABLE` constant below accordingly.


## Setup & Authentication

In [2]:

# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd
import os

# ▶️ Prompt user to enter their GCP Project ID
PROJECT_ID = input("mgmt-46700-471119: ").strip()
REGION = "us-central1"

# ▶️ Export environment variable for consistency
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

# ▶️ Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID)

print("✅ Authenticated successfully!")
print("Project:", PROJECT_ID)
print("Region:", REGION)

# ▶️ Set active project for gcloud
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project

# ▶️ Example: choose your working table (change if needed)
FULL_TABLE = "mgmt-46700-471119.flights.flights predict"
print("Using table:", FULL_TABLE)

# Done: Auth + Project/Region set


mgmt-46700-471119: mgmt-46700-471119
✅ Authenticated successfully!
Project: mgmt-46700-471119
Region: us-central1
Updated property [core/project].
mgmt-46700-471119
Using table: mgmt-46700-471119.flights.flights predict


In [3]:
!gcloud config get-value project
print("Region:", REGION)

mgmt-46700-471119
Region: us-central1


prompt =

Use Gemini to brainstorm predictive features:

```python
prompt = """
# TASK: Brainstorm features for a machine learning model.
# CONTEXT: I'm using the BigQuery public flights dataset. I want to predict the 'arr_delay' (arrival delay in minutes), which is a numerical value.
# GOAL: List 5 columns from the dataset that you think would be the best predictors for 'arr_delay' and briefly explain why for each one.
"""
```
Add your chosen features below.


**Chosen features (edit this cell):** `dep_delay`, `distance`, `carrier`, `origin`, `dest`


---
## Train Your First Regression Model

Use **BQML** to create a **linear regression** model. Edit the feature list as needed.  
We also enable explanations to support **ML.EXPLAIN_PREDICT** and global feature importance.


In [4]:
# STEP 2: Create and Train a Linear Regression Model in BigQuery ML

# Create a linear regression model
# NOTE: Adjust the feature list if your columns differ.
model_id = f"{PROJECT_ID}.flights.flight_delay_predictor"

create_model_sql = f"""
CREATE OR REPLACE MODEL `{model_id}`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['arr_delay'],
  enable_global_explain=TRUE
) AS
SELECT
  CAST(arr_delay AS FLOAT64) AS arr_delay,
  CAST(dep_delay AS FLOAT64) AS dep_delay,
  CAST(distance  AS FLOAT64) AS distance,
  CAST(carrier   AS STRING)  AS carrier,
  CAST(origin    AS STRING)  AS origin,
  CAST(dest      AS STRING)  AS dest
FROM `{PROJECT_ID}.flights.flights predict`
WHERE arr_delay IS NOT NULL
LIMIT 200000;  -- 限制训练样本数量，避免消耗太多资源
"""

print("⏳ Training model... This may take 1–2 minutes.")
job = client.query(create_model_sql)
job.result()

print("✅ Model created successfully!")
print("Model ID:", model_id)


⏳ Training model... This may take 1–2 minutes.
✅ Model created successfully!
Model ID: mgmt-46700-471119.flights.flight_delay_predictor



---
## Evaluate Model Performance (`ML.EVALUATE`) — Validate

Run evaluation to view **mean_absolute_error**, **r2_score**, etc. Then use Gemini to explain a key metric to a non-technical audience.

**Gemini Explainer Prompt:**

```python
prompt = """
# TASK: Generate the OPTIONS clause for a BigQuery ML linear regression model.
# GOAL: I need to specify the model_type, the input_label_cols, and enable the model to be explainable.
"""
```


In [5]:

eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.flights.flight_delay_predictor`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df


Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,11.481701,257.534469,0.963442,8.880821,0.877425,0.877426



---
## Explain Predictions (`ML.EXPLAIN_PREDICT`) — Why did the model predict that?

Author your own prompt to generate a query that explains a **hypothetical** case:  
> Predict the arrival delay for a 2000-mile flight with a 30-minute departure delay on carrier 'AA', and explain the top contributing features.

**Hint:** You’ll need a small input table with the required features.

> **Your Gemini prompt (write in the next cell):**  
> *Ask for a full `ML.EXPLAIN_PREDICT` example for this model and scenario, returning top features.*


In [6]:

# Example scaffold: fill in the SELECT for your hypothetical instance
explain_sql = f"""
SELECT *
FROM ML.EXPLAIN_PREDICT(
  MODEL `{PROJECT_ID}.flights.flight_delay_predictor`,
  (SELECT
     30.0  AS dep_delay,
     2000.0 AS distance,
     'AA'  AS carrier,
     'JFK' AS origin,
     'LAX' AS dest
  ),
  STRUCT(3 AS top_k_features)
)
"""
explain_df = client.query(explain_sql).result().to_dataframe()
explain_df


Unnamed: 0,predicted_arr_delay,top_feature_attributions,baseline_prediction_value,prediction_value,approximation_error,dep_delay,distance,carrier,origin,dest
0,-271659.694111,"[{'feature': 'carrier', 'attribution': -431277...",170016.956925,-271659.694111,0.0,30.0,2000.0,AA,JFK,LAX


### Challenge: Construct Your Own Prompt


prompt = """
TASK: Write a BigQuery ML.EXPLAIN_PREDICT query.
CONTEXT: I trained a regression model named `superstore_data.flight_delay_predictor` that predicts flight arrival delays.
GOAL: Show how to use ML.EXPLAIN_PREDICT to understand why the model predicted a certain delay value.
SCENARIO: Predict the delay for a hypothetical flight with the following details:
- carrier: 'AA'
- departure delay: 30 minutes
- distance: 2000 miles
The query should return the predicted delay and feature contributions explaining the prediction.
"""


---
## ✅ Deliverable for Lab 4

- Completed `Lab4_Regression_BQML.ipynb` with:
  - Brainstormed features + rationale
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` output and explanation
  - `ML.EXPLAIN_PREDICT` output (hypothetical case)
- Push to **GitHub** and submit the link on **Brightspace**.
