
# **Lab 4: Predicting Delays (Regression with BQML)**
**Unit 2 • Week 7 (Thu) — ML Foundations & Regression**

**Objective:** Train, evaluate, and interpret a **linear regression** model with **BigQuery ML** to predict a flight’s **arrival delay (minutes)**. You’ll use Gemini for feature brainstorming and explanation, and run all SQL from **Colab**.

**Business Goal:** Improve airline resource planning by predicting arrival delays.

> **Dataset (recommended):** `bigquery-public-data.airline_ontime_data.flights`  
> If your project uses a different on-time dataset, update the `FULL_TABLE` constant below accordingly.


## Setup & Authentication

In [9]:
# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

# ▶️ Set your GCP project id
PROJECT_ID = "sapient-office-471119-g4"

# ▶️ Table containing flights data (update if your dataset differs)
FULL_TABLE = "sapient-office-471119-g4.superstore_data.airdash_base_enriched"  # e.g., "your_project.your_dataset.flights"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)

Authenticated. Project: sapient-office-471119-g4
Using table: sapient-office-471119-g4.superstore_data.airdash_base_enriched


In [6]:

# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

# ▶️ Set your GCP project id
PROJECT_ID = "sapient-office-471119-g4"

# ▶️ Table containing flights data (update if your dataset differs)
FULL_TABLE = "sapient-office-471119-g4.superstore_data.airdash_base_enriched"  # e.g., "your_project.your_dataset.flights"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)


Authenticated. Project: sapient-office-471119-g4
Using table: sapient-office-471119-g4.superstore_data.airdash_base_enriched



---
## Feature Brainstorming (Gemini Prompt)

Use Gemini to brainstorm predictive features:

```python
prompt = brainstorm predictive features of this notebook
```
Add your chosen features below.

Time-based features: Day of the week, time of day, month, or even season could influence delays.
Flight-specific features: Aircraft type, tail number, or even the specific flight number might be correlated with delays.
Weather information: While not directly available in this dataset, external weather data for origin and destination airports could be a strong predictor.
Airport congestion: The number of flights scheduled around the same time at the origin or destination airport could contribute to delays.
Previous flight performance: The on-time performance of the same aircraft or carrier on previous flights could be a useful feature.


**Chosen features (edit this cell):** `dep_delay`, `distance`, `carrier`, `origin`, `dest`


---
## Train Your First Regression Model

Use **BQML** to create a **linear regression** model. Edit the feature list as needed.  
We also enable explanations to support **ML.EXPLAIN_PREDICT** and global feature importance.


In [12]:
# Create a linear regression model
# NOTE: Adjust the feature list if your columns differ.
model_id = f"{PROJECT_ID}.superstore_data.flight_delay_predictor"  # You may change dataset/model name

create_model_sql = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['arr_delay_min'],
  enable_global_explain=TRUE
) AS
SELECT
  CAST(arr_delay_min AS FLOAT64) AS arr_delay_min,
  CAST(dep_delay_min AS FLOAT64) AS dep_delay_min,
  CAST(distance_miles  AS FLOAT64) AS distance_miles,
  CAST(carrier   AS STRING)  AS carrier,
  CAST(origin    AS STRING)  AS origin,
  CAST(dest      AS STRING)  AS dest
FROM `{FULL_TABLE}`
WHERE arr_delay_min IS NOT NULL
LIMIT 200000;  -- keep costs low
"""

job = client.query(create_model_sql)
job.result()
print("Model created:", model_id)

Model created: sapient-office-471119-g4.superstore_data.flight_delay_predictor



---
## Evaluate Model Performance (`ML.EVALUATE`) — Validate

Run evaluation to view **mean_absolute_error**, **r2_score**, etc. Then use Gemini to explain a key metric to a non-technical audience.

**Gemini Explainer Prompt:**

```python
prompt = explain a key metric to a non-technical audience.

Okay, let's explain a key metric from the model evaluation to a non-technical audience.

Looking at the evaluation results, the Mean Absolute Error (MAE) is a key metric. Think of it this way:

If the model predicts a flight will be 10 minutes late, and it actually arrives 15 minutes late, the error for that single flight is 5 minutes. The Mean Absolute Error is essentially the average of these error amounts across all the flights the model predicted on.

So, if your model has a Mean Absolute Error of, say, 8 minutes, it means that, on average, the model's prediction for a flight's arrival delay is off by about 8 minutes. A lower MAE means the model's predictions are closer to the actual delays.
```


In [13]:

eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df


Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,8.003696,102.745023,0.942901,6.01449,0.597406,0.597558



---
## Explain Predictions (`ML.EXPLAIN_PREDICT`) — Why did the model predict that?

Author your own prompt to generate a query that explains a **hypothetical** case:  
> Predict the arrival delay for a 2000-mile flight with a 30-minute departure delay on carrier 'AA', and explain the top contributing features.

**Hint:** You’ll need a small input table with the required features.

> **Your Gemini prompt (write in the next cell):**  
> *Ask for a full `ML.EXPLAIN_PREDICT` example for this model and scenario, returning top features.*


In [14]:
# Example scaffold: fill in the SELECT for your hypothetical instance
explain_sql = f"""
SELECT *
FROM ML.EXPLAIN_PREDICT(
  MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`,
  (SELECT
     30.0  AS dep_delay_min,
     2000.0 AS distance_miles,
     'AA'  AS carrier,
     'JFK' AS origin,
     'LAX' AS dest
  ),
  STRUCT(3 AS top_k_features)
)
"""
explain_df = client.query(explain_sql).result().to_dataframe()
explain_df

Unnamed: 0,predicted_arr_delay_min,top_feature_attributions,baseline_prediction_value,prediction_value,approximation_error,dep_delay_min,distance_miles,carrier,origin,dest
0,-8137.235882,"[{'feature': 'carrier', 'attribution': -1126.9...",-6863.403084,-8137.235882,0.0,30.0,2000.0,AA,JFK,LAX



---
## ✅ Deliverable for Lab 4

- Completed `Lab4_Regression_BQML.ipynb` with:
  - Brainstormed features + rationale
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` output and explanation
  - `ML.EXPLAIN_PREDICT` output (hypothetical case)
- Push to **GitHub** and submit the link on **Brightspace**.
