
# **Lab 4: Predicting Delays (Regression with BQML)**
**Unit 2 • Week 7 (Thu) — ML Foundations & Regression**

**Objective:** Train, evaluate, and interpret a **linear regression** model with **BigQuery ML** to predict a flight’s **arrival delay (minutes)**. You’ll use Gemini for feature brainstorming and explanation, and run all SQL from **Colab**.

**Business Goal:** Improve airline resource planning by predicting arrival delays.

> **Dataset (recommended):** `bigquery-public-data.airline_ontime_data.flights`  
> If your project uses a different on-time dataset, update the `FULL_TABLE` constant below accordingly.


## Setup & Authentication

In [4]:

# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

# ▶️ Set your GCP project id
PROJECT_ID = "noble-broker-471012-q6"

# ▶️ Table containing flights data (update if your dataset differs)
FULL_TABLE = "bigquery-samples.airline_ontime_data.flights"  # e.g., "your_project.your_dataset.flights"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)


Authenticated. Project: noble-broker-471012-q6
Using table: bigquery-samples.airline_ontime_data.flights



---
## Feature Brainstorming (Gemini Prompt)

Use Gemini to brainstorm predictive features:

```python
prompt = # TASK: Generate a BQML query to create a regression model.
# GOAL: Predict the 'diverted' column using 'carrier' and 'dep_delay' as features.
# SPECIFICATIONS: The model type must be a numerical regression.
```
Add your chosen features below.


**Chosen features (edit this cell):** `dep_delay`, `airline`, `departure_airport`, `arrival_airport`. distance was removed because it does not exist in the dataset. While I could calculate the distance with latitude/longitude, I decided against it for computational efficiencies.


---
## Train Your First Regression Model

Use **BQML** to create a **linear regression** model. Edit the feature list as needed.  
We also enable explanations to support **ML.EXPLAIN_PREDICT** and global feature importance.


In [12]:
# Create the BigQuery dataset to store the model
dataset_id = f"{PROJECT_ID}.superstore_data"
dataset = bigquery.Dataset(dataset_id)
dataset.location = "US"

client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset_id} created or already exists.")

# Create a linear regression model
# NOTE: This query now uses the correct snake_case column names from the preview.
model_id = f"{PROJECT_ID}.superstore_data.flight_delay_predictor"  # You may change dataset/model name

create_model_sql = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['arr_delay'],
  enable_global_explain=TRUE
) AS
SELECT
  CAST(arrival_delay AS FLOAT64) AS arr_delay,
  CAST(departure_delay AS FLOAT64) AS dep_delay,
  CAST(airline   AS STRING)  AS carrier,
  CAST(departure_airport    AS STRING)  AS origin,
  CAST(arrival_airport      AS STRING)  AS dest
FROM `{FULL_TABLE}`
WHERE arrival_delay IS NOT NULL
LIMIT 200000;  -- keep costs low
"""

job = client.query(create_model_sql)
job.result()
print("Model created:", model_id)

Dataset noble-broker-471012-q6.superstore_data created or already exists.
Model created: noble-broker-471012-q6.superstore_data.flight_delay_predictor



---
## Evaluate Model Performance (`ML.EVALUATE`) — Validate

Run evaluation to view **mean_absolute_error**, **r2_score**, etc. Then use Gemini to explain a key metric to a non-technical audience.

**Gemini Explainer Prompt:**

```python
prompt = # TASK: Explain the meaning of the explained variance metric, and how that's different than r2. The audience you're trying to explain this to is a non-technical audience.
```


These metrics give you a detailed report card on your model's performance. Here's what the most important ones mean:

mean_absolute_error (8.84): This is the most straightforward metric. It means that, on average, your model's prediction for the arrival delay is off by about 8.8 minutes. For a flight delay predictor, this is a pretty good result.

r2_score (0.85): This tells you how much of the variation in arrival delay is captured by your model. A score of 0.85 means that your features (dep_delay, carrier, etc.) explain about 85% of the reason a flight's arrival is delayed. This is a very strong score, indicating your model has high predictive power.

median_absolute_error (6.42): This shows that for half of all the predictions, the error was even smaller—just 6.4 minutes or less. This is lower than the mean, suggesting that while there are some predictions that are significantly off (which pull the average up), the model is very accurate for the typical flight.

In short, your model is performing well. It explains a lot of the delay and is typically accurate to within a few minutes.

In [13]:

eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df


Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,8.842249,175.946025,1.035393,6.423609,0.849674,0.849674



---
## Explain Predictions (`ML.EXPLAIN_PREDICT`) — Why did the model predict that?

Author your own prompt to generate a query that explains a **hypothetical** case:  
> Predict the arrival delay for a flight for a flight from SFO to JFK, a departure delay of 30 minutes on AA Airline.

**Hint:** You’ll need a small input table with the required features.

> **Your Gemini prompt (write in the next cell):**  
> *Ask for a full `ML.EXPLAIN_PREDICT` example for this model and scenario, returning top features.*


# Task
Generate a BQML query using `ML.EXPLAIN_PREDICT` to predict the arrival delay for a hypothetical flight from SFO to JFK on airline AA with a 30-minute departure delay, and return the top 3 features influencing the prediction.

## Generate Explain Predict Query

### Subtask:
Generate and execute a BQML query using ML.EXPLAIN_PREDICT for a hypothetical flight from SFO to JFK with a 30-minute departure delay on AA. The query will return the predicted arrival delay and the top 3 features that influenced the prediction.


**Reasoning**:
The user wants to construct and execute a `ML.EXPLAIN_PREDICT` query for a specific hypothetical scenario. I will replace the incorrect code in cell `7b6ec703` with the correct BQML query that uses the specified features (`dep_delay`, `carrier`, `origin`, `dest`) and values, and then execute it to get the prediction and explanation.



In [16]:
explain_sql = f"""
SELECT *
FROM ML.EXPLAIN_PREDICT(
  MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`,
  (SELECT
     30.0 AS dep_delay,
     'AA' AS carrier,
     'SFO' AS origin,
     'JFK' AS dest
  ),
  STRUCT(3 AS top_k_features)
)
"""
explain_df = client.query(explain_sql).result().to_dataframe()
explain_df

Unnamed: 0,predicted_arr_delay,top_feature_attributions,baseline_prediction_value,prediction_value,approximation_error,dep_delay,carrier,origin,dest
0,106084.796517,"[{'feature': 'origin', 'attribution': 375970.9...",-614113.374652,106084.796517,0.0,30.0,AA,SFO,JFK


## Summary:

### Q&A
**Question:** What is the predicted arrival delay for a flight from SFO to JFK on airline AA with a 30-minute departure delay, and what are the top features influencing this prediction?
**Answer:** For the specified flight scenario, the model predicted an arrival delay of approximately 106,084.8 minutes. The `ML.EXPLAIN_PREDICT` function identified the top features influencing this outcome, with the `origin` airport (SFO) having the most significant impact on this specific prediction.

### Data Analysis Key Findings
*   A `ML.EXPLAIN_PREDICT` query was successfully executed to generate a prediction and explanation for a hypothetical flight with a 30-minute departure delay from SFO to JFK on airline AA.
*   The model predicted an arrival delay of `106084.8` minutes. This result is anomalously high and likely unrealistic, suggesting potential issues with the model's training or the input features.
*   The query successfully returned the top 3 features contributing to the prediction, as requested.

### Insights or Next Steps
*   The extreme predicted delay value indicates that the model is not performing realistically for this scenario. It's recommended to review the training data for outliers or errors and potentially retrain the model with cleaned data or an adjusted feature set.
*   Further investigate the feature attributions to understand why certain features, like the `origin` airport, have such a disproportionately large and likely incorrect influence on the prediction.


In [19]:
training_data_sql = f"""
SELECT
  CAST(arrival_delay AS FLOAT64) AS arr_delay,
  CAST(departure_delay AS FLOAT64) AS dep_delay,
  CAST(airline   AS STRING)  AS carrier,
  CAST(departure_airport    AS STRING)  AS origin,
  CAST(arrival_airport      AS STRING)  AS dest
FROM `{FULL_TABLE}`
WHERE arrival_delay IS NOT NULL
AND airline = "AA" AND departure_airport = "SFO" AND arrival_airport = "JFK"
LIMIT 10;  -- Limiting to 10 for a quick preview
"""
training_df = client.query(training_data_sql).result().to_dataframe()
display(training_df)

Unnamed: 0,arr_delay,dep_delay,carrier,origin,dest
0,95.0,106.0,AA,SFO,JFK
1,-3.0,-8.0,AA,SFO,JFK
2,74.0,77.0,AA,SFO,JFK
3,-9.0,-4.0,AA,SFO,JFK
4,-35.0,-2.0,AA,SFO,JFK
5,55.0,61.0,AA,SFO,JFK
6,11.0,-2.0,AA,SFO,JFK
7,-11.0,-1.0,AA,SFO,JFK
8,13.0,-5.0,AA,SFO,JFK
9,25.0,43.0,AA,SFO,JFK



---
## ✅ Deliverable for Lab 4

- Completed `Lab4_Regression_BQML.ipynb` with:
  - Brainstormed features + rationale
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` output and explanation
  - `ML.EXPLAIN_PREDICT` output (hypothetical case)
- Push to **GitHub** and submit the link on **Brightspace**.
