
# **Lab 4: Predicting Delays (Regression with BQML)**
**Unit 2 • Week 7 (Thu) — ML Foundations & Regression**

**Objective:** Train, evaluate, and interpret a **linear regression** model with **BigQuery ML** to predict a flight’s **arrival delay (minutes)**. You’ll use Gemini for feature brainstorming and explanation, and run all SQL from **Colab**.

**Business Goal:** Improve airline resource planning by predicting arrival delays.

> **Dataset (recommended):** `bigquery-public-data.airline_ontime_data.flights`  
> If your project uses a different on-time dataset, update the `FULL_TABLE` constant below accordingly.


## Setup & Authentication

In [13]:
# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

# ▶️ Set your GCP project id
PROJECT_ID = "imposing-coast-442802-a7"

# ▶️ Table containing flights data (update if your dataset differs)
# Using Citibike data as a placeholder - **REPLACE THIS WITH YOUR TABLE**
FULL_TABLE = "bigquery-public-data.new_york_citibike.citibike_trips"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)

Authenticated. Project: imposing-coast-442802-a7
Using table: bigquery-public-data.new_york_citibike.citibike_trips


In [14]:
# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

# ▶️ Set your GCP project id
PROJECT_ID = "imposing-coast-442802-a7"

# ▶️ Table containing flights data (update if your dataset differs)
# Using Citibike data as a placeholder - **REPLACE THIS WITH YOUR TABLE**
FULL_TABLE = "bigquery-public-data.new_york_citibike.citibike_trips"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)

query = f"SELECT column_name FROM `bigquery-public-data.new_york_citibike`.INFORMATION_SCHEMA.COLUMNS"
columns_df = client.query(query).result().to_dataframe()
print("Columns in the table:")
display(columns_df)

Authenticated. Project: imposing-coast-442802-a7
Using table: bigquery-public-data.new_york_citibike.citibike_trips
Columns in the table:


Unnamed: 0,column_name
0,station_id
1,name
2,short_name
3,latitude
4,longitude
5,region_id
6,rental_methods
7,capacity
8,eightd_has_key_dispenser
9,num_bikes_available



---
## Feature Brainstorming (Gemini Prompt)

Use Gemini to brainstorm predictive features:

```python
prompt =
```
Add your chosen features below.


**Chosen features (edit this cell):** `dep_delay`, `distance`, `carrier`, `origin`, `dest`


---
## Train Your First Regression Model

Use **BQML** to create a **linear regression** model. Edit the feature list as needed.  
We also enable explanations to support **ML.EXPLAIN_PREDICT** and global feature importance.


In [None]:
# Create a linear regression model
# NOTE: Adjust the feature list and input_label_cols if your columns differ based on the dataset you are using.
model_id = f"{PROJECT_ID}.superstore_data.flight_delay_predictor"  # You may change dataset/model name

create_model_sql = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['tripduration'], -- Assuming 'tripduration' is the label in Citibike data
  enable_global_explain=TRUE
) AS
SELECT
  CAST(tripduration AS FLOAT64) AS tripduration, -- Example: predict trip duration
  CAST(starttime  AS TIMESTAMP) AS starttime,
  CAST(stoptime   AS TIMESTAMP) AS stoptime,
  CAST(start_station_id AS STRING) AS start_station_id,
  CAST(end_station_id AS STRING) AS end_station_id,
  CAST(bikeid      AS STRING) AS bikeid,
  CAST(usertype   AS STRING) AS usertype,
  CAST(birth_year  AS INT64) AS birth_year,
  CAST(gender      AS STRING) AS gender

FROM `{FULL_TABLE}`
WHERE tripduration IS NOT NULL -- Adjust based on your label column
LIMIT 200000;  -- keep costs low
"""

job = client.query(create_model_sql)
job.result()
print("Model created:", model_id)

Model created: imposing-coast-442802-a7.superstore_data.flight_delay_predictor



---
## Evaluate Model Performance (`ML.EVALUATE`) — Validate

Run evaluation to view **mean_absolute_error**, **r2_score**, etc. Then use Gemini to explain a key metric to a non-technical audience.

**Gemini Explainer Prompt:**

```python
prompt =
```


In [1]:

eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df


NameError: name 'PROJECT_ID' is not defined


---
## Explain Predictions (`ML.EXPLAIN_PREDICT`) — Why did the model predict that?

Author your own prompt to generate a query that explains a **hypothetical** case:  
> Predict the arrival delay for a 2000-mile flight with a 30-minute departure delay on carrier 'AA', and explain the top contributing features.

**Hint:** You’ll need a small input table with the required features.

> **Your Gemini prompt (write in the next cell):**  
> *Ask for a full `ML.EXPLAIN_PREDICT` example for this model and scenario, returning top features.*


In [None]:
# Example scaffold: fill in the SELECT for your hypothetical instance
explain_sql = f"""
SELECT *
FROM ML.EXPLAIN_PREDICT(
  MODEL `{PROJECT_ID}.superstore_data.flight_delay_predictor`,
  (SELECT
     30.0 AS tripduration, -- Hypothetical trip duration in minutes
     TIMESTAMP('2023-10-27 08:00:00+00') AS starttime, -- Hypothetical start time
     TIMESTAMP('2023-10-27 08:30:00+00') AS stoptime,  -- Hypothetical stop time
     '316' AS start_station_id, -- Hypothetical start station ID
     '435' AS end_station_id,   -- Hypothetical end station ID
     '20000' AS bikeid,         -- Hypothetical bike ID
     'Subscriber' AS usertype,  -- Hypothetical user type
     1990 AS birth_year,      -- Hypothetical birth year
     'Male' AS gender           -- Hypothetical gender
  ),
  STRUCT(3 AS top_k_features)
)
"""
explain_df = client.query(explain_sql).result().to_dataframe()
explain_df

Unnamed: 0,predicted_tripduration,top_feature_attributions,baseline_prediction_value,prediction_value,approximation_error,tripduration,starttime,stoptime,start_station_id,end_station_id,bikeid,usertype,birth_year,gender
0,620.513539,"[{'feature': 'starttime', 'attribution': 288.2...",53.131792,620.513539,0.0,30.0,2023-10-27 08:00:00+00:00,2023-10-27 08:30:00+00:00,316,435,20000,Subscriber,1990,Male



---
## ✅ Deliverable for Lab 4

- Completed `Lab4_Regression_BQML.ipynb` with:
  - Brainstormed features + rationale
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` output and explanation
  - `ML.EXPLAIN_PREDICT` output (hypothetical case)
- Push to **GitHub** and submit the link on **Brightspace**.
