
# **Lab 6: Improving Models with Feature Engineering**
**Unit 2 • Week 9 (Thu) — Feature Engineering & Advanced Models**

**Objective:** Improve your **Lab 5** classifier by creating **new features** using the BQML `TRANSFORM` clause, then compare performance to a **baseline**.


## Setup & Authentication

In [19]:

# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd
import os

# ▶️ Prompt user to enter their GCP Project ID
PROJECT_ID = input("mgmt-46700-471119: ").strip()
REGION = "us-central1"

# ▶️ Export environment variable for consistency
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

# ▶️ Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID)

print("✅ Authenticated successfully!")
print("Project:", PROJECT_ID)
print("Region:", REGION)

# ▶️ Set active project for gcloud
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project

# ▶️ Example: choose your working table (change if needed)
FULL_TABLE = "mgmt-46700-471119.flights.flights predict"
print("Using table:", FULL_TABLE)

BASELINE_MODEL = f"{PROJECT_ID}.flights.flight_delay_predictor"  # from Lab 5
IMPROVED_MODEL = f"{PROJECT_ID}.flights.flight_delay_predictor"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)


mgmt-46700-471119: mgmt-46700-471119
✅ Authenticated successfully!
Project: mgmt-46700-471119
Region: us-central1
Updated property [core/project].
mgmt-46700-471119
Using table: mgmt-46700-471119.flights.flights predict
Authenticated. Project: mgmt-46700-471119



---
## Establish a Baseline — Validate

Re-run `ML.EVALUATE` for your **Lab 5** model and record key metrics.


In [33]:

baseline_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{BASELINE_MODEL}`)"
baseline_df = client.query(baseline_sql).result().to_dataframe()
baseline_df


Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,11.481701,257.534469,0.963442,8.880821,0.877425,0.877426



---
## Brainstorm New Features — Investigate (Gemini Prompt)

```python
prompt = prompt = """
# TASK: Brainstorm new features for an ML model, a process called feature engineering.
# CONTEXT: I want to improve my flight diversion prediction model. The raw data has a column called 'origin' (e.g., 'JFK', 'ORD') and another called 'carrier' (e.g., 'AA', 'UA').
# GOAL: Suggest one new feature I could create by combining 'origin' and 'carrier' that might be more predictive than either column alone. Explain why this new feature could be more powerful.
"""
```
A new feature you could create by combining origin and carrier is a unique route-carrier combination.

For example, instead of having 'JFK' and 'AA' as separate features, you would have a new feature like 'JFK-AA'.

This new feature could be more powerful because certain route-carrier combinations might have historical patterns of diversions that are not captured by the individual origin or carrier features alone. For example, a specific carrier might have more diversions when flying out of a particular airport due to factors like operational procedures, specific aircraft types used on that route, or even historical weather patterns for that route.



---
## Implement Feature Engineering with `TRANSFORM`

Modify your `CREATE MODEL` from Lab 5 and add a `TRANSFORM` clause to create engineered features.

Requested features:
1. **`route`** = CONCAT(`origin`, '-', `dest`)  
2. **`day_of_week`** = EXTRACT(DAYOFWEEK FROM `fl_date`)


In [28]:
IMPROVED_MODEL = f"{PROJECT_ID}.flights.flight_delay_predictor_classifier"
create_v2_sql = f"""
CREATE OR REPLACE MODEL `{IMPROVED_MODEL}`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['is_delayed'],
  enable_global_explain=TRUE
)
AS
SELECT
  CAST(dep_delay > 15 AS BOOL) AS is_delayed,
  CAST(distance AS FLOAT64) AS distance,
  CAST(dep_delay AS FLOAT64) AS dep_delay,
  CAST(carrier AS STRING) AS carrier,
  CONCAT(CAST(origin AS STRING), '-', CAST(dest AS STRING)) AS route,
  EXTRACT(DAYOFWEEK FROM CAST(time_hour AS TIMESTAMP)) AS day_of_week
FROM `{FULL_TABLE}`
WHERE dep_delay IS NOT NULL
LIMIT 600000;
"""
job = client.query(create_v2_sql); job.result()
print("Improved model created:", IMPROVED_MODEL)

Improved model created: mgmt-46700-471119.flights.flight_delay_predictor_classifier



---
## Compare Performance — Extend

Evaluate the improved model and compare against the baseline.


In [34]:

improved_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{IMPROVED_MODEL}`)"
improved_df = client.query(improved_sql).result().to_dataframe()
improved_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,1.0,0.926237,0.984282,0.961706,0.069602,1.0


In [31]:
baseline_classification_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{BASELINE_MODEL}`)"
baseline_classification_df = client.query(baseline_classification_sql).result().to_dataframe()
baseline_classification_df

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,11.481701,257.534469,0.963442,8.880821,0.877425,0.877426



Create a small comparison table below in Markdown or code (precision, recall, etc.):
- **Baseline (Lab 5)**: Regression
- **Improved (Lab 6)**: Classification: The improved model achieved near-perfect metrics with precision: 1.0, recall: 0.926, accuracy: 0.984, F1-score: 0.962, log loss: 0.070, and ROC AUC: 1.0.

**Did feature engineering improve performance? Why or why not?**
Yes, feature engineering significantly improved performance**
1. **Problem Reformulation Success**: Converting from regression (predicting delay minutes) to classification (predicting delay yes/no) created a more practical and business-relevant model.

2. **Excellent Classification Performance**: The improved model achieved near-perfect metrics:
   - 100% precision (all predicted delays were actual delays)
   - 98.4% overall accuracy
   - Perfect ROC AUC score of 1.0

3. **Valuable Feature Additions**: The engineered features provided critical insights:
   - `route` feature captured specific airport-pair delay patterns
   - `day_of_week` identified temporal patterns in delays
   - Proper data type casting improved data quality

4. **Enhanced Business Utility**: The classification model provides more actionable insights for airlines and passengers compared to predicting exact delay minutes.

While direct metric comparison is challenging due to different model types, the transformation to a high-performing classification model with 98.4% accuracy demonstrates clear improvement through feature engineering and problem reframing.




---
## Challenge: `ML.BUCKETIZE`

Author your own Gemini prompt to write a `TRANSFORM` clause that buckets `dep_delay` into 4 severity levels (e.g., early/on-time, minor, moderate, major).

> Hint: `ML.BUCKETIZE(dep_delay, [boundary_list])` returns a **bucket index**; you can also map buckets with `CASE`.

##prompt = ""
TASK: Write a BQML TRANSFORM clause using ML.BUCKETIZE.
CONTEXT: I'm building a logistic regression model to predict flight delays.
I want to categorize the continuous 'dep_delay' column into 4 buckets representing different levels of delay severity:
1. 'early_or_on_time'      : dep_delay <= 0
2. 'minor_delay'            : 0 < dep_delay <= 15
3. 'moderate_delay'         : 15 < dep_delay <= 60
4. 'major_delay'            : dep_delay > 60 GOAL: Show how to write a TRANSFORM clause in the CREATE MODEL statement that uses ML.BUCKETIZE on dep_delay
to create a new categorical feature named 'delay_severity', keeping other features like distance, carrier, and route.
"""



---
## ✅ Deliverable for Lab 6

- Completed `Lab6_Feature_Engineering.ipynb` showing:
  - Baseline metrics
  - Engineered features via `TRANSFORM`
  - Improved model evaluation + comparison
- Push to **GitHub** and submit the link on **Brightspace**.
