
# **Lab 6: Improving Models with Feature Engineering**
**Unit 2 • Week 9 (Thu) — Feature Engineering & Advanced Models**

**Objective:** Improve your **Lab 5** classifier by creating **new features** using the BQML `TRANSFORM` clause, then compare performance to a **baseline**.


## Setup & Authentication

In [1]:

from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

PROJECT_ID = "sapient-office-471119-g4"
FULL_TABLE = "sapient-office-471119-g4.superstore_data.airdash_base_enriched"  # update if needed

BASELINE_MODEL = f"{PROJECT_ID}.superstore_data.flight_diverted_classifier"  # from Lab 5
IMPROVED_MODEL = f"{PROJECT_ID}.superstore_data.flight_diverted_classifier_v2"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)


Authenticated. Project: sapient-office-471119-g4



---
## Establish a Baseline — Validate

Re-run `ML.EVALUATE` for your **Lab 5** model and record key metrics.


In [2]:

baseline_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{BASELINE_MODEL}`)"
baseline_df = client.query(baseline_sql).result().to_dataframe()
baseline_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.995119,0.0,0.030853,0.498643



---
## Brainstorm New Features — Investigate (Gemini Prompt)

```python
prompt = Write a short paragraph explaining future ideas for improving machine learning models using feature engineering. Discuss potential methods like feature selection, transformation, interaction terms, handling class imbalance, and using domain knowledge to create more predictive inputs.
```



---
## Implement Feature Engineering with `TRANSFORM`

Modify your `CREATE MODEL` from Lab 5 and add a `TRANSFORM` clause to create engineered features.

Requested features:
1. **`route`** = CONCAT(`origin`, '-', `dest`)  
2. **`day_of_week`** = EXTRACT(DAYOFWEEK FROM `fl_date`)


In [25]:
create_v2_sql = f"""
CREATE OR REPLACE MODEL `{IMPROVED_MODEL}`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['diverted'],
  enable_global_explain=TRUE
)
AS
SELECT
  CAST(distance_miles AS FLOAT64) AS distance_miles, # Use correct column name and cast
  CAST(dep_delay_min AS FLOAT64) AS dep_delay_min,   # Use correct column name and cast
  CAST(carrier  AS STRING)  AS carrier,
  CAST(route AS STRING) AS route,                   # Use existing route column
  EXTRACT(DAYOFWEEK FROM CAST(date AS DATE)) AS day_of_week, # Use correct date column and extract day of week
  CAST(diverted AS BOOL) AS diverted
FROM `{FULL_TABLE}`
WHERE diverted IS NOT NULL
LIMIT 600000;
"""
job = client.query(create_v2_sql); job.result()
print("Improved model created:", IMPROVED_MODEL)

Improved model created: sapient-office-471119-g4.superstore_data.flight_diverted_classifier_v2


In [24]:
# Query the table schema to check column names
schema_sql = f"""
SELECT column_name, data_type
FROM `{PROJECT_ID}.superstore_data`.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'airdash_base_enriched';
"""
schema_df = client.query(schema_sql).result().to_dataframe()
display(schema_df)

Unnamed: 0,column_name,data_type
0,date,DATE
1,month,STRING
2,carrier,STRING
3,origin,STRING
4,dest,STRING
5,route,STRING
6,distance_miles,INT64
7,seats,INT64
8,passengers,INT64
9,dep_delay_min,INT64



---
## Compare Performance — Extend

Evaluate the improved model and compare against the baseline.


In [26]:

improved_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{IMPROVED_MODEL}`)"
improved_df = client.query(improved_sql).result().to_dataframe()
improved_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.995086,0.0,0.030824,0.57581



Create a small comparison table below in Markdown or code (precision, recall, etc.):
- **Baseline (Lab 5)**: …  
- **Improved (Lab 6)**: …  

**Did feature engineering improve performance? Why or why not?**



---
## Challenge: `ML.BUCKETIZE`

Author your own Gemini prompt to write a `TRANSFORM` clause that buckets `dep_delay` into 4 severity levels (e.g., early/on-time, minor, moderate, major).

> Hint: `ML.BUCKETIZE(dep_delay, [boundary_list])` returns a **bucket index**; you can also map buckets with `CASE`.



---
## ✅ Deliverable for Lab 6

- Completed `Lab6_Feature_Engineering.ipynb` showing:
  - Baseline metrics
  - Engineered features via `TRANSFORM`
  - Improved model evaluation + comparison
- Push to **GitHub** and submit the link on **Brightspace**.
