<a href="https://colab.research.google.com/github/brandonmoss124/mgmt467-analytics-portfolio/blob/main/Lab6_Feature_Engineering_Completed_confident_nexus_475419_k9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lab 6 — Improving Models with Feature Engineering (**BQML TRANSFORM**)

**Objective:** Improve your **Lab 5** classifier by creating new, more informative features using **BQML's TRANSFORM** clause. Compare performance to your baseline.


## 1) Setup

In [None]:

# from google.colab import auth
# auth.authenticate_user()

!pip -q install -U google-cloud-bigquery pandas-gbq google-cloud-bigquery-storage

from google.cloud import bigquery
import pandas as pd

print("BigQuery libraries installed.")


## 2) Configuration

In [None]:

# >>> EDIT THESE <<<
PROJECT_ID = "confident-nexus-475419-k9"     # <-- Replace
DATASET_ID = "superstore_data"     # Use the same dataset as Lab 5 (or update as needed)
LOCATION   = "US"
BASE_MODEL_ID = "flight_diversion_classifier"        # From Lab 5
NEW_MODEL_ID  = "flight_diversion_classifier_fe"     # New model with TRANSFORM

BASE_MODEL_FQN = f"`{PROJECT_ID}.{DATASET_ID}.{BASE_MODEL_ID}`"
NEW_MODEL_FQN  = f"`{PROJECT_ID}.{DATASET_ID}.{NEW_MODEL_ID}`"

client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

print("Base model:", BASE_MODEL_FQN)
print("New model:", NEW_MODEL_FQN)


## 3) Establish a Baseline (Re-run Evaluation of Lab 5 Model)

In [None]:

baseline_eval_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL {BASE_MODEL_FQN})
"""
print(baseline_eval_sql)
baseline_eval_df = client.query(baseline_eval_sql).result().to_dataframe()
baseline_eval_df



**Record baseline metrics (precision, recall)** from the table above — we’ll compare to the engineered model.


## 4) Brainstorm New Features (Gemini prompt)

In [None]:

prompt = """
# TASK: Brainstorm new features for an ML model, a process called feature engineering.
# CONTEXT: I want to improve my flight diversion prediction model. The raw data has a column called 'origin' (e.g., 'JFK', 'ORD') and another called 'carrier' (e.g., 'AA', 'UA').
# GOAL: Suggest one new feature I could create by combining 'origin' and 'carrier' that might be more predictive than either column alone. Explain why this new feature could be more powerful.
"""
print(prompt)



**Filled idea:** Create a categorical **origin_carrier** feature (e.g., `JFK-AA`) to capture **interaction effects**: some carriers perform differently at specific airports (slots, gates, congestion patterns).


## 5) Implement Feature Engineering with TRANSFORM

In [None]:

prompt = """
# TASK: Generate the TRANSFORM clause for a BQML CREATE MODEL statement.
# GOAL: I need to pass through my original features ('dep_delay', 'distance') and create two new features:
# 1. 'route': a new feature created by combining the 'origin' and 'destination' columns using the CONCAT() function.
# 2. 'day_of_week': a new feature created by extracting the day of the week from the 'fl_date' column using the EXTRACT() function.
"""
print(prompt)


In [None]:

create_fe_model_sql = f"""
CREATE OR REPLACE MODEL {NEW_MODEL_FQN}
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['diverted_label'],
  enable_global_explain=TRUE,
  TRANSFORM = (
    SELECT
      CAST(diverted AS BOOL) AS diverted_label,
      -- pass-through
      dep_delay,
      distance,
      carrier,
      origin,
      dest,
      -- engineered
      CONCAT(origin, '-', dest) AS route,
      EXTRACT(DAYOFWEEK FROM fl_date) AS day_of_week
    FROM `bigquery-public-data.airline_ontime_data.flights`
    WHERE diverted IS NOT NULL
      AND distance IS NOT NULL
      AND dep_delay IS NOT NULL
  )
)
"""
print(create_fe_model_sql)
job = client.query(create_fe_model_sql)
job.result()
print("Engineered model created:", job.job_id)


## 6) Compare Performance

In [None]:

new_eval_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL {NEW_MODEL_FQN})
"""
print(new_eval_sql)
new_eval_df = client.query(new_eval_sql).result().to_dataframe()
new_eval_df



**Create a comparison table** (fill with your numbers):

| Model | Precision | Recall |
|------|-----------|--------|
| Baseline (Lab 5) | _(your value)_ | _(your value)_ |
| Engineered (Lab 6) | _(your value)_ | _(your value)_ |

**Interpretation:** Did feature engineering improve precision and/or recall? Explain the trade-offs.


## 7) Challenge — BUCKETIZE dep_delay

In [None]:

challenge_prompt = """
# TASK: Help me write a TRANSFORM clause in BQML that uses ML.BUCKETIZE on the dep_delay column
#       to create 4 buckets representing delay severity (e.g., 'early_or_on_time', 'minor_delay',
#       'moderate_delay', 'major_delay').
# CONTEXT: This is for improving the flight diversion classifier.
"""
print(challenge_prompt)



**Hint (example pattern to adapt):**
```sql
TRANSFORM = (
  SELECT
    ...,
    ML.BUCKETIZE(dep_delay, [0, 15, 45]) AS dep_delay_bucket  -- 4 buckets: (-inf,0), [0,15), [15,45), [45,inf)
  FROM ...
)
```
Choose bucket edges that make business sense for your analysis.
