
# **Lab 6: Improving Models with Feature Engineering**
**Unit 2 • Week 9 (Thu) — Feature Engineering & Advanced Models**

**Objective:** Improve your **Lab 5** classifier by creating **new features** using the BQML `TRANSFORM` clause, then compare performance to a **baseline**.


## Setup & Authentication

In [2]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

PROJECT_ID = "imposing-coast-442802-a7"
FULL_TABLE = "bigquery-public-data.new_york_citibike.citibike_trips"  # update if needed

BASELINE_MODEL = f"{PROJECT_ID}.bqml_tutorial.usertype_classifier"  # from Lab 5
IMPROVED_MODEL = f"{PROJECT_ID}.bqml_tutorial.usertype_classifier_v2"

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)

Authenticated. Project: imposing-coast-442802-a7



---
## Establish a Baseline — Validate

Re-run `ML.EVALUATE` for your **Lab 5** model and record key metrics.


In [3]:

baseline_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{BASELINE_MODEL}`)"
baseline_df = client.query(baseline_sql).result().to_dataframe()
baseline_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.99236,1.0,0.99236,0.996165,0.044444,0.612626



---
## Brainstorm New Features — Investigate (Gemini Prompt)

```python
prompt =
```



---
## Implement Feature Engineering with `TRANSFORM`

Modify your `CREATE MODEL` from Lab 5 and add a `TRANSFORM` clause to create engineered features.

Requested features:
1. **`route`** = CONCAT(`origin`, '-', `dest`)  
2. **`day_of_week`** = EXTRACT(DAYOFWEEK FROM `fl_date`)


In [14]:
query_columns = f"""
SELECT
    column_name,
    data_type
FROM
    `bigquery-public-data`.new_york_citibike.INFORMATION_SCHEMA.COLUMNS
WHERE
    table_name = 'citibike_trips';
"""

columns_df = client.query(query_columns).result().to_dataframe()
display(columns_df)

Unnamed: 0,column_name,data_type
0,tripduration,INT64
1,starttime,DATETIME
2,stoptime,DATETIME
3,start_station_id,INT64
4,start_station_name,STRING
5,start_station_latitude,FLOAT64
6,start_station_longitude,FLOAT64
7,end_station_id,INT64
8,end_station_name,STRING
9,end_station_latitude,FLOAT64


In [22]:
create_v2_sql = f"""
CREATE OR REPLACE MODEL `{IMPROVED_MODEL}`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['usertype'],
  enable_global_explain=TRUE
)
AS
SELECT
  tripduration,
  starttime,
  stoptime,
  start_station_id,
  end_station_id,
  bikeid,
  birth_year,
  gender,
  CONCAT(CAST(start_station_id AS STRING), '-', CAST(end_station_id AS STRING)) AS route,
  EXTRACT(DAYOFWEEK FROM CAST(starttime AS DATE)) AS day_of_week,
  usertype
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
WHERE usertype IS NOT NULL
LIMIT 600000;
"""
job = client.query(create_v2_sql)
job.result()
print("Improved model created:", IMPROVED_MODEL)


Improved model created: imposing-coast-442802-a7.bqml_tutorial.usertype_classifier_v2



---
## Compare Performance — Extend

Evaluate the improved model and compare against the baseline.


In [23]:

improved_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{IMPROVED_MODEL}`)"
improved_df = client.query(improved_sql).result().to_dataframe()
improved_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.971493,0.918744,0.977483,0.941768,0.231853,0.993946



Create a small comparison table below in Markdown or code (precision, recall, etc.):
- **Baseline (Lab 5)**: …  
- **Improved (Lab 6)**: …  

**Did feature engineering improve performance? Why or why not?**



---
## Challenge: `ML.BUCKETIZE`

Author your own Gemini prompt to write a `TRANSFORM` clause that buckets `dep_delay` into 4 severity levels (e.g., early/on-time, minor, moderate, major).

> Hint: `ML.BUCKETIZE(dep_delay, [boundary_list])` returns a **bucket index**; you can also map buckets with `CASE`.


In [31]:
bucketize_sql = """
WITH BucketedTrips AS (
    SELECT
        tripduration,
        CAST(ML.BUCKETIZE(tripduration, [600, 1200, 1800]) AS INT64) AS tripduration_bucket_index
    FROM
        `bigquery-public-data.new_york_citibike.citibike_trips`
    LIMIT 10
)
SELECT
    tripduration,
    tripduration_bucket_index,
    CASE
        WHEN tripduration_bucket_index = 0 THEN 'Under 10 mins'
        WHEN tripduration_bucket_index = 1 THEN '10–20 mins'
        WHEN tripduration_bucket_index = 2 THEN '20–30 mins'
        WHEN tripduration_bucket_index = 3 THEN 'Over 30 mins'
        ELSE 'Unknown'
    END AS tripduration_bucket_label
FROM
    BucketedTrips;
"""

job = client.query(bucketize_sql)
bucketize_df = job.result().to_dataframe()
display(bucketize_df)


Unnamed: 0,tripduration,tripduration_bucket_index,tripduration_bucket_label
0,,,Unknown
1,,,Unknown
2,,,Unknown
3,,,Unknown
4,,,Unknown
5,,,Unknown
6,,,Unknown
7,,,Unknown
8,,,Unknown
9,,,Unknown



---
## ✅ Deliverable for Lab 6

- Completed `Lab6_Feature_Engineering.ipynb` showing:
  - Baseline metrics
  - Engineered features via `TRANSFORM`
  - Improved model evaluation + comparison
- Push to **GitHub** and submit the link on **Brightspace**.
