# Assignment 2: Feature Engineering and Predictive Modeling
### Dataset: Titanic (same as Assignment 1)
Objective: Build a classification model to predict survival.

## Tasks:
1. Load and preprocess the Titanic dataset.
2. Perform feature engineering (e.g., family size, titles).
3. Encode categorical variables.
4. Split dataset into train/test sets.
5. Train Logistic Regression and Random Forest models.
6. Evaluate using accuracy, precision, recall, F1, ROC AUC.
7. Compare results and discuss model performance.

Deliverable: Notebook with code, metrics, and interpretation.

Rubric:
- Feature engineering: 20%
- Model training: 20%
- Evaluation metrics: 20%
- Comparison and discussion: 30%
- Clarity and organization: 10%


In [2]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
from google.colab import auth
auth.authenticate_user()

import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Set active project for gcloud/BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project
# Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt-467-2500
Project: mgmt-467-2500 | Region: us-central1
INFORMATION: Project 'mgmt-467-2500' has no 'environment' tag set. Use either 'Production', 'Development', 'Test', or 'Staging'. Add an 'environment' tag using `gcloud resource-manager tags bindings create`.
Updated property [core/project].
mgmt-467-2500


## Kaggle API

In [3]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()
# #
import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
    os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only

!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


## Download and unzip Dataset

In [4]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
!mkdir -p /content/data/raw
!kaggle datasets download -d yasserh/Titanic-Dataset -p /content/data
!unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/yasserh/Titanic-Dataset
License(s): CC0-1.0
Downloading Titanic-Dataset.zip to /content/data
  0% 0.00/22.0k [00:00<?, ?B/s]
100% 22.0k/22.0k [00:00<00:00, 91.4MB/s]
Archive:  /content/data/Titanic-Dataset.zip
  inflating: /content/data/raw/Titanic-Dataset.csv  
-rw-r--r-- 1 root root 60K Dec 24  2021 /content/data/raw/Titanic-Dataset.csv


## Create GCS Bucket and Upload

In [5]:
# # EXAMPLE (from LLM) — GCS staging (commented)
import uuid, os
bucket_name = f"mgmt467-titanic-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name
!gcloud storage buckets create gs://$BUCKET_NAME --location="us-central1"
!gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
print("Bucket:", bucket_name)
# # # Verify contents
!gcloud storage ls gs://$BUCKET_NAME/netflix/

Creating gs://mgmt467-titanic-ab1a309e/...
Copying file:///content/data/raw/Titanic-Dataset.csv to gs://mgmt467-titanic-ab1a309e/netflix/Titanic-Dataset.csv
Bucket: mgmt467-titanic-ab1a309e
gs://mgmt467-titanic-ab1a309e/netflix/Titanic-Dataset.csv


## Create Dataset

In [6]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
DATASET="titanic"
# # # Attempt to create; ignore if exists
!bq --location=US mk -d --description "MGMT467 Titanic dataset" $DATASET || echo "Dataset may already exist."

BigQuery error in mk operation: Dataset 'mgmt-467-2500:titanic' already exists.
Dataset may already exist.


## Load Data into table

In [7]:
# # EXAMPLE (from LLM) — Load tables (commented)
tables = {
   "Titanic": "Titanic-Dataset.csv",
 }
import os
DATASET = "titanic" # Assuming DATASET is set in a previous cell
for tbl, fname in tables.items():
   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}" # Corrected path
   print("Loading", tbl, "from", src)
   # Corrected bq load syntax: destination_table, source_uri
   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

# # # Row counts
for tbl in tables.keys():
    # Corrected bq query syntax with escaped backticks
    query = f"SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{os.environ['GOOGLE_CLOUD_PROJECT']}.titanic.{tbl}`"

Loading Titanic from gs://mgmt467-titanic-ab1a309e/netflix/Titanic-Dataset.csv
Waiting on bqjob_r17c83125b652b19b_0000019a5568a989_1 ... (1s) Current status: DONE   


## Load Data From Table

In [8]:
# --- Minimal setup (edit 2 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-2500"   # e.g., mgmt-467-47888
REGION     = "us-central1"
TABLE_PATH = "mgmt-467-2500.titanic.Titanic"

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)

BQ Project: mgmt-467-2500
Source table: mgmt-467-2500.titanic.Titanic


## Sanity Check

In [9]:
bq.query(f"SELECT * FROM `{TABLE_PATH}` LIMIT 5").result().to_dataframe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
1,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
2,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
3,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
4,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S


In [13]:
%%bigquery --project $PROJECT_ID
CREATE OR REPLACE TABLE `titanic.Model_B_features` AS
SELECT
  Pclass,
  Sex,
  Age,
  Fare,
  Embarked,
  SibSp + Parch + 1 AS family_size,
  CASE
    WHEN Fare <= 15 THEN 'Low'
    WHEN Fare > 15 AND Fare <= 50 THEN 'Mid' -- Corrected AND condition
    WHEN Fare > 50 THEN 'High'
  END AS fare_bucket,
  CONCAT(Sex, '_', CAST(Pclass AS STRING)) AS sex_pclass,
  Survived
  FROM `titanic.Titanic`;

Query is running:   0%|          |

## Train Model

In [15]:
# ✅ Train enhanced model
%%bigquery --project $PROJECT_ID
CREATE OR REPLACE MODEL `titanic.Model_B`
OPTIONS(model_type='logistic_reg', input_label_cols=['Survived']) AS
SELECT
  Pclass,
  Sex,
  Age,
  Fare,
  Embarked,
  family_size,
  fare_bucket,
  sex_pclass,
  Survived
FROM `titanic.Model_B_features`;

Query is running:   0%|          |

## Evaluate Model

In [17]:
# ✅ Evaluate enhanced model
%%bigquery --project $PROJECT_ID
SELECT *
FROM ML.EVALUATE(MODEL `titanic.Model_B`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.75,0.545455,0.765363,0.631579,0.461262,0.844109


In [18]:
# Get the confusion matrix for the trained model
%%bigquery --project $PROJECT_ID
SELECT
  *
FROM
  ML.CONFUSION_MATRIX(MODEL `titanic.Model_B`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,expected_label,_0,_1
0,0,202,24
1,1,60,72


## Sample Answer Based on Notebook Results

Based on the results from the notebook, here is a sample response addressing parts of the rubric:

**Baseline or engineered model build + clear metrics (AUC/log_loss)**

The engineered model (Model B) was successfully built using BigQuery ML's logistic regression. The evaluation metrics for Model B (from `ML.EVALUATE`) are as follows:

*   **AUC:** 0.844109
*   **Log Loss:** 0.461262
*   **Accuracy:** 0.765363
*   **Precision:** 0.75
*   **Recall:** 0.545455
*   **F1 Score:** 0.631579

These metrics provide a clear picture of the model's performance on the test data. The AUC of 0.844 indicates good discriminatory power, suggesting the model is reasonably capable of distinguishing between survivors and non-survivors. The log loss of 0.461 is also relatively low, indicating good probability predictions.

**Confusion matrix interpretation (default 0.5)**

The confusion matrix for Model B at the default 0.5 threshold (from `ML.CONFUSION_MATRIX`) is:

| expected_label | _0  | _1  |
|----------------|-----|-----|
| 0              | 202 | 24  |
| 1              | 60  | 72  |

Interpreting this matrix:

*   **True Negatives (TN):** 202 passengers were correctly predicted as not surviving.
*   **False Positives (FP):** 24 passengers were incorrectly predicted as surviving (Type I error).
*   **False Negatives (FN):** 60 passengers were incorrectly predicted as not surviving (Type II error).
*   **True Positives (TP):** 72 passengers were correctly predicted as surviving.

At this threshold, the model is better at identifying non-survivors (high TN) than survivors (moderate TP). There is a notable number of false negatives, meaning the model missed predicting survival for 60 passengers who actually survived. The false positives are lower, indicating fewer instances where the model incorrectly predicted survival.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
!git config --global user.name "DanielGallagher1"
!git config --global user.email "gallagherdaniel555@gmail.com"

In [3]:
from getpass import getpass
token = getpass('Enter your GitHub token: ')
!git clone https://{token}@github.com/garci843/Unit1_TheLook_Team1.git

Enter your GitHub token: ··········
Cloning into 'Unit1_TheLook_Team1'...
remote: Enumerating objects: 179, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 179 (delta 21), reused 21 (delta 21), pack-reused 155 (from 1)[K
Receiving objects: 100% (179/179), 548.43 KiB | 4.94 MiB/s, done.
Resolving deltas: 100% (83/83), done.


In [4]:
%cd Unit1_TheLook_Team1

/content/Unit1_TheLook_Team1


In [5]:
!cp "/content/drive/My Drive/MGMT467/Assignments/Unit2_Daniel_BQML.ipynb" "/content/Unit1_TheLook_Team1/Assignment_2/Individual"

In [8]:
import nbformat

# Path to this notebook (change if needed)
notebook_path = "Assignment_2/Individual/Unit2_Daniel_BQML.ipynb"

# Read the notebook
nb = nbformat.read(notebook_path, as_version=4)

# Remove the broken 'widgets' metadata if it exists
if "widgets" in nb["metadata"]:
    del nb["metadata"]["widgets"]
    print("Removed 'metadata.widgets' from notebook.")
else:
    print("No 'metadata.widgets' field found.")

# Save the cleaned notebook
nbformat.write(nb, notebook_path)
print(f"✅ Cleaned notebook saved as: {notebook_path}")

Removed 'metadata.widgets' from notebook.
✅ Cleaned notebook saved as: Assignment_2/Individual/Unit2_Daniel_BQML.ipynb


In [1]:
%cd /content/Unit1_TheLook_Team1
!git add Assignment_2/Individual/Unit2_Daniel_BQML.ipynb
!git commit -m "Updated Unit2_Daniel_BQML.ipynb with latest analysis"
!git push https://{token}@github.com/garci843/Unit1_TheLook_Team1.git main

[Errno 2] No such file or directory: '/content/Unit1_TheLook_Team1'
/content
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
