# Day 30: Integration Speed Drill + Exit Gate Preparation (Type A)

**Session Context:**

- Last 3 days: Day 27-B, Day 28-A, Day 29-A
- Previous day energy: Normal (precision maintained)
- Cumulative training day: 30
- Week 3 Day 6

**Type:** A (Heavy Practice Day)
**Focus:** Verify speed and readiness across ALL exit gate requirements

**Critical Observation:**
The exit gate has THREE parts:

- Part A: Regression (90 min) ‚Äî You've been practicing this all week ‚úì
- Part B: Classification (90 min) ‚Äî Last practiced Day 24 (~5 days ago) ‚ö†Ô∏è
- Part C: Conceptual Validation (30 min)

Today ensures all three are ready.

## FOUNDATION DRILLING (MANDATORY)

### Part 1: Python Patterns Speed Test

**Goal:** Complete all 4 tasks in under 5 minutes total.

**Task 1:** Use .get() to safely access a dict with default

In [5]:
config = {'model': 'ridge', 'alpha': 1.0}
# Get 'learning_rate' with default 0.01

task1 = config.get('learning_rate', 0.01)
print(task1)

0.01


**Task 2:** Use sorted() with key=lambda to sort tuples by second element

In [11]:
scores = [('ridge', 0.65), ('lasso', 0.63), ('linear', 0.64)]
# Sort by score (second element), highest first

task2 = sorted(scores, key=lambda model: -model[1])
print(task2)

[('ridge', 0.65), ('linear', 0.64), ('lasso', 0.63)]


**Task 3:** Use dict comprehension to create squared values

In [7]:
numbers = [1, 2, 3, 4, 5]
# Create dict: {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

task3 = {num: num**2 for num in numbers}
print(task3)

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}


**Task 4:** Use .join() to create comma-separated string

In [8]:
features = ['MedInc', 'HouseAge', 'bedroom_ratio']
# Result: "MedInc, HouseAge, bedroom_ratio"

task4 = ', '.join(features)
print(task4)

MedInc, HouseAge, bedroom_ratio


**Time yourself.** Record: 3:37 minutes

### Part 2: Scaling Workflow Verification

**Verbal recitation (no code, just speak/write):** 

**1. What are the 4 steps in order?** train_test_split, create scaler, fit_transform(X_train), transform(X_test)

**2. What does fit_transform() do that transform() doesn't?** it learns the data's meand & std

**3. Why does fitting on test data cause data leakage?** If scaler fits on all data, it learns mean/std including test set. When we evaluate on test, those statistics already influenced the transformation ‚Üí artificially optimistic scores. Test set is no longer "unseen.

### Part 3: Classification Metrics Recitation

**From memory, write the definitions:**

- Precision = the ability of the classifier not to label as positive a sample that is negative; of all positive predicitons, how many are correct?; TP / (TP + FP)
- Recall = the ability of the classifier to find all the positve samples; of all actual positives, how many did we catch?; TP / (TP + FN)
- F1 = Harmonic mean of Precision and Recall; 2 x (Precision x Recall) / (Precision + Recall)
- When prioritize precision: when a false positive is more costly
- When prioritize recall: when a false negative is more costly

**Checkpoint:** Post Foundation Drilling completion before proceeding.

## BLOCK 1: Regression Pipeline Speed Drill (Timed)

**Goal:** Complete full regression pipeline in under 15 minutes.

**START YOUR TIMER NOW.**

In [17]:
# Complete these steps as fast as you can while maintaining correctness:

# 1. Imports (sklearn, pandas, numpy, matplotlib)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_california_housing

# 2. Load California Housing data
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# 3. Create 3 engineered features:
#    - bedroom_ratio
#    - rooms_per_person
#    - one more of your choice

df['bedroom_ratio'] = df['AveBedrms'] / df['AveRooms']
df['rooms_per_person'] = df['AveRooms'] / df['AveOccup']
df['rooms_per_income'] = df['AveRooms'] / df['MedInc']

# 4. Define X (your feature columns) and y (target)

feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                'Latitude', 'Longitude', 'bedroom_ratio', 'rooms_per_person', 'rooms_per_income']

X = df[feature_cols]
y = df['MedHouseVal']

# 5. Train/test split (80/20, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 6. Scale features (correct order!)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 7. Fit LinearRegression on scaled training data

lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# 8. Make predictions on scaled test data

lr_pred = lr_model.predict(X_test_scaled)

# 9. Calculate and print: R¬≤, MAE, RMSE

lr_r2_scaled = r2_score(y_test, lr_pred)
lr_mae_scaled = mean_absolute_error(y_test, lr_pred)
lr_rmse_scaled = np.sqrt(mean_squared_error(y_test, lr_pred))

print(f"LR R2 Scaled: {lr_r2_scaled:.4f}")
print(f"LR MAE Scaled: ${lr_mae_scaled * 100000:,.0f}")
print(f"LR RMSE Scaled: ${lr_rmse_scaled * 100000:,.0f}")

# 10. BONUS (if time): Compare with Ridge using cross_val_score


LR R2 Scaled: 0.6511
LR MAE Scaled: $48,710
LR RMSE Scaled: $67,613


**STOP TIMER. Record: 9:46 minutes**

**Scoring:**

- < 10 min: ‚úÖ Exit gate ready
- 10-15 min: ‚úÖ On track
- 15-20 min: üü° Review workflow before exit gate
- 20 min: üü° Need additional practice day

**Record your results:**

- Time: 9:46 min
- R¬≤: 0.6511
- MAE: $48,710
- Any errors encountered: no errors

**Checkpoint:** Share time and results before Block 2.

## BLOCK 2: Classification Refresher (Critical)

**Why this matters:** Exit gate Part B requires classification. You haven't practiced this in ~5 days. This block verifies those skills are still automatic.

### Setup with Clean Synthetic Data

In [18]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Generate clean classification data (no encoding confusion)
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    weights=[0.7, 0.3],  # Slight imbalance
    random_state=42
)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Class distribution: {np.bincount(y_train)}")
print(f"X_train shape: {X_train.shape}")


Class distribution: [564 236]
X_train shape: (800, 10)


### Task 1: Build Two Classifiers

**Task:** From memory, build LogisticRegression and DecisionTreeClassifier. Fit both on training data.

In [19]:
# Your code here

lr_model = LogisticRegression(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)

lr_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)

### Task 2: Generate Predictions and Confusion Matrix

**Task:** For each model:

1. Generate predictions on test data
2. Create confusion matrix
3. Extract TN, FP, FN, TP

In [21]:
# Your code here

lr_pred = lr_model.predict(X_test)
dt_pred = dt_model.predict(X_test)

cm_lr = confusion_matrix(y_test, lr_pred)
cm_dt = confusion_matrix(y_test, dt_pred)

tn_lr, fp_lr, fn_lr, tp_lr = cm_lr.ravel()
tn_dt, fp_dt, fn_dt, tp_dt = cm_dt.ravel()

print(f"LogReg - TN:{tn_lr}, FP:{fp_lr}, FN:{fn_lr}, TP:{tp_lr}")
print(f"DecTree - TN:{tn_dt}, FP:{fp_dt}, FN:{fn_dt}, TP:{tp_dt}")

LogReg - TN:127, FP:6, FN:35, TP:32
DecTree - TN:129, FP:4, FN:7, TP:60


### Task 3: Calculate All Four Metrics

**Task:** Calculate accuracy, precision, recall, F1 for both models.

In [26]:
# Your code here

lr_acc = accuracy_score(y_test, lr_pred)
lr_prec = precision_score(y_test, lr_pred)
lr_rec = recall_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred)

dt_acc = accuracy_score(y_test, dt_pred)
dt_prec = precision_score(y_test, dt_pred)
dt_rec = recall_score(y_test, dt_pred)
dt_f1 = f1_score(y_test, dt_pred)

print(f"Accuracy Scores:\nLR {lr_acc:.4f} & DT {dt_acc:.4f}")
print(f"\nPrecision Scores:\nLR {lr_prec:.4f} & DT {dt_prec:.4f}")
print(f"\nRecall Scores:\nLR {lr_rec:.4f} & DT {dt_rec:.4f}")
print(f"\nF1 Scores:\nLR {lr_f1:.3f} & DT {dt_f1:.4f}")

Accuracy Scores:
LR 0.7950 & DT 0.9450

Precision Scores:
LR 0.8421 & DT 0.9375

Recall Scores:
LR 0.4776 & DT 0.8955

F1 Scores:
LR 0.610 & DT 0.9160


**Record (exact values):**
| Metric | LogisticRegression | DecisionTree |
| --- | --- | --- |
| Accuracy | 0.7950 | 0.9450 |
| Precision | 0.8421 | 0.9375 |
| Recall | 0.4776 | 0.8955 |
| F1 | 0.610 | 0.9160 |

### Task 4: Business Justification

**Scenario:** This classifier predicts customer churn. A false negative means we miss a customer who was about to leave (lost revenue). A false positive means we offer a retention discount to someone who wasn't leaving (unnecessary cost, but keeps customer happy).

**Question:** Which metric should we prioritize, and why? 

**Your answer (2-3 sentences):** In this instance, we should prioritize Recall. This is a classic example of "choose Recall over Precision when a false negative is more costly."

**More Professional Version:** Prioritize Recall. Missing a churning customer (FN) loses entire customer lifetime value ($thousands), while offering unnecessary discounts (FP) costs far less and maintains goodwill. DecisionTree's 0.90 recall captures 90% of churners vs LogReg's 48%.

## BLOCK 3: Cross-Validation Quick Check

**Exit gate Part A requires:** "Compare models with cross-validation"

**Task:** Use cross_val_score to compare LinearRegression and Ridge on California Housing data.

In [27]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Use Pipeline to combine scaling + model (prevents data leakage in CV)
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

pipe_ridge = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

# X and y should be your California Housing data from Block 1
# Cross-validate (5-fold)

feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                'Latitude', 'Longitude', 'bedroom_ratio', 'rooms_per_person', 'rooms_per_income']

X = df[feature_cols]
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scores_lr = cross_val_score(pipe_lr, X, y, cv=5, scoring='r2')
scores_ridge = cross_val_score(pipe_ridge, X, y, cv=5, scoring='r2')

print(f"LinearRegression CV R¬≤: {scores_lr.mean():.4f} (+/- {scores_lr.std():.4f})")
print(f"Ridge CV R¬≤: {scores_ridge.mean():.4f} (+/- {scores_ridge.std():.4f})")


LinearRegression CV R¬≤: 0.6096 (+/- 0.0474)
Ridge CV R¬≤: 0.6096 (+/- 0.0474)


**Record:**

- LinearRegression CV R¬≤: 0.6096 (+/- 0.0474)
- Ridge CV R¬≤: 0.6096 (+/- 0.0474)

**Question:** Why use Pipeline inside cross_val_score instead of scaling beforehand? Data leakage prevention.

## BLOCK 4: Exit Gate Conceptual Review

**These questions mirror Part C of the exit gate. Answer each in 2-3 sentences.**

### Question 1: Bias-Variance Tradeoff

**Explain the bias-variance tradeoff to a non-technical stakeholder. Use an analogy if helpful.**

Imagine teaching archery. High bias is like always aiming 2 feet left of the target (consistent error, simple approach). High variance is like hitting all over the board based on wind, stance, mood (inconsistent, overthinking). We want balanced aim.

**Standard definition:**

Bias = Error from oversimplified model (underfitting). Think: "Always guessing the average price regardless of features"
Variance = Error from model being too sensitive to training data (overfitting). Think: "Memorizing training examples, failing on new ones"
Tradeoff = Simple models have high bias/low variance. Complex models have low bias/high variance.

### Question 2: Overfitting Diagnosis

**Given: Train R¬≤ = 0.92, Test R¬≤ = 0.61. Diagnose the problem and suggest two solutions.** 

Model is too complex, memorizing noise in training data instead of learning true patterns.

**Correct solutions:**

Regularization (Ridge/Lasso with higher alpha)
Simplify model (fewer features, shallower tree, etc.)
Get more training data
Remove noisy features

### Question 3: Feature Engineering Design

**A client wants to predict house prices. They give you: square footage, number of bedrooms, number of bathrooms, year built, zip code. Propose 2 engineered features and explain why each might help.** 

(note: This is a difficult hypothetical because these are the parameters used in the housing market, and there aren't many other useful features aside from location, size, bedroom count, bathroom count, and age) Engineered Feature 1 - bedrooms_per_sqft = number of bedrooms / square footage. This might help by giving an idea of how much personal space the occupants may have, aside from the common areas such as living room and kitchen. Engineered Feature 2 - is_new = (year_built > 2015).astype(int)(binary flag for new construction), or house_age = 2025 - year_built (direct age feature). This might be helpful in determining where new construction homes are primarily located, which might be a motivator because new home builders tend to offer very competitive incentives such as closing cost credits and home additions which end up being much more attractive options that older homes cannot offer.

### Question 4: Scaling Decision

**A teammate asks: "I'm building a Random Forest classifier. Should I scale my features?" What's your answer and why?** 

There's no need to scale your features. Random Forest Regressors & Classifiers split on individual feature thresholds, which don't depend on other features' scales.

### Question 5: Metric Selection

**A fraud detection model has: Accuracy 99%, Precision 80%, Recall 20%. Is this a good model? Explain.** 

For the purpose of fraud detection, we very much need to priortize Recall, since a false negative is extremely costly. This model, with only 20% Recall, is completly unfit for the task and needs significant improvement before deployment.