<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/demos/week03_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/TODO-YOUR-REPO/cap4767-data-mining/blob/main/demos/week03-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3 Demo ‚Äî Regression: From Linear Relationships to Logistic Classification
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Chapter 3** | Competencies: 1.3, 1.4, 1.5, 1.6, 6 (partial)

**What we're building today:**
- **Example 1 (Basic):** Simple linear regression ‚Äî one predictor, one target
- **Example 2 (Intermediate):** Multiple regression ‚Äî dummy variables, scaling, feature selection
- **Example 3 (Full Pipeline):** Complete housing price prediction + logistic regression classifier

**Pipeline position:** Regression is the interpretable baseline. Every model you build from here forward ‚Äî decision trees, random forests, neural networks ‚Äî gets compared against a regression baseline first.

**Datasets:**
| Dataset | Rows | Use |
|---------|------|-----|
| WA Housing Sales | ~4,600 | Predict home prices (regression) |
| Cars | 205 | Feature engineering playground |
| UCLA Admissions | 400 | Binary classification (logistic regression) |

---
## Setup

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run this cell to load all libraries and datasets. Do not modify.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.metrics import (mean_squared_error, r2_score,
                             classification_report, confusion_matrix,
                             ConfusionMatrixDisplay)

plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

# Load datasets from GitHub
housing_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/housingData.csv"
cars_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/cars.csv"
admissions_url = "https://stats.idre.ucla.edu/stat/data/binary.csv"

housing_df = pd.read_csv(housing_url)
cars_df = pd.read_csv(cars_url)
admissions_df = pd.read_csv(admissions_url)

print(f"Housing:    {housing_df.shape[0]:,} rows √ó {housing_df.shape[1]} columns")
print(f"Cars:       {cars_df.shape[0]:,} rows √ó {cars_df.shape[1]} columns")
print(f"Admissions: {admissions_df.shape[0]:,} rows √ó {admissions_df.shape[1]} columns")
print("\n‚úÖ All datasets loaded successfully")

---
# Example 1 ‚Äî Simple Linear Regression on Housing Data

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Before we throw every feature at a model, we start with the simplest version: <strong>one predictor ‚Üí one target</strong>. This establishes a baseline and teaches the core sklearn workflow: <code>fit()</code> ‚Üí <code>predict()</code> ‚Üí <code>score()</code>. You'll use this exact pattern in every ML model for the rest of the course.
</div>

**Question:** Can we predict a home's sale price using only its square footage?

In [None]:
# 1a. Filter housing data ‚Äî remove outliers
housing = housing_df.copy()
housing = housing[(housing["sqft_living"] < 8000) &
                  (housing["price"] < 1_000_000) &
                  (housing["price"] > 0)]

print(f"Filtered: {len(housing):,} rows (removed {len(housing_df) - len(housing):,} outliers)")
print(f"Price range: ${housing['price'].min():,.0f} ‚Äì ${housing['price'].max():,.0f}")
print(f"Sqft range:  {housing['sqft_living'].min():,.0f} ‚Äì {housing['sqft_living'].max():,.0f}")

In [None]:
# 1b. EDA ‚Äî scatterplot: does square footage relate to price?
plt.figure(figsize=(10, 5))
plt.scatter(housing["sqft_living"], housing["price"], alpha=0.15, s=10, color="steelblue")
plt.title("Square Footage vs Sale Price ‚Äî WA Housing")
plt.xlabel("Living Area (sqft)")
plt.ylabel("Sale Price ($)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 1c. Correlation heatmap ‚Äî which features correlate most with price?
numeric_cols = housing.select_dtypes(include=[np.number])
corr = numeric_cols.corr()["price"].drop("price").sort_values(ascending=False)

plt.figure(figsize=(8, 6))
corr.plot(kind="barh", color=["steelblue" if v > 0 else "salmon" for v in corr.values])
plt.title("Feature Correlations with Price (Sorted)")
plt.xlabel("Pearson Correlation")
plt.axvline(x=0, color="black", linewidth=0.5)
plt.tight_layout()
plt.show()

print("Top 5 features correlated with price:")
for feat, val in corr.head(5).items():
    print(f"  {feat:20s} r = {val:.3f}")

In [None]:
# 1d. Simple linear regression ‚Äî sqft_living ‚Üí price
X = housing[["sqft_living"]]     # 2D DataFrame (sklearn requires this)
y = housing["price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

r2_simple = model_simple.score(X_test, y_test)
y_pred_simple = model_simple.predict(X_test)
rmse_simple = np.sqrt(mean_squared_error(y_test, y_pred_simple))

print(f"Simple Linear Regression (sqft_living only)")
print(f"  R¬≤:   {r2_simple:.4f}")
print(f"  RMSE: ${rmse_simple:,.0f}")
print(f"  Coefficient: ${model_simple.coef_[0]:,.2f} per sqft")
print(f"  Intercept:   ${model_simple.intercept_:,.2f}")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° INTERPRETING R¬≤ = ~0.29</strong><br>
  An R¬≤ of 0.29 means square footage explains about 29% of the variation in price. That's a real relationship ‚Äî larger homes do cost more ‚Äî but 71% of the variation comes from other factors: location, condition, year built, and so on. This is our <strong>baseline</strong> to beat.
</div>

In [None]:
# 1e. Predicted vs actual scatter
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_simple, alpha=0.15, s=10, color="steelblue")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         "r--", linewidth=2, label="Perfect prediction")
plt.title(f"Predicted vs Actual ‚Äî Simple Regression (R¬≤={r2_simple:.3f})")
plt.xlabel("Actual Price ($)")
plt.ylabel("Predicted Price ($)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 1f. Residual plot ‚Äî are errors random or systematic?
residuals = y_test - y_pred_simple

plt.figure(figsize=(10, 5))
plt.scatter(y_pred_simple, residuals, alpha=0.15, s=10, color="steelblue")
plt.axhline(y=0, color="red", linewidth=2)
plt.title("Residual Plot ‚Äî Simple Regression")
plt.xlabel("Predicted Price ($)")
plt.ylabel("Residual (Actual ‚àí Predicted)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 1g. Regression line visualization
plt.figure(figsize=(10, 5))
sns.regplot(x="sqft_living", y="price", data=housing,
            scatter_kws={"alpha": 0.1, "s": 8, "color": "steelblue"},
            line_kws={"color": "red", "linewidth": 2})
plt.title("Regression Line ‚Äî sqft_living vs price")
plt.xlabel("Living Area (sqft)")
plt.ylabel("Sale Price ($)")
plt.tight_layout()
plt.show()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 1</strong><br>
  <ul>
    <li>R¬≤ ‚âà 0.29 ‚Äî one variable captures the trend but leaves most variation unexplained</li>
    <li>RMSE ‚âà $130K‚Äì$140K ‚Äî on average, our prediction is off by this much</li>
    <li>Residual plot fans out to the right ‚Äî the model is worse at predicting expensive homes</li>
    <li>The regression line tilts upward ‚Äî positive relationship confirmed</li>
  </ul>
</div>

### ‚ö° Common Error Demo ‚Äî 1D vs 2D Input

In [None]:
# ‚ö° DELIBERATE ERROR ‚Äî what happens when you pass a 1D Series?
try:
    X_wrong = housing["sqft_living"]       # 1D Series ‚Äî WRONG
    model_simple.fit(X_wrong, y)
except ValueError as e:
    print(f"‚ùå ValueError: {e}")
    print()
    print("FIX: Use double brackets to create a 2D DataFrame:")
    print('  X = housing[["sqft_living"]]   # ‚Üê 2D DataFrame (correct)')
    print('  X = housing["sqft_living"]     # ‚Üê 1D Series (wrong)')

---
# Example 2 ‚Äî Multiple Regression + Feature Engineering on Cars Data

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Real datasets have categorical variables (text labels), features at different scales, and too many columns. This example teaches three critical skills: (1) converting categories to numbers with <strong>dummy variables</strong>, (2) normalizing features with <strong>StandardScaler</strong>, and (3) automatically picking the best features with <strong>SelectKBest</strong>.
</div>

**Question:** Can we predict car prices better by adding features, engineering new ones, and selecting wisely?

In [None]:
# 2a. Cars EDA ‚Äî sorted correlation with price
cars = cars_df.copy()
numeric_cars = cars.select_dtypes(include=[np.number])
corr_cars = numeric_cars.corr()["price"].drop("price").sort_values(ascending=False)

plt.figure(figsize=(8, 8))
corr_cars.plot(kind="barh", color=["steelblue" if v > 0 else "salmon" for v in corr_cars.values])
plt.title("Feature Correlations with Car Price (Sorted)")
plt.xlabel("Pearson Correlation")
plt.axvline(x=0, color="black", linewidth=0.5)
plt.tight_layout()
plt.show()

print("Top 5:")
for feat, val in corr_cars.head(5).items():
    print(f"  {feat:20s} r = {val:.3f}")

In [None]:
# 2b. Simple regression baseline ‚Äî enginesize only
X_cars = cars[["enginesize"]]
y_cars = cars["price"]

X_tr, X_te, y_tr, y_te = train_test_split(X_cars, y_cars, test_size=0.33, random_state=42)

model_1var = LinearRegression().fit(X_tr, y_tr)
r2_1var = model_1var.score(X_te, y_te)

print(f"Simple Regression (enginesize only)")
print(f"  R¬≤: {r2_1var:.4f}")

In [None]:
# 2c. Two-variable model ‚Äî enginesize + curbweight
X_cars_2 = cars[["enginesize", "curbweight"]]
X_tr2, X_te2, y_tr2, y_te2 = train_test_split(X_cars_2, y_cars, test_size=0.33, random_state=42)

model_2var = LinearRegression().fit(X_tr2, y_tr2)
r2_2var = model_2var.score(X_te2, y_te2)

print(f"Two-Variable Model (enginesize + curbweight)")
print(f"  R¬≤: {r2_2var:.4f}  (+{r2_2var - r2_1var:.4f} improvement)")

In [None]:
# 2d. Identify categorical columns and create dummy variables
cat_cols = cars.select_dtypes(include=["object"]).columns.tolist()
print(f"Categorical columns ({len(cat_cols)}):")
for col in cat_cols:
    print(f"  {col}: {cars[col].nunique()} unique values ‚Äî {cars[col].unique()[:5].tolist()}...")

# Create dummies for key categoricals
cars_encoded = pd.get_dummies(cars, columns=["fueltype", "aspiration", "drivewheel"],
                               drop_first=True, dtype=int)
print(f"\nAfter dummies: {cars_encoded.shape[1]} columns (was {cars.shape[1]})")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY drop_first=True?</strong><br>
  If a car has two fuel types (gas/diesel), one dummy column is enough: <code>fueltype_gas = 1</code> means gas, <code>fueltype_gas = 0</code> means diesel. Creating both columns adds redundant information that confuses the model ‚Äî this is called the <strong>dummy variable trap</strong>.
</div>

In [None]:
# 2e. StandardScaler ‚Äî normalize features to the same scale
feature_cols = ["enginesize", "curbweight", "horsepower", "carwidth", "citympg"]
X_multi = cars_encoded[feature_cols + ["fueltype_gas", "aspiration_turbo",
                                        "drivewheel_fwd", "drivewheel_rwd"]]
y_multi = cars_encoded["price"]

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_multi), columns=X_multi.columns)

print("Before scaling (first row):")
print(X_multi.iloc[0].to_string())
print(f"\nAfter scaling (first row):")
print(X_scaled.iloc[0].round(3).to_string())

In [None]:
# 2f. Multi-feature regression ‚Äî 5 numeric + 4 dummies
X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(X_scaled, y_multi,
                                                     test_size=0.33, random_state=42)

model_multi = LinearRegression().fit(X_tr_m, y_tr_m)
r2_multi = model_multi.score(X_te_m, y_te_m)
r2_train = model_multi.score(X_tr_m, y_tr_m)

print(f"Multi-Feature Model (9 features)")
print(f"  Train R¬≤: {r2_train:.4f}")
print(f"  Test R¬≤:  {r2_multi:.4f}")
print(f"  Gap:      {r2_train - r2_multi:.4f}")

In [None]:
# 2g. R¬≤ progression ‚Äî showing the value of feature engineering
progression = pd.DataFrame({
    "Model": ["1 feature (enginesize)", "2 features (+curbweight)", "9 features (+dummies+scale)"],
    "Test R¬≤": [r2_1var, r2_2var, r2_multi]
})
print(progression.to_string(index=False))

# Bar chart
plt.figure(figsize=(8, 4))
bars = plt.bar(progression["Model"], progression["Test R¬≤"],
               color=["#AED6F1", "#5DADE2", "#2E86C1"])
for bar, val in zip(bars, progression["Test R¬≤"]):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f"{val:.3f}", ha="center", fontsize=11, fontweight="bold")
plt.title("R¬≤ Progression ‚Äî Adding Features Improves the Model")
plt.ylabel("Test R¬≤")
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

In [None]:
# 2h. SelectKBest ‚Äî automated feature ranking
all_numeric = cars_encoded.select_dtypes(include=[np.number]).drop(columns=["price"])
X_all = all_numeric.fillna(0)
y_all = cars_encoded["price"]

selector = SelectKBest(score_func=mutual_info_regression, k=10)
selector.fit(X_all, y_all)

feature_scores = pd.DataFrame({
    "Feature": X_all.columns,
    "Score": selector.scores_
}).sort_values("Score", ascending=False)

print("Top 10 Features by Mutual Information:")
print(feature_scores.head(10).to_string(index=False))

In [None]:
# 2i. Find the "sweet spot" ‚Äî train vs test R¬≤ for k=1 to 20
results = []
for k in range(1, min(21, X_all.shape[1] + 1)):
    sel = SelectKBest(score_func=mutual_info_regression, k=k)
    X_sel = sel.fit_transform(X_all, y_all)
    Xtr, Xte, ytr, yte = train_test_split(X_sel, y_all, test_size=0.33, random_state=42)
    m = LinearRegression().fit(Xtr, ytr)
    results.append({"k": k, "Train R¬≤": m.score(Xtr, ytr), "Test R¬≤": m.score(Xte, yte)})

results_df = pd.DataFrame(results)

plt.figure(figsize=(10, 5))
plt.plot(results_df["k"], results_df["Train R¬≤"], "o-", label="Train R¬≤", color="steelblue")
plt.plot(results_df["k"], results_df["Test R¬≤"], "s--", label="Test R¬≤", color="darkorange")
plt.fill_between(results_df["k"], results_df["Train R¬≤"], results_df["Test R¬≤"],
                 alpha=0.15, color="salmon", label="Overfitting gap")
plt.title("Feature Count vs R¬≤ ‚Äî Finding the Sweet Spot")
plt.xlabel("Number of Features (k)")
plt.ylabel("R¬≤")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

best_k = results_df.loc[results_df["Test R¬≤"].idxmax(), "k"]
print(f"\nüèÜ Best test R¬≤ at k={int(best_k)} features")

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 2</strong><br>
  <ul>
    <li>R¬≤ progressed from ~0.60 ‚Üí ~0.76 ‚Üí ~0.84 ‚Äî each addition improved the model</li>
    <li>The sweet spot chart shows the train-test gap widening as features increase ‚Äî that's overfitting</li>
    <li>The best test R¬≤ occurs somewhere around k=8‚Äì12 features, not at the maximum</li>
  </ul>
</div>

---
# Example 3 ‚Äî Full Housing Pipeline + Logistic Regression Classifier

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Two objectives in one example. <strong>First half:</strong> Build a complete housing price model with feature engineering ‚Äî this is the regression pipeline you'll replicate in the lab. <strong>Second half:</strong> Switch to <em>classification</em> with logistic regression ‚Äî when the target is yes/no instead of a dollar amount, the entire evaluation framework changes.
</div>

## Part A ‚Äî Full Housing Regression Pipeline

In [None]:
# 3a. Feature engineering on housing data
h = housing.copy()

# Create binary feature: has basement?
h["has_basement"] = (h["sqft_basement"] > 0).astype(int)

# Select features based on correlation analysis from Example 1
feature_cols = ["sqft_living", "bathrooms", "sqft_above", "floors", "has_basement"]
X_house = h[feature_cols]
y_house = h["price"]

print(f"Features: {feature_cols}")
print(f"Target: price")
print(f"Samples: {len(h):,}")
print(f"\nhas_basement distribution:")
print(h["has_basement"].value_counts().to_string())

In [None]:
# 3b. Train/test split and fit
X_tr_h, X_te_h, y_tr_h, y_te_h = train_test_split(
    X_house, y_house, test_size=0.33, random_state=42)

model_full = LinearRegression().fit(X_tr_h, y_tr_h)
y_pred_full = model_full.predict(X_te_h)

r2_full = r2_score(y_te_h, y_pred_full)
rmse_full = np.sqrt(mean_squared_error(y_te_h, y_pred_full))

print(f"Full Housing Model (5 features)")
print(f"  R¬≤:   {r2_full:.4f}")
print(f"  RMSE: ${rmse_full:,.0f}")
print(f"\nCoefficients:")
coef_df = pd.DataFrame({
    "Feature": feature_cols,
    "Coefficient": model_full.coef_
}).sort_values("Coefficient", ascending=False)
print(coef_df.to_string(index=False))
print(f"\nIntercept: ${model_full.intercept_:,.2f}")

In [None]:
# 3c. Residual plot ‚Äî full model
residuals_full = y_te_h - y_pred_full

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predicted vs actual
axes[0].scatter(y_te_h, y_pred_full, alpha=0.15, s=10, color="steelblue")
axes[0].plot([y_te_h.min(), y_te_h.max()], [y_te_h.min(), y_te_h.max()],
             "r--", linewidth=2)
axes[0].set_title(f"Predicted vs Actual (R¬≤={r2_full:.3f})")
axes[0].set_xlabel("Actual Price ($)")
axes[0].set_ylabel("Predicted Price ($)")
axes[0].grid(True, alpha=0.3)

# Residuals
axes[1].scatter(y_pred_full, residuals_full, alpha=0.15, s=10, color="steelblue")
axes[1].axhline(y=0, color="red", linewidth=2)
axes[1].set_title(f"Residuals (RMSE=${rmse_full:,.0f})")
axes[1].set_xlabel("Predicted Price ($)")
axes[1].set_ylabel("Residual ($)")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 3</strong><br>
  <ul>
    <li>R¬≤ jumped from ~0.29 (simple) to ~0.40+ (multiple) ‚Äî adding features helped</li>
    <li>RMSE is now in the $100K‚Äì$120K range ‚Äî still large, but this dataset is limited</li>
    <li>Residuals should be roughly randomly scattered around zero</li>
  </ul>
</div>

---
## Part B ‚Äî Logistic Regression: When the Target is Yes/No

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Linear regression predicts <em>continuous</em> values ‚Äî dollars, temperatures, counts. But what if the target is <strong>binary</strong>: admitted/rejected, click/no-click, churn/stay? That's where <strong>logistic regression</strong> comes in. Same <code>.fit()</code> ‚Üí <code>.predict()</code> workflow, completely different evaluation.
</div>

**Question:** Can we predict graduate school admission using GRE score, GPA, and school prestige?

In [None]:
# 3d. UCLA Admissions ‚Äî EDA
print(admissions_df.head())
print(f"\nAdmission rate: {admissions_df['admit'].mean():.1%}")
print(f"\nAdmit rate by prestige rank:")
print(admissions_df.groupby("rank")["admit"].mean().to_string())

In [None]:
# 3e. Visualize: GPA distribution by admission outcome
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# GPA
for outcome, color, label in [(1, "steelblue", "Admitted"), (0, "salmon", "Rejected")]:
    subset = admissions_df[admissions_df["admit"] == outcome]
    axes[0].hist(subset["gpa"], bins=15, alpha=0.6, color=color, label=label)
axes[0].set_title("GPA Distribution by Outcome")
axes[0].set_xlabel("GPA")
axes[0].legend()

# GRE
for outcome, color, label in [(1, "steelblue", "Admitted"), (0, "salmon", "Rejected")]:
    subset = admissions_df[admissions_df["admit"] == outcome]
    axes[1].hist(subset["gre"], bins=15, alpha=0.6, color=color, label=label)
axes[1].set_title("GRE Distribution by Outcome")
axes[1].set_xlabel("GRE Score")
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# 3f. Prepare features ‚Äî create dummy variables for prestige rank
adm = admissions_df.copy()
adm = pd.get_dummies(adm, columns=["rank"], drop_first=True, dtype=int)

print("Features after dummies:")
print(adm.columns.tolist())
print(f"\nShape: {adm.shape}")
adm.head()

In [None]:
# 3g. Logistic Regression ‚Äî fit and predict
X_adm = adm.drop(columns=["admit"])
y_adm = adm["admit"]

X_tr_a, X_te_a, y_tr_a, y_te_a = train_test_split(
    X_adm, y_adm, test_size=0.25, random_state=42)

# Scale features
scaler_adm = StandardScaler()
X_tr_a_scaled = scaler_adm.fit_transform(X_tr_a)
X_te_a_scaled = scaler_adm.transform(X_te_a)

log_model = LogisticRegression(max_iter=1000, random_state=42)
log_model.fit(X_tr_a_scaled, y_tr_a)

y_pred_adm = log_model.predict(X_te_a_scaled)
accuracy = log_model.score(X_te_a_scaled, y_te_a)

print(f"Logistic Regression ‚Äî Graduate Admissions")
print(f"  Accuracy: {accuracy:.4f} ({accuracy:.1%})")

In [None]:
# 3h. Confusion Matrix ‚Äî who did we get right and wrong?
cm = confusion_matrix(y_te_a, y_pred_adm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=["Rejected", "Admitted"])
fig, ax = plt.subplots(figsize=(6, 5))
disp.plot(ax=ax, cmap="Blues", values_format="d")
plt.title("Confusion Matrix ‚Äî Graduate Admissions")
plt.tight_layout()
plt.show()

# Interpret the matrix
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives (correctly predicted rejected):  {tn}")
print(f"False Positives (predicted admit, actually rejected): {fp}")
print(f"False Negatives (predicted reject, actually admitted): {fn}")
print(f"True Positives (correctly predicted admitted):  {tp}")

In [None]:
# 3i. Classification Report ‚Äî precision, recall, F1
print("Classification Report:")
print(classification_report(y_te_a, y_pred_adm,
                            target_names=["Rejected", "Admitted"]))

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE CLASSIFICATION REPORT</strong><br>
  <ul>
    <li><strong>Precision</strong> = "Of everyone we <em>predicted</em> would be admitted, what % actually were?" ‚Äî answers "How trustworthy are our positive predictions?"</li>
    <li><strong>Recall</strong> = "Of everyone who was <em>actually</em> admitted, what % did we catch?" ‚Äî answers "How many real admits did we miss?"</li>
    <li><strong>F1</strong> = The harmonic mean of precision and recall ‚Äî a single number that balances both</li>
    <li><strong>Support</strong> = The number of actual cases in each class</li>
  </ul>
  For an admissions office: high <em>recall</em> means you're not accidentally rejecting strong candidates. High <em>precision</em> means you're not wasting interview slots on unlikely admits.
</div>

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 4</strong><br>
  <ul>
    <li>Accuracy ‚âà 0.70‚Äì0.75 ‚Äî decent but not spectacular (the baseline of "predict everyone rejected" would be ~0.68)</li>
    <li>The confusion matrix shows the model is better at predicting rejections than admissions</li>
    <li>Recall for "Admitted" is likely lower than for "Rejected" ‚Äî the model is conservative</li>
  </ul>
</div>

---
## Regression vs Classification ‚Äî When to Use Which

| Question | Linear Regression | Logistic Regression |
|----------|------------------|-------------------|
| Target type | Continuous (dollars, sqft, degrees) | Binary (yes/no, 0/1) |
| Output | A number on a scale | A probability (0‚Äì1) ‚Üí class label |
| Evaluation | R¬≤, RMSE, residual plot | Accuracy, precision, recall, F1, confusion matrix |
| sklearn class | `LinearRegression()` | `LogisticRegression()` |
| Same workflow? | ‚úÖ `.fit()` ‚Üí `.predict()` ‚Üí `.score()` | ‚úÖ `.fit()` ‚Üí `.predict()` ‚Üí `.score()` |

---
## Takeaway

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ WHAT WE BUILT TODAY</strong><br>
  Three types of regression models ‚Äî simple, multiple, and logistic ‚Äî using the same sklearn <code>.fit()</code> ‚Üí <code>.predict()</code> ‚Üí <code>.score()</code> workflow. The only things that change are the features, the model class, and the evaluation metrics.
</div>

**Pipeline position:** Regression is the interpretable baseline. Every model you build from here forward will be compared against a regression baseline.

**Next chapter preview:** In Chapter 4, you'll apply logistic regression to a real business problem: predicting which of 7,032 telecom customers are about to cancel their service ‚Äî and putting a $509,000 price tag on the cost of getting it wrong.

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 3 Demo ‚Äî Regression: Simple, Multiple, and Logistic | 19 code cells across 3 examples
</p>