<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/labs/lab02_regression_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2 ‚Äî SOLUTION KEY üîë
## Regression Pipeline on a Dataset of Your Choice
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 20 (+3 bonus) | **Format:** Individual | **Due:** End of Week 3

| Part | Skills | Points |
|------|--------|--------|
| A: EDA | Correlation, scatterplot, data description | 5 |
| B: Simple Regression | One predictor, R¬≤, RMSE | 4 |
| C: Multiple Regression | Dummies, residuals, coefficients | 5 |
| D: Written Interpretation | Model comparison, business application | 6 |
| Bonus: Logistic Extension | Binary target, confusion matrix | +3 |

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GRADING PHILOSOPHY</strong><br>
  This lab rewards <strong>process over perfection</strong>. If your code fails but you explain what you tried, you earn most of the points.
</div>

### Student Information
- **Name:** SOLUTION KEY
- **Date:** Spring 2026
- **Dataset Chosen:** California Housing

---
## Setup

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the setup cell, then uncomment your chosen dataset in the next cell.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (mean_squared_error, r2_score,
                             classification_report, confusion_matrix,
                             ConfusionMatrixDisplay)

plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")
print("‚úÖ Setup complete")

In [None]:
# ============================================================
# Choose your dataset ‚Äî uncomment ONE option below
# ============================================================

# --- Option 1: California Housing ---
from sklearn.datasets import fetch_california_housing
cal = fetch_california_housing(as_frame=True)
df = cal.frame
TARGET = "MedHouseVal"

# --- Option 2: Auto MPG ---
# df = sns.load_dataset("mpg").dropna()
# TARGET = "mpg"

# --- Option 3: Medical Insurance ---
# insurance_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/insurance.csv"
# df = pd.read_csv(insurance_url)
# TARGET = "charges"

print(f"Dataset: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"Target: {TARGET}")
df.head()

---
# Part A ‚Äî Exploratory Data Analysis (5 points)

### Task 1: Dataset Description (2 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Display <code>.info()</code> and <code>.describe()</code>. Then write 2‚Äì3 sentences describing the dataset.
</div>

In [None]:
# Task 1: Explore
df.info()
print()
df.describe().round(2)

**Dataset description (2‚Äì3 sentences):**

**Sample:** The California Housing dataset has 20,640 rows, each representing a census block group. It has 8 features including median income, house age, average rooms, and geographic coordinates, with the target being median house value in units of $100K. Each row represents an aggregated neighborhood, not an individual home.

### Task 2: Correlation Heatmap (1.5 points)

In [None]:
# Task 2: Sorted correlation with target
numeric = df.select_dtypes(include=[np.number])
corr = numeric.corr()[TARGET].drop(TARGET).sort_values(ascending=False)

plt.figure(figsize=(8, 5))
corr.plot(kind="barh", color=["steelblue" if v > 0 else "salmon" for v in corr.values])
plt.title(f"Feature Correlations with {TARGET}")
plt.xlabel("Pearson Correlation")
plt.axvline(x=0, color="black", linewidth=0.5)
plt.tight_layout()
plt.show()

print("Top 3:", corr.head(3).index.tolist())

### Task 3: Scatterplot of Strongest Predictor (1.5 points)

In [None]:
# Task 3: Scatterplot of strongest predictor
best_feature = "MedInc"  # Top correlated for California Housing
plt.figure(figsize=(10, 5))
plt.scatter(df[best_feature], df[TARGET], alpha=0.1, s=8, color="steelblue")
plt.title(f"{best_feature} vs {TARGET}")
plt.xlabel(best_feature)
plt.ylabel(TARGET)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**What do you see? (1‚Äì2 sentences):**

**Sample:** There is a strong positive linear relationship between median income and median house value. The relationship is roughly linear up to about $500K, where the target is capped at $5.0 ($500K), creating a visible ceiling effect.

---
# Part B ‚Äî Simple Linear Regression (4 points)

### Task 4: Simple Regression (2 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build a simple regression using your strongest predictor. Use <code>test_size=0.25, random_state=42</code>. Report R¬≤.
</div>

In [None]:
# Task 4: Simple regression
X_simple = df[["MedInc"]]
y = df[TARGET]
X_train, X_test, y_train, y_test = train_test_split(X_simple, y, test_size=0.25, random_state=42)
model_simple = LinearRegression().fit(X_train, y_train)
r2_simple = model_simple.score(X_test, y_test)
y_pred_simple = model_simple.predict(X_test)
print(f"Simple Regression R¬≤: {r2_simple:.4f}")

### Task 5: RMSE Interpretation (2 points)

In [None]:
# Task 5: RMSE
rmse_simple = np.sqrt(mean_squared_error(y_test, y_pred_simple))
print(f"RMSE: {rmse_simple:.4f} ($100K units)")
print(f"In dollars: ${rmse_simple * 100_000:,.0f}")

**RMSE interpretation (1‚Äì2 sentences):**

**Sample:** The RMSE of approximately 0.84 means the model's predictions are off by about $84,000 on average. For a housing market where homes range from $15K to $500K, this is a substantial error ‚Äî the model captures the general trend but misses many neighborhood-level factors.

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT</strong><br>
  You should have a baseline R¬≤ and RMSE. If R¬≤ is negative, your feature choice may be poor ‚Äî try a different one.
</div>

---
# Part C ‚Äî Multiple Regression (5 points)

### Task 6: Multiple Regression with Dummies (2 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build a multiple regression using 3‚Äì5 features. If categorical columns exist, use <code>pd.get_dummies(drop_first=True)</code>. Report R¬≤ and compare to Task 4.
</div>

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
  If your dataset has string/object columns, sklearn will crash. Use <code>get_dummies()</code> or <code>.select_dtypes(include=[np.number])</code> to keep only numeric features.
</div>

In [None]:
# Task 6: Multiple regression
features = ["MedInc", "AveRooms", "HouseAge", "AveOccup", "Latitude"]
X_multi = df[features]
X_tr, X_te, y_tr, y_te = train_test_split(X_multi, y, test_size=0.25, random_state=42)
model_multi = LinearRegression().fit(X_tr, y_tr)
r2_multi = model_multi.score(X_te, y_te)
y_pred_multi = model_multi.predict(X_te)
print(f"Multiple Regression R¬≤: {r2_multi:.4f} (was {r2_simple:.4f})")
print(f"Improvement: +{r2_multi - r2_simple:.4f}")

### Task 7: Residual Plot (1.5 points)

In [None]:
# Task 7: Residual plot
residuals = y_te - y_pred_multi
plt.figure(figsize=(10, 5))
plt.scatter(y_pred_multi, residuals, alpha=0.1, s=8, color="steelblue")
plt.axhline(y=0, color="red", linewidth=2)
plt.title("Residual Plot ‚Äî Multiple Regression")
plt.xlabel("Predicted Value")
plt.ylabel("Residual")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Residual interpretation (2+ sentences):**

**Sample:** The residuals show a roughly random scatter around zero for lower predicted values, but there is a visible pattern at the upper end ‚Äî the model consistently underpredicts for the most expensive neighborhoods due to the $500K cap in the target variable. There is also a slight funnel shape, with larger errors at higher predictions.

### Task 8: Coefficient Table (1.5 points)

In [None]:
# Task 8: Coefficients
coef_df = pd.DataFrame({
    "Feature": features,
    "Coefficient": model_multi.coef_
}).sort_values("Coefficient", key=abs, ascending=False)
print(coef_df.to_string(index=False))
print(f"\nIntercept: {model_multi.intercept_:.4f}")

---
# Part D ‚Äî Written Interpretation (6 points)

### Task 9: Analysis (minimum 150 words)

Answer ALL four questions:

1. **Model comparison:** How much did R¬≤ improve from simple to multiple regression?
2. **Feature insight:** Which features were strongest? Any surprises?
3. **Business application:** In 2‚Äì3 sentences, explain what your model does for a non-technical audience and give one actionable recommendation.
4. **Limitations:** Name one factor NOT in the dataset that would improve predictions.

**1. Model comparison:** R¬≤ improved from approximately 0.47 (simple, income only) to approximately 0.60 (multiple, 5 features). The improvement of +0.13 was meaningful ‚Äî adding geographic and housing characteristics gave the model substantially more explanatory power. The added complexity of four more features was justified by the significant accuracy gain.

**2. Feature insight:** Median income was by far the strongest predictor, which is expected ‚Äî wealthier neighborhoods have more expensive homes. Latitude was a surprising addition ‚Äî it captures the North/South California price gradient (coastal Southern California is more expensive). Average occupancy had a negative coefficient, meaning crowded neighborhoods tend to have lower property values.

**3. Business application:** This model predicts median home values across California neighborhoods based on demographic and housing characteristics. For a real estate investment firm, the key insight is that median income alone explains nearly half of neighborhood pricing, making it the single most important factor to evaluate. Our recommendation: prioritize investment in neighborhoods where median income is rising but home values haven't caught up yet ‚Äî that's where the model predicts the largest undervaluation.

**4. Limitations:** School district quality is not in the dataset but is one of the strongest real-world predictors of home values. Adding school ratings, test scores, or proximity to top-rated schools would likely improve R¬≤ by 5‚Äì10 points.

---
# Bonus Challenge ‚Äî Logistic Regression Extension (+3 points)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° OPTIONAL</strong><br>
  Create a binary target (e.g., "above median price"), build a <code>LogisticRegression</code>, generate a confusion matrix and classification report, and interpret precision/recall in one paragraph. No scaffolding provided.
</div>

In [None]:
# Bonus: Logistic regression extension
median_val = df[TARGET].median()
df["above_median"] = (df[TARGET] > median_val).astype(int)

X_log = df[["MedInc", "AveRooms", "HouseAge", "AveOccup", "Latitude"]]
y_log = df["above_median"]

X_tr_l, X_te_l, y_tr_l, y_te_l = train_test_split(X_log, y_log, test_size=0.25, random_state=42)

scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr_l)
X_te_scaled = scaler.transform(X_te_l)

log_model = LogisticRegression(max_iter=1000, random_state=42)
log_model.fit(X_tr_scaled, y_tr_l)
y_pred_log = log_model.predict(X_te_scaled)

print(f"Accuracy: {log_model.score(X_te_scaled, y_te_l):.4f}")
print()
print(classification_report(y_te_l, y_pred_log, target_names=["Below Median", "Above Median"]))

cm = confusion_matrix(y_te_l, y_pred_log)
disp = ConfusionMatrixDisplay(cm, display_labels=["Below Median", "Above Median"])
fig, ax = plt.subplots(figsize=(6, 5))
disp.plot(ax=ax, cmap="Blues")
plt.title("Confusion Matrix ‚Äî Above/Below Median Price")
plt.tight_layout()
plt.show()

**Bonus interpretation:**

**Sample:** The logistic regression achieved approximately 83% accuracy in classifying neighborhoods as above or below median value. Precision for 'above median' was strong (~0.83), meaning when the model predicts a neighborhood is expensive, it's usually right. Recall was similar, meaning it catches most of the truly expensive neighborhoods. The confusion matrix shows the errors are roughly balanced between false positives and false negatives, suggesting no strong bias toward either class.

---
## Reflection (required)

> *Regression models predict, but they also reveal which features matter. Think about a dataset from your life or career ‚Äî what would you predict, what features would you include, and how would you know if the model was "good enough"?*

**Your reflection (minimum 3 sentences):**

**Sample:** I would predict monthly customer churn at a subscription service using features like usage frequency, support ticket count, subscription tenure, payment method, and plan type. I would know the model was 'good enough' if it could identify at-risk customers at least 2 weeks before they cancel, with precision above 70% ‚Äî meaning most flagged customers would actually churn without intervention. The business value comes not from perfect prediction but from giving the retention team a prioritized list to act on.

---
## Troubleshooting

| Problem | Fix |
|---------|-----|
| `ValueError: could not convert string` | Categorical columns present ‚Äî use `get_dummies()` or drop them |
| R¬≤ is negative | Feature choice is poor ‚Äî try the top-correlated features |
| `ValueError: reshape` | Use `df[["col"]]` (double brackets) for X |
| `fetch_california_housing` error | Run `!pip install -U scikit-learn` |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 2 ‚Äî Regression Pipeline | 20 Points (+3 Bonus)
</p>