<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/exercises/week03_group_exercise_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3 Group Exercise ‚Äî SOLUTION KEY üîë ‚Äî Multiple Regression: Predict and Interpret
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 10 | **Duration:** ~45 minutes | **Deliverable:** Completed notebook + 2-minute presentation

**Objective:** Build a multiple regression model on WA housing data, evaluate it, interpret the coefficients, and present findings to the class.

### Group Members & Roles

| Role | Name | Responsibility |
|------|------|----------------|
| üñ•Ô∏è **Lead Coder** | | Drives the notebook |
| üìä **Data Interpreter** | | Reads outputs, explains numbers |
| üé§ **Presenter** | | Delivers the 2-minute share-out |
| ‚úÖ **QA Reviewer** | | Checks outputs against checkpoints |

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GROUP DISCUSSION (before coding ‚Äî 3 minutes)</strong><br>
  Your group has been hired by a Washington State real estate agency. They want a model that predicts home sale prices. Before touching any code, discuss:
  <ol>
    <li>Which features do you <em>think</em> will be the strongest predictors of price? Rank your top 3.</li>
    <li>Are any features likely to be redundant with each other?</li>
    <li>What is one feature NOT in this dataset that would improve predictions?</li>
  </ol>
</div>

**Our group's predictions:**

**Sample:** We predict sqft_living, bathrooms, and sqft_above will be strongest. sqft_living and sqft_above are likely redundant since above-ground sqft is a subset of total living sqft. A feature like school district rating or neighborhood name would greatly improve predictions since location drives real estate pricing.

---

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the setup cell below. Do not modify.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

data_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/housingData.csv"
housing = pd.read_csv(data_url)
housing = housing[(housing["sqft_living"] < 8000) &
                  (housing["price"] < 1_000_000) &
                  (housing["price"] > 0)]
print(f"Dataset: {len(housing):,} rows √ó {housing.shape[1]} columns")
print(f"Price range: ${housing['price'].min():,.0f} ‚Äì ${housing['price'].max():,.0f}")

---
## Task 1 ‚Äî Explore the Data (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Display <code>.info()</code> and <code>.describe()</code> to understand the data.
</div>

In [None]:
# Task 1: Explore
housing.info()
print()
housing.describe().round(2)

---
## Task 2 ‚Äî Correlation Analysis (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Calculate correlations with <code>price</code> and display a sorted bar chart. Identify the top 3 features.
</div>

In [None]:
# Task 2: Correlation analysis
numeric = housing.select_dtypes(include=[np.number])
corr = numeric.corr()["price"].drop("price").sort_values(ascending=False)

plt.figure(figsize=(8, 6))
corr.plot(kind="barh", color=["steelblue" if v > 0 else "salmon" for v in corr.values])
plt.title("Feature Correlations with Price")
plt.xlabel("Pearson Correlation")
plt.axvline(x=0, color="black", linewidth=0.5)
plt.tight_layout()
plt.show()

print("Top 3:", corr.head(3).index.tolist())

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 1</strong><br>
  <code>sqft_living</code> should be the strongest positive correlation with price. If not, check your filter.
</div>

---
## Task 3 ‚Äî Simple Regression Baseline (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build a simple linear regression using ONLY your strongest predictor. Use <code>test_size=0.33, random_state=42</code>. Report R¬≤.
</div>

In [None]:
# Task 3: Simple regression baseline
X = housing[["sqft_living"]]
y = housing["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model_simple = LinearRegression().fit(X_train, y_train)
r2_simple = model_simple.score(X_test, y_test)
print(f"Simple Regression R¬≤: {r2_simple:.4f}")

---
## Task 4 ‚Äî Multiple Regression (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build a multiple regression using 3‚Äì4 numeric features. Same split parameters. Report R¬≤ and compare to Task 3.
</div>

In [None]:
# Task 4: Multiple regression
features = ["sqft_living", "bathrooms", "sqft_above", "floors"]
X_multi = housing[features]
X_tr, X_te, y_tr, y_te = train_test_split(X_multi, y, test_size=0.33, random_state=42)
model_multi = LinearRegression().fit(X_tr, y_tr)
r2_multi = model_multi.score(X_te, y_te)
print(f"Multiple Regression R¬≤: {r2_multi:.4f} (was {r2_simple:.4f})")

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 2</strong><br>
  Your multiple regression R¬≤ should be higher than the simple regression R¬≤.
</div>

---
## Task 5 ‚Äî Add Dummy Variables (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Create <code>has_basement</code> (1 if <code>sqft_basement > 0</code>, else 0). Add it to your model and report R¬≤.
</div>

In [None]:
# Task 5: Add dummy variables
housing["has_basement"] = (housing["sqft_basement"] > 0).astype(int)
features_dum = ["sqft_living", "bathrooms", "sqft_above", "floors", "has_basement"]
X_dum = housing[features_dum]
X_tr_d, X_te_d, y_tr_d, y_te_d = train_test_split(X_dum, y, test_size=0.33, random_state=42)
model_dum = LinearRegression().fit(X_tr_d, y_tr_d)
r2_dum = model_dum.score(X_te_d, y_te_d)
y_pred = model_dum.predict(X_te_d)
print(f"With Dummies R¬≤: {r2_dum:.4f}")

---
## Task 6 ‚Äî Calculate RMSE (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Calculate RMSE on your best model's test predictions. Interpret what this number means in dollars.
</div>

In [None]:
# Task 6: RMSE
rmse = np.sqrt(mean_squared_error(y_te_d, y_pred))
print(f"RMSE: ${rmse:,.0f}")
print(f"Interpretation: On average, our prediction is off by about ${rmse:,.0f}.")

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 3</strong><br>
  RMSE should be in the range of $55,000‚Äì$85,000.
</div>

---
## Task 7 ‚Äî Residual Plot (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Plot residuals (actual ‚àí predicted) vs predicted values with a red zero line.
</div>

In [None]:
# Task 7: Residual plot
residuals = y_te_d - y_pred
plt.figure(figsize=(10, 5))
plt.scatter(y_pred, residuals, alpha=0.15, s=10, color="steelblue")
plt.axhline(y=0, color="red", linewidth=2)
plt.title("Residual Plot")
plt.xlabel("Predicted Price ($)")
plt.ylabel("Residual ($)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Residual interpretation (minimum 2 sentences):**

**Sample:** The residuals are roughly randomly scattered around zero, which suggests the model is not systematically biased. However, there is a slight funnel shape ‚Äî errors are larger for higher predicted prices ‚Äî indicating the model is less reliable for expensive homes.

---
## Task 8 ‚Äî Coefficient Interpretation (1 point)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Display coefficients alongside feature names as a sorted DataFrame. Answer: Which feature has the largest positive effect? Does <code>has_basement</code> make business sense?
</div>

In [None]:
# Task 8: Coefficients
coef_df = pd.DataFrame({
    "Feature": features_dum,
    "Coefficient": model_dum.coef_
}).sort_values("Coefficient", ascending=False)
print(coef_df.to_string(index=False))
print(f"\nIntercept: ${model_dum.intercept_:,.2f}")

**Coefficient interpretation (minimum 3 sentences):**

**Sample:** sqft_living has the largest positive coefficient, meaning each additional square foot adds the most to predicted price. has_basement is positive, which makes business sense ‚Äî homes with basements offer extra usable space, storage, and potential for finishing, all of which add value. The floors coefficient is smaller, suggesting that the number of stories matters less than total square footage for pricing.

---

## Troubleshooting

| Problem | Fix |
|---------|-----|
| `ValueError: could not convert string to float` | You have a categorical column in X ‚Äî remove it or use `get_dummies()` |
| R¬≤ is negative | Your features are worse than predicting the mean ‚Äî check feature selection |
| `ValueError: reshape your data` | Use double brackets: `housing[["sqft_living"]]` not `housing["sqft_living"]` |
| RMSE is in the millions | Check that you filtered outliers in the setup cell |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 3 Group Exercise ‚Äî Multiple Regression: Predict and Interpret | 10 Points
</p>