## Regression
Scenario: Predicting the price of houses based on features such as square footage, number of bedrooms, bathrooms, and location.

---

## Dataset
The California Housing Dataset from scikit-learn (built into Colab), it's open, built-in, and commonly used for regression tasks. The following table provides descriptions, data ranges, and data types for each feature in the data set.

---
**longitude**

A measure of how far west a house is; a more negative value is farther west

Longitude values range from -180 to +180

Data set min: -124.3

Data set max: -114.3

float64

---
**latitude**

A measure of how far north a house is; a higher value is farther north

Latitude values range from -90 to +90

Data set min: 32.5

Data set max: 42.5

float64

---
**housingMedianAge**

Median age of a house within a block; a lower number is a newer building

Data set min: 1.0

Data set max: 52.0

float64

---
**totalRooms**

Total number of rooms within a block

Data set min: 2.0

Data set max: 37937.0

float64

---
**totalBedrooms**

Total number of bedrooms within a block

Data set min: 1.0

Data set max: 6445.0

float64

---
**population**

Total number of people residing within a block

Data set min: 3.0

Data set max: 35682.0

float64

---
**households**

Total number of households, a group of people residing within a home unit, for a block

Data set min: 1.0

Data set max: 6082.0

float64

---
**medianIncome**

Median income for households within a block of houses (measured in tens of thousands of US Dollars)

Data set min: 0.5

Data set max: 15.0

float64

---
**medianHouseValue**

Median house value for households within a block (measured in US Dollars)

Data set min: 14999.0

Data set max: 500001.0

float64

---

Target: medianHouseValue

In [1]:
# Step 3: Train model and compute regression metrics

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Compute metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error (MAE):", round(mae, 3))
print("Mean Squared Error (MSE):", round(mse, 3))
print("Root Mean Squared Error (RMSE):", round(rmse, 3))
print("R² Score:", round(r2, 3))

Mean Absolute Error (MAE): 0.533
Mean Squared Error (MSE): 0.556
Root Mean Squared Error (RMSE): 0.746
R² Score: 0.576


##Interpretation of Metrics

Metric	- Meaning	- Interpretation

MAE (0.53) - Mean Absolute Error
- Average absolute difference between predicted and actual house prices.

- On average, predictions are off by about $53,000 (since target is in \$100,000s).

MSE (0.56) - Mean Squared Error
- Average squared difference; penalizes larger errors more.

- The moderate MSE indicates some variance in errors.

RMSE (0.746) - Root Mean Squared Error
- Square root of MSE; same scale as the target variable.

- A lower RMSE indicates a better fit, as it signifies smaller prediction errors. An RMSE of 0 would mean a perfect fit.  

- The model's predictions differ from true prices by about $74,600 on average.

R² (0.58)
- Proportion of the variance in the dependent variable that is predictable from the independent variables (the proportion of variance in target explained by model's input features).

- An R² of 1 means the model perfectly explains all the variability of the response data around its mean.

- The model explains 58% of the variation in house prices - decent for a simple linear model.

- Out of all the variation in house prices, about 58% can be explained by the model using the given features (income, rooms, etc.), and the remaining 42% is due to other factors or randomness.

## Compare Evaluation Methods

### Method 1 — Hold-Out Split

Already did this earlier: 80% training data, 20% test data

### Method 2 — k-Fold Cross-Validation (k = 5)

Instead of using one fixed split, k-fold cross-validation divides the dataset into 5 parts (folds). Each fold takes a turn being the “test set,” while the remaining 4 folds are used for training. The results are then averaged across all folds - giving a more reliable estimate of model performance.

In [2]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error
import numpy as np

model = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Compute cross-validated R² scores
r2_scores = cross_val_score(model, X, y, cv=kf, scoring='r2')
mae_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_absolute_error')
rmse_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_root_mean_squared_error')

print("Average R²:", round(np.mean(r2_scores), 3))
print("Average MAE:", round(-np.mean(mae_scores), 3))
print("Average RMSE:", round(-np.mean(rmse_scores), 3))


Average R²: 0.601
Average MAE: 0.532
Average RMSE: 0.728


## Reflection

The cross-validation results are very close to the hold-out split. However, there is a marginal improvement in every metric. There is a 0.001 improvement in the Mean Absolute Error, so instead of being off by \$53,300, it is now off by \$53,200. Unfortunately, it is a negligible improvment.

The model can now explain 60% of the variation in house prices based on given input, but again, 40% of the variation being seemingly random doesn't bode well.

The best improvement, though still small, comes from RMSE - the model's predictions differ from the true value by \$72,800 on average (down from \$74,600)

In practice, I'd choose k-Fold Cross-Validation (k=5 or k=10) because
- It gives a more accurate, less biased picture of model performance.

- It helps detect if the model performs inconsistently across subsets of data.

- It is especially useful when data is limited - every data point gets used for both training and testing at some point.

However, for very large datasets or early prototyping, a simple hold-out split is often good enough to save computation time.

## Communication Reflection

The goal is to explain the model's performance and evaluation process in plain language so that a retail stakeholder (marketing manager/ business director) can decide whether the model is trustworthy and ready for use.

First, I would start with the business context, and explain the purpose:

We built a model to estimate housing prices in California based on characteristics like income levels, house age, and number of rooms. The goal is to predict prices reasonably accurately so we can make better pricing or marketing decisions.

Next, I would translate the metrics into plain english, to give them something tangible and sensible rather than raw statistics:

On average, our model's predictions are off by about $53,000, and it correctly captures about 60% of the patterns that drive house prices. The remaining 40% reflects factors we don't have in our data - such as renovations, neighborhood quality, or current market trends.

Then, I would explain the model's reliability and how it was checked:

We didn't just test the model once. We used a technique called cross-validation, which checks the model's accuracy across multiple random subsets of data. This helps ensure the results are consistent and not just a lucky outcome from one test.

Lastly, I would interpret this for business use:

Our model performs reasonably well, it captures most of the main factors influencing prices, but it's not perfect. I would recommend using it as a support tool, not as the sole source for pricing decisions. If we collect more detailed features (like neighborhood or school quality), we can likely improve its accuracy.