# Lab: Regression Analysis & Metrics from Scratch

## Objectives
1.  Train a simple **Linear Regression** model using `scikit-learn`.
2.  **Implement regression metrics from scratch** (RMSE, MAE, $R^2$) to understand the math.
3.  **Perform Residual Analysis** to check model assumptions (Normality, Homoscedasticity).


## 1. Setup and Data Loading

We will use the **California Housing dataset**.
* **Input (X):** 8 features including median income, house age, average rooms, etc.
* **Target (y):** Median house value (in $100,000s).

*Note: We will visualize the relationships first.*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Set plotting style
sns.set_style("whitegrid")

# Load the data
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print(f"Feature shape: {X.shape}")
print(f"Target shape: {y.shape}")
print("\nFirst 5 rows of features:")
print(X.head())

## 2. Data Splitting and Training

**Task:**
1.  Split `X` and `y` into training and testing sets. Use `test_size=0.2` and `random_state=42`.
2.  Initialize a `LinearRegression` model.
3.  Fit the model on the training data.

In [None]:
# 1. Split the data
X_train, X_test, y_train, y_test = None, None, None, None  # Replace None

# 2. Initialize model
model = None

# 3. Fit model
# model.fit(...)


In [None]:
# Validation
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

## 3. Predictions

**Task:** Generate predictions for the **Test Set**.


In [None]:

y_pred = None

In [None]:
print(f"First 5 predictions: {y_pred[:5]}")
print(f"First 5 Actual values: {y_test[:5]}")

## 4. Regression Metrics (From Scratch)

We will evaluate how close our predictions are to the actual values.

### Definitions

1.  **MAE (Mean Absolute Error):** The average absolute difference between predicted and actual values.
    $$ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $$

2.  **RMSE (Root Mean Squared Error):** Penalizes large errors more than MAE.
    $$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
    $$ RMSE = \sqrt{MSE} $$

3.  **$R^2$ Score (Coefficient of Determination):** Represents the proportion of variance in the dependent variable that is predictable from the independent variables.
    $$ SS_{res} = \sum (y_i - \hat{y}_i)^2 \quad (\text{Residual Sum of Squares}) $$
    $$ SS_{tot} = \sum (y_i - \bar{y})^2 \quad (\text{Total Sum of Squares}) $$
    $$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$

**Task:** Implement these formulas using `numpy`. Do **not** use `sklearn.metrics`.


In [None]:
n = len(y_test)

# 1. Calculate MAE
mae = 0

# 2. Calculate MSE and RMSE
mse = 0
rmse = 0

# 3. Calculate R2
# Hint: You need the mean of the actual test values (y_mean)
y_mean = np.mean(y_test)
ss_res = 0
ss_tot = 0
r2 = 0

In [None]:
print(f"MAE:  {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R^2:  {r2:.4f}")

## 5. Residual Analysis

Analyzing the **residuals** (errors) is crucial to check if a linear model is appropriate.

$$ Residuals = y_{test} - y_{pred} $$

We look for two things:
1.  **Homoscedasticity:** The residuals should be randomly scattered around 0. If they form a funnel shape, the variance is not constant (bad).
2.  **Normality:** The residuals should follow a normal distribution (bell curve).

**Task:** Calculate the residuals.

In [None]:
residuals = None

### Visualization: Residual Analysis

Run the code below to visualize your residuals. You do not need to write code here.

* **Left Plot (Residuals vs. Predicted):** Look for patterns. A random cloud is good. A U-shape implies non-linearity. A funnel implies heteroscedasticity.
* **Right Plot (Residual Distribution):** Should look like a Bell Curve centered at 0.

In [None]:
def plot_residual_analysis(y_pred, residuals):
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Plot 1: Residuals vs Predicted
    sns.scatterplot(x=y_pred, y=residuals, alpha=0.5, ax=axes[0])
    axes[0].axhline(0, color="red", linestyle="--", lw=2)
    axes[0].set_xlabel("Predicted Values")
    axes[0].set_ylabel("Residuals")
    axes[0].set_title("Residuals vs. Predicted Values")

    # Plot 2: Distribution of Residuals
    sns.histplot(residuals, kde=True, ax=axes[1], color="purple")
    axes[1].set_xlabel("Residual Error")
    axes[1].set_title("Distribution of Residuals")

    plt.tight_layout()
    plt.show()


# Run the plot
if residuals is not None:
    plot_residual_analysis(y_pred, residuals)
else:
    print("Please calculate residuals first.")

### Visualization: Actual vs Predicted

A simple way to see how well the model fits. Ideally, all points should lie on the red diagonal line.



In [None]:
def plot_actual_vs_predicted(y_test, y_pred):
    plt.figure(figsize=(8, 8))
    sns.scatterplot(x=y_test, y=y_pred, alpha=0.5)

    # Draw the perfect prediction line
    min_val = min(min(y_test), min(y_pred))
    max_val = max(max(y_test), max(y_pred))
    plt.plot([min_val, max_val], [min_val, max_val], color="red", linestyle="--", lw=2)

    plt.xlabel("Actual Values")
    plt.ylabel("Predicted Values")
    plt.title("Actual vs. Predicted")
    plt.show()


if y_pred is not None:
    plot_actual_vs_predicted(y_test, y_pred)