# CSIS4260 — Linear Regression: From Fundamentals to Evaluation
#### Lecture 9, Ch - 14, 15
##### Based on "Data Science from Scratch, 2nd Edition" by Joel Grus

---

## Introduction: Scenario & Motivation

Suppose you’re a data scientist at a fintech startup, tasked with predicting a customer’s annual spending based on features like age and income. Before we approach regression, let’s recall tools from last class—clustering and apriori (association rule mining)—and why those are insufficient when the outcome is continuous, not categorical or binary.

Recall:  
- Clustering is unsupervised: finds groupings, doesn’t predict values.
- Apriori finds associations (e.g., "If {milk, bread}, then {butter}"), works with categorical/basket data, not for forecasting.

Problem:  
When the target is numeric (e.g., salary, price, grade), we need a regression approach.


## Revisiting Previous Methods: Clustering and Apriori

Before we dive into regression, let’s briefly connect to what you’ve learned so far:

- **Clustering**: Unsupervised learning technique for discovering groups or patterns in data. It does not provide a way to predict a numeric target variable.
- **Apriori (Association Rule Mining)**: Used to find associations or frequent patterns in categorical data, such as items frequently purchased together in transactions. These rules help uncover relationships but do not predict continuous outcomes.

Both methods are valuable for understanding structure and relationships in data, but when our goal is to predict or explain a numeric quantity, we need a fundamentally different approach: regression.

---

In this lecture, we will transition from grouping and association-based methods to **regression analysis**—the foundation for predictive modeling with continuous outcomes.


## Linear Regression

### Definition
Linear regression models the relationship between one or more independent variables and a continuous dependent variable, assuming linearity.

Linear regression estimates the expected value of a continuous dependent variable as a linear function of the independent variables.

### Mathematical Formula

**Simple Linear Regression:**

$$y = \beta_0 + \beta_1 x + \epsilon$$

**Multiple Linear Regression:**

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$$

- $y$: Dependent variable
- $x_i$: Independent variables
- $\beta_0$: Intercept
- $\beta_i$: Coefficients
- $\epsilon$: Error (residual)

### Explaining the Formula

- **Intercept** ($\beta_0$): Value when all $x$ are zero.
- **Coefficients** ($\beta_i$): Change in $y$ per unit increase in $x_i$.
- **Error** ($\epsilon$): Difference between actual and predicted $y$.


## Understanding Linear Regression: Key Concepts & Definitions

**Linear regression** is a statistical technique that models the relationship between a dependent variable (also called the response or outcome variable) and one or more independent variables (also called predictors, features, or explanatory variables). The goal is to fit a line (or in higher dimensions, a hyperplane) that best predicts the outcome based on the inputs.

### Key Terms:

- **Dependent Variable ($y$)**: The outcome you are trying to predict or explain (e.g., house price).

- **Independent Variable ($x$ or $x_i$)**: The input or predictor used to explain changes in the dependent variable (e.g., living area in square feet).

- **Regression Line/Equation**: The mathematical relationship is typically written as:
  $$
  y = \beta_0 + \beta_1 x + \epsilon
  $$
  - $y$: predicted value of the dependent variable  
  - $x$: value of the independent variable  
  - $\beta_0$: intercept (the value of $y$ when $x = 0$)  
  - $\beta_1$: slope (how much $y$ increases for a one-unit increase in $x$)  
  - $\epsilon$: error or residual (difference between observed and predicted value)  

- **Slope ($\beta_1$)**: Measures the change in the dependent variable for a one-unit increase in the independent variable. In the equation, it is the coefficient of $x$.

- **Intercept ($\beta_0$)**: The expected value of the dependent variable when all independent variables are zero. Graphically, it is where the regression line crosses the $y$-axis.

- **Residuals ($\epsilon$)**: The difference between the actual observed value and the value predicted by the regression model for each observation. Residuals are used to diagnose the fit and assumptions of the model.

- **Ordinary Least Squares (OLS)**: The most common method to estimate the regression coefficients. OLS chooses the line that minimizes the sum of the squared residuals.

### Main Assumptions of Linear Regression:

1. **Linearity**: The relationship between the independent and dependent variable is linear.
2. **Independence**: The residuals (errors) are independent.
3. **Homoscedasticity**: The residuals have constant variance at all levels of the independent variable.
4. **Normality**: The residuals are normally distributed.

**Why regression?**  
Regression allows us to quantify the relationship between variables, make predictions, and infer which variables are most strongly associated with the outcome.


## Example: Predicting Home Prices

| Living Area (sq ft) | Price ($1000s) |
|---------------------|----------------|
| 1400                | 245            |
| 1600                | 312            |
| 1700                | 279            |
| 1875                | 308            |
| 1100                | 199            |



## Step-by-Step: Manual Linear Regression Calculation

### OLS Formulas:

**Slope $(\beta_1)$:**
$$
\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
$$

**Intercept $(\beta_0)$:**
$$
\beta_0 = \bar{y} - \beta_1\bar{x}
$$

---

### Step 1: Calculate the Means

- Mean of living area $(\bar{x}$):
$$
\bar{x} = \frac{1400 + 1600 + 1700 + 1875 + 1100}{5} = \frac{7675}{5} = 1535
$$

- Mean of price $(\bar{y}$):
$$
\bar{y} = \frac{245 + 312 + 279 + 308 + 199}{5} = \frac{1343}{5} = 268.6
$$

---

### Step 2: Calculate $(\beta_1)$

Compute each component separately:

| $(x_i)$ | $(y_i)$ | $(x_i - \bar{x})$ | $(y_i - \bar{y})$ | $((x_i - \bar{x})(y_i - \bar{y}))$ | $((x_i - \bar{x})^2)$ |
|---------|---------|-------------------|-------------------|-------------------------------------|-----------------------|
| 1400    | 245     | -135              | -23.6             | 3186                                | 18225                 |
| 1600    | 312     | 65                | 43.4              | 2821                                | 4225                  |
| 1700    | 279     | 165               | 10.4              | 1716                                | 27225                 |
| 1875    | 308     | 340               | 39.4              | 13396                               | 115600                |
| 1100    | 199     | -435              | -69.6             | 30276                               | 189225                |

**Summations:**

- Sum of $((x_i - \bar{x})(y_i - \bar{y})$:
$$
3186 + 2821 + 1716 + 13396 + 30276 = 51395
$$

- Sum of $(x_i - \bar{x})^2)$:
$$
18225 + 4225 + 27225 + 115600 + 189225 = 334500
$$

Calculate the slope $(\beta_1)$:
$$
\beta_1 = \frac{51395}{334500} \approx 0.1537
$$

---

### Step 3: Calculate $(\beta_0)$

Using the calculated slope $(\beta_1)$:
$$
\beta_0 = \bar{y} - \beta_1\bar{x} = 268.6 - (0.1537 \times 1535) \approx 268.6 - 236.12 = 32.48
$$

---

### Final Regression Equation:

The resulting linear regression equation is:
$$
y = 32.48 + 0.1537x
$$

**Interpretation:**  
This equation predicts home prices (in thousands of dollars) based on the living area (in square feet). For each additional square foot, the price is expected to increase by approximately \$153.70.


## Book Example: Step-by-Step Calculation

Suppose we have this dataset (from the book):

| $x$ | $y$ |
|-----|-----|
| 1   | 2   |
| 2   | 4   |
| 3   | 5   |
| 4   | 4   |
| 5   | 5   |

We want to fit a line $y = \beta_0 + \beta_1 x$.

### Step 1: Compute the means
- $\bar{x} = (1+2+3+4+5) / 5 = 3$
- $\bar{y} = (2+4+5+4+5) / 5 = 4$

### Step 2: Compute the slope ($\beta_1$)
$$
\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
$$

Calculate each term:
- $(1-3)(2-4) = (-2) \times (-2) = 4$
- $(2-3)(4-4) = (-1) \times (0) = 0$
- $(3-3)(5-4) = 0 \times 1 = 0$
- $(4-3)(4-4) = 1 \times 0 = 0$
- $(5-3)(5-4) = 2 \times 1 = 2$

Sum: $4 + 0 + 0 + 0 + 2 = 6$

Denominator:
- $(1-3)^2 = 4$
- $(2-3)^2 = 1$
- $(3-3)^2 = 0$
- $(4-3)^2 = 1$
- $(5-3)^2 = 4$

Sum: $4 + 1 + 0 + 1 + 4 = 10$

So, $\beta_1 = 6 / 10 = 0.6$

### Step 3: Compute the intercept ($\beta_0$)
$$
\beta_0 = \bar{y} - \beta_1 \bar{x} = 4 - 0.6 \times 3 = 2.2
$$

**Final regression equation:**  
$$
y = 2.2 + 0.6x
$$

---

This fitted line describes the best linear relationship between $x$ and $y$ in the dataset above.

## Step-by-Step Linear Regression with sklearn Dataset

We will use the California Housing dataset, fit a linear regression model, show predictions, and plot the residuals.

---

### 1. Load the dataset


In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

# Load California housing data
data = fetch_california_housing(as_frame=True)
X = data.data  # Use all variables
y = data.target  # Median house value in $100,000s

print("Variables:", list(X.columns))
df = pd.concat([X, y], axis=1)
df.head()


In [None]:
print(data.DESCR)

In [None]:
import plotly.figure_factory as ff
import plotly.colors

num_cols = df.select_dtypes(include='number').columns
colors = plotly.colors.qualitative.Plotly  # 10 unique, cycling if >10 vars

for i, col in enumerate(num_cols):
    data = [df[col].dropna()]
    group_labels = [col]
    color = colors[i % len(colors)]
    fig = ff.create_distplot(
        data, group_labels, 
        bin_size=(df[col].max() - df[col].min()) / 20,
        show_hist=True, show_curve=True, show_rug=False,
        colors=[color]
    )
    fig.update_layout(
        title_text=f'Histogram and KDE for {col}',
        template='plotly_white',
        xaxis_title=col,
        yaxis_title="Density",
        width=700,
        height=400,
        title_x=0.5
    )
    fig.update_traces(marker_line_color='black', marker_line_width=1)
    fig.show()


### Understanding the Value of the KDE (Kernel Density Estimate) Line

While histograms show the frequency distribution of values in discrete bins, the **KDE (Kernel Density Estimate) line** provides a smoothed, continuous approximation of the data’s underlying probability density function. The KDE helps in several ways:

- **Smooths Out Noise:** Unlike histograms, which depend on bin edges and can appear jagged, the KDE line reveals the overall shape of the distribution without being sensitive to bin width.
- **Reveals Structure:** It can make it easier to see if the data is unimodal, bimodal, or skewed—helping to identify outliers or multiple populations in the data.
- **Better Comparison:** When comparing multiple distributions, KDE lines provide a clearer, direct comparison of density and shape than overlapping histograms.

In summary, adding a KDE line to a histogram makes your data visualization more informative, especially when diagnosing patterns or anomalies in numeric data.


In [None]:
import plotly.express as px

# Compute the correlation matrix
corr = df.corr(numeric_only=True)

fig = px.imshow(
    corr,
    text_auto=".2f",  # Automatically annotate with correlation value
    color_continuous_scale='Viridis',
    aspect='auto',
    labels=dict(x="Variable", y="Variable", color="Correlation"),
    title="Correlation Heatmap (Interactive)"
)

# Enhance layout and interactivity
fig.update_layout(
    width=900,
    height=900,
    title_x=0.5,
    font=dict(size=16),
    xaxis=dict(tickangle=45),
)

# Optionally make the annotation text color white for higher contrast on dark backgrounds
fig.update_traces(
    hovertemplate="<b>%{x}</b> vs <b>%{y}</b><br>Correlation: %{z:.2f}<extra></extra>",
    textfont_size=16,
    textfont_color="white"
)

fig.show()


In [None]:
from sklearn.linear_model import LinearRegression

# Fit linear regression model with all features
model = LinearRegression()
model.fit(X, y)

print("Intercept (beta_0):", model.intercept_)
print("Coefficients (beta_j):")
for name, coef in zip(X.columns, model.coef_):
    print(f"{name}: {coef:.4f}")


### Understanding the Slope and Intercept in Multiple Regression

In a multiple regression model, the prediction equation takes the form:

$$
\text{MedHouseVal} = \beta_0 + \beta_1 \cdot \text{MedInc} + \beta_2 \cdot \text{HouseAge} + \cdots + \beta_8 \cdot \text{Longitude}
$$

**Intercept ($\beta_0$):**
- The intercept is the predicted value of the target variable ($y$) when all features ($x_1, x_2, ..., x_n$) are set to zero.
- In our California housing model:
  - $\beta_0 = -36.94$
- Interpretation: If all input features were zero, the model predicts a median house value of $-36.94 \times \$100,000$, which is not realistic in practice but mathematically defines the baseline for the regression line.

**Slope ($\beta_j$ for each variable):**

Each coefficient ($\beta_j$) represents the **change in the target ($y$)** for a **one-unit increase** in that feature, *holding all other features constant*.

For example, in our results:

- $0.4367$ for **MedInc**: For each additional unit increase in median income (in $10,000s$), the predicted median house value increases by $0.4367 \times \$100,000$ ($43,670)$, keeping all other variables constant.
- $0.0094$ for **HouseAge**: For each extra year of average house age, the predicted house value increases by about $940$.
- $-0.1073$ for **AveRooms**: For each additional average room, the predicted value decreases by about $-10,730$.
- $0.6451$ for **AveBedrms**: For each additional average bedroom, the predicted value increases by about $64,510$.
- $-0.0000$ for **Population**: For each extra person, the effect on predicted value is negligible.
- $-0.0038$ for **AveOccup**: For each additional person per household, the predicted value decreases by about $-380$.
- $-0.4213$ for **Latitude**: For each degree further north, the predicted value decreases by about $-42,130$.
- $-0.4345$ for **Longitude**: For each degree further east, the predicted value decreases by about $-43,450$.

...and similarly for any other coefficients.


**Why the Means Matter:**
- The regression line (or hyperplane) is positioned based on the means of $x$ and $y$. If you compute predictions at the mean values of all predictors, the result should be close to the mean of $y$ (the average house value in the dataset).
- The intercept itself is often not meaningful in real-world terms if no homes have all features at zero, but it is essential for accurate predictions and for anchoring the regression model.

**Summary Table from the Model:**

| Feature    | Coefficient ($\beta_j$) | Interpretation Example |
|------------|--------------------------|----------------------|
| Intercept  | -36.94                   | Baseline prediction when all features are zero |
| MedInc     | 0.4367                   | +1 unit MedInc $\Rightarrow$ +$43,670$ |
| HouseAge   | 0.0094                   | +1 year $\Rightarrow$ +$940$ |
| AveRooms   | -0.1073                  | +1 room $\Rightarrow$ -$10,730$ |
| AveBedrms  | 0.6451                   | +1 bedroom $\Rightarrow$ +$64,510$ |
| Population | -0.0000                  | +1 person $\Rightarrow$ negligible effect |
| AveOccup   | -0.0038                  | +1 person/household $\Rightarrow$ -$380$ |
| Latitude   | -0.4213                  | +1 degree north $\Rightarrow$ -$42,130$ |
| Longitude  | -0.4345                  | +1 degree east $\Rightarrow$ -$43,450$ |

---

In multiple regression, **each slope quantifies the unique contribution of that variable to the prediction, controlling for all others**. The intercept is a mathematical anchor; the means show where your model is centered in the feature space.


In [None]:
# Predict the target values
y_pred = model.predict(X)

# Create a DataFrame with all features, actual, and predicted values
results_df = X.copy()
results_df['Actual'] = y
results_df['Predicted'] = y_pred

# Show the first 10 rows
results_df.head(10)


In [None]:
import plotly.express as px

# Compute residuals
results_df['Residual'] = results_df['Actual'] - results_df['Predicted']

# Interactive residuals scatter plot
fig = px.scatter(
    results_df, 
    x='Predicted', y='Residual',
    hover_data=results_df.columns,
    color='Residual',
    color_continuous_scale='RdBu',
    title='Residuals vs. Predicted Values',
    labels={
        'Predicted': 'Predicted Median House Value ($100,000s)',
        'Residual': 'Residual (Actual - Predicted)'
    },
    width=800,
    height=500
)
fig.add_hline(y=0, line_dash="dash", line_color="black")
fig.update_traces(marker=dict(size=7, line=dict(width=1, color='black')))
fig.update_layout(
    title_x=0.5,
    font=dict(size=16),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True)
)
fig.show()


In [None]:
import plotly.graph_objects as go

# You can pick the two most important features for X and Y axes
x_feature = 'MedInc'
y_feature = 'AveRooms'

fig = go.Figure(data=[go.Scatter3d(
    x=results_df[x_feature],
    y=results_df[y_feature],
    z=results_df['Actual'],
    mode='markers',
    marker=dict(
        size=4,
        color=results_df['Residual'],         # Color by residual value
        colorscale='RdBu',
        colorbar=dict(title="Residual"),
        opacity=0.85,
        line=dict(width=0.5, color='black')
    ),
    text=[f"{x_feature}: {x:.2f}<br>{y_feature}: {y:.2f}<br>Actual: {a:.2f}<br>Predicted: {p:.2f}<br>Residual: {r:.2f}"
          for x, y, a, p, r in zip(
              results_df[x_feature], 
              results_df[y_feature], 
              results_df['Actual'], 
              results_df['Predicted'], 
              results_df['Residual']
          )],
    hoverinfo='text'
)])

fig.update_layout(
    scene = dict(
        xaxis_title=x_feature,
        yaxis_title=y_feature,
        zaxis_title='Actual Median House Value',
    ),
    title=f"3D Scatter: {x_feature}, {y_feature}, Actual Value (Colored by Residual)",
    width=900,
    height=700,
    title_x=0.5,
    margin=dict(l=0, r=0, b=0, t=60)
)
fig.show()


### Model Evaluation Metrics: How to Assess Your Regression Model

Evaluating a regression model requires a deep understanding of multiple metrics, each serving a unique role in diagnosing performance.

---

#### **1. Sum of Squared Errors (SSE)**

**Definition:**  
The sum of the squares of the residuals (errors) between actual ($y_i$) and predicted ($\hat{y}_i$) values.

**Equation:**  
$$
\mathrm{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

**How to Calculate:**  
- For every data point, compute the difference between actual and predicted value.
- Square each difference.
- Sum all squared differences.

**When to Use:**  
- Useful for measuring the total error of the model.
- Not comparable across datasets of different sizes (grows with $n$).

---

#### **2. Mean Squared Error (MSE)**

**Definition:**  
The average of squared errors across all data points.

**Equation:**  
$$
\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

**How to Calculate:**  
- Compute SSE.
- Divide by the number of observations $n$.

**When to Use:**  
- Standard for comparing regression models on the same dataset.
- Penalizes large errors more than small ones due to squaring.
- Same unit as the square of the target variable.

---

#### **3. Root Mean Squared Error (RMSE)**

**Definition:**  
The square root of MSE, bringing the error measure back to the original unit of $y$.

**Equation:**  
$$
\mathrm{RMSE} = \sqrt{\mathrm{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$

**How to Calculate:**  
- Take the square root of the MSE.

**When to Use:**  
- Directly interpretable in the same units as the target variable.
- Useful for comparing model error to the scale of $y$.
- Sensitive to outliers.

---

#### **4. Mean Absolute Error (MAE)**

**Definition:**  
The average of the absolute differences between actual and predicted values.

**Equation:**  
$$
\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
$$

**How to Calculate:**  
- Compute the absolute value of each residual.
- Take the average over all data points.

**When to Use:**  
- Less sensitive to large outliers than MSE/RMSE.
- Useful when you care about average magnitude of errors, not their direction.

---

#### **5. R-squared ($R^2$) — Coefficient of Determination**

**Definition:**  
The proportion of the variance in the dependent variable explained by the regression model.

**Equation:**  
$$
R^2 = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}
$$

Where:
- $\mathrm{SSE}$ is the sum of squared errors as above.
- $\mathrm{SST} = \sum_{i=1}^n (y_i - \bar{y})^2$ is the total sum of squares, measuring total variance in $y$.

**How to Calculate:**  
- Compute SSE and SST.
- Take $1 - (\mathrm{SSE}/\mathrm{SST})$.

**When to Use:**  
- $R^2$ ranges from $0$ to $1$ (can be negative for a model worse than mean prediction).
- Closer to $1$ means a better fit.
- Use for assessing model’s explanatory power.
- Be careful: A high $R^2$ does not guarantee a good model (overfitting, omitted variable bias, etc.).

---

### **How to Choose the Right Metric**

- **MAE**: Use for interpretability and when outliers are not as important.
- **MSE/RMSE**: Use when you want to heavily penalize large errors or when errors need to be in squared units.
- **SSE**: Diagnostic, but best for comparing fits on the same dataset.
- **$R^2$**: Use for a quick sense of explained variance, but always in conjunction with error metrics.

**Key Practice:**  
- Always report multiple metrics.
- Interpret them in the context of your data’s scale, distribution, and the problem at hand.


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Actual and predicted values
y_true = results_df['Actual']
y_pred = results_df['Predicted']

# Sum of Squared Errors (SSE)
sse = np.sum((y_true - y_pred) ** 2)

# Mean Squared Error (MSE)
mse = mean_squared_error(y_true, y_pred)

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_true, y_pred)

# Total Sum of Squares (SST)
sst = np.sum((y_true - y_true.mean()) ** 2)

# R-squared
r2 = r2_score(y_true, y_pred)

print(f"Sum of Squared Errors (SSE): {sse:.2f}")
print(f"Total Sum of Squares (SST): {sst:.2f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R-squared (R^2): {r2:.4f}")


### Interpretation of Regression Metrics

Based on your model's output:

- **Sum of Squared Errors (SSE): 10,821.99**
  - This represents the total squared deviation of the predicted values from the actual median house values.
  - While the absolute number is dataset-dependent, a lower SSE always indicates a closer fit of the model to the data. Here, the model leaves about 10,822 units of error unexplained by its predictions.

- **Total Sum of Squares (SST): 27,483.20**
  - This is the total variance present in the actual data, i.e., the sum of squared differences between the observed values and the mean of the observed values.
  - SST sets a benchmark for how much the data can be explained by any model (including just predicting the mean).

- **Mean Squared Error (MSE): 0.5243**
  - The average squared error per prediction.
  - This means, on average, each prediction is off by about $0.5243$ (in the same units as your target variable squared). For California housing, the units are in $100,000s$, so this corresponds to $52,430$ squared error per home.

- **Root Mean Squared Error (RMSE): 0.7241**
  - The square root of MSE, bringing error units back to the original scale.
  - The average magnitude of error for each prediction is roughly $0.7241$ ($72,410)$.
  - This gives you a sense of the "typical" prediction error: the model’s predicted median house value will usually be within about $72,410 of the true value.

- **Mean Absolute Error (MAE): 0.5312**
  - The mean of the absolute differences between predicted and actual values.
  - On average, your predictions are off by about $0.5312$ ($53,120) per home.
  - MAE is robust to outliers and offers a straightforward, intuitive measure of typical error.

- **R-squared ($R^2$): 0.6062**
  - This means that approximately 60.6% of the variance in median house value across California is explained by your linear regression model.
  - In other words, your features capture a substantial but not exhaustive amount of the underlying patterns in the housing prices. There’s still ~39.4% of the variance that your model does not explain, potentially due to non-linearities, omitted features, or inherent randomness.

---

#### **Summary and Recommendations:**
- **Model Quality:** Your model does a reasonable job—capturing over 60% of the explainable variance and producing a typical prediction error just above $50,000$. This is a strong start for a purely linear approach with default features.
- **Diagnostic Use:** 
  - *RMSE* and *MAE* should always be interpreted relative to the natural variability (SST) and the range of the target variable.
  - If the average house value is, for example, $2.07 ($207,000), an RMSE of $0.72 ($72,410) suggests there is still significant room for improvement.
- **Next Steps:**
  - Investigate residuals for patterns (non-linearity, heteroscedasticity).
  - Try more complex models (e.g., polynomial regression, decision trees, random forests).
  - Feature engineering: add or transform variables, or include location-based features with higher resolution.
  - Cross-validate to ensure generalizability to unseen data.

**Bottom line:**  
This model is interpretable and captures much of the important variation in California housing prices, but there’s scope to reduce error further by refining the model or incorporating richer data.


### Next Steps: Beyond Basic Regression

So far, we have:
- Built and evaluated a linear regression model using all available data.

However, two important topics remain for building robust, real-world predictive models:

---

**1. Train/Test Split**

In practice, we want to assess how well our model generalizes to new, unseen data.  
To do this, we divide our data into:
- **Training set:** Used to fit the model.
- **Test set:** Used only to evaluate model performance.

This helps us detect overfitting (when a model learns the training data too well but performs poorly on new data).

---

**2. Regularization**

With multiple predictors, models can become too flexible and start fitting noise rather than true relationships—a problem known as overfitting.  
**Regularization** adds a penalty to the model for having large or complex coefficients, encouraging simpler, more generalizable solutions.

Two common types:
- **Ridge Regression (L2):** Penalizes the sum of squared coefficients.
- **Lasso Regression (L1):** Penalizes the sum of absolute coefficients and can set some coefficients to zero (feature selection).

---

**In the next steps, we will:**
1. Split the data into training and test sets.
2. Fit and evaluate models with and without regularization (ridge and lasso).
3. Compare the results to see how regularization helps prevent overfitting and improves generalization.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import numpy as np

# 1. Train/Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Fit Models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1)
}

results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results.append({
        'Model': name,
        'SSE': np.sum((y_test - y_pred)**2),
        'MSE': mean_squared_error(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'MAE': mean_absolute_error(y_test, y_pred),
        'R^2': r2_score(y_test, y_pred)
    })

# 3. Results DataFrame
results_df = pd.DataFrame(results)
results_df = results_df[['Model', 'SSE', 'MSE', 'RMSE', 'MAE', 'R^2']]
display(results_df)


In [None]:
# Make predictions for all models on the test set
y_pred_lin = models['Linear Regression'].predict(X_test)
y_pred_ridge = models['Ridge Regression'].predict(X_test)
y_pred_lasso = models['Lasso Regression'].predict(X_test)

# Build a DataFrame with all features, actual values, and all model predictions
compare_df = X_test.copy()
compare_df['Actual'] = y_test
compare_df['Linear_Pred'] = y_pred_lin
compare_df['Ridge_Pred'] = y_pred_ridge
compare_df['Lasso_Pred'] = y_pred_lasso

# Show the first 10 rows for inspection
compare_df.head(10)


In [None]:
# Prompt user for input with instructions on data type

print("Please enter values for the following features:")

medinc = float(input("Median Income (MedInc, e.g. 4.5) [float]: "))
houseage = int(input("House Age (HouseAge, e.g. 30) [int]: "))
averooms = float(input("Average Rooms (AveRooms, e.g. 6.0) [float]: "))
avebedrms = float(input("Average Bedrooms (AveBedrms, e.g. 1.0) [float]: "))
population = int(input("Population (e.g. 1000) [int]: "))
aveoccup = float(input("Average Occupancy (AveOccup, e.g. 3.0) [float]: "))
latitude = float(input("Latitude (e.g. 34.0) [float]: "))
longitude = float(input("Longitude (e.g. -118.0) [float]: "))

# Assemble input into a DataFrame
user_sample = pd.DataFrame([{
    'MedInc': medinc,
    'HouseAge': houseage,
    'AveRooms': averooms,
    'AveBedrms': avebedrms,
    'Population': population,
    'AveOccup': aveoccup,
    'Latitude': latitude,
    'Longitude': longitude
}])

# Predict with all trained models
lin_pred = models['Linear Regression'].predict(user_sample)[0]
ridge_pred = models['Ridge Regression'].predict(user_sample)[0]
lasso_pred = models['Lasso Regression'].predict(user_sample)[0]

print(f"\nPredicted Median House Value (Linear Regression): ${lin_pred * 100000:,.2f}")
print(f"Predicted Median House Value (Ridge Regression):  ${ridge_pred * 100000:,.2f}")
print(f"Predicted Median House Value (Lasso Regression):  ${lasso_pred * 100000:,.2f}")


### Final Interpretation and Reflection

#### **Comparing Model Performance**

Based on the test set results:

| Model              | SSE      | MSE     | RMSE    | MAE     | $R^2$   |
|--------------------|----------|---------|---------|---------|---------|
| Linear Regression  | 2294.72  | 0.5559  | 0.7456  | 0.5332  | 0.5758  |
| Ridge Regression   | 2294.36  | 0.5558  | 0.7455  | 0.5332  | 0.5759  |
| Lasso Regression   | 2532.58  | 0.6135  | 0.7833  | 0.5816  | 0.5318  |

- **Linear and Ridge regression** perform almost identically in this case, with Ridge offering slightly better $R^2$ and error metrics—indicating that the model benefits only a little from regularization, likely due to moderate feature correlations and the scale of coefficients.
- **Lasso regression** produces a higher error and lower $R^2$. This suggests that, at the chosen penalty, setting some coefficients closer to zero actually reduces model performance for this dataset. Lasso is most useful when strong feature selection is needed (sparse model), which may not be the case here.
- All models explain roughly **57–58% of the variance** in California house prices in the test set. This leaves about 42% unexplained, pointing to possible non-linear patterns, omitted variables, or noise.

---

#### **Interpreting Individual Predictions**

Sample rows show that even with the best model, individual predictions can still be far from the actual value, especially at the extremes or for outlier data points. For example, the first test example had an actual value of $0.477$ ($\$47,700$) but the predictions ranged from $0.719$ ($\$71,900$) to $1.046$ ($\$104,600$).

---

#### **Reference to the Book**

As presented in *Data Science from Scratch (Chapters 14–15)*, the core workflow of regression modeling includes:
- Splitting data to assess generalization,
- Fitting models with and without regularization,
- Interpreting both coefficients and prediction errors,
- Using evaluation metrics ($R^2$, MAE, MSE, RMSE) to understand both fit and predictive quality,
- Recognizing limitations of linear models and when more flexible or sophisticated approaches may be necessary.

---

#### **Something to Think About**

> **Are we reaching the limit of what a linear model can do with these features?**
>
> The gap between actual and predicted values, and the $R^2$ ceiling, suggest opportunities for:
> - Adding new, more informative features (feature engineering)
> - Using non-linear models (decision trees, ensembles, or neural nets)
> - Further exploring regularization and hyperparameter tuning

**Reflect:** What real-world processes or data limitations could account for the unexplained variance in house prices?  
What trade-offs exist between interpretability and predictive power in model selection?

---

**For next time:** Consider how you would approach this problem if model *explainability* was crucial (for a policymaker or a homeowner) versus if you only cared about raw predictive performance.
