<a href="https://colab.research.google.com/github/financieras/math_for_ai/blob/main/articulos/linear_regression_from_scratch_in_python_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Linear Regression from Scratch in Python. Part 2/2**

#### **Evaluation, Metrics & Real-World Use**

You already know how to implement linear regression using least squares. But a model that makes predictions isn't useful if you don't know **how good those predictions are** and **when it's appropriate to use it**.

In this article, you'll learn to evaluate your model, interpret its results, and decide when to use this method. By the end, you'll understand the key metrics, the method's limitations, and when to consider alternatives.

**Note**: This article continues where we left off in the previous article. If you haven't implemented linear regression using least squares yet, start there to get the most out of this content.

---

# **1. Initial Setup**

First, let's recreate the model we built in the previous article using the same data:

In [7]:
import numpy as np
import matplotlib.pyplot as plt

# Sample data: housing size vs price (same as previous article)
area = np.array([50, 55, 60, 64, 70, 78, 80, 89, 90, 100])
price = np.array([140000, 155000, 190000, 200000, 225000,
                  212000, 240000, 230000, 270000, 300000])

# Design matrix
X = np.column_stack([np.ones(len(area)), area])
y = price

def least_squares(X, y):
    """Calculates optimal coefficients using least squares"""
    return np.linalg.inv(X.T @ X) @ X.T @ y

# Calculate coefficients and predictions
w = least_squares(X, y)
y_pred = X @ w

print(f"Calculated coefficients:")
print(f"  w₀ (intercept) = ${w[0]:,.2f}")
print(f"  w₁ (slope) = ${w[1]:,.2f}/m²")

Calculated coefficients:
  w₀ (intercept) = $10,722.85
  w₁ (slope) = $2,791.81/m²


---

# **2. Evaluation Metrics**

Now that we have predictions, we need to quantify how good they are. We'll use three fundamental metrics that give us complementary perspectives on the model's performance.

## Correlation coefficient (r): Measuring the linear relationship

Before evaluating our model, we need to understand a fundamental metric: the **Pearson linear correlation coefficient (r)**, which measures the strength and direction of the linear relationship between two variables.

**Values of r:**
- r = 1: perfect positive linear relationship (when x increases, y increases proportionally)
- r = -1: perfect negative linear relationship (when x increases, y decreases proportionally)  
- r = 0: no linear relationship
- |r| > 0.7: strong linear relationship
- |r| < 0.3: weak linear relationship

**Formula:**

$$r = \frac{\sum_{i=1}^{m} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{m} (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{m} (y_i - \bar{y})^2}}$$

Or equivalently using covariance and standard deviations:

$$r = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$

> **Intuitive interpretation**: The numerator measures how the deviations of x and y from their means move together. If both variables increase or decrease together, r will be close to 1 or -1. If there's no pattern, r will be near 0.

**Why is it important to calculate it?**  
The coefficient r tells us **before training the model** whether there's a linear relationship worth modeling. If |r| is very low, we know in advance that linear regression won't be effective.

## R²: Explained variability

**R² (Coefficient of Determination)** indicates what proportion of the data's variability our model explains. It ranges between 0 and 1, where 1 means perfect fit.

$$R^2 = 1 - \frac{\sum_{i=1}^{m}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{m}(y_i - \bar{y})^2}$$

**Practical interpretation:**
- **R² = 1**: Perfect model (explains 100% of variability)
- **R² = 0.95**: The model explains 95% of the variability in the data
- **R² = 0**: The model is no better than simply predicting the mean

## Connection between r and R²

Here's an important mathematical detail: **in simple linear regression (one variable), R² = r²**.

This means the coefficient of determination is simply the square of the correlation coefficient.

**Practical implications:**

| Correlation (r) | R² | Interpretation |
|-----------------|----|----------------|
| r = 0.95 | R² = 0.90 | The model explains 90% of variability |
| r = 0.70 | R² = 0.49 | The model explains 49% of variability |
| r = 0.50 | R² = 0.25 | The model explains only 25% of variability |
| r = -0.90 | R² = 0.81 | The model explains 81% (sign doesn't affect R²) |

**Key differences:**
- **r** tells you the direction (with the ± sign) and the strength of the relationship
- **R²** is always positive and tells you what percentage of variability the model explains
- In multiple regression (several variables), only R² exists, there's no single r

## RMSE: The average error

**RMSE (Root Mean Square Error)** measures the typical magnitude of the prediction errors in the same units as the target variable (dollars). For example, an RMSE of \\$15,000 means predictions are typically off by around \\$15,000 — larger errors are more heavily penalized because they are squared in the calculation.

$$\text{RMSE} = \sqrt{\frac{1}{m} \sum_{i=1}^{m}(y_i - \hat{y}_i)^2}$$

**Why is it useful?**
- Unlike R², RMSE is in the **same units** as your target variable (dollars, meters, etc.)
- It gives a concrete sense of "how far off" the predictions are on average
- Because errors are squared before averaging, larger errors are penalized more heavily
- In practice, RMSE approximates the **typical error magnitude** — e.g., with RMSE = \\$14,574, predictions are typically off by around \\$14,000–\\$15,000
- It's easy to communicate: "On average, our model is off by about \\$15,000."




## Calculating the metrics


In [8]:
# Calculate evaluation metrics

# Correlation coefficient r (for simple regression)
r = np.corrcoef(area, price)[0, 1]

# MSE and RMSE
mse = np.mean((y - y_pred) ** 2)
rmse = np.sqrt(mse)

# Coefficient of determination R²
ss_res = np.sum((y - y_pred) ** 2)  # Residual sum of squares
ss_tot = np.sum((y - y.mean()) ** 2)  # Total sum of squares
r2 = 1 - (ss_res / ss_tot)

# Display results
print("\n" + "="*50)
print("PERFORMANCE METRICS")
print("="*50)
print(f"r (correlation): {r:.4f}")
print(f"  → {'Strong' if abs(r) > 0.7 else 'Moderate' if abs(r) > 0.4 else 'Weak'} linear relationship")
print(f"\nR² (coef. of determination): {r2:.4f}")
print(f"  → The model explains {r2*100:.2f}% of variability")
print(f"\nVerification: r² = {r**2:.4f} ≈ R² = {r2:.4f} ✓")
print(f"\nRMSE (root mean square error): ${rmse:,.2f}")
print(f"  → Average error of approximately ${rmse:,.0f}")


PERFORMANCE METRICS
r (correlation): 0.9488
  → Strong linear relationship

R² (coef. of determination): 0.9001
  → The model explains 90.01% of variability

Verification: r² = 0.9001 ≈ R² = 0.9001 ✓

RMSE (root mean square error): $14,573.71
  → Average error of approximately $14,574


# **3. Visualizing the Fit and Making Predictions**

## Visualizing the fit

The metrics give us numbers, but visualizing the fit helps us better understand the model's performance:

In [None]:
# Model visualization
plt.figure(figsize=(12, 5))

# Subplot 1: Model fit
plt.subplot(1, 2, 1)
plt.scatter(area, price/1000, alpha=0.6, s=100, label='Actual data')
plt.plot(area, y_pred/1000, 'r-', linewidth=2, label='Linear fit')
plt.xlabel('Area (m²)', fontsize=11)
plt.ylabel('Price (thousands $)', fontsize=11)
plt.title(f'Model Fit (R² = {r2:.3f})', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)

# Subplot 2: Residual analysis
plt.subplot(1, 2, 2)
residuals = y - y_pred
plt.scatter(y_pred/1000, residuals/1000, alpha=0.6, s=100)
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.xlabel('Predicted Price (thousands $)', fontsize=11)
plt.ylabel('Residuals (thousands $)', fontsize=11)
plt.title(f'Residual Analysis (RMSE = ${rmse/1000:.1f}k)', fontsize=12)
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nResidual analysis shows how far each prediction is from the actual value.")
print("Ideally, residuals should be randomly distributed around zero.")

## Making predictions

With our model evaluated, we can use it to predict prices for new properties:

In [10]:
# Predictions for new properties
new_areas = np.array([65, 85, 95])
predictions = w[0] + w[1] * new_areas

print("\nPREDICTIONS FOR NEW PROPERTIES")
print("-" * 40)
for area_val, pred in zip(new_areas, predictions):
    print(f"   {area_val}m² house  →  ${pred:,.0f}")

print(f"\n(Remember: predictions have ~${rmse:,.0f} average error)")


PREDICTIONS FOR NEW PROPERTIES
----------------------------------------
   65m² house  →  $192,190
   85m² house  →  $248,027
   95m² house  →  $275,945

(Remember: predictions have ~$14,574 average error)


---

# **4. Comparison with scikit-learn**

An excellent way to validate our implementation is to compare it with the industry-standard library:

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create and train the scikit-learn model
sklearn_model = LinearRegression()
sklearn_model.fit(area.reshape(-1, 1), price)

# Predictions with scikit-learn
sklearn_pred = sklearn_model.predict(area.reshape(-1, 1))
sklearn_rmse = np.sqrt(mean_squared_error(price, sklearn_pred))
sklearn_r2 = r2_score(price, sklearn_pred)

# Compare results
print("\n" + "="*65)
print("COMPARISON: Our Implementation vs scikit-learn")
print("="*65)
print(f"\n{'Parameter':<20} {'Our impl.':<22} {'scikit-learn':<22}")
print("-" * 65)
print(f"{'w₀ (intercept)':<20} ${w[0]:>18,.2f}   ${sklearn_model.intercept_:>18,.2f}")
print(f"{'w₁ (slope)':<20} ${w[1]:>17,.2f}/m²  ${sklearn_model.coef_[0]:>17,.2f}/m²")
print(f"\n{'R²':<20} {r2:>21.4f}   {sklearn_r2:>21.4f}")
print(f"{'RMSE':<20} ${rmse:>20,.2f}   ${sklearn_rmse:>20,.2f}")
print("\n✓ Results are identical! Our implementation is correct.")


COMPARISON: Our Implementation vs scikit-learn

Parameter            Our impl.              scikit-learn          
-----------------------------------------------------------------
w₀ (intercept)       $         10,722.85   $         10,722.85
w₁ (slope)           $         2,791.81/m²  $         2,791.81/m²

R²                                  0.9001                  0.9001
RMSE                 $           14,573.71   $           14,573.71

✓ Results are identical! Our implementation is correct.


---

# **5. Advantages, Limitations, and When to Use It**

Now that we understand how it works and how to evaluate it, it's crucial to know when it's appropriate to use this method.

## Advantages of the least squares method

The Ordinary Least Squares (OLS) method has characteristics that make it especially valuable:

- **Exact solution**: Finds optimal coefficients directly, without iterations or approximations
- **Speed**: For small and medium datasets, it's computationally very efficient
- **Interpretability**: Coefficients have a direct and clear interpretation you can explain to stakeholders
- **Mathematical guarantee**: If a solution exists, this method finds it
- **No hyperparameters**: Doesn't require tuning learning rate or other parameters

## Important limitations

However, the method has important limitations you should be aware of:

In [12]:
# Example of multicollinearity problem
X_problem = np.column_stack([area, area * 2])  # Linearly dependent columns

print("Example with linearly dependent columns:")
print("\nFirst 3 rows of problematic matrix:")
print(X_problem[:3])
print(f"\nDeterminant of X.T @ X: {np.linalg.det(X_problem.T @ X_problem):.2e}")

Example with linearly dependent columns:

First 3 rows of problematic matrix:
[[ 50 100]
 [ 55 110]
 [ 60 120]]

Determinant of X.T @ X: 0.00e+00


**Main problems with the method:**

1. **Singular matrices / Perfect collinearity**  
   When predictor variables are linearly dependent (one is an exact linear combination of others), the matrix $(X^T X)$ becomes singular and cannot be inverted.  
   In the example above, the second column is exactly twice the first → the determinant of $X^T X$ is exactly 0 → no unique least squares solution exists.

2. **Multicollinearity (high but not perfect correlation)**  
   Even when features are not perfectly collinear, strong correlations between predictors make the estimated coefficients unstable and hard to interpret (small changes in the data can cause large swings in the coefficients).

3. **Limited scalability**  
   For very large datasets (millions of records) or high-dimensional data, matrix inversion has O(n³) complexity and becomes prohibitively expensive.

4. **Sensitivity to outliers**  
   Because errors are squared, extreme values have a disproportionately large influence and can significantly bias the model.

5. **Memory consumption**  
   Requires loading the entire dataset into RAM.

**Practical solutions for multicollinearity:**
- Remove one of the highly correlated features
- Combine correlated features (e.g., using Principal Component Analysis – PCA)
- Switch to regularized regression methods (Ridge or Lasso), which are more robust to collinearity

## When to use Least Squares vs Gradient Descent?

The choice between these methods depends mainly on the size and nature of your data:

**Use least squares when:**
- You have few features (typically fewer than a few hundred on consumer hardware)
- You need the exact analytical solution in a single operation
- The dataset fits comfortably in RAM
- Interpretability of individual coefficients is important
- There are no severe multicollinearity issues

**Use Gradient Descent when:**
- You have millions of records or hundreds/thousands of features
- The dataset doesn't fit in memory (you can use mini-batches)
- You need to update the model continuously with new data (online learning)
- You're working with neural networks or other non-linear models
- You want built-in regularization (L1, L2) for better handling of collinearity or feature selection

**The reason to choose**: Beyond several hundred features, the cost and numerical instability of matrix inversion grow rapidly, making iterative methods like gradient descent far more practical for large-scale or high-dimensional problems.

**Practical scalability rule:**
- < 10,000 observations and < 100 features → Least squares (fast and exact)
- \> 100,000 observations or > 1,000 features → Gradient Descent (scalable)
- Between 10k–100k → Both work; choose based on your specific needs

---

# **6. Conclusion**

In this article, we completed our journey through linear regression with least squares, learning not only to build the model but also to evaluate it correctly and understand when to use it.

## Recap

**What we learned:**
- How to measure performance with **RMSE** and **R²**
- The relationship between R² and correlation
- How to visualize and validate the model fit
- The advantages and limitations of the method
- When to choose least squares vs Gradient Descent

**Why it matters:**
Linear regression using least squares is:
- **Your starting point** for any numerical prediction problem
- **Extremely effective** when there are linear relationships in the data
- **Interpretable and explainable**, crucial in business contexts
- **The foundation** for understanding more advanced methods

## A practical note for beginners

Here's some straightforward advice as you start your journey in Data Science and Machine Learning:

1. **For real-world projects: Reach for scikit-learn first**  
   In almost every practical scenario, your go-to choice should be scikit-learn's `LinearRegression`. It handles the implementation details reliably, works well regardless of dataset size, and is numerically robust — saving you from worrying about the underlying solver.

2. **For hands-on learning and future-proof skills: Implement gradient descent**  
   When building models from scratch to deepen your understanding, prioritize **gradient descent** over the closed-form least squares solution. Gradient descent scales to problems of any size and forms the foundation of nearly all modern machine learning optimization, especially in deep learning. Iterative methods like this are the tools you'll use throughout your career.

3. **For building intuition: Study the least squares (normal equations) method**  
   The normal equations are excellent for developing intuition about convex optimization and understanding what "best-fitting line" really means in linear regression. However, most real-world loss functions are not convex, so treat least squares primarily as a powerful teaching tool — in practice, rely on libraries or gradient descent.

## Next steps

Now that you've mastered the closed-form solution, the natural next step is to explore the **Gradient Descent algorithm**. This method will allow you to:
- Scale to problems with millions of data points
- Understand iterative optimization
- Prepare for Deep Learning

In the next article in this series, we'll implement Gradient Descent from scratch and compare its performance with least squares, giving you all the tools to decide which method to use in each situation.

---

> ### Thank you

---
© 2025
---