# Regression Metrics

1. MAE 
2. MSE
3. RMSE
4. R2 Score
5. Adjusted R2 Score

## 1. MAE (Mean absolute error)

Mean of absolute error, i.e. we will calculate absoulte error (mod of the error to keep it positive) for each data point and calculate its mean

$$
mae = \frac{\sum_{i=1}^m |{y_i} - {\hat{y}_i} |}{m}
$$

#### Advantage:
1. mae is exactly in the unit of y (LPA in salary, 100,000 in housing price)
2. Mae is robust to outliers (relatively less affected by outliers)

#### Disadvantage:

1. Non-differentiable at zero residual

    The absolute value function has a sharp corner at zero, so its derivative is undefined at that point. While practical implementations use subgradients and do not crash, the loss surface is not smooth. This can lead to slower or less stable convergence in gradient-based optimization compared to smooth losses like MSE.

2. Constant gradient magnitude

    The gradient of MAE depends only on the sign of the error, not its magnitude. As a result, the gradient does not decrease as the model approaches the minimum. This can cause oscillations near the optimum and slower convergence compared to MSE.

3. Less penalization of large errors

    Since errors are not squared, large residuals are penalized linearly. This makes MAE robust to outliers but reduces the incentive for the model to aggressively correct very large mistakes.

---

## 2. MSE (Mean Squared Error)

$$
MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2
$$

MSE calculates the average of the squared differences between actual and predicted values.


---
### Example:

If the true salary is **6 LPA** and the predicted salary is **12 LPA**, then:

Error = ( 6 )

Squared Error = ( 6^2 = 36 )

So the contribution to MSE is **36 (LPA²)**.

Notice that the unit becomes **squared**, which makes interpretation less intuitive.



### Advantages

1. **Smooth and fully differentiable**

   MSE is based on a quadratic function, which is differentiable everywhere with continuous gradients. This makes it highly suitable for gradient-based optimization methods like gradient descent. As the error approaches zero, the gradient smoothly decreases to zero, enabling stable convergence.

2. **Strongly penalizes large errors**

   Since errors are squared, large residuals grow rapidly.
   For example:

   * Error = 2 → Squared = 4
   * Error = 10 → Squared = 100
   * Error = 50 → Squared = 2500

   This forces the model to prioritize correcting large mistakes.


### Disadvantages

1. **Less interpretable**

   The loss is expressed in squared units of the target variable (e.g., LPA²), which makes it harder to interpret directly compared to MAE.

2. **Sensitive to outliers**

   Because large errors are amplified quadratically, even a single outlier can significantly increase the loss and heavily influence the model. Therefore, MSE is not robust to outliers.

---



## RMSE (Root Mean Squared Error)
$$
RMSE = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2}
$$

RMSE is simply the square root of the Mean Squared Error.

---

### Example

If the true salary is **6 LPA** and the predicted salary is **12 LPA**, then:

Error = ( 6 )

Squared Error = ( 6^2 = 36 )

MSE contribution = 36

RMSE = ( \sqrt{36} = 6 )

Notice that RMSE brings the loss back to the **same unit as the target variable (LPA)**, making it easier to interpret compared to MSE.

---

### Properties

RMSE retains all mathematical properties of MSE because it is directly derived from it.

---

### Advantages

1. **Interpretable unit**

   Unlike MSE, RMSE is expressed in the same unit as the target variable, making it more intuitive to understand.

2. **Smooth and differentiable**

   Since it is derived from MSE, it remains differentiable and suitable for gradient-based optimization.

---

### Disadvantages

1. **Sensitive to outliers**

   Because it is based on squared errors, large residuals are still heavily penalized. Therefore, RMSE is not robust to outliers.

2. **Still influenced by large errors**

   Although the square root reduces the scale, the underlying squaring effect still gives higher weight to large mistakes compared to MAE.

---


## R² Score (Coefficient of Determination)

$$
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
$$

Where:

$$
SS_{res} = \sum_{i=1}^{m} (y_i - \hat{y}_i)^2
$$

$$
SS_{tot} = \sum_{i=1}^{m} (y_i - \bar{y})^2
$$

* $ {SS_{res}} $ → Residual Sum of Squares (SSE of regression model)
* $ {SS_{tot}} $→ Total Sum of Squares (SSE of predicting the mean)
* $ {\bar{y} }$ → Mean of actual target values

---

### What is R²?

R² measures how much variance in the target variable is explained by the model compared to simply predicting the mean.

It compares:

* Error made by your regression model
  vs
* Error made by a naive model that always predicts the mean

So it tells us how much better our model is than just using the average.

---

### Interpretation of R²

#### 1️⃣ If R² = 0

$$
SS_{res} = SS_{tot}
$$

This means:

Your regression model performs exactly the same as predicting the mean.

In other words:

* The regression line is effectively behaving like the mean line.
* The features are not helping.
* The model is not learning any useful relationship.

---

#### 2️⃣ If R² = 1

$$
SS_{res} = 0
$$

This means:

* Predictions are perfectly equal to actual values.
* There is no residual error.
* The regression line fits all data points exactly.

Since the numerator becomes 0:
$$
R^2 = 1 - 0 = 1
$$

This represents a perfect fit.

---

#### 3️⃣ If R² is Between 0 and 1

Example:

R² = 0.75

This means:

75% of the variance in the target variable is explained by the model, and 25% remains unexplained.

---

#### 4️⃣ If R² is Negative

This happens when:

$$
SS_{res} > SS_{tot}
$$

Meaning:

The model performs worse than simply predicting the mean.

This can happen when:

* The model is poorly trained
* The wrong features are used
* There is severe overfitting on training data

---

### Key Insight

* R² does **not** measure prediction accuracy directly.
* It measures how much variance is explained relative to a baseline (mean model).
* Higher R² does not always mean better model — especially in overfitting scenarios.
---

## Adjusted R²

Adjusted R² is a modified version of R² that accounts for the number of features in the model.

While R² **never decreases** when you add more features (even useless ones), Adjusted R² increases **only if the new feature actually improves the model more than would be expected by chance**.

---

### Why is Adjusted R² Used?

Because R² can be misleading when comparing models with different numbers of features.

When you add a new column:

* R² will either increase or stay the same.
* It never penalizes complexity.

Adjusted R² introduces a penalty for adding unnecessary features.

So it helps answer:

> Is this new feature genuinely improving the model, or just artificially inflating R²?

---

### Example 1 — Useless Feature (R² Increases, Adjusted R² Decreases)

Suppose you are predicting salary using:

* Years of experience
* Education level

Model performance:

* R² = 0.78
* Adjusted R² = 0.77

Now you add a random column:

* Employee ID number (which has no logical relationship with salary)

After retraining:

* R² = 0.79 (it increased slightly)
* Adjusted R² = 0.75 (it decreased)

Why?

Because R² only checks whether the residual error decreased.
Even random noise can slightly reduce error due to chance.

But Adjusted R² detects that this improvement is not statistically meaningful and penalizes the model for unnecessary complexity.

So in this case:

* R² suggests improvement.
* Adjusted R² correctly signals overfitting.

---

### Example 2 — Useful Feature (Both Increase)

Same original model:

* R² = 0.78
* Adjusted R² = 0.77

Now you add:

* Skill certification score (which genuinely affects salary)

After retraining:

* R² = 0.85
* Adjusted R² = 0.84

Here:

* Error decreases significantly.
* The new feature explains additional variance.
* Adjusted R² increases because the improvement justifies the added complexity.

---

### When Adjusted R² Is Not Very Useful

Adjusted R² is less useful when:

* You are not comparing models with different numbers of features.
* You care more about predictive performance on unseen data (where cross-validation is better).
* You are using non-linear or non-parametric models (like decision trees or neural networks).

In modern ML workflows, validation metrics often matter more than Adjusted R².

---

### Key Takeaway

* R² measures how much variance is explained.
* R² always increases when you add features.
* Adjusted R² increases only if the new feature adds real explanatory power.
* It is mainly used for comparing multiple regression models with different numbers of predictors.

---
