## Loss functions

Loss functions are essential tools to measure how well a regression model predicts outcomes. Each loss function quantifies the error, with different emphases and mathematical characteristics, helping guide model development for various data distributions and objectives. Here’s an in-depth, beginner-friendly explanation of each major loss function for regression, including formulas and practical considerations.

- Loss is a numerical metric that tells us how far off our predictions are from the actual observed data.​

- Loss functions are minimized during model training to improve prediction accuracy.

### Mean Squared Error (MSE)

- **Definition**: MSE calculates the average squared difference between predicted ($\hat{y}_i$) and actual ($y_i$) values.
- **Formula**: 
  $$
  \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2
  $$
- **Advantages**: Efficient convergence for small errors, widely used, sensitive to larger mistakes, smooth for optimization.
- **Drawbacks**: Extremely sensitive to outliers – large errors have disproportionate influence, which can destabilize optimization and bias the model.


### Root Mean Squared Error (RMSE)

- **Definition**: RMSE is the square root of MSE and shares its properties, but critically, it is expressed in the same units as the target variable.
- **Formula**:
  $$
  \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}
  $$
- **Advantages**: Easy to interpret, penalizes larger errors, useful for applications sensitive to large deviations, directly measures average magnitude of errors.
- **Drawbacks**: Sensitive to outliers, less robust on non-normal error distributions, scale-variant—hard to compare across datasets.



### Mean Absolute Error (MAE)

- **Definition**: MAE computes the average absolute difference between predicted and actual values.
- **Formula**:
  $$
  \text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|
  $$
- **Advantages**: Simple, computationally efficient, less sensitive to outliers than MSE.
- **Drawbacks**: All errors are weighted equally, not differentiable at zero, which can affect optimization.


### Huber Loss (Smooth MAE)

- **Definition**: Huber loss blends MSE (for small errors, for smooth optimization) and MAE (for large errors, which reduces sensitivity to outliers).
- **Formula** (with threshold $\delta$):
  $$
  \begin{cases}
      \frac{1}{2}(y_i - \hat{y}_i)^2 & \text{for } |y_i - \hat{y}_i| \leq \delta \\
      \delta|y_i - \hat{y}_i| - \frac{1}{2}\delta^2 & \text{for } |y_i - \hat{y}_i| > \delta
  \end{cases}
  $$
- **Advantages**: Robust to outliers, continuously differentiable, stable for optimization.
- **Drawbacks**: Requires tuning the threshold parameter $\delta$ for best results.


### Mean Squared Logarithmic Error (MSLE)

- **Definition**: MSLE measures the ratio between true and predicted values and is useful when penalizing large differences less aggressively, particularly when values are on vastly different scales.
- **Formula**:
  $$
  \text{MSLE} = \frac{1}{n} \sum_{i=1}^n \left(\log(y_i + 1) - \log(\hat{y}_i + 1)\right)^2
  $$
- **Advantages**: Treats small and large value differences comparably, less severe penalties for huge differences.
- **Drawbacks**: Penalizes underprediction more than overprediction, can be biased in some regression cases.


### Mean Bias Error (MBE)

- **Definition**: MBE calculates the average signed error between prediction and observation (not absolute), highlighting model bias.
- **Formula**:
  $$
  \text{MBE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)
  $$
- **Advantages**: Useful for identifying whether the model generally overestimates or underestimates.
- **Drawbacks**: Positive and negative errors cancel out, possibly hiding large individual mistakes. Not robust.



### Log-Cosh Loss

- **Definition**: This function uses $\log(\cosh(x))$ to achieve a blend of smoothness (like MSE) and outlier robustness (like MAE).
- **Formula**:
  $$
  \text{Log-cosh} = \frac{1}{n}\sum_{i=1}^n \log(\cosh(y_i - \hat{y}_i))
  $$
- **Advantages**: Smooth gradient, symmetric, robust to outliers, good generalization.
- **Drawbacks**: Computationally more complex than MSE or MAE, lacks tunability for outlier sensitivity, sometimes hard to interpret.



### Mean Absolute Percentage Error (MAPE)

- **Definition**: Expresses prediction error as a percentage; good for reporting error relative to actual value magnitude.
- **Formula**:
  $$
  \text{MAPE} = \frac{100\%}{n}\sum_{i=1}^n \frac{|y_i - \hat{y}_i|}{|y_i|}
  $$
- **Advantages**: Scale-free, interpretable as percent error.
- **Drawbacks**: Can be unstable when true values are near zero.


### Comparison Table: Regression Loss Functions

| Loss Function      | Formula (LaTeX)                          | Key Traits/When To Use                                                                      | Drawbacks                        |
|--------------------|------------------------------------------|---------------------------------------------------------------------------------------------|-----------------------------------|
| MSE                | $$\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2$$         | Standard for regression, penalizes large errors, smooth optimization                        | Sensitive to outliers            |
| RMSE               | $$\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2}$$  | Interpretable units, penalizes large errors, preferred for applications needing magnitude    | Sensitive to outliers            |
| MAE                | $$\frac{1}{n}\sum_{i=1}^n |y_i-\hat{y}_i|$$         | Simple, robust to outliers, easy to compute                                                 | Not differentiable at zero       |
| Huber Loss         | See above (piecewise)                     | Robust, combines smoothness and outlier resistance, stable optimization                     | Requires threshold tuning        |
| MSLE               | $$\frac{1}{n}\sum_{i=1}^n (\log(y_i+1)-\log(\hat{y}_i+1))^2$$ | Use when ratios matter, robust to huge differences                                          | Underprediction penalty          |
| MBE                | $$\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)$$           | Model bias measurement, check under/overestimation bias                                     | Errors cancel; can mislead       |
| Log-Cosh           | $$\frac{1}{n}\sum_{i=1}^n \log(\cosh(y_i-\hat{y}_i))$$| Smooth, robust, symmetric, good for generalization                                          | Hard to interpret, slow compute  |
| MAPE               | $$\frac{100\%}{n}\sum_{i=1}^n \frac{|y_i-\hat{y}_i|}{|y_i|}$$ | Scale independent, percent error reporting                                                  | Unstable near zero true values   |


### Practical Notes

- **Choosing your loss function depends on your data and the problem context.** Use MSE or RMSE when large errors are unacceptable; MAE or Huber when outliers are present; MSLE for ratio-based regression; and MBE to diagnose bias.
- Loss functions guide a model to learn and improve by minimizing error during training.
- Understanding their nuances and selecting appropriately improves model reliability for all regression tasks.

Sources: 

[1](https://arize.com/blog-course/root-mean-square-error-rmse-what-you-need-to-know/)
[2](https://www.geeksforgeeks.org/deep-learning/loss-functions-in-deep-learning/)
[3](https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/)
[4](https://www.datacamp.com/tutorial/loss-function-in-machine-learning)
[5](https://www.datacamp.com/tutorial/rmse)
[6](https://www.datarobot.com/blog/introduction-to-loss-functions/)
[7](https://statisticsbyjim.com/regression/root-mean-square-error-rmse/)
[8](https://developers.google.com/machine-learning/crash-course/linear-regression/loss)
[9](https://en.wikipedia.org/wiki/Root_mean_square_deviation)
[10](https://www.youtube.com/watch?v=RNUaQciYPzs)