# Loss Function Overview

For supervised learning, there are two main categories of learning objectives.

- Classification (discrete prediction)
- Regression (continuous prediction)

I will use the following notations and assume indexing starts at 1 for ease of writing.

- $N$ is the number of training inputs
- $i$ is an index of a training input
- $P$ is the number of parameters or weights in the model
- $j$ is an index of a parameter
- $w$ is a parameter/weight of the model
- $y_i$ is the ground truth label for ith input
- $\hat{y}$ is the predicted label for ith input

## Regression Losses

### L2 Loss 

This is also called Mean Square Error (MSE) loss. It measures the average squared difference between predictions and actual observations. 

$$
L = \frac{1}{N} \Sigma^{N}_{i=1} (y_i - \hat{y}_i)^2
$$


Now add a L2 regularization term.

$$
L = \frac{1}{N} \Sigma^{N}_{i=1} (y_i - \hat{y}_i)^2 + \lambda \Sigma^{P}_{j=1} w_j^2
$$

Lambda is the hyperparameter for tuning L2 regularization strength. A regression model that uses L2 regularization technique is called **Ridge Regression**. This regularization penalizes the squared of weights, which acting as a force to remove % of weight. At any given rate, L2 regularization does not drive weights to zero.

**Behavior**

Intuitively speaking, since a L2-norm squares the error (increasing by a lot if error > 1), the model will see a much larger error ( e vs e^2 ) than the L1-norm, so the model is much more sensitive to this example, and adjusts the model to minimize this error. If this example is an outlier, the model will be adjusted to minimize this single outlier case, at the expense of many other common examples, since the errors of these common examples are small compared to that single outlier case.

### L1 Loss

This is also called Mean Absolute Error (MAE) loss. It measures the average of sum of absolute differences between predictions and actual observations.

$$
L = \frac{1}{N}\Sigma^{N}_{i=1} \left | y_i - \hat{y}_i \right |
$$

Now add L1 regularization term.

$$
L = \frac{1}{N}\Sigma^{N}_{i=1} \left | y_i - \hat{y}_i \right | + \lambda\Sigma^{P}_{i=1} \left | w_i \right |
$$

Lambda is the hyperparameter for tuning L1 regularization strength. A regression model that uses L1 regularization technique is called **Lasso Regression**. This regularization penalizes the absolute value of weights, which acting as a force that subtracts some constant from the weight every time. This regularization can drive weights to zero which can be useful for a sparsed inputs to minimize computational effort.

In a high dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model.

### Comparison

## Classification Losses