# Regression
- Classification models output a probability distribution for classes that define "what" something is
- Regression models determine a specific numeric value based on input, for instance, predicting the output of a PDE
- Therefore we need more granular output, a different way to measure loss, different output layer activation function
- Also training data needs to have target scalar values rather than target classes

## Linear Activation Output Function
- Output layer uses linear activation, i.e) y=x doesn't modify the input, to preserve the "numeric value" aspect rather than scaling
- Derivative of this function is 1

## Mean Squared Error Loss
- One of two main methods for calculating error in regression is MSE
- Square difference between the predicted and true values of single outputs and **average** the squared values if there are multiple outputs
- y is target, y-hat is predicted, i is current sample, j is current output in sample, J is number of outputs

![image.png](attachment:image.png)

- So for instance, if you have 2 output neurons in the output layer, for sample i you would get the squared difference for both neurons and average them, this gives the loss for sample i. If your batch has 100 samples, repeat this 100 times and the average loss is the average of the 100 individual losses

- The partial derivative of loss i wrt the output j:

![image-2.png](attachment:image-2.png)

In [5]:
# sample losses is done using axis=-1 which calculates mean across outputs (mean of the row) for each sample separately
import numpy as np

# ex) 3 samples and otuput layer has 3 neurons
output = np.array([[2, 3, 5],
                   [4, 6, 1],
                   [2, 4, 1]])
expected = np.array([[3, 4, 5],
                   [4, 7, 1],
                   [2, 4, 1]])

# axis=-1 means to calculate the mean along the last axis, which in this case is columns
sample_losses = np.mean((expected - output)**2, axis=-1)
print(sample_losses)


[0.66666667 0.33333333 0.        ]


## Mean Absolute Error Loss
- Rather than squaring the difference between y_pred and y_true take the absolute value of the difference
- Similarly to L1 and L2 regularization, MAE penalizes error linearly while MSE penalizes non-linearly
- MAE loss is used less often than MSE

![image.png](attachment:image.png)

- Derivative of a sample loss is 1 if the difference is greater than 0, or -1 if diff less than 0, divided by number of output neurons

![image-2.png](attachment:image-2.png)