## Lecture 5 - Linear Regression

### 4. Empirical Risk

-----

#### Compute Hinge Loss

The empirical risk $R_n$ is defined as

$$R_ n(\theta ) = \frac{1}{n} \sum _{t=1}^{n} \text {Loss}(y^{(t)} - \theta \cdot x^{(t)})$$

where $(x^{(t)}, y^{(t)})$ is the th training example (and there are $n$ in total), and ${Loss}$ is some loss function, such as hinge loss.

Recall from a previous lecture that the definition of hinge loss:

$$\text {Loss}_ h(z) = \begin{cases}  0 & \text {if } z \geq 1 \\ 1 -z, & \text { otherwise} \end{cases}$$

In this problem, we calculate the empirical risk with hinge loss when given specific $\theta$ and $\big \{ (x^{(t)}, y^{(t)})\big \} _{t=1,...,n}$. Assume we have  training examples (i.e. $n=4$), where $x^{(t)}\in \mathbb {R}^3$ and $y^{(t)}$ is a scalar. The training examples $\big \{ (x^{(t)}, y^{(t)})\big \} _{t=1,2,3,4}$ are given as follows:

![4-1](Media/4-1.PNG)

Also, we have $\theta = \big [ 0,1,2\big ]^ T$. Compute the value of

$$R_ n(\theta ) = \frac{1}{4} \sum _{t=1}^{4} \text {Loss}_ h(y^{(t)} - \theta \cdot x^{(t)}).$$

In [26]:
import numpy as np

# X training examples
xt : np.ndarray = np.asarray(
    [
        [ 1, 0,  1],
        [ 1, 1,  1],
        [ 1, 1, -1],
        [-1, 1,  1]
    ]
).T

# Y training examples
yt : np.ndarray = np.asarray([[2, 2.7, -0.7, 2]])

# Linear regression parameters
theta : np.ndarray = np.asarray([[0, 1, 2]])

# Hinge loss
def hinge_loss(z : np.ndarray):

    # Set loss as 1-Z unless a member of "z" is higher than 1
    loss = 1 - z
    loss[z >= 1] = 0
    return loss

# Number of samples for each training example
n = xt.shape[1]

# Empirical risk
Rn : np.float16 = (1/n) * np.sum( hinge_loss(yt - np.dot(theta, xt)) )

# Final answer
print("Answer:", Rn)

Answer: 1.25


#### Compute Squared Error Loss

Now, we will calculate the empirical risk with the squared error loss. Remember that the squared error loss is given by:

$$\displaystyle \text {Loss}(z) = \frac{z^2}{2}$$

The 4 training examples and the $\theta$ parameters are as in the previous problem. Compute the empirical risk.

In [28]:
# Squared error loss
def squared_error_loss(z : np.ndarray):
    return z**2 / 2

# Empirical risk
Rn : np.float16 = (1/n) * np.sum( squared_error_loss(yt - np.dot(theta, xt)) )

# Final answer
print("Answer:", Rn)

Answer: 0.1475


### Empirical Risk and Model Performance

![4-2](Media/4-2.PNG)

-----------

### 5. Gradient Based Approach

The gradient for our previous problem can be calculated as follows:

$$\nabla_\theta \frac{(y^t - \theta x^t)^2}{2} = (y^t - \theta x^t) \nabla_\theta (y^t - \theta x^t) = -(y^t - \theta x^t) x^t$$

Algorithm:
- Initialize $\theta = 0$
- Randomly pick some example "t" from 1 to $n$
- $\theta = \theta + \alpha ((y^t - \theta x^t) x^t) $

Depending on our learning rate we can take a very big or small step in the opposite direction of the gradient. The learning rate may be a function of the number of iterations $K$ like so

$$\alpha_K = \frac{1}{1+K}$$

This setup allows us to make smaller and smaller steps as the algorithm goes longer and longer.

Curiously this algorithm is self correcting. For example, if the prediction $\theta x^t$ results in a value much smaller than $y^t$, their subtraction ($y^t - \theta x^t$) will be positive, pushing the algorithm to add bigger values to theta and then reducing that gap between the prediction and ground truth.

![4-3](Media/4-3.PNG)
![4-4](Media/4-4.PNG)

-----

### 6. Closed Form Solution

Notes:
- Due to the previous loss being a convex function, it can be solved in a closed form.
- After obtaining the gradient of the empirical risk, we get an expression of the form $A\theta = b$. If $A$ is reversible, then we can get an exact solution for $\theta$. However, $A$ can be reversible if the number of training samples is substantially larger than the dimensionality of the feature vector.

$$R_ n(\theta ) = \frac{1}{n} \sum _{t=1}^{n} \frac{(y^{(t)} - \theta \cdot x^{(t)})^2}{2},$$

$$\displaystyle  \displaystyle \nabla R_ n(\theta ) = A\theta - b (=0) \quad \text {where } \,  A = \frac{1}{n} \sum _{t=1}^{n} x^{(t)} ( x^{(t)})^ T,\,  b = \frac{1}{n} \sum _{t=1}^{n} y^{(t)} x^{(t)}.$$

- Sometimes getting the closed form solution can be computationally costly, as the complexity of the problem increases in squared time.


---

### 7. Generalization and Regularization
### 8. Regularization

Notes:
- What happens if you dont have enough training samples? What happens if your training data has some noise? You use regularization.
- Regularization: Pushes you away from trying to perfectly fit your training examples. Generally algorithms try to be lazy and set on parameters that are equal to 0. Here we apply a strong push to prevent this lazyness.
- Now, we will add a new value to the empirical risk: The square norm of the parameters. 
- The relative contribution of the regularization is controlled by lambda
- Rn tries to find thetas as good as possible. $||\theta||^2$ will always consist of a positive number, so it tends to pull the value of $\theta$ back to a base value. This means that noise will no longer affect the estimation, as only a very significant push will be able to escape the effect of the regularization value.

$$J_{\lambda, n}(\theta) = \frac{\lambda}{2}||\theta||^2 + R_n(\theta)$$

- Al incrementar el valor de lambda, decimos que nos importa el "empirical risk", pero también nos importa mantener nuestro valor de Theta lo más pequeña posible. Ahora, mientras más alta sea la lambda, peor será nuestra predicción. Hacemos esto porque ya no queremos que cada pequeñito cambio en los datos de entrenamiento causen un cambio significativo en los parámetros, queremos que el modelo generalice mejor, entonces le introducimos un valor base alrededor del cual los parámetros se ubicarán, para que los mismos se muevan únicamente cuando haya evidencia sustancial que apoye el cambio.