# Linear Regression

<hr>

**Objective: *Minimize Empirical Risk***

$R_n(\theta)$ where $n$ is the number of samples and $\theta$ is the vector parameter, such that the empirical risk can be defined as follows:

$R_n (\theta) = \frac{1}{n} \sum_{i=1}^{n} (y^{(i)} - \theta x^{(i)})^2 / 2$

But in practice, the idea is to minimize the empirical risk when generalized outside of the training set.

There are two basic mistakes when it comes to the application of linear regression:

1. **Structural**: Where the relationship between the dependent and independent variables are non-linear. Consider a much broader family of non-linear functions as the estimator


2. **Estimation**: Very limited training data which doesn't generalize well to the population

<hr>

**Algorithm 1: *Gradient Descent***

Derivative of the empirical risk function:

$\nabla_{\theta} (y^{(t)} - \theta x^{(t)})^2 / 2 = - (y^{(t)} - \theta x^{(t)}) \cdot x^{(t)}$

Process: 

- Initialize $\theta = 0$
- Randomly pick $t = \{ 1, \dots, n \}$
- Update $\theta := \theta - \eta \cdot [- (y^{(t)} - \theta x^{(t)}) \cdot x^{(t)}]$, where $\eta$ is the learning rate parameter and could be set to $\eta_k = \frac{1}{1+k}$, where $k$ is the number of iterations and the algorithm corrects with smaller steps as algorithm has gone on for many steps

****

**Closed form solution: *Linear Algebra***

$\hat \theta = (X^T X)^{-1} X^T Y$

Necessary conditions:

- Training samples, $n$, must be bigger than dimensionality, $d$
- The matrix, $X^T X$, must be invertible, $\therefore X$ must be invertible

****

**Generalization & Regularization**

Regularization helps to prevent the estimated parameters to overfit the training data and improve the model's ability to generalize beyond the training set, by adding a new parameter in the loss function, $\frac{\lambda}{2} \Vert \theta \Vert^2$. Intuitively, it does so by pushing $theta$ to 0, unless there is a strong pattern to justify otherwise.


Regularization example, *Ridge Regression*

Loss function with regularization,

$J_{\lambda, n} (\theta) = \frac{\lambda}{2} \Vert \theta \Vert^2 + R_n (\theta)$

Derivative of loss function with regularization

$\nabla_{\theta} J_{\lambda, n} (\theta) = \lambda \theta - (y^{(t)} - \theta x^{(t)}) \cdot x^{(t)}$

then, we update $\theta$

$\theta := \theta - \eta \cdot [\lambda \theta - (y^{(t)} - \theta x^{(t)}) \cdot x^{(t)}] = (1 - \eta \lambda) \cdot \theta - \eta \cdot [- (y^{(t)} - \theta x^{(t)}) \cdot x^{(t)}]$

The second part of the update is the same, but the first part has changed from $\theta$ to $(1 - \eta \lambda) \cdot \theta$, where every learning step tries to push $\theta$ down to zero

Whenever $\lambda$ increases, we care to push $\theta$ as close to zero as possible which increases ability to generalize better. When $\lambda$ is zero, then there is no regularization and may cause overfitting and not generalize well.

<hr>

# Basic code
A `minimal, reproducible example`