# Linear Regression

In this lecture we introduce linear regression, one of the most fundamental models in supervised
learning. Despite its apparent simplicity, linear regression provides a unifying framework for
many key ideas in Machine Learning, including loss minimisation, optimisation, probabilistic
modelling, and feature representation.  

As all unsupervised learning models, Linear Regression relies on the concept that, given a Training set consisting of $ n $ input-output pairs, at prediction time we want to compute a prediction  
$ \hat{y}^* = \hat{f}(x^*) $  
which approximates the unknown true output $ y^* $, but a good fit on the training set alone is not sufficient.  

A common modeling assumption is that the observed output is generated according to an unkown deterministic function we want to model $ f(x) $ and a random noise term $ epsilon $ which is independent of $ x $.  

To make the problem tractable, we restrict the class of admissible predictors and assume a model that is *linear in the parameters*.

### Affine predictor

We define the linear regression model as  
$ \hat{y}(x,\theta) = \theta^T x $  
where $ x = [1, x_1 ... x_p] $ and $ \theta = [\theta_0 ... \theta_p] $  
The first component of $ x $ is 1 so the model is allowed to learn and intercept term $ \theta_0 $. This makes the predictor *affine* rather than purely linear in the original features.  

### Measurement model
We now model the data-generating process as a function that sees the observed output $ y $ as the sum of a systematic linear component ( parameter vector $ \theta^T $ multiplied by the input vector $ x $ ) plus a random noise term.

### Prediction
For a given **new** input $ x^* $ , a prediction will be $ \hat{y}^* = \hat{theta}^T x^* $ .   
We can now define the *residual* $ r_i $ as the distance between the actual output value $ y_i $ for a given sample and it's predicted value $ \hat{y}_i $  

Remember that for *training* we use Matrix notation both for input and output, so:  
Measurement model: $ Y = X\theta + \epsilon $  
Affine predictor: $ \hat{Y} = X\theta  $  
where the first column of $ X $ is all ones in order to be affine (the intercept is learned jointly with the other parameters).  

# Loss function $ L(\hat{y},y) $ and Cost function $J(\theta)$

To train a linear regression model we need to define what is a *good* or *bad* prediction.  
We use a **loss function** $ L(\hat{y},y) $ in order to measure the discrepancy between the *actual* output $ y $ and the *predicted* output $ \hat(y) $ for *a single training example*.  
For regression problems a common practice is to use *squared error loss* as a loss function:  
$ L(\hat{y},y) = (\hat{y},y)^2 = (\theta^{T}x - y)^2$  
This function is useful because if the predicted value is the same as the real one the loss is 0, and the more the prediction is far from the observed value, the more the error has impact on learning.  

However, when we are training our predictor, we don't just train over a single sample, but on the whole dataset. If the *loss function* is defined over a single sample, we can define a **cost function** over the whole dataset as the aggregations of all the *loss* ovver the whole training data.  
So for a dataset made of $ n $ samples, the cost function will be:  
$ J(\theta) = \frac{1}{n} \sum_{i=1}^{n} L(\hat{y}(x_i,\theta),y_i) $  
and if $ L $ is the *squared error loss*:  
$   J(\theta) = \frac{1}{n} \sum_{i=1}^{n} L(\theta^{T}x_i - y_i)^2 = \frac{1}{n} || X\theta - y ||^2 $  

### Least squares  
Notice that our cost function for a squared error loss depends on the *sum of squared residuals*, since we are using the matrix notation, where all the rows corresponds on the residual of the $i$-th sample: $ r_i = \theta^{T}x_i - y_i $  
In order to build a predictor which predicts values as close as possible to the observed outputs, we need to basically minimize the parameter vector $ \theta $ , which means minimising the sum of the squared residuals: this is why for training a linear regressor under these circumstances we work with the *least squares problem* (quadrati MINIMI). The minimum parameter vector is: $ \hat{\theta} = \text{arg}\min_{\theta}||X\theta - Y||^2 $

### Solution
In order to find a way to calculate a possible $\hat{\theta}$ we can derive the *closed-form* solution of the least squares problem using calculus.  
We first expand $ J(\theta) $ by expanding the squared norm, and then we calculate the gradient of $ J(\theta) $ with respect to $ \theta $.  We obtain:  
$ \nabla_{\theta}J(\theta) = -2X^{T}Y + 2X^{T}X\theta $  
We can observe that by exposing the $ 2X^{T} $ term we obtain $  \nabla_{\theta}J(\theta) = 2X^{T}(X\theta - Y) $, and thus we interpret the gradient as a vector which points in the direction where slope is steep, depending on the *residual*.  
Setting the gradient to zero, we otbain the *normal equations*:  
$X^{T}X\hat{\theta}=X^{T}Y$  

We can otbain the solution $ \hat{\theta} $ only if the term $ X^{T}X $ is invertible, which needs columns of $ X $ to be independent (in order to admit a unique solution). Also, we now that such solution will be optimal since the Hessian matrix of J $\nabla^{2}_{\theta}J = 2X^{T}X $ contains the $ X^{T}X $ term which is positive semidefinite, and so the cost function of J is convex. Thanks to this property, any *stationary* point is a global minimiser, since a stationary point is a $x_0$ for which the $ f(x_0)' = 0 $. 

The columns of $ X $ are independent if: no feature is the exact linear combination of the others, and the number of samples $ n $ is larger than the number of parameters $ p + 1 $.  
If the term $ X^{T}X $ is not invertible, we need comptue the *pseudo-inverse* of such term using numerical linear algebra methods such as *QR Decompsition* or *Singular Value Decomposition (SVD)*.  

It is important to note that the *squares linear regression* is so popular because of all the previous observatios: only this methods gives a **closed-form solution** which means that an exact solution is given using a finite number of standard operations. Many other loss functions do not admit such explicit solution, and require *iterative optimisation methods* instead.  

Anyway, the **squared error loss** function is useful when we are dealing with linear regression problems over small dataset. If a dataset is very large, or the objective function does not admit an analytic minimiser, it can be computationally expensive - thus we use other methods, which are not in closed form.