# 2. Linear Regression

## Linear Models
In simple words a linear regression is a methods to find the best linear model for a set of data.

What is a linear model?

A linear model is a mathematical function that makes predictions using a linear combination of its input x.
Mathematically, it looks like this:

$$f(x; w, b) = w^T x + b = w_1 x_1 + ... + w_D x_D + b$$

where:
* $
\mathbf{x} = \begin{pmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{pmatrix} 
\in R^D$ 
is vector of D real features reprensenting the input data.

* w,b are the parameters of the linear model:
    * $
    \mathbf{w} = \begin{pmatrix}
    w_1 \\
    w_2 \\
    \vdots \\
    w_n
\end{pmatrix}$ is a vector D parameters
    *  b a single parameter (a scalar) often called the bias parameters. 


## A concrete Example to better understand: 

Imagine our task was to predict the strength of concrete batch based on the following features:
    * $x_1$: the amount of water (in mL) 
    * $x_2$: the amount of sand (in kg)  
If we believe the relation between these features and the strength of the concrete to be linear we could use a linear/

Then, based on the our linear model would say the strength of the resulting concrete would be: 

$$f(x1, x2; w, b) = w_1 x_1 + w_2  x_2 +b $$

If we set the parameters to $w_1 = w_2 = 1$ and $b = 0$: <br>
For $x_1 = 100 ml$ and $x_2 = 1kg$ this would yield $f(x_1, x2; w, b) = 100 + 1 = 101 N$ 


However, it is highly that the parameters we chosed are the correct ones. If we where experts in making we would likely have an idea on how to set these parameters to get the correct result. 

But if it is not the case, the linear regression finds the paremeters w and b that give the best model for a given set of data. 



Below is an example in 2 dimension (i.e 1 input features and an output dimenssions of size 1.) You can try changing the value of the parameters to see how they influence the model.

In [1]:
%matplotlib inline
from ipywidgets import interactive
import matplotlib.pyplot as plt
import numpy as np
plt.figure(2)
x = np.linspace(-100, 100, num=1000)

def f(w, b):
    plt.plot(x, w * x + b)
    plt.ylim(-5, 5)
    plt.xlim(-10, 10)
    plt.xlabel("x")
    plt.ylabel("f(x;w,b)")
    plt.show()

interactive_plot = interactive(f, w=(-2.0, 2.0), b=(-3, 3, 0.5))
output = interactive_plot.children[-1]
output.layout.height = '450px'
interactive_plot

interactive(children=(FloatSlider(value=0.0, description='w', max=2.0, min=-2.0), FloatSlider(value=0.0, descr…

How do we find the best value for w and b?

As explain the introduction we need example to learn from:

Let the $\{x^{(n)}, y^{(n)}\}_{n=1}^{N_{train}}$ our training dataset (i.e. the set set of data we want to learn from).

For example we have N batchs of ciment each with a different set of characteristics $x^{(n)} \in R^D$ and a corresponding output strengh $y^{(n)}$.

How can we find the linear model that best fits this data.

Assuming the training was generated using a linear model f(x, w, b) to which we added noise. the resulting output is:

$$y^{n} = f(x^{(n)}, w, b) + y^{(n)}_{noise}$$

We have generated a set of $N_{train} = 20$ example this way. 
Try changing the parameter the parameter w,b to fit the points once you are ready you can check reveal and the true plot will be displayed.

In [2]:
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interactive
import matplotlib.pyplot as plt
import numpy as np
N_train = 20
x = np.linspace(-10, 10, num=1000)
x_train = np.linspace(-10, 10, num= N_train)
sigma_noise = 1
w_true =  0.345
b_true = 1
y_noise = sigma_noise *np.random.randn(N_train)
def f(w, b, reveal):
    plt.plot(x_train, w_true * x_train + b_true + y_noise, 'x')
    plt.plot(x, w * x + b)
    if reveal: 
        plt.plot(x, w_true * x + b_true)
    plt.ylim(-5, 5)
    plt.xlabel("x")
    plt.ylabel("f(x;w,b)")
    plt.show()

interactive_plot: interactive = interactive(f, w=(-2.0, 2.0), b=(-3, 3, 0.5), reveal = False)


output = interactive_plot.children[-1]
output.layout.height = '450px'

interactive_plot

interactive(children=(FloatSlider(value=0.0, description='w', max=2.0, min=-2.0), FloatSlider(value=0.0, descr…

What you have likely done intuitively is minimizing the distance between each points and there corresponding projection on the linear model outputs. Mathematically this can be formalized as follow: 


$$
\mathcal{L}(w, b) = \frac{1}{N} \sum_{n=1}^N \left( f(x^{(n)}; w, b) - y^{(n)} \right)^2
$$

This is called the mean sqare error as the mean of the square error of the model on each training input-output pairs.

Then finding the best parameters (w^*, b^*) can be written: 

$$
(w^*, b^*) = \underset{w, b}{\text{argmin}} \; \mathcal{L}(w, b)
$$

To solve this in a program we first need to stack: 
* the training input features $x^{(n)}$ into a single matrix: $X  = \begin{pmatrix}
x^{(1)} \\
x^{(2)}\\
\vdots \\
x^{(N_{train})}
\end{pmatrix} = \begin{pmatrix}
x^{(1)}_1 & \dots & x^{(1)}_D\\
x^{(2)}_1 & \dots & x^{(2)}_D\\
& \vdots & \\
x^{(N_{train})}_1 & \dots & x^{(N_{train})}_D
\end{pmatrix}$
* the traing ouputs $y^{(n)}$ into a single vector:  $
\mathbf{y} = \begin{pmatrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(N_{train})}
\end{pmatrix}$

This is because programs are optimized for array and matrices.

We can then write the linear model with the resulting input matrix $X$ and output vecotr $y$: 

$$\hat{y} = \begin{pmatrix}
f(x^{(1)}; w,b) \\
f(x^{(2)}; w,b) \\
\vdots \\
f(x^{(N_{train})}; w,b)
\end{pmatrix} = \begin{pmatrix}
w^T x^{(1)} + b \\
w^T x^{(2)} + b \\
\vdots \\
w^T x^{(N_{train})} + b
\end{pmatrix} = X w + b$$

Strictly speaking we should write  $b 1_D = b \begin{pmatrix} 
1 \\
1 \\
\vdots \\
1
\end{pmatrix}$ instead of just b in the left hand side of the last equality.

But this is often written as just b with the implicit idea that $b$ is added to every line of $Xw$.

One can further simplify the notation removng b from the linear model entirely by notice that: 
$$
X w + b = \begin{pmatrix}
x^{(1)}_1 & \dots & x^{(1)}_D & 1\\
x^{(2)}_1 & \dots & x^{(2)}_D & 1\\
& \vdots & &1\\
x^{(N_{train})}_1 & \dots & x^{(N_{train})}_D & 1
\end{pmatrix} \begin{pmatrix}
w^{(1)} \\
w^{(2)} \\
\vdots \\
w^{D}\\
b
\end{pmatrix}
$$
In the following we won't refer to b has it will be implicitely part of w.

The loss funtction can be rewritten with the matrix $X$ and vectors $y$, $\hat{y}$:

$$
\mathcal{L}(w, b) = (y-Xw)^T (y-Xw)
$$

Check you can see why (don't forget that b in w now). 

### Reminder: 

In 1D the minum and maximum of a convex function $g:
\mathbb{R}
\rightarrow
\mathbb{R}
$ are the points x such that (s.t.):

$g'(x) = 
\frac{dg}{dx}
= 0$

In general case of D dimension, for a convex function $g: \mathbb{R}^D
\rightarrow
\mathbb{R}$, the minimums $x \in \mathbb{R}^D$ are points such that

$$
\nabla g(x) = \begin{pmatrix}
\frac{\partial g(x)}{\partial x_1}\\
\frac{\partial g(x)}{\partial x_2}\\
\vdots \\
\frac{\partial g(x)}{\partial x_D}\\
\end{pmatrix} = \begin{pmatrix}
0\\
\vdots \\
0\\
\end{pmatrix}
$$

In our case we can show that: 

$$
\nabla_w\mathcal{L}(w) = X^T(y-XW)
$$

Therefore the solution to the problem is a solution of : 

$$
 X^T(y-XW) = 0
$$

Assuming $X^T X$  is non-singular: 

w^* = (X^T X)^-1 X^T y

This a classic maths problem and so function have already been implemeted in python: 

```python
w_fit = np.linalg.lstsq(X, yy, rcond=None)[0]
```

## Generalized Linear Regression Model:


The linear assumption we have been making since the bigining is often not true. 
If we modify the input: we obtain a new set of features: 

$\tilde{x} =  \phi(x)$

This weekens the linear assumption into the following assumption: 

The output is a linear function of the parameters.

Common functions used to modify the input data include:

* Radial Basis function (RBF):

$$
\phi_{RBF}(x; c, \sigma) = \exp\left(-\frac{\|x - c\|^2}{2\sigma^2}\right)
$$

* logistic-Sigmoid:
$$
\sigma(x; v, b) = \frac{1}{1 + e^{-v^T x -b}}
$$

Bellow are representations of the function you can try to play with the parameters to get intuition into their roles:



interactive(children=(FloatSlider(value=1.0, description='σ (RBF)', max=5.0, min=0.1), FloatSlider(value=0.0, …

## What to do next:

This small course on linear regression is just a reformulation of the machine learning and pattern recognition (MLPR) of Ian Murray: https://mlpr.inf.ed.ac.uk/2022/notes/w0a_welcome_and_advice.html <br>

I would advise to read the follwing sections:
-  The linear regression
- gradient descent
- Train, Test split
- logistic sigmoid
- neural networks

You can read the other sections but they might not be usefull for this internship.