<h1>Linear Regression</h1>
<h4>Design Matrix</h4>
has the structure n x (1 + d )

$ \Phi = \begin{pmatrix} bias & x_1^{(1)} & x_2^{(1)}& ... & x_d^{(1)}\\ ... & ... & ... & ... & ...\\ bias & x_1^{(n)} & x_2^{(n)} & ... & x_d^{(n)}\end{pmatrix}$

Linear Regression Implies

$ y(w,x) = \sum_{j=0}^{M-1}w_j\cdot\phi_j(x)=\langle w|\Phi(x) \rangle = w^T\cdot \Phi(x)$

The loss function used is the least-squares estimator, with the $l_2$ norm,

$E(w)=\frac{1}{2}\cdot \sum_{\eta=1}^N(t_{\eta}-w^T\cdot x_{\eta})^2 + \frac{\lambda}{2}\lVert w \rVert_2^2$

The gradient-descent is,

$\nabla E(w)= \nabla \Big(\frac{1}{2}\cdot \sum_{\eta=1}^N(t_{\eta}-w^T\cdot \phi_{\eta})^2 + \frac{\lambda}{2}\lVert w \rVert_2^2\Big) $

The gradient-descent is used as a learning rule for the weight finding, the goal is to converge to an optimal solution.



<h3>Ridge Regression</h3>
The ridge regression uses a closed-form solution therefore we need to find a minimum of the previous loss function. This minimum can be local or global there's no way to know.

Therefore the gradient is,

$\nabla E(w)= \nabla \Big(\frac{1}{2}\cdot \sum_{\eta=1}^N(t_{\eta}-w^T\cdot \phi_{\eta})^2 + \frac{\lambda}{2}\lVert w \rVert_2^2\Big)=0 $

$\iff \nabla \Big(\frac{1}{2}\cdot (t-w \cdot \Phi)^T \cdot (t-w \cdot \Phi) + \frac{\lambda}{2} w^Tw\Big)=0 $

$\iff \nabla \Big(\frac{1}{2}\cdot \big(t^Tt - t^Tw\Phi - w^T\Phi^Tt + w^T\Phi^Tw\Phi + \lambda w^Tw\big)\Big)=0 $

$\iff \frac{1}{2}\cdot \big( 0 - t^T\Phi - \Phi^Tt + 2 w\Phi^T\Phi + 2 \lambda w\big)=0 $

$\iff \frac{1}{2}\cdot \big( -2 \Phi^Tt + 2 w\Phi^T\Phi + 2 \lambda w\big)=0 $

$\iff  - \Phi^Tt + w\Phi^T\Phi + \lambda w =0 $

$\iff  \Phi^Tt =  w\Phi^T\Phi + \lambda w $

$\iff  \Phi^Tt = \big( \Phi^T\Phi + \lambda \cdot I \big) w $

$\iff  w = \big( \Phi^T\Phi + \lambda \cdot I \big)^{-1} \cdot \Phi^Tt $

The closed-form solution is a fixed vector of the best weights that give the minimum of the loss function (obtained from the gradient of the squared error)<p>
Lasso does not have a closed-form soltuion because it uses an $l_1$ which does not have a derivative

<h4>Ridge Regression</h4>
Prediction with  linear models with a penalty to minimize overfitting.

$ w = \big( \Phi^T\Phi + \lambda \cdot I \big)^{-1} \cdot \Phi^Tt $

In [1]:
import numpy as np

phi=np.array([[1,2,4],[1,4,2],[1,5,6],[1,7,5]])     #write here the design matrix values
res1 = np.dot(phi.T,phi)
i=np.identity(res1.shape[0])
l2=4*i                                              #write here the lambda value                                       
print('(')
print(res1)
print('+')
print(l2)
print(')^-1')

targets=np.array([1 , 1.5 , 2 , 2.5])              #write here the target values
print('.')
print('(')
print(phi.T)
print('.')
print(targets)
print('=')
inverse=np.linalg.inv(res1+l2)
res2 = np.dot(phi.T,targets)
print(inverse)
print('.')
print(res2)
print('=')
w = np.dot(inverse,res2)
print(w)

(
[[ 4 18 17]
 [18 94 81]
 [17 81 81]]
+
[[4. 0. 0.]
 [0. 4. 0.]
 [0. 0. 4.]]
)^-1
.
(
[[1 1 1 1]
 [2 4 5 7]
 [4 2 6 5]]
.
[1.  1.5 2.  2.5]
=
[[ 0.22500636 -0.0194607  -0.02645637]
 [-0.0194607   0.04973289 -0.04350038]
 [-0.02645637 -0.04350038  0.05850929]]
.
[ 7.  35.5 31.5]
=
[0.05081404 0.25903078 0.11358433]
