# Linear Models

In [4]:
import numpy as np
# synthetic data for the rest of the linear models:
np.random.seed(5)
n = 100 # samples
p = 5 # features
sigma = 0.2 # std
X = np.random.normal(0, 1, size=(n,p))
beta_true = np.random.randint(-4, 2, p)
noise = np.random.normal(0, sigma, size=(n))
y = X @ beta_true + noise


## Ordinary Least Squares (OLE)



### Analytical Method:

$$\begin{align*}\hat{\beta} &= \arg \min_{\beta} \frac{1}{n}\|y-X \beta\|^2_2\\
 &= (X^TX)^{-1}X^T y\end{align*}
 $$

In [5]:
betahat = np.linalg.inv(X.T @ X) @ X.T @ y
print("betahat: ", betahat)
print("beta true:", beta_true)

betahat:  [-2.94946726  0.01589149 -2.004408   -3.97428268 -3.99637663]
beta true: [-3  0 -2 -4 -4]


#### Analysis
realise that `beta_true` is now obfuscated with noise, thus you will never be able to retrieve it perfectly.

### Gradient Descent:


#### Motivations:
sometimes, you cannot explicitly solve the `argmin` formulation of the parameters in closed form. also, sometimes calculating the entire analytic solution may be too computationally intensive, and thus gradient descent is a fantastic technique.

note that there are two flavours:

Stochastic Gradient Descent:
$$\hat{\beta}^{(k+1)} = \hat{\beta}^{(k)} - \eta\nabla_\beta L (\hat{\beta}^{(k)})$$

in which we iterate only over 1 sample at a time, and,

Batch Gradient Descent:
$$$$

in which we iterate over the entire dataset.

Note, in practice you will choose **mini-batch Gradient Descent**, which is a mediation of both these approaches:
$$$$

for this problem:
$$
\nabla_\beta L(\beta) = -2 X^T y + 2X^TX\beta,
$$


In [6]:
eta = 0.001                         # learning stepsize
T = 50                              # number of iterations

def grad_loss(b, X, y):
    return -2 * X.T @ y + 2 * X.T @ X @ b

betas = np.zeros(shape=(T,p))       # initialised as p-dimensional vector of 0's

for t in range(1, T):
    betas[t,:] = betas[t-1,:] - eta * grad_loss(betas[t-1,:], X, y)
    
print("beta batch GD: ", betas[T-1,:])
print("beta true: ", beta_true)

beta batch GD:  [-2.94939435  0.01570463 -2.00406754 -3.97354862 -3.99569519]
beta true:  [-3  0 -2 -4 -4]


## Ridge Regression

## Lasso Regression