# **Linear Regression Notes**

## Loss Function
Let the loss function be defined as: $$ℒ=\frac{1}{N}\Sigma\left(y-ŷ\right)^{2}$$
* N = number of data points
* ŷ = prediction function
where $$ŷ=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+...$$
$$x_{0}=1$$
* θn = parameter to be learned
* Xn = inputs

## Parameter Update Rule
We can define the update rule for each parameter through each iteration as such: $$\boxed{\theta_{n}^{t+1}=\theta_{n}-\alpha∇ℒ}$$
* we're subtracting because gradients point the direction of steepest ascent
* we want the direction of steepest descent (hence, substraction)

## Partial Derivative for each Parameter
Partial with respect to bias term: $$\frac{∂ℒ}{∂\theta_{0}}=-\frac{2}{N}\Sigma\left(y-ŷ\right)$$

Partial with respect to non-bias term: $$\frac{∂ℒ}{∂\theta_{n}}=-\frac{2}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)$$

## Rewriting Update Rule
$$\boxed{\theta_{n}^{t+1}=\theta_{n}+\frac{2\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)}$$
* you can just turn 2α into α since a constant * constant = constant

# **Including Ridge Regression (L2 Regularization)**

## Loss Function
Let the loss function be defined as: $$ℒ=\frac{1}{N}\Sigma\left(y-ŷ\right)^{2}\color{red}{+\lambda\Sigma\theta^{2}}$$
* the part in red is the regularization penalty for L2
* notice how the rest of the loss function is identical to the one without L2 regularization

## Parameter Update Rule (w/ Ridge Regression)
$$\theta_{n}^{t+1}=\theta_{n}-\alpha∇ℒ$$
$$\theta_{n}^{t+1}=\theta_{n}+\frac{2\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)\color{red}{-2\alpha\lambda\theta_{n}}$$
$$\theta_{n}^{t+1}=\theta_{n}\textcolor{red}{-2\alpha\lambda\theta_{n}}+\frac{2\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)$$
$$\boxed{\theta_{n}^{t+1}=\theta_{n}\left(1\textcolor{red}{-2\alpha\lambda}\right)+\frac{2\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)}$$
* L2 regularization makes parameters **approach 0 but never actually equal to 0**
* unhelpful parameters converge much closer to 0 compared to helpful parameters

# **Including Lasso Regression (L1 Regularization)**

## Loss Function
Let the loss function be defined as: $$ℒ=\frac{1}{N}\Sigma\left(y-ŷ\right)^{2}\color{red}{+\lambda\Sigma\left|\theta\right|}$$
* the part in red is the regularization penalty for L1

## Parameter Update Rule (w/ Lasso Regression)
$$\theta_{n}^{t+1}=\theta_{n}-\alpha∇ℒ$$
$$\theta_{n}^{t+1}=\theta_{n}+\frac{2\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)\color{red}{-\alpha\lambda\operatorname{sign}\left(\theta_{n}\right)}$$
$$\boxed{\theta_{n}^{t+1}=\theta_{n}\textcolor{red}{-\alpha\lambda\operatorname{sign}\left(\theta_{n}\right)}+\frac{2\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)}$$
* L1 regularization makes it **possible for some parameters to equal 0**
    * unhelpful parameters get turned to 0
* L1 regularization automatically includes feature selection