# **Logistic Regression Notes**

## Loss Function
Let the loss function be defined as: $$ℒ=\frac{-1}{N}\Sigma\left(\textcolor{red}{y\ln\left(ŷ\right)+\left(1-y\right)\ln\left(1-ŷ\right)}\right)$$
* the part in red is **<font color='red'>Binary Cross Entropy</font>**
* N = number of data points
* ŷ is the following: 
$$ŷ=\frac{1}{1+e^{-z}}$$
* z is the linear model: 
$$z=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+...$$
$$x_{0}=1$$
* θn = parameter to be learned
* Xn = inputs

## Parameter Update Rule
We can define the update rule for each parameter through each iteration as such: $$\boxed{\theta_{n}^{t+1}=\theta_{n}-\alpha∇ℒ}$$
To find ∇ℒ, we'll have to use the **chain rule** $$\frac{∂ℒ}{∂\theta_{n}}=\textcolor{#FA5053}{\frac{∂ℒ}{∂ŷ}}\cdot\textcolor{#50C878}{\frac{∂ŷ}{∂z}}\cdot\textcolor{#6395EE}{\frac{∂z}{∂\theta_{n}}}$$

## Chain Rule
Finding each partial derivative: 
$$\textcolor{#6395EE}{\frac{∂z}{∂\theta_{n}}}=\frac{∂}{∂\theta_{n}}\left(\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+...\right)=x_{n}$$
* partial with respect to the bias term = 1
$$\textcolor{#50C878}{\frac{∂ŷ}{∂z}}=\frac{∂}{∂z}\left(\frac{1}{1+e^{-z}}\right)=\frac{e^{-z}}{\left(1+e^{-z}\right)^{2}}$$
$$=\frac{e^{-z}}{1+e^{-z}}\cdot\frac{1}{1+e^{-z}}$$
$$=\frac{\textcolor{red}{1}+e^{-z}\textcolor{red}{-1}}{1+e^{-z}}\cdot\frac{1}{1+e^{-z}}$$
$$=\left(\frac{1+e^{-z}}{1+e^{-z}}-\frac{1}{1+e^{-z}}\right)\left(\frac{1}{1+e^{-z}}\right)$$
$$\textcolor{#50C878}{\frac{∂ŷ}{∂z}}=\left(1-ŷ\right)\left(ŷ\right)$$
$$\textcolor{#FA5053}{\frac{∂ℒ}{∂ŷ}}=-\frac{1}{N}\frac{∂}{∂ŷ}\Sigma\left(y\ln\left(ŷ\right)+\left(1-y\right)\ln\left(1-ŷ\right)\right)$$
$$=-\frac{1}{N}\Sigma\left(\frac{y}{ŷ}-\frac{1-y}{1-ŷ}\right)$$

Putting it all together...
$$\frac{∂ℒ}{∂\theta_{n}}=\textcolor{#FA5053}{\frac{∂ℒ}{∂ŷ}}\cdot\textcolor{#50C878}{\frac{∂ŷ}{∂z}}\cdot\textcolor{#6395EE}{\frac{∂z}{∂\theta_{n}}}=\textcolor{#FA5053}{-\frac{1}{N}\Sigma\left(\frac{y}{ŷ}-\frac{1-y}{1-ŷ}\right)}\textcolor{#50C878}{\left(1-ŷ\right)\left(ŷ\right)}\textcolor{#6395EE}{\left(x_{n}\right)}$$
$$=-\frac{1}{N}\Sigma\left(y\textcolor{red}{-yŷ}-ŷ\textcolor{red}{+yŷ}\right)\left(x_{n}\right)$$
$$\frac{∂ℒ}{∂\theta_{n}}=-\frac{1}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)$$

## Rewriting Update Rule
$$\boxed{\theta_{n}^{t+1}=\theta_{n}+\frac{\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)}$$
* update rule is almost identical with that from linear regression
* instead of 2α, it's α
    * to be fair, a constant times a constant is just a constant...

# **Including Ridge Regression (L2 Regularization)**

## Loss Function
Let the loss function be defined as: $$ℒ=\frac{-1}{N}\Sigma\left(y\ln\left(ŷ\right)+\left(1-y\right)\ln\left(1-ŷ\right)\right)\textcolor{red}{+\lambda\Sigma\theta^{2}}$$
* the part in red is the regularization penalty for L2
* notice how the rest of the loss function is identical to the one without L2 regularization

## Parameter Update Rule (w/ Ridge Regression)
$$\theta_{n}^{t+1}=\theta_{n}-\alpha∇ℒ$$
$$\theta_{n}^{t+1}=\theta_{n}+\frac{\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)\color{red}{-2\alpha\lambda\theta_{n}}$$
$$\theta_{n}^{t+1}=\theta_{n}\textcolor{red}{-2\alpha\lambda\theta_{n}}+\frac{\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)$$
$$\boxed{\theta_{n}^{t+1}=\theta_{n}\left(1\textcolor{red}{-2\alpha\lambda}\right)+\frac{\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)}$$
* L2 regularization makes parameters **approach 0 but never actually equal to 0**
* unhelpful parameters converge much closer to 0 compared to helpful parameters

# **Including Lasso Regression (L1 Regularization)**

## Loss Function
Let the loss function be defined as: $$ℒ=\frac{-1}{N}\Sigma\left(y\ln\left(ŷ\right)+\left(1-y\right)\ln\left(1-ŷ\right)\right)\color{red}{+\lambda\Sigma\left|\theta\right|}$$
* the part in red is the regularization penalty for L1

## Parameter Update Rule (w/ Lasso Regression)
$$\theta_{n}^{t+1}=\theta_{n}-\alpha∇ℒ$$
$$\theta_{n}^{t+1}=\theta_{n}+\frac{\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)\color{red}{-\alpha\lambda\operatorname{sign}\left(\theta_{n}\right)}$$
$$\boxed{\theta_{n}^{t+1}=\theta_{n}\textcolor{red}{-\alpha\lambda\operatorname{sign}\left(\theta_{n}\right)}+\frac{\alpha}{N}\Sigma\left(y-ŷ\right)\left(x_{n}\right)}$$
* L1 regularization makes it **possible for some parameters to equal 0**
    * unhelpful parameters get turned to 0
* L1 regularization automatically includes feature selection