# Regularization Detail 

> - Details of Regularization
> - Approaches to understanding how regularization works

### Understanding how regularization works 

Having worked through some regularization examples, let's examine intuitively how these techniques (Ridge, Lasso, and Elastic Net) interact with modeling.

There are several approaches to this interpretation:

- The **Analytic**  view 
- The **Geometric** view
- The **Probabilitic** view

--- 
### The analytic view

Increasing L2/L1 penalties force coefficients to be smaller, restricting their plausible range.

A smaller range for coefficients must be simpler/lower variance than a model with an infinite possible coefficient range. 

![image.png](attachment:image.png)

### The Geometric view

Below are mathematically equivalent formulations of the optmization objectives of Ridge/LASSO

![image-2.png](attachment:image-2.png)

---
### The Geometric view

Under this geometric formulation, the cost function minimum is found at the intersection of the penalty boundary and a contour of the traditional OLS cost function surface. 

The geometry reveals the selection effect of LASSO (intersection at a corner/axis zeroes out coefficients)

![image-3.png](attachment:image-3.png)

---

### The Probabilistic view

![image-4.png](attachment:image-4.png)

Letting $f$ be the likelihood (probability of forget given parameter vector $\beta$), and $p(\beta)$ the prior distribution of $\beta$, we can calculate the posterior of $\beta$.

$P(\beta)$ is driven from independent draws of a prior coefficient density function $g$ that we choose when regularizing.

L2 (ridge) regularization imposes a **Gaussian prior** on the coefficients, while L1 (Lasso) regularization impose a **Laplace prior**. 

### The probabilistic view

![image-5.png](attachment:image-5.png)

### Regularization Recap

`Complexity Tradeoff`
- Reduce complexity by penalizing it in cost function.
- Increases bias, but reduces variance (may be worth the trade-off)
- Options: L2, L1, can  validate the choice and strength

`Regularization`
- Optimizing predictive models is about finding the right bias/variance tradeoff. 
- We need models that are sufficiently complex to capture patterns in data, but not so complex that they overfit.

`How it works`
- **Analytically**: penalty constrains the coefficient range
- **Geometrically**: L1/L2 imposes bounded regions. 
- **Probabilistically**: imposes prior on coefficients. 
