# Elastic Net Regularization — Complete Mathematical & Conceptual Guide (Arnav Tomar copyright)

---

## 1. Motivation: Why Elastic Net Exists

Regularization is used to control model complexity and reduce overfitting.

Two classical approaches:

- Ridge Regression (L2): handles multicollinearity but keeps all features
- Lasso Regression (L1): performs feature selection but is unstable with correlated features

Elastic Net is introduced to **combine the strengths of both**.

---

## 2. Linear Regression (Baseline)

Model:

$$
\hat{y} = Xw + b
$$

Mean Squared Error loss:

$$
J(w) = \frac{1}{n}\|y - Xw\|_2^2
$$

Problem: overfitting when features are many or correlated.

---

## 3. Ridge Regression (L2)

Objective function:

$$
J_{\text{ridge}}(w) = \frac{1}{n}\|y - Xw\|_2^2 + \lambda \|w\|_2^2
$$

Key properties:

- Shrinks coefficients
- Never sets coefficients exactly to zero
- Stable under multicollinearity

Closed-form solution:

$$
w = (X^TX + \lambda I)^{-1}X^Ty
$$

---

## 4. Lasso Regression (L1)

Objective function:

$$
J_{\text{lasso}}(w) = \frac{1}{n}\|y - Xw\|_2^2 + \lambda \|w\|_1
$$

Key properties:

- Produces sparse solutions
- Performs feature selection
- Unstable when features are highly correlated

No closed-form solution due to non-differentiability at zero.

---

## 5. Limitation of Ridge and Lasso

### Ridge limitation:
- Cannot remove irrelevant features

### Lasso limitation:
- Selects only one feature from a correlated group
- Selection becomes unstable

Elastic Net solves both.

---

## 6. Elastic Net Objective Function

Elastic Net combines L1 and L2 penalties:

$$
J_{\text{EN}}(w) =
\frac{1}{n}\|y - Xw\|_2^2
+ \lambda \left(
\alpha \|w\|_1
+ (1 - \alpha)\|w\|_2^2
\right)
$$

Where:

- $$\lambda \ge 0$$ → overall regularization strength
- $$0 \le \alpha \le 1$$ → mixing parameter

---

## 7. Special Cases

### When $$\alpha = 0$$

$$
J_{\text{EN}} = \text{Ridge Regression}
$$

### When $$\alpha = 1$$

$$
J_{\text{EN}} = \text{Lasso Regression}
$$

### When $$0 < \alpha < 1$$

$$
\text{Elastic Net}
$$

---

## 8. Alternative Parameterization (Library Form)

Many libraries use:

$$
\lambda = \lambda_1 + \lambda_2
$$

$$
\alpha = \frac{\lambda_1}{\lambda_1 + \lambda_2}
$$

Penalty becomes:

$$
\lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2
$$

---

## 9. Why Elastic Net Works for Multicollinearity

Given two highly correlated features:

$$
x_1 \approx x_2
$$

- Lasso tends to select **one** and discard the other
- Ridge shrinks both but keeps both
- Elastic Net **selects both together and shrinks them**

This is called the **grouping effect**.

---

## 10. Geometry Intuition

### Lasso constraint:

$$
\|w\|_1 \le c
$$

→ diamond-shaped  
→ sharp corners → sparsity

### Ridge constraint:

$$
\|w\|_2^2 \le c
$$

→ circular  
→ smooth → no sparsity

### Elastic Net constraint:

$$
\alpha \|w\|_1 + (1-\alpha)\|w\|_2^2 \le c
$$

→ rounded diamond  
→ sparsity + stability

---

## 11. Optimization Properties

- Convex objective
- Non-differentiable at zero due to L1 term
- Solved using:
  - Coordinate Descent
  - Proximal Gradient Descent
  - SGD with elastic penalty

---

## 12. Bias–Variance Tradeoff

As $\lambda$ increases:

$$
\text{Bias} \uparrow
$$

$$
\text{Variance} \downarrow
$$

Elastic Net allows smoother control compared to pure Lasso.

---

## 13. Feature Selection Behavior

Elastic Net:

- Can set coefficients exactly to zero
- Keeps correlated important features together
- Reduces dimensionality safely

---

## 14. When to Use Elastic Net

Use Elastic Net when:

- Number of features $$p \gg n$$
- Strong multicollinearity exists
- Feature importance is unknown
- Need both stability and sparsity

---

## 15. Interview-Ready Comparison

| Method | Shrinkage | Feature Selection | Multicollinearity |
|------|----------|------------------|------------------|
| Ridge | Yes | No | Excellent |
| Lasso | Yes | Yes | Poor |
| Elastic Net | Yes | Yes | Excellent |

---

## 16. One-Line Interview Answer

> Elastic Net combines L1 and L2 penalties to achieve both feature selection and coefficient stability, especially effective when features are highly correlated.

---

## 17. Key Takeaways

- Elastic Net generalizes Ridge and Lasso
- Controlled by $\lambda$ and $\alpha$
- Produces sparse yet stable models
- Preferred for high-dimensional correlated data

---

## 18. Memory Formula

$$
\text{Elastic Net} = \text{Lasso} + \text{Ridge}
$$

$$
\alpha \rightarrow \text{L1 vs L2 balance}
$$

$$
\lambda \rightarrow \text{penalty strength}
$$
---

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X,y = load_diabetes(return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

0.4399338661568968

In [None]:
# Linear Regression
reg = LinearRegression()
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4399338661568968

In [None]:
# Ridge
reg = Ridge(alpha=0.1)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.45199494197195456

In [None]:
# Lasso
reg = Lasso(alpha=0.01)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)


0.44111855963110613

In [None]:
# ElasticNet
# arnav
reg = ElasticNet(alpha=0.005,l1_ratio=0.9)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4531474541554823