<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/Lasso_regression(Part_2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Ridge Regression (L2 Regularization)

**Concept:** Ridge regression adds a penalty equal to the *square of the magnitude* of the coefficients. This penalty term is proportional to the sum of the squared coefficients.

**Loss Function:**
$J(\beta)_{\text{Ridge}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$

Where:
*   $J(\beta)_{\text{Ridge}}$ is the Ridge regression cost function.
*   $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ is the Residual Sum of Squares (RSS), which is the standard OLS loss term.
*   $\lambda$ (lambda) is the tuning parameter that controls the strength of the penalty. $\lambda \ge 0$.
*   $\sum_{j=1}^{p} \beta_j^2$ is the L2 penalty term (sum of squared coefficients).

**Effect:**
*   **Shrinkage:** Ridge regression shrinks the coefficients towards zero, but it rarely makes them exactly zero. This means that all features typically remain in the model, though their impact is reduced.
*   **Bias-Variance Trade-off:** By introducing a small amount of bias (due to shrinking coefficients), Ridge regression can significantly reduce the variance of the model, leading to better predictions on unseen data, especially in the presence of multicollinearity.
*   **Multicollinearity:** It is particularly effective in handling multicollinearity (highly correlated independent variables) because it distributes the impact of correlated variables among them, instead of picking just one.

**When to Use:**
*   When you have many features and believe that most of them are relevant, but some might be highly correlated.
*   When you want to prevent overfitting without performing explicit feature selection.

### 2. Lasso Regression (L1 Regularization)

**Concept:** Lasso regression adds a penalty equal to the *absolute value of the magnitude* of the coefficients. This penalty term is proportional to the sum of the absolute values of the coefficients.

**Loss Function:**
$J(\beta)_{\text{Lasso}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$

Where:
*   $J(\beta)_{\text{Lasso}}$ is the Lasso regression cost function.
*   $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ is the Residual Sum of Squares (RSS).
*   $\lambda$ (lambda) is the tuning parameter that controls the strength of the penalty. $\lambda \ge 0$.
*   $\sum_{j=1}^{p} |\beta_j|$ is the L1 penalty term (sum of absolute values of coefficients).

**Effect:**
*   **Sparsity / Feature Selection:** Unlike Ridge, Lasso has the property of shrinking some coefficients *exactly to zero*. This means it effectively performs feature selection by excluding less important features from the model.
*   **Interpretability:** Because it drives some coefficients to zero, Lasso can produce simpler, more interpretable models with fewer features.
*   **Bias-Variance Trade-off:** Similar to Ridge, it introduces bias to reduce variance, but its ability to perform feature selection can be a stronger advantage for interpretability.

**When to Use:**
*   When you suspect that many features are irrelevant or only a small subset of features truly contribute to the prediction.
*   When you need a simpler, more interpretable model by explicitly selecting important features.
*   For high-dimensional datasets where feature selection is crucial.

### Key Differences Summarized

| Feature              | Ridge Regression (L2)                                  | Lasso Regression (L1)                                  |
| :------------------- | :----------------------------------------------------- | :----------------------------------------------------- |
| **Penalty Term**     | Sum of the square of the coefficients (L2-norm)        | Sum of the absolute values of the coefficients (L1-norm) |
| **Effect on Coefficients** | Shrinks coefficients towards zero, but rarely to exactly zero. | Shrinks coefficients towards zero, often making some exactly zero. |
| **Feature Selection**| Does not perform intrinsic feature selection; all features generally remain. | Performs automatic feature selection by setting some coefficients to zero. |
| **Model Complexity** | Reduces variance, but keeps all features.              | Reduces variance and simplifies the model by selecting features. |
| **Interpretability** | Less interpretable if many features are retained.      | More interpretable due to feature selection (fewer non-zero coefficients). |
| **Multicollinearity**| Good at handling multicollinearity; it shrinks correlated coefficients equally. | Can be unstable with highly correlated variables; it tends to pick one and zero out others arbitrarily. |
| **Geometric Interpretation** | Penalty region is a circle/sphere.                  | Penalty region is a diamond/octahedron (favors axes). |

**In essence:**

*   **Ridge** is for when you want to reduce the magnitude of coefficients and handle multicollinearity without removing any features. It's good when you believe all features have *some* relevance.
*   **Lasso** is for when you want to achieve sparsity (drive some coefficients to zero) and perform feature selection, resulting in a simpler, more interpretable model. It's good when you suspect many features are *irrelevant*.

In [1]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso

X, y = load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

print("Ridge Coefficients:", ridge.coef_)
print("Lasso Coefficients:", lasso.coef_)


Ridge Coefficients: [  27.30180832  -26.83112487  286.57641706  163.05869312    4.92872272
  -28.16034099 -142.87431977  112.29755445  245.59101719  114.15738316]
Lasso Coefficients: [  -0.          -51.02797104  547.77247372  195.16763772 -100.50776624
   -0.         -146.93267869    0.          541.54368898   53.66332285]


### The Core Idea: Constraint Regions and Loss Function

Both Ridge and Lasso add a penalty term to the ordinary least squares (OLS) loss function. This penalty can be thought of as a constraint on the size of the coefficients. We're trying to minimize the sum of squared residuals (the OLS part) subject to this constraint on the coefficients.

### 1. Ridge Regression (L2 Penalty)

*   **Penalty Term:** The L2 penalty term is $\lambda \sum_{j=1}^{p} \beta_j^2$. This means that the sum of the *squares* of the coefficients must be less than or equal to some constant `s` (which is inversely related to $\lambda$).

*   **Geometric Interpretation:** In a 2-dimensional coefficient space (for two features, $\beta_1$ and $\beta_2$), the constraint region defined by $\beta_1^2 + \beta_2^2 \le s$ is a **circle** centered at the origin. In higher dimensions, it's a sphere or hypersphere.

*   **Why coefficients are *not* zeroed out:** The OLS loss function (without any penalty) has elliptical contour lines, with the center of the ellipse representing the OLS solution. When we add the L2 penalty, we're looking for the point where these elliptical OLS contours *first touch* the circular constraint region. Because the circular constraint has smooth, rounded boundaries, the optimal point where the ellipse touches the circle is almost always a tangent point where both $\beta_1$ and $\beta_2$ are non-zero (though shrunk towards zero). It's very unlikely for the tangent point to occur exactly on one of the axes unless the OLS solution itself is on an axis or at the origin.

    The L2 penalty shrinks all coefficients proportionally. It penalizes large coefficients heavily, but it doesn't have a mechanism to completely eliminate them.

### 2. Lasso Regression (L1 Penalty)

*   **Penalty Term:** The L1 penalty term is $\lambda \sum_{j=1}^{p} |\beta_j|$. This means that the sum of the *absolute values* of the coefficients must be less than or equal to some constant `s`.

*   **Geometric Interpretation:** In a 2-dimensional coefficient space, the constraint region defined by $|\beta_1| + |\beta_2| \le s$ is a **diamond** (a square rotated by 45 degrees) centered at the origin. In higher dimensions, it's an octahedron or hyperoctahedron.

*   **Why coefficients *are* zeroed out:** Like Ridge, we're looking for the first point where the elliptical OLS contours touch the diamond-shaped constraint region. The crucial difference is that the diamond-shaped constraint has **sharp corners** at the axes. For example, if you have two coefficients, the corners are at $(s, 0)$, $(-s, 0)$, $(0, s)$, and $(0, -s)$.

    Due to these sharp corners, it is much more likely that the elliptical OLS contours will *first touch* the constraint region at one of these corners. When the contour touches a corner, one of the coefficients corresponding to that corner (e.g., $\beta_2=0$ if it touches $(s,0)$) will be exactly zero.

    The L1 penalty has a tendency to produce sparse models because it prefers solutions where some coefficients are exactly zero, effectively performing feature selection.

### Visual Analogy

Imagine inflating an elliptical balloon (your OLS loss function) until it touches a solid shape (your constraint region).

*   **Ridge (Circular/Spherical constraint):** The balloon will almost always touch the smooth, rounded surface of the circle/sphere at a point *between* the axes. Both coefficients will be non-zero.

*   **Lasso (Diamond/Octahedral constraint):** The balloon is much more likely to hit one of the sharp 'points' or 'edges' of the diamond/octahedron first. Since these points often lie directly on an axis (meaning one or more coefficients are zero), Lasso naturally forces some coefficients to become zero.

This geometric property is why Lasso is used for feature selection, as it can completely remove the influence of less important features by setting their coefficients to zero, leading to simpler and more interpretable models. Ridge, on the other hand, is effective at shrinking coefficients to handle multicollinearity without outright removing features.