# Ridge Regression

## What is Ridge Regression?

Ridge Regression is like **regular linear regression with a "safety brake"**. It prevents the model from getting too excited and fitting the data too closely.

Think of it this way:

* **Regular linear regression** tries to draw the perfect line through all data points
* **Ridge regression** draws a good line but also keeps the line from getting too steep or wiggly

---

## What is L2 Regularization? (The Key Ingredient!)

**L2 regularization** is the special rule used in Ridge Regression.

* It adds a **penalty** based on the **sum of the squares of the model’s weights** (coefficients).
* This penalty is called the **L2 penalty** because of the squared numbers (the “2” in L2).

### L2 in the Ridge Formula

$$
\text{Total Error} = \text{Prediction Error} + \lambda \cdot \left(\text{sum of squared weights}\right)
$$

Where:

* $\lambda$ (lambda) is how strong you want the penalty to be (like the speed limit).
* The penalty is the **L2 term**: if your weights are big, the penalty is big; if weights are small, the penalty is small.

---

## Why Do We Need Ridge Regression?

### Problem: Overfitting

Imagine you're trying to predict house prices. Regular linear regression might:

* Fit the training data perfectly
* But fail miserably on new houses
* Because it's memorizing noise, not learning patterns

### Solution: Add a Penalty with L2 Regularization

Ridge Regression says: "Make good predictions, BUT don't use huge numbers in your formula."

* **The L2 penalty** is what keeps the numbers small and the model simple.

---

## Key Benefits (Simple Terms)

| Benefit                         | What it means                                    |
| ------------------------------- | ------------------------------------------------ |
| **Prevents Overfitting**        | Works better on new data                         |
| **Handles Correlated Features** | Doesn't break when variables are related         |
| **Stable Results**              | Small data changes don't cause big model changes |
| **Works with Many Features**    | Handles datasets with lots of columns            |

---

## When to Use Ridge Regression

**Use it when:**

* You have many features (columns)
* Your features are correlated with each other
* Regular regression gives you huge, unstable numbers
* Your model works great on training data but poorly on test data

**Don't use it when:**

* You want to completely remove some features (use Lasso/L1 instead)
* You have very few features and no overfitting problems

---

## Simple Analogy

Think of Ridge Regression like **driving with a speed limit**:

* **Regular regression**: "Get to your destination as fast as possible!" (might crash)
* **Ridge regression**: "Get there fast, but don't exceed the speed limit!" (safer journey)

The **speed limit** is the **L2 penalty** — it keeps things under control.

---

## Bottom Line

**Ridge Regression = Linear Regression + L2 Regularization (L2 penalty)**
It prevents your model from going crazy by adding a smart penalty for complexity, thanks to L2 regularization.


# Mathematical Example

---

## What is Ridge Regression?

Ridge Regression is an improved version of linear regression that:

- Fits a line through data points  
- Keeps the model simple  
- Adds a penalty to prevent large coefficient values  

---

## Ridge Regression Formula

The cost function for Ridge Regression is:

$$
\text{Cost} = \frac{1}{n} \sum (y - (w \cdot x + b))^2 + \lambda \cdot w^2
$$

For simplicity, we'll assume $ b = 0 $, so the equation becomes:

$$
\text{Cost} = \frac{1}{n} \sum (y - w \cdot x)^2 + \lambda \cdot w^2
$$

Where:

- $ n $: number of data points  
- $ x_i, y_i $: input and actual output values  
- $ w $: the coefficient (slope)  
- $ \lambda $: regularization strength  
- $ w^2 $: penalty term that discourages large weights  

---

## Our Simple Data

| x | y_actual |
|---|----------|
| 1 | 2        |
| 2 | 3        |
| 3 | 4        |

We want to find the best value for $ w $ in:

$$
y_{\text{predicted}} = w \cdot x
$$

We will test different $ w $ values using:

$$
\text{Cost} = \text{Average Error} + \lambda \cdot w^2
$$

Assume:

- $ \lambda = 1 $

---

## Step-by-Step Calculations

---

### Try $ w = 1 $

$$
y_{\text{predicted}} = 1 \cdot x
$$

| x | y_actual | y_predicted | Error = $ y_{\text{actual}} - y_{\text{predicted}} $ | Error² |
|---|----------|-------------|--------------------------------------------------------|--------|
| 1 | 2        | 1           | 1                                                      | 1      |
| 2 | 3        | 2           | 1                                                      | 1      |
| 3 | 4        | 3           | 1                                                      | 1      |

- Total error = $ 1 + 1 + 1 = 3 $
- Average error = $ \frac{3}{3} = 1 $
- Penalty = $ 1^2 = 1 $
- Total cost = $ 1 + 1 = \mathbf{2} $

---

### Try $ w = 1.5 $

$$
y_{\text{predicted}} = 1.5 \cdot x
$$

| x | y_actual | y_predicted | Error | Error² |
|---|----------|-------------|--------|--------|
| 1 | 2        | 1.5         | 0.5    | 0.25   |
| 2 | 3        | 3.0         | 0      | 0      |
| 3 | 4        | 4.5         | -0.5   | 0.25   |

- Total error = $ 0.25 + 0 + 0.25 = 0.5 $
- Average error = $ \frac{0.5}{3} \approx 0.167 $
- Penalty = $ 1.5^2 = 2.25 $
- Total cost = $ 0.167 + 2.25 = \mathbf{2.417} $

---

### Try $ w = 0.8 $

$$
y_{\text{predicted}} = 0.8 \cdot x
$$

| x | y_actual | y_predicted | Error | Error² |
|---|----------|-------------|--------|--------|
| 1 | 2        | 0.8         | 1.2    | 1.44   |
| 2 | 3        | 1.6         | 1.4    | 1.96   |
| 3 | 4        | 2.4         | 1.6    | 2.56   |

- Total error = $ 1.44 + 1.96 + 2.56 = 5.96 $
- Average error = $ \frac{5.96}{3} \approx 1.987 $
- Penalty = $ 0.8^2 = 0.64 $
- Total cost = $ 1.987 + 0.64 = \mathbf{2.627} $

---

## Final Comparison Table

| $ w $ | Average Error | Penalty $ w^2 $ | Total Cost     |
|--------|----------------|--------------------|----------------|
| 1.0    | 1.0            | 1.0                | **2.0 (Best)** |
| 1.5    | 0.167          | 2.25               | 2.417          |
| 0.8    | 1.987          | 0.64               | 2.627          |

---

## Final Answer:

The best value is:

$$
\boxed{w = 1}
\quad \Rightarrow \quad y_{\text{predicted}} = 1 \cdot x
$$

---

## Summary:

- Ridge Regression balances good predictions with small coefficients  
- It uses this cost formula:

$$
\text{Cost} = \frac{1}{n} \sum (y - wx)^2 + \lambda w^2
$$

- We tested $ w = 1, 1.5, 0.8 $  
- Best cost = **2** when $ w = 1 $

So, Ridge Regression gives us a **simple, stable, and reliable model**.

In [39]:
import numpy as np

In [40]:
#sample data
X = np.array([1, 2, 3])      # input
y = np.array([2, 3, 4])   

In [41]:
X,y

(array([1, 2, 3]), array([2, 3, 4]))

In [42]:
#initalizing the values of w
w_val = [0.8, 1.0, 1.5]

In [43]:
w_val

[0.8, 1.0, 1.5]

In [44]:
lambda_reg = 1  # regularization strength

In [45]:
 all_costs = []

In [46]:
for w in w_val:
    # Calculate predictions
    y_pred = w * X
    
    # Calculate MSE (Mean Squared Error)
    mse = np.mean((y - y_pred) ** 2)
    
    # Calculate penalty (L2 regularization)
    penalty = lambda_reg * (w ** 2)
    
    # Total cost
    total_cost = mse + penalty
    all_costs.append(total_cost)
    
    print(f"{w}\t{mse:.3f}\t\t{penalty:.3f}\t\t{total_cost:.3f}")

    

0.8	1.987		0.640		2.627
1.0	1.000		1.000		2.000
1.5	0.167		2.250		2.417


In [47]:
all_costs

[np.float64(2.626666666666666),
 np.float64(2.0),
 np.float64(2.4166666666666665)]

In [50]:
min_cost= min(all_costs)

In [51]:
min_cost

np.float64(2.0)

In [52]:
best_weight_idx = all_costs.index(min_cost)

In [53]:
best_weight_idx

1

In [56]:
best_weight = w_val[best_weight_idx]

In [57]:
print(f"\nBest weight: w = {best_weight} (lowest cost = {min_cost:.3f})")


Best weight: w = 1.0 (lowest cost = 2.000)
