# Ridge Regression using Gradient Descent

---

## 1. Recap: Linear Regression Setup

Let the dataset be:

- Number of samples: $m$  
- Number of features: $n$  

Define:

$$
X =
\begin{bmatrix}
1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_n^{(1)} \\
1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)}
\end{bmatrix}
$$

$$
W =
\begin{bmatrix}
w_0 \\
w_1 \\
\vdots \\
w_n
\end{bmatrix}
\quad
y =
\begin{bmatrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(m)}
\end{bmatrix}
$$

Hypothesis function:

$$
\hat{y} = XW
$$

---

## 2. Ordinary Least Squares (OLS) Loss

The standard linear regression loss function is:

$$
J(W) = \frac{1}{2m}(XW - y)^T(XW - y)
$$

Problem with OLS:

- Large coefficients
- High variance
- Severe overfitting

---

## 3. Ridge Regression Loss Function

To control coefficient magnitude, add **L2 regularization**:

$$
J(W) = \frac{1}{2m}(XW - y)^T(XW - y) + \frac{\lambda}{2m} W^T W
$$

Where:

- $$\lambda \ge 0$$ is the regularization parameter
- Penalizes large weights
- Controls model complexity

---

## 4. Objective Function

Minimize the ridge loss:

$$
\min_W J(W)
$$

Instead of closed-form solution, we use **Gradient Descent**.

---

## 5. Gradient Descent Principle

General gradient descent update rule:

$$
W^{(t+1)} = W^{(t)} - \alpha \frac{\partial J(W)}{\partial W}
$$

Where:

- $\alpha$ = learning rate
- $t$ = iteration number

---

## 6. Gradient of the Data Loss Term

Differentiate first term:

$$
\frac{\partial}{\partial W}
\left[
\frac{1}{2m}(XW - y)^T(XW - y)
\right]
=
\frac{1}{m} X^T(XW - y)
$$

---

## 7. Gradient of Regularization Term

Differentiate regularization term:

$$
\frac{\partial}{\partial W}
\left[
\frac{\lambda}{2m} W^T W
\right]
=
\frac{\lambda}{m} W
$$

---

## 8. Total Gradient of Ridge Loss

Combining both gradients:

$$
\nabla_W J(W) =
\frac{1}{m} X^T(XW - y)
+
\frac{\lambda}{m} W
$$

---

## 9. Final Gradient Descent Update Rule

Substitute gradient into update equation:

$$
W^{(t+1)} =
W^{(t)}
-
\alpha
\left[
\frac{1}{m} X^T(XW^{(t)} - y)
+
\frac{\lambda}{m} W^{(t)}
\right]
$$

This is the **core ridge regression gradient descent equation**.

---

## 10. Effect of Regularization on Updates

Rewriting update rule:

$$
W^{(t+1)} =
\left(1 - \frac{\alpha \lambda}{m}\right) W^{(t)}
-
\frac{\alpha}{m} X^T(XW^{(t)} - y)
$$

Key insight:

- Weights are **shrunk at every step**
- Shrinkage proportional to $\lambda$

---

## 11. Bias–Variance Behavior

- Small $\lambda$ → behaves like OLS → high variance  
- Large $\lambda$ → strong shrinkage → high bias  
- Optimal $\lambda$ balances both  

---

## 12. Why Gradient Descent Instead of Closed Form

Closed-form ridge solution:

$$
W = (X^T X + \lambda I)^{-1} X^T y
$$

Limitations:

- Matrix inversion cost: $O(n^3)$
- Memory expensive for large $n$
- Not scalable

Gradient descent:

- Works for large datasets
- Avoids matrix inversion
- Supports online learning

---

## 13. Algorithm (Stepwise)

1. Initialize $W = 0$
2. For each iteration:
   - Compute $\hat{y} = XW$
   - Compute gradient
   - Update $W$
3. Stop when convergence reached

---

## 14. Python Implementation (From Scratch)

```python
import numpy as np

class RidgeGD:
    def __init__(self, lr=0.01, epochs=1000, lam=0.1):
        self.lr = lr
        self.epochs = epochs
        self.lam = lam

    def fit(self, X, y):
        m, n = X.shape
        X = np.c_[np.ones(m), X]
        y = y.reshape(-1, 1)
        self.W = np.zeros((n + 1, 1))

        for _ in range(self.epochs):
            y_hat = X @ self.W
            gradient = (1/m) * (X.T @ (y_hat - y)) + (self.lam/m) * self.W
            self.W -= self.lr * gradient

    def predict(self, X):
        X = np.c_[np.ones(X.shape[0]), X]
        return X @ self.W
```
## 15. Convergence Conditions

Gradient Descent for Ridge Regression converges if the learning rate satisfies:

$$
0 < \alpha < \frac{1}{\lambda_{\max}(X^T X)}
$$

Where:

- $\alpha$ is the learning rate  
- $\lambda_{\max}$ is the largest eigenvalue of $X^T X$  

Additional conditions:

- Features should be normalized  
- Loss function is strictly convex due to L2 regularization  
- Unique global minimum is guaranteed  

---

## 16. Observations

- Ridge regression shrinks coefficients smoothly  
- Coefficients **never become exactly zero**  
- Reduces multicollinearity effects  
- Improves numerical stability  
- Weight shrinkage increases with $\lambda$  

Mathematically:

$$
\lambda \uparrow \;\Rightarrow\; \|W\|_2 \downarrow
$$

---

## 17. Practical Notes

- Input features must be standardized  

$$
x' = \frac{x - \mu}{\sigma}
$$

- Intercept term should not be regularized  
- Optimal $\lambda$ is chosen via cross-validation  
- Too large $\lambda$ causes underfitting  

---

## 18. Comparison with Ordinary Least Squares

| Property | OLS | Ridge |
|--------|-----|-------|
| Regularization | None | L2 |
| Overfitting | High | Reduced |
| Coefficient size | Large | Shrunk |
| Stability | Poor | Strong |

---

## 19. Core Update Equation

The ridge regression gradient descent update rule:

$$
W \leftarrow W - \alpha
\left(
\frac{1}{m} X^T(XW - y) + \frac{\lambda}{m} W
\right)
$$

---

## 20. Final Conclusion

Ridge Regression with Gradient Descent:

- Prevents overfitting  
- Guarantees convex optimization  
- Scales to large datasets  
- Avoids costly matrix inversion  

Final insight:

$$
\lambda \uparrow \;\Rightarrow\; \text{Model complexity} \downarrow \;\Rightarrow\; \text{Generalization} \uparrow
$$


In [1]:
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
import numpy as np
X,y = load_diabetes(return_X_y=True)


In [2]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4)


In [3]:
from sklearn.linear_model import SGDRegressor
reg = SGDRegressor(penalty='l2',max_iter=500,eta0=0.1,learning_rate='constant',alpha=0.001)
reg.fit(X_train,y_train)

In [4]:
y_pred = reg.predict(X_test)
print("R2 score",r2_score(y_test,y_pred))
print(reg.coef_)
print(reg.intercept_)

R2 score 0.45752130124139156
[  49.3613458  -168.66544146  372.01659224  274.9281426    -7.35598526
  -59.78921244 -164.09902561  135.82098654  336.79422215   88.08811994]
[159.8374821]


In [5]:
from sklearn.linear_model import Ridge

reg = Ridge(alpha=0.001, max_iter=500,solver='sparse_cg')
reg.fit(X_train,y_train)

y_pred = reg.predict(X_test)
print("R2 score",r2_score(y_test,y_pred))
print(reg.coef_)
print(reg.intercept_)


R2 score 0.46250101619914563
[  34.52192544 -290.84084076  482.40181344  368.0678662  -852.44873179
  501.59160336  180.11115788  270.76333979  759.73534372   37.4913546 ]
151.10198517439466


In [6]:
class MYRidgeGD:

    def __init__(self,epochs,learning_rate,alpha):

        self.learning_rate = learning_rate
        self.epochs = epochs
        self.alpha = alpha
        self.coef_ = None
        self.intercept_ = None

    def fit(self,X_train,y_train):

        self.coef_ = np.ones(X_train.shape[1])
        self.intercept_ = 0
        thetha = np.insert(self.coef_,0,self.intercept_)

        X_train = np.insert(X_train,0,1,axis=1)

        for i in range(self.epochs):
            thetha_der = np.dot(X_train.T,X_train).dot(thetha) - np.dot(X_train.T,y_train) + self.alpha*thetha
            thetha = thetha - self.learning_rate*thetha_der

        self.coef_ = thetha[1:]
        self.intercept_ = thetha[0]

    def predict(self,X_test):

        return np.dot(X_test,self.coef_) + self.intercept_

In [8]:
reg = MYRidgeGD(epochs=500,alpha=0.001,learning_rate=0.005)
reg.fit(X_train,y_train)

y_pred = reg.predict(X_test)
print("R2 score",r2_score(y_test,y_pred))
print(reg.coef_)
print(reg.intercept_)

R2 score 0.4738018280260913
[  46.65050914 -221.3750037   452.12080647  325.54248128  -29.09464178
  -96.47517735 -190.90017011  146.32900372  400.80267299   95.09048094]
150.86975316713472
