## Ridge Regression

The explaination of this algorithm was taken from the book *The Elements of Statistical Learning*, chapter 3.

## Importing packages

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from os.path import join
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin

## Explaining the Algorithm

Ridge Regression shrinks the regression coefficients by imposing a penalty on their size. The ridge coefficients minimize a penalized residual sum of squares, given by the following equation:

$$ \hat{\beta}^{lasso} = \argmin_{\beta} \left\{ \frac{1}{2} \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\} $$

It can also be wrote in the matrix form:

$$ RSS(\lambda) = (\bold{y} - \bold{X} \beta)^T (\bold{y} - \bold{X}  \beta) + \lambda \beta^T \beta$$

Denote by $\bold{X}$ the $N \times p $ (not $p + 1$) matrix with each row an input vector, and similarly let $\bold{y}$ be the N-vector of outputs in the training set. In this case we **DO NOT WANT TO REGULARIZE THE INTERCEPT**, so usually we do not add one extra column with ones. Penalization of the intercept would make the procedure depend on the origin chosen for $\bold{y}$; that is, adding a constant $c$ to each of the targets $y_i$ would not simply result in a shift of the predictions by the same amount $c$. To find a solution to this problem, one can *center* the inputs around the column mean and update its values to $x_{ij} = x_{ij} - \overline{x}_j$ and the outputs to $y_{i} = y_{i} - \overline{y}$. This way the intercept $\beta_0$ is equal to $\overline{y} = \sum_{i=1}^{N} y_i$


One way to minimize this function is by setting it's derivative in respect to $\beta$ to zero.

<!-- $$ \frac{\partial RSS}{\partial\beta}  = -2\bold{X}^T (\bold{y} - \bold{X}  \beta)$$
$$ \bold{X}^T (\bold{y} - \bold{X}  \beta) = 0 $$ -->

$$ \hat{\beta} =  (\bold{X}^T \bold{X} + \lambda \bold{I})^{-1} \bold{X}^T \bold{y}$$

Other solution for the case that there is an intercept can be found [here](https://stats.stackexchange.com/questions/602412/what-would-be-the-solution-of-ridge-regression-if-there-is-an-intercept). In this case one should include a column of 1s plus any "features" or "independent variables to $\bold{X}$ and consider the following
$$\gamma = \begin{bmatrix}
\beta_0 \\
\beta 
\end{bmatrix}$$

$$\hat{\gamma} =  (\bold{X}^T \bold{X} + \lambda \bold{A})^{-1} \bold{X}^T \bold{y}$$

$$\bold{A} = \begin{bmatrix}
0 & 0 \\
0 & \bold{I} 
\end{bmatrix}$$

In [None]:
alpha = 0.5
X = np.array([[1], [2], [3], [4]])
y = np.array([1, 3, 2, 5])

In [51]:
# Solution that include a column of 1s and do not penalizes the intercept

# Adding a new column with ones to represent the intercept
_X = np.hstack((np.ones([X.shape[0],1], X.dtype), X))

# This "Turns off" the regularization for for beta_0
A = np.identity(_X.shape[1])
A[0, 0] = 0

In [52]:
weights = np.linalg.inv(_X.T @ _X + alpha * A) @ _X.T @ y
weights

array([0.25, 1.  ])

In [53]:
_X @ weights

array([1.25, 2.25, 3.25, 4.25])

In [6]:
# Solution that scales the inputs and outputs

# Centering X around the mean
X_scaled = X - np.mean(X, axis=0)
y_scaled = y - np.mean(y, axis=0)
w0 = np.mean(y, axis=0)

In [14]:
I = np.identity(X_scaled.shape[1])

weights = np.linalg.inv(X_scaled.T @ X_scaled + alpha * I) @ X_scaled.T @ y_scaled
np.hstack([w0, weights])

array([2.75, 1.  ])

In [9]:
X_scaled  @ weights + w0

array([1.25, 2.25, 3.25, 4.25])

## Custom Model

In [71]:
class RidgeRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, alpha=0.5, scale=False):
        self.alpha = alpha
        self.scale = scale

    def fit(self, X, y):
        if self.scale == True:
            X_scaled = X - np.mean(X, axis=0)
        else:
            X_scaled = X.copy()
        
        self.X = np.hstack((np.ones([X.shape[0],1], X.dtype), X))
        self.y = y
        self.N, self.p = X.shape

        A = np.identity(self.p+1)
        
        # This "Turns off" the regularization for for beta_0
        A[0, 0] = 0
        
        self.weights = np.linalg.inv(self.X.T @ self.X + self.alpha*A) @ self.X.T @ self.y
        
        return self

    def predict(self, X):
        _X = np.hstack((np.ones([X.shape[0],1], X.dtype), X))
        return _X @ self.weights
    
    def get_variance(self):
        y_hat = self.predict(self.X)
        return 1/(self.N - self.p - 1) * np.sum((y_hat - self.y)**2)
    
    def get_params_covariance(self) -> np.ndarray:
        """The variance–covariance matrix of the least squares parameter estimates

        Returns:
            np.ndarray: The variance–covariance matrix
        """
        return np.linalg.inv(self.X.T @ self.X) * self.get_variance()

In [74]:
my_ridge = RidgeRegressor(alpha=0.5).fit(X, y)
y_hat = my_ridge.predict(X)
y_hat

array([1.25, 2.25, 3.25, 4.25])

In [72]:
from sklearn.linear_model import Ridge

In [73]:
sk_ridge = Ridge(alpha=0.5).fit(X, y)
y_hat = sk_ridge.predict(X)
y_hat

array([1.25, 2.25, 3.25, 4.25])