<a href="https://colab.research.google.com/github/alibagheribardi/Regression/blob/main/Theory_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Let $X$ be a numerical dataset comprising $N$ observations, where each observation is characterized by $m$ numerical features, and let $y$ represent the corresponding target variable. The objective of Ordinary Least Squares (OLS) regression is to determine the coefficient vector $\beta$ that minimizes the residual error $\|X\beta - y\|$. Geometrically, the optimal coefficient vector $\beta$ is obtained by projecting $y$ onto the column space of $X$, ensuring that the residual vector $y - X\beta$ is orthogonal to this subspace. It is mathematically formulated by  

## $$X^T(X\beta - y) = 0,$$  

## which yields  

## $$\beta = (X^T X)^{-1} X^T y.$$


##Remarks.

##- In this approach, the Euclidean norm defines the notion of distance. However, various other distance metrics are commonly employed in machine learning algorithms, which may be more suitable in certain cases. These metrics are not induced by an inner product, and consequently, the concept of orthogonality does not necessarily hold. This absence poses significant challenges when formulating the coefficient vector $\beta$.
    
## Although the square matrix $X^T X$ is generally nonsingular, numerical instability can pose significant challenges, potentially compromising computational reliability. To mitigate this issue, various regularization techniques, such as Lasso and Ridge regression, are employed to enhance numerical stability and prevent ill-conditioning.  

## As a methodological refinement, one may introduce a controlled perturbation to the original dataset $X$ by defining $X_{\lambda} = X + \lambda I'$, where $\lambda$ is a nonzero scalar. Ordinary least squares (OLS) estimation is then applied to this modified dataset, with the corresponding loss function given by:


##$$
\tilde{L}(\beta, \lambda) = \| (X \beta - y)    + \lambda I'\beta\|^2.
$$

## Geometrically, this corresponds to projecting the target vector $y$ onto the hyperplane spanned by the columns of $X + \lambda I'$.

## Sufficiently small scalars $\lambda$ that ensure the numerical stability of the inverse of $X_{\lambda}^T X_{\lambda}$ lead to the following prediction:  

## $$
y_{\text{pred}} = X_\lambda\beta  ~~~~\text{where}~~~\beta = (X_\lambda^T X_\lambda)^{-1} X_\lambda^Ty
$$





## Applying SVD in Rgression


**Singular Value Decomposotion Theorem**:  Let $X$ be any $n \times p$ matrix.  Then we can find a factorization

$$X = U \Sigma V^\top$$

where

* $V$ is an orthogonal $p \times p$ matrix (its columns form an orthonormal basis of $\mathbb{R}^p$).
* $U$ is an orthogonal $n \times n$ matrix (its columns form an orthonormal basis of $\mathbb{R}^n$).
* $\Sigma$ is a $n \times p$ matrix which has zero entries everywhere but the diagonal $\Sigma_{jj} = \sigma_j \geq 0$.

Here we only take the $r = \textrm{Rank}(X)$ left and right singular vectors with non-zero singular values.  So we get the decomposition

$$X = U_r \Sigma_r V_r^\top$$

where $V_r$ is an $p \times r$ matrix, $\Sigma_r$ is an $r \times r$ diagonal matrix with positive values along the diagonal, and $U_r$ is an $n \times r$ matrix.

The columns of $U_r$ form an orthonormal basis of the image of $X$.  So it is easy to use it to project a vector $\vec{y}$ onto the image of $X$:

$$\hat{y} = \text{Proj}_{\textrm{Im}(X)} (\vec{y}) = \sum_1^r (\vec{y} \cdot \vec{u}_j) \vec{u_j}$$

To find the value of $\beta$ with $\hat{y} = X \vec{\beta}$, we can use

$$
\begin{align*}
X \beta &= U_{r} U_{r}^\top \vec{y}\\
U_{r} \Sigma_r V_r^\top \beta &= U_{r} U_{r}^\top \vec{y}\\
\Sigma_r V_r^\top \beta &= U_{r}^\top \vec{y}\\
V_r^\top \beta  &=  \Sigma_r^{-1} U_{r}^\top \vec{y}\\
\beta &= V_r \Sigma_r^{-1} U_{r}^\top \vec{y}
\end{align*}
$$