In [1]:
import sympy
from sympy import Matrix, Rational, sqrt, symbols, zeros
import numpy as np
%matplotlib widget
import matplotlib.pyplot as plt

# Mathematics for Machine Learning

## Session 07: Orthogonal projection; the determinant

## Gerhard Jäger

### November 12, 2024

## Orthogonality

Recall: vectors $\mathbf v$ and $\mathbf w$ are **orthogonal** if and only if

$$
\mathbf v^T \mathbf w = \mathbf 0
$$

#### Examples

- $\begin{bmatrix}1\\1 \\0\end{bmatrix}$, $\begin{bmatrix}0\\0 \\1\end{bmatrix}$


- $\begin{bmatrix}1\\1 \end{bmatrix}$, $\begin{bmatrix}2\\-2\end{bmatrix}$


- $\begin{bmatrix}1\\1 \\ 2\end{bmatrix}$, $\begin{bmatrix}-2\\-2\\2\end{bmatrix}$

- ..

### orthogonal spaces

Two vector spaces $\mathbf V$ and $\mathbf W$ are orthogonal if and only if

$$
\forall \mathbf v\in \mathbf V, \mathbf w \in \mathbf W. \mathbf v^T\mathbf w = \mathbf 0
$$

**Examples**

$$
\begin{aligned}
\mathbf V &= \{\begin{bmatrix}x\\0\end{bmatrix}: x \in \mathbb R\}\\[1em]
\mathbf W &= \{\begin{bmatrix}0\\y\end{bmatrix}: y \in \mathbb R\}\\
\end{aligned}
$$

These are of course the $x$-axis and $y$-axis of a 2d-space.

$$
\begin{aligned}
\mathbf V &= \{\begin{bmatrix}x\\y\\0\end{bmatrix}: x,y \in \mathbb R\}\\[1em]
\mathbf W &= \{\begin{bmatrix}0\\0\\z\end{bmatrix}: z \in \mathbb R\}\\
\end{aligned}
$$

These are the $x$-$y$ plane and the $z$-axis of a 3d-space.

$$
\begin{aligned}
\mathbf V &= \mathrm{span}(
\begin{bmatrix}
1\\
-1\\
0
\end{bmatrix},
\begin{bmatrix}
1\\
1\\
1
\end{bmatrix}
)\\[1em]
\mathbf W &= \mathrm{span}(\begin{bmatrix}-1\\-1\\2\end{bmatrix})\\
\end{aligned}
$$

How do we know whether $\mathbf V$ and $\mathbf W$ are orthogonal?

**Observation** Let $V$ and $W$ be two sets of vectors $\subseteq \mathbb R^n$. 

$\mathrm{span}(V)$ is orthogonal to $\mathrm{span}(W)$ if and only if for all $\mathbf v\in V, \mathbf w \in W$: $\mathbf v$ and $\mathbf w$ are orthogonal.

*Proof*

Suppose $\mathrm{span}(V)$ is orthogonal to $\mathrm{span}(W)$. If $\mathbf v\in V$, then $\mathbf v\in\mathrm{span}(V)$, and likewise for $\mathbf w$. Hence $\mathbf v$ and $\mathbf w$ are orthogonal.

Now suppose for all $\mathbf v\in V, \mathbf w \in W$: $\mathbf v$ and $\mathbf w$ are orthogonal. Let $\mathbf x\in\mathrm{span}(V)$ and $\mathbf y\in\mathrm{span}(W)$.

If $\mathbf x\in\mathrm{span}(V)$ and $\mathbf y\in\mathrm{span}(W)$, $\mathbf x = \sum_i r_i\mathbf v_i$, $\mathbf y = \sum_j s_j\mathbf w_j$ for $r_1,\ldots,r_{|V|}, s_1,\ldots,s_{|W|}\in \mathbb R$.

$$
\begin{aligned}
\mathbf x^T\mathbf y &= (\sum_i r_i\mathbf v_i)^T(\sum_j s_i\mathbf w_j)\\
        &= \sum_i (r_i\mathbf v_i)^T(\sum_j s_j\mathbf w_j)\\
        &= \sum_i \sum_j(r_i\mathbf v_i)^T( s_j\mathbf w_j)\\
        &= \sum_i \sum_jr_i\mathbf v_i^T( s_j\mathbf w_j)\\
        &= \sum_i \sum_jr_is_j\mathbf v_i^T\mathbf w_j\\
        &= \sum_i \sum_jr_is_j\mathbf 0\\
        &= \mathbf 0\\
\end{aligned}
$$

$\dashv$

**Observation**

Let $A$ be an $m\times n$ matrix. Then

- the column space $C(A)$ is orthogonal to the left null space $C(A^T)$, and
- the row space $C(A^T)$ is orthogonal to the null space $N(A)$.

*Proof*

The column space of $A$ is $\mathrm{span}(\{\mathbf a_i|1\leq i \leq n\})$. If $\mathbf x$ is in the left null space of $A$, this means that

$$
A^T \mathbf x = \mathbf 0
$$

It follows that 

$$
\forall i:\mathbf a_i^T\mathbf x = \mathbf 0
$$

Due to the previous observation, it follows that $C(A)$ is orthogonal to $N(A^T)$. 

The proof of the second statement is analogous.

$\dashv$

## Orthogonal projections

Suppose we have two vectors $\mathbf a$ and $\mathbf b$. We want to find the *orthogonal projection from $\mathbf a$ onto the line through $\mathbf b$*. This is a vector $\mathbf p$ with the properties:

- $\mathbf p = x\mathbf b$ ($\mathbf p$ lies on the line defined by $\mathbf a$)
- $\mathbf a - \mathbf p$ is orthogonal to $\mathbf b$

Here is how we find $\mathbf p$:

$$
\begin{aligned}
(\mathbf a - x\mathbf b)^T\mathbf b &= 0\\
(\mathbf a^T - x\mathbf b^T)\mathbf b &= 0\\
\mathbf a^T\mathbf b - x\mathbf b^T\mathbf b &= 0\\
\mathbf a^T\mathbf b &= x\mathbf b^T\mathbf b\\
x &= \frac{\mathbf a^T\mathbf b}{\mathbf b^T\mathbf b}\\
\mathbf p &= \frac{\mathbf a^T\mathbf b}{\mathbf b^T\mathbf b}\mathbf b\\
\end{aligned}
$$

- $\mathbf p$ is called the *projection of $\mathbf a$ onto the line throuhg $\mathbf b$*.
- $\mathbf e = \mathbf a - \mathbf p$ is called the *error*.
- $\mathbf p$ is the point on the line through $\mathbf b$ which is closest to $\mathbf a$, i.e., the point which minimizes the error.

## Orthogonal projections

Now suppose we have a matrix $A$ and a vector $\mathbf b$, and we want to find the *orthogonal projection of  $\mathbf b$ onto the* ***column space*** of $A$.

In other words, we want to find the point $\mathbf p$ which

- is in the column space of $A$, and
- minimizes the error $\mathbf b-\mathbf p$.



<img src="_img/projection.svg"  width="1000" style="display: block; margin-left: auto; margin-right: auto;">


(image from https://medium.com/linear-algebra/part-17-projections-122aac21b07c)
    

- assumptions:

$$
\begin{aligned}
A\mathbf x &= \mathbf p\\
\mathbf p + \mathbf e &= \mathbf b\\
A^T\mathbf e &= \mathbf 0
\end{aligned}
$$

- finding the solution

Let us assume that the columns of $A$ are independent. (If this is not the case, we can replace $A$ by some basis of $C(A)$.





**Observation** $(A^TA)$ is invertible if and only if the columns of $A$ are independent.

*Proof*


Suppose $(A^TA)$ is invertible, and let $A\mathbf x = \mathbf 0$. Then it follows

$$
\begin{aligned}
A^TA\mathbf x &= A^T\mathbf 0\\
A^TA\mathbf x &= \mathbf 0\\
\mathbf x &= (A^TA)^{-1}\mathbf 0\\
&= \mathbf 0
\end{aligned}
$$
This entails that the columns of $A$ are independent.

Now suppose the columns of $A$ are independent. The Gauss-Jordan elimination factorizes

$$
A^T = E R,
$$
where $E$ is the combined elimination matrix and $R$ is the reduced row echelon form of $A^T$.

As shown earlier, $E$ is invertible.

If the columns of $A$ are independent, $R$ contains $n$ pivot columns, and no free column. It follows that

$$
R^T R = \mathbf I,
$$

since the dot product of a pivot column with itself must be $1$, and the dot product of two different pivot columns must be $0$.



Then we have:

$$
\begin{aligned}
A^TA &= ERR^T E^T\\
&= E~\mathbf I~ E^T\\
&= E E^T\\
\end{aligned}
$$

By construction, $E$ is invertible. Therefore

$$
(A^TA)^{-1} = (E^{-1})^T E^{-1}
$$

$\dashv$

- deriving the solution:

$$
\begin{aligned}
A\mathbf x &= \mathbf p\\
\mathbf p + \mathbf e &= \mathbf b\\
A^T\mathbf e &= \mathbf 0\\
A^T\mathbf b &= A^T\mathbf p + A^T\mathbf e\\
A^T\mathbf b &= A^T\mathbf p\\
&= A^T A\mathbf x\\
\mathbf x &= (A^T A)^{-1}A^T\mathbf b\\
\mathbf p &= A(A^T A)^{-1}A^T\mathbf b\\
\end{aligned}
$$


### Projection matrix

The matrix

$$
P = A(A^TA)^{-1}A^T
$$

is the **projection matrix** that maps each vector to its projection onto the column space of $A$.

Each projection matrix $P$ has the property that $PP=P$

$$
\begin{aligned}
P &= A(A^TA)^{-1}A^T\\
PP &= A(A^TA)^{-1}A^TA(A^TA)^{-1}A^T\\
&= A(A^TA)^{-1}(A^TA)(A^TA)^{-1}A^T\\
&= A(A^TA)^{-1}A^T\\
&= P
\end{aligned}
$$

### Statistics interlude: Linear regression

**linear regression**

- independent variables: $m\times n$ matrix $X$
    - $n$: number of observations
    - $m$: number of independent variables
- dependent variable: length-$n$ vector $\mathbf y$
- goal: find parameter vector $\beta$ (length $m+1$) such that the *total squared error* is minimized

$$
\hat\beta = \arg_\beta\min \sum_i (\beta_1 + \sum_{j=1}^m \beta_{j+1}x_{i,j} - y_i)^2
$$

Let's rephrase this with linear algebra

$$
\begin{aligned}
X_1 &= [\mathbf 1 X]\\
\hat {\mathbf y} &= X_1\beta\\
\epsilon &= ||\mathbf y - \hat {\mathbf y}||^2\\
\hat\beta &= \arg_\beta\min \epsilon
\end{aligned}
$$

From the second equation we see that $\hat {\mathbf y}$ is in the column space of $X_1$. 

The goal is to find the point $\hat {\mathbf y}$ in the column space of $X_1$ that minimizes the squared distance to $\mathbf y$.

This is also the point that minimizes the absolute distance between $\hat {\mathbf y}$ and $\mathbf y$. In other words, $\hat {\mathbf y}$ is the projection of $\mathbf y$ onto the column space of $X_1$:

$$
\hat\beta = (X_1^TX_1)^{-1}X_1^T\mathbf y
$$

If the columns of $X_1$ are not independent, $\hat\beta$ is not well-defined (and your statistics software will complain).

https://book.stat420.org/collinearity.html