# Orthogonality
## The 90 degrees

So orthogonal is another word for perpendicular and simply means that the angle difference between them are 90 degrees. This is important for many aspects, and can also be used to prove pythagoras theorem. 

### Finding the plane of the linear combinations
Finding the plane of the linear combinations of f.e 2 vectors in $\mathbb{R}^3$ can be done by thinking orthogonal. 
Let the linear combinations of $cv + dw = L$ and the resulting plane $P$ needs to go through the zero vector, and resides in 3 dimensions. This means that the plane will be some combination of x's, y's and z's that equals 0. The reason for that is simple, if the vectors go from origin, which the should in linear algebra and you choose variables that are all 0, we still need to end up at zero. 

This means that we need to find some combination $e$ of the components of $L$ and that combination needs to produce $0$ in order to satisfy the conditions for the plane. The combination $e$ holds the coefficients for that plane. The reason for this is pretty simple, we have a vector with 3 components $L$, $(c_1, c_2, c_3)$ and those equals $(x, y, z)$ respectively. A combination of the components in $(x, y, z) = (c_1, c_2, c_3)$ always need to produce 0, which means that every point or vector is only part of the plane if their combination equals 0 when applied with the equation. That equation produces a plane, which can be any plane in this case, in $\mathbb{R}^3$ as long as it goes through $(0, 0, 0)$. 

So what does this have to do with orthogonality? Well, some combination $e$ which can act as an elimination vector, multiplies that vector $L$ that holds all the linear combinations, in this case, two independent vectors. The combination $e$ needs to be orthogonal to $L$ in order to satisfy the equation to the plane, i.e when $eL = 0$ 

Let's say we have two independent vectors that creates a 2 dimensional plane in $\mathbb{R}^3$:

$v = c \begin{bmatrix}
    3\\
    5\\
    2 \end{bmatrix}\;
w = d \begin{bmatrix}
    1\\
    2\\
    -1 \end{bmatrix}
$

All linear combinations:

$L = \begin{bmatrix}
    3c + d\\
    5c + 2d\\
    2c -d
    \end{bmatrix}   
$

Some combinations $e$ of the rows/components in this case, produces zero, regardless of $c$ and $d$. The dot product between $eL = 0$ and they are perpendicular. An interesting fact here is that, if we think of the nullspace for a second to a matrix $A$, the nullspace is perpendicular to the row space because all the vectors in $N(A)$ turns the rows of $A$ into 0. When finding the nullspace we simply ask, what combination of the rows of $A$ gives us 0? Of course all the solutions to that need to be orthogonal. This is the same with the columns of a matrix, which we can imagine our above vectors as, the columns of $A$. Any vector from the left nullspace, in this case $e$ will satisfy our answer and therefore $e$ will give us the coefficients to the plane. Therefore the most systemetic way of finding the plane, is simply to put the vectors back to a matrix $A$ and solve for the left nullspace. 

$A = \begin{bmatrix}
    3 & 1\\
    5 & 2\\
    2 & -1
    \end{bmatrix}
    \; A^T = \begin{bmatrix}
    3 & 5 & 2\\
    1 & 2 & -1
    \end{bmatrix}
$

As column 1 and 2 are independent to each other, we know that the third column is a combination of those, because this matrix can't have a rank which is more than 2, which we also know from the fact that we picked two independent vectors and transposed them. To be really clear, we can reduce this matrix to $R$ as much as possible as that won't change the column space of $A$ nor the left nullspace. The reason for this is because now with $A^T$ the elimination process will create combinations of the columns which stays in the column space. The left nullspace is also intact, because if the column space is intact, the same vectors in the left nullspace will still be orthogonal because the whole subspaces are orthogonal to each other. 

After some elimination and row exchanges, we reach:
$R =\begin{bmatrix}
    1 & 0 & 9\\
    0 & 1 & -5
    \end{bmatrix}
$

Solve for the special solutions in $N(A^T)$ by standard procedure and we end up with:

$
e = k\begin{bmatrix}
    -9\\
    5\\
    1
    \end{bmatrix}
$

$e(x, y, z)$ is our plane! If you noticed, $k$ is any constant in this case. 

To double check, let's check $eL$ and hopefully it is 0:

$ \begin{bmatrix}
    -9\\
    5\\
    1
    \end{bmatrix} \cdot \begin{bmatrix}
    3c + d\\
    5c + 2d\\
    2c -d
    \end{bmatrix}   
$

$-9(3c + d) + 5(5c + 2d) + 1(2c - d) = (-27c + 25c + 2c) + (10d - 9d - d) = 0$

The equation of the plane is:
$-9x + 5y + z = 0$

and sure is, it's perfectly zero and orthogonal. And of course, to find the plane to the rows of $A$, the coefficients are in the nullspace $N(A)$, and obviously in this case with two independent row vectors in $\mathbb{R^2}$ that will just fill the 2 dimensional space and the equation will only be $z = 0$ which will be a flat lying plane on the "ground". Image below shows $-9x + 5y + z = 0$ with our two vectors, beautifully aligned.  

<img src="pngs/orthogonal-plane.png" width="300">

If the nullspaces simply contain the zero vector, that means that the subspaces orthogonal to the nullspaces (row/column - space) span the whole space they reside in.

$
A = \begin{bmatrix}
    1 & 0\\
    0 & 1
\end{bmatrix}
$

What space does the row space span? Look at the nullspace, the nullspace only contains the 0 vector, so in this case it spans the full dimensional space.



## Projection
Orthogonality is very important for a specific problem called projection, when we want to project something unto another subspace. This is key if we want to "solve" an unsolveable problem, with an added error. The standard problem to solve with projection is the least square problem, fitting the best possible line to a number of vectors. Before solving that, let's check how projection itself works.

### Formula & Proof
Proving the projection formula is pretty straightforward and it works basically the same for vector and matrices except a notational difference. I'm going to use the following image to explain projection.

<img src="pngs/projection-example.png" width="400" >

What's our goal here?
We want to find the projection of $b$ on the plane $A$, through the vector $e$ which is the shortest path to $A$. This means that we want to solve $Ax = b$ with an added error amount and to get the minimal error, we choose to take the shortest path thorugh vector $e$. Notice that the shortest path is always a $90^{\circ}$ angle to the target, according to the pythagorean theorem, and a very important point for this formula. The vector $e$ is always orthogonal to the subspace $A$. 

If we use some multiple of $A$ we will find the vector $p$, simply because some combinations of $A$'s vectors will result in $p$. The multiple we will call $\hat{x}$. The reason for this is because $Ax = b$ is unsolveable, but not $A\hat{x} = b$. Therefore the goal is to find $\hat{x}$. As we see according to the picture, vector $e = b - p$. $p$ is the projected $b$ therefore $A\hat{x} = p$ so we have the equation $e = b - A\hat{x}$. We know that $e$ is orthogonal to the plane, therefore $Ae = 0$, which means that every vector in subspace A is orthogonal to $e$, which then of course is in $N(A^T)$.

$A^{T}b - A^{T}A\hat{x} = 0$

$A^{T}b = A^{T}A\hat{x} $

$(A^{T}A)^{-1}A^{T}b = \hat{x} $

$A\hat{x} = p$

$A(A^{T}A)^{-1}A^{T}b = p$

In practice the best one to use the following expression and solve for $\hat{x}$: $A^{T}A\hat{x} = A^{T}b $

This whole proof relies on the fact that $A^{T}A$ is invertible as long as the columns of $A$ are independent. We can prove this by assuming $Ax = 0$ when $x$ is the zero vector. 

Proof: 

$x^TA^{T}Ax = 0$

$(Ax)^{T}Ax = 0$

$(Ax)^2 = 0$

$n^2 = 0 : n = 0$

We know that $A$ is not a zero matrix as the column space is at least 1 dimensional, full rank in this case. Therefore, $x$ must be 0 and $A$ is invertible as it is a square, $(m,n) \times (n,m) = (m, m)$  

## Linear Regression

Let's demonstrate Least Square Approximations in 2 ways, first by purely using the above logic and Linear Algebra and secondly with multivariable calculus. For simplicity sake, let's "find the best line" for 3 data points:

$d_1 = (1,2)$

$d_2 = (4,2)$

$d_3 = (6,4)$

We want to find the best line: $C + Dt = b$.

### Linear Algebra Approach

$b = \begin{bmatrix}
    2\\
    2\\
    4
\end{bmatrix}
$


This would result in the following equations:

$C + D = 2$

$C + 4D = 2$

$C + 6D = 4$

Let's put it in the form of $A\hat{x} = p$ where $\hat{x}$ is an approximation and $p$ is the projection from $b$. 

$e = b - p$

$A^Te = 0$

$A^Tb = A^Tp$

$A^Tb = A^TA\hat{x}$

$A^TA\hat{x} = A^Tb$

$
A = \begin{bmatrix}
 1 & 1 \\
 1 & 4 \\
 1 & 6 \\
\end{bmatrix}
$

$\hat{x} = 
\begin{bmatrix}
    C\\
    D
\end{bmatrix}
$


Below we have $A^TA\hat{x} = A^Tb$

$
\begin{bmatrix}
 1 & 1 & 1\\
 1 & 4 & 6 \\
\end{bmatrix}
\begin{bmatrix}
 1 & 1 \\
 1 & 4 \\
 1 & 6 \\
\end{bmatrix}
\hat{x} = 
\begin{bmatrix}
 1 & 1 & 1\\
 1 & 4 & 6 \\
\end{bmatrix}
\begin{bmatrix}
    2\\
    2\\
    4
\end{bmatrix}
$

$
\begin{bmatrix}
    3 & 11\\
    11 & 53\\
\end{bmatrix}
\hat{x}
=
\begin{bmatrix}
    8\\
    34
\end{bmatrix}
$

Solve for $\hat{x}$ by elimination, for this I used NumPy. 

$\hat{x} = 
\begin{bmatrix}
    1.31578947\\
    0.36842105
\end{bmatrix}
$

$C = 1.31578947$

$D = 0.36842105$

Let's verify this solution.

$p_1 = 1.31578947 + 0.36842105 = 1.6842105263157898$

$p_2 = 1.31578947 + 4 * 0.36842105 = 2.7894736842105265$

$p_3 = 1.31578947 + 6 * 0.36842105 = 3.526315789473684$

$p = 
\begin{bmatrix}
    1.6842105263157898\\
    2.7894736842105265\\
    3.526315789473684
\end{bmatrix}
$

$p \perp e$, mutliplying $p$ by $e = b - p$ gives $0$

The line also looks really, good, probably is the best!

<img src="pngs/regression-example.png" width="300" >


### Calculus Approach
As we already know, we can find the maximum or minimum point of a function by taking it's derivative and look where it equals to 0. We want to find the point where the error function $E$ is as small as possible. 

We have the unsolveable equations:

$C + D = 2$

$C + 4D = 2$

$C + 6D = 4$

We can view the left side as $p$ and subtract it to leave the hidden error $e_1 = b_1 - p_1$

$e_1 = 2 - C - D $

$e_2 = 2 - C - 4D $

$e_3 = 4 - C - 6D$ 

We want to square these to make sure they are all positive, as we want to minimise the length of $e$ and find the positive length.

$E = (2 - C - D)^2 + (2 - C - 4D)^2 + (4 - C - 6D)^2$

We need to find $\nabla E = 0$ which is the gradient of $E$ when it equals to 0, the minimum point, error. 

Respect to C

$\frac{\partial E}{\partial C} = 2(2 - C - D)*-1 + 2(2 - C - 4D)*-1 + 2(4 - C - 6D)*-1$

$\frac{\partial E}{\partial C} = -4 + 2C + 2D + -4 + 2C + 8D) + -8 + 2C + 12D)$

$\frac{\partial E}{\partial C} = 6C + 22D - 16$

Respect to D

$\frac{\partial E}{\partial D} = 2(2 - C - D)*-1 + 2(2 - C - 4D)*-4 + 2(4 - C - 6D)*-6$

$\frac{\partial E}{\partial D} = -4 + 2C + 2D + -16 + 8C + 32D + -48 + 12C + 72D$

$\frac{\partial E}{\partial D} = 22C + 106D + -68$

Now we have to equations, put them into the matrix:

$\nabla E = 
\begin{bmatrix}
    6C + 22D - 16\\
    22C + 106D + -68
\end{bmatrix}
$

$\nabla E\hat{x} = 0$

Solve the matrix with elimination (or NumPy)
and you will once again find: 

$\hat{x} = 
\begin{bmatrix}
    1.31578947\\
    0.36842105
\end{bmatrix}
$

Beautiful!



## Gram-Schmidt Approach
Even if the calculations might seem trivial for smaller matrices, it can get a bit complicated with larger matrices and the general formula $A^TA\hat{x} = A^Tb$ can seem a bit complicated when we also have to invert the matrix to find $\hat{x}$. Fortunately, there exists an approach that avoids inverting the matrix all together. The idea behind Gram-Schmidt is to avoid multiple steps in calculating a projection, by using orthogonal matrices $Q$ instead of $A$. 

To demonstrate Gram-Schmidt we need to first understand something about orthonormal vectors. An orthonormal vector is simply a unit vector that's perpendicular to another unit vector. We can think of them as the axes of a space, for example the x y z axes in $\mathbb{R}^3$, but they can be any base, not necessarily $(1,0,0), (0,1,0), (0,0,1)$.

Some general rules of orthogonal matrices are:

- $Q^TQ = Q^TQ = I$
- $Q^{-1} = Q^T$
- If Q is not square $Q^TQ = I$ still applies

An interesting point regarding projecting $b$ onto $q_i$, as $q_i$ are orthononormal, it means that they will act as a basis. In $\mathbb{R}^3$, projecting $b$ onto $q_1, q_2 and q_3$ and then sum them up, will of course just produce $b$, because we eventually project $b$ to the space itself. The vector will only change if the subspace is smaller.

When we move onto the Gramd-Schmidt process, we will do the reverse to find orthogonal vectors. As we remember that $e = b - p$, we know that $e$ is definitely orthogonal to $p$ as $b$ creates a perpendicular line while projecting towards $p$. To create an $e$ which we will later call $B$ or $C$, which is orthogonal to multiple vectors at the same time, we can subtract $p_i$ from $b$. The cool part here is, when we add projections, we create new space with larger dimensions, while we subtract, we create a vector perpendicular to larger dimensions.

### The Process
The process is already explained above but let's make it more formal. If we have 3 independent vectors, $a, b, c$ and we want to create orthonormal vectors from these, we have a step-by-step method for it. Remember that the vectors are independent but not necessarily orthogonal. You can only create as many orthonormal vectors as the total space they fill out, so if you have 3 vectors but the matrix rank is only 2, you will end up with 2 orthonormal vectors. You can view these as the axes of the targeted space. 

First, we can pick a vector, $a$ and make it orthogonal set $A$. To create $B$ from $b$ we need to project $b$ onto $A$ and get the error, perpendicular vector $B$. 

$B = b - \frac{A^Tb}{A^TA}A$

The nice thing here is in order to get $C$ which is orthogonal to both $A$ and $B$ we simply continue subtracting.

$C = b - \frac{A^Tb}{A^TA}A - \frac{A^Tc}{A^TA}A$ 

Now we have $AB = 0, AC = 0, BC = 0$ all combinations are perpendicular! 

To get orthonormal vectors and put them in Q we need to divide by the length:

$\frac{A}{||A||}, \frac{B}{||B||}, \frac{C}{||C||}$
