### Orthogonal Vectors and Subspaces
A basis is a set of independent vectors that span a space. Geometrically, it is a set of coordinate axes (think about x- and y-axis in the x-y plane). In choosing a basis, we tend to choose an orthogonal basis. 

**orthonormal basis** is an orthogonal basis with length 1.

**Length squared** of vector $x$: $||x||^{2}=x_{1}^{2}+x_{2}^{2}+\cdots+x_{n}^{2}=x^{T}x$. The length squared is the inner product of x with itself.
.
**orthogonal test**: The *inner product* $x^Ty=0$ if and only if $x$ and $y$ are orthogonal vectors. If $x^Ty > 0$, their angle is less than 90°. If $x^Ty < 0$, their angle is greater than 90°. If x and y are orthogonal, length square of x plut length square of y equal with length square of (x + y). Unfold the right hand and use the fact that length square is $x^Tx$, we get $x^Ty=0$.

If nonzero vectors $v_1,..., v_k$ are mutually orthogonal (every vector is perpendicular to every other), then those vectors are linearly independent.

**Prthogonal subspaces**Two subspaces V and W of the same space $\mathbb{R}^{n}$ are orthogonal if every vector v in V is orthogonal to every vector w in W: $v^Tw = 0$ for all v and w. By this definition, the wall and the floor are not orthogonal, they interscet with a line. Two plants cannot be orthogonal in 3-d space.

**Fundamental theorem of orthogonality**: The row space is orthogonal to the nullspace (in $\mathbb{R}^{n}$, each row of A times x equal 0). The column space is orthogonal to the left nullspace (in $\mathbb{R}^{m}$).

Given a subspace V of $\mathbb{R}^{n}$, the space of all vectors orthogonal to V is called the **orthogonal complement** of V. It is denoted by $V^⊥$ = “V perp.”

*The nullspace $N(A)$ is the **orthogonal complement** of the row space $C(A^T)$ in $\mathbb{R}^{n}$*: the row space contains all vectors that are orthogonal to the nullspace.  
*The left nullspace $N(A^T)$ is the **orthogonal complement** of the column space $C(A)$ in $\mathbb{R}^{m}$*: The column space contains all vectors that are orthogonal to the left nullspace.   
$Ax = b$ to be solvable, $b$ must to be in the column space, or indirectly, $b$ to be perpendicular to the left nullspace $N(A^T)$.

![subspaces](http://i.imgur.com/AKYbfWa.png)

Figure 3.4 summarizes the fundamental theorem of linear algebra. It illustrates the true effect of a matrix--what is happening inside the multiplication Ax. **Every matrix transforms its row space onto its column space**. The nullspace is carried to the zero vector. Every Ax is in the column space. Nothing is carried to the left nullspace. The real action is between the row space and column space, and you see it by looking at a typical vector x. It has a “row space component” and a “nullspace component,” with $x = x_r + x_n$. When multiplied by A, this is $Ax = Ax_r + Ax_n$:
The nullspace component goes to zero: $Ax_n = 0$.
The row space component goes to the column space: $Ax_r = Ax$.
Of course everything goes to the column space—the matrix cannot do anything else.

Every vector b in the column space comes from **exactly one** vector $x_r$ in the row space. On those r-dimensional spaces A is invertible.

$A^T$ goes in the opposite direction, from $\mathbb{R}^{m}$ to $\mathbb{R}^{n}$ and from $C(A)$ back to $C(A^T)$. Of course the transpose is not the inverse! $A^T$ moves the spaces correctly, but not the individual vectors. That honor belongs to $A^{−1}$ if it exists--and it only exists if $r= m= n$ (i.e, square matrix with full rank). We cannot ask $A^{−1}$ to bring back a whole nullspace out of the zero vector.

When $A^{−1}$ fails to exist, the best substitute is the pseudoinverse $A^+$. This inverts A where that is possible: $A^+Ax = x$ for x in the row space. On the left nullspace, nothing can be done: $A^+y = 0$. Thus $A^+$ inverts A where it is invertible, and has the same rank r.

![Imgur](http://i.imgur.com/MEhViPQ.png)

Instead of line, b can also project to any subspace. 

### why projection?
When $A^{−1}$ fails to exist, i.e. $Ax = b$ has no solution, the best substitute is the pseudoinverse $A^+$ -- solve the closest problem we can solve. $Ax$ will always be in the column space of A, but b is not necessary to be in the column space (thus no solution). So we can choose the closest vector (compare with b) in the column space. So we will solve $A\hat{x} = P$ instead, where $P$ is the projection of $b$ onto the column space. $\hat{x}$ is the best possible solution. 

In linear regression $y ~ x$, this is saying that it is impossible to link every points with one line, instead we will try to find a line with least square of errors.

The **cosine** of the angle between any nonzero vectors a and b is $\cos\theta=\frac{a^{T}b}{||a||||b||}$. Because $|\cos\theta| \leq 1$, this gives the **Schwarz inequality**: $|a^Tb| \leq ||a||||b||$

Law of cosines: $||b-a||^{2}=||b||^{2}+||a||^{2}-2||b||||a||\cos\theta$

### Projection onto a line
p must be = $\hat{x}a$ since they are on the same line. To get p, all we need is to computer $\hat{x}$. We know $a\perp(b-\hat{x}a)$ --> $a^T(b-\hat{x}a)=0$ --> $\hat{x}a^Ta = a^Tb$ --> $\hat{x}=\frac{a^Tb}{a^Ta}$. Thus, **projection onto a line** $p = \hat{x}a = \frac{a^Tb}{a^Ta}a$.

**Projection matrix *P:*** ($p = Pb$) is the matrix that multiples $b$ and produces $p$: $p=\frac{a^Tb}{a^Ta}a = a\frac{a^Tb}{a^Ta} = \frac{aa^T}{a^Ta}b$, so $P = \frac{aa^T}{a^Ta}$. 

- $aa^T$ is a square matrix, $a^Ta$ is a number. So *P* is a square matrix.
- *P* is symmetric.
- $P^2 = P$: $P^2b$ is the projection of $Pb$ and $Pb$ is already on the line.
- Column space of *P* consists of the line through a.
- the rank is 1.

### Projection and least squares

System of equations $Ax = b$ either has a solution or not. If b is not in the column space C(A), the system is inconsistent and Gaussian elimination fails. This failure is almost certain when there are several equations and only one unknown. For example: $2x = b_1$, $30 = b_2$, $4x=b_3$. In this example, the row space is in 3-d, but the column space is just a line through a = (2,3,4). Only b on this line (out of the whole 3-d space!) will solve these equations. Thus, chances are that there is no solution for most b.

n spite of their unsolvability, inconsistent equations arise all the time in practice. They have to be solved! One possibility is to determine x from part of the system, and ignore the rest; this is hard to justify if all m equations come from the same source. Rather than expecting no error in some equations and large errors in the others, it is much better to choose the x that minimizes an average error E in the m equations.

Squared error: $E^2=(2x-b_1)^2+(30-b_2)^2+(4x-b_3)^2$. If there is an exact solution, the mimimum error will be zero. When there is no solution, the minimum error is at the lowest point of a parabola, wherer the derivative is zero: $\frac{dE^2}{dx} = 0$, solving for x, **the least-squares solution of this model system $ax=b$ with one unknown is denoted by $\hat{x} = \frac{a^Tb}{a^Ta}$** -- same answer as the projection on a line above.

### Least Squares Problems with Several Variables

Now we are ready for the serious step, to project b onto a subspace—rather than just onto a line. This problem arises from $Ax = b$ when A is an m by n matrix. Instead of one column and one unknown x, the matrix now has n columns. The number m of observations is still larger than the number n of unknowns, so it must be expected that
$Ax = b$ will be inconsistent. *Probably, there will not exist a choice of x that perfectly fits the data b*. In other words, the vector b probably will not be a combination of the columns of A; it will be outside the column space. (think about a line (column space) in 3-d space...)

Again the problem is to choose x so as to minimize the error, and again this minimization will be done in the least-squares sense. The error is $E = ||Ax − b||$ , and this is exactly **the distance from $b$ to the point $Ax$ in the column space**. Searching for the least-squares solution x, which minimizes E, is the same as locating the point $p = A\hat{x}$ that is closer to b than any other point in the column space.

We may use geometry or calculus to determine $\hat{x}$. In n dimensions, we prefer the appeal of geometry; 1. p must be the “projection of b onto the column space.” 2. The error vector $e = b − A\hat{x}$ must be perpendicular to that space.

![Imgur](http://i.imgur.com/swH0zpq.png)

Two ways to find $\hat{x}$:

1. All vectors perpendicular to the column space lie in the left nullspace. Thus the error vector must be in the null space of $A^T$: $A^T(b-A\hat{x})=0$ --> $A^TA\hat{x}=A^Tb$.  
2. each column vector of A is perpendicular to the error vector: $a_1^T(b-A\hat{x})=0$, ...this is again $A^T$: $A^T(b-A\hat{x})=0$ --> $A^TA\hat{x}=A^Tb$.
3. multiple the equation $Ax = b$ by $A^T$...

When $Ax = b$ is inconsistent, the least-squares solution minimuzes $||Ax-b||^2$ is:

- the **Normal equations** $A^TA\hat{x}=A^Tb$. 
- The best estimate $\hat{x} = (A^TA)^{-1}A^Tb$.
- the projection of b onto the column space is the nearest point $p=A\hat{x}=A(A^TA)^{-1}A^Tb$.
    - the **projection matrix** $P=A(A^TA)^{-1}A^T$, it projects any vector b onto the column space of A.
    - $P^2=P$, when we project again nothing is changed.
    - $P^T=P$

Example: $A=\begin{bmatrix}1 & 2\\
1 & 3\\
0 & 0
\end{bmatrix},b=\begin{bmatrix}4\\
5\\
6
\end{bmatrix}$, $Ax = b$ has no slution and $A^TA\hat{x}=A^Tb$ gives the best x. (Think A as X, b as Y and x as regression coefficience...)

- suppose b is actually in the column space of A, then the projection of b is just b.
- suppose b is perpendicular to every column of A, so $A^Tb = 0$, then b projects to the zero vector (i.e. b is in the left null space, which overlap with column space at the zero vector).
- When A is square and invertible, the column space is the whole space. Every vector projects to itself, p equals b, and $\hat{x} = x$: $p=A(A^{T}A)^{-1}A^{T}b=AA^{-1}(A^{T})^{-1}A^{T}b=b$. This is the only case when we can take apart $(A^TA)^{-1}$ and write it as $A^{-1}(A^T)^{-1}$. When A is rectangular, this is not possible.
- when A has only one column, we get the same formular as the projection on line.

#### The Cross-Product Matrix $A^TA$

- The matrix is **square and symmetric**: $(A^TA)^T=A^TA^{TT} = A^TA$.
- Null space of $A^TA$ = N(A)
- It is invertible exactly if A has independent columns.

### Least-squares fitting of data

Suppose we do a series of experiments, and expect the output $b$ to be a linear function of the input $t$. We look for a straight line $b = C + Dt$.

How to compute C and D? If there is no experimental error, then two measurements of b will determine the line $b = C + Dt$. But if there is error, we must be prepared to “average” the experiments and find an optimal line. **That line is not to be confused with the line through a on which b was projected in the previous section!** In fact, since there are two unknowns C and D to be determined, we now project onto a two-dimensional subspace. A perfect experiment would give a perfect C and D:
$C + Dt_1 = b_1$, ... $C + Dt_m = b_m$. 

This is an overdetermined system, with m equations and only two unknowns. If errors are present, it will have no solution. A has two columns, and x = (C, D).

$$\begin{bmatrix}1 & t_{1}\\
1 & t_{2}\\
\vdots & \vdots\\
1 & t_{m}
\end{bmatrix}\begin{bmatrix}C\\
D
\end{bmatrix}=\begin{bmatrix}b_{1}\\
b_{2}\\
\vdots\\
b_{m}
\end{bmatrix}$$

The best solution ($\hat{C}, \hat{D}$) is the $\hat{x}$ that minimizes the squared error $E^2=(b_1-C-Dt_1)^2+\ldots+(b_m-C-Dt_m)^2$. The vector p = A x is as close as possible to b. Of all straight lines $b = C + Dt$, we are choosing the one that best fits the data (Figure 3.9). On the graph, **the errors are the vertical distances $b − C − Dt$ to the straight line (not perpendicular distances!)**. It is the **vertical distances** that are squared, summed, and minimized.

![Imgur](http://i.imgur.com/5cwBtDA.png)

**The measurements $b_1 , \ldots, b_m$ are given at distinct points $t_1 , \ldots ,t_m$. Then the straight line $C + Dt$ which minimizes $E^2$ comes from least squares:   
$A^{T}A\begin{bmatrix}\hat{C}\\
\hat{D}
\end{bmatrix}=A^{T}b$ or 
$\begin{bmatrix}m & \sum t_{i}\\
\sum t_{i} & \sum t_{i}^{2}
\end{bmatrix}\begin{bmatrix}\hat{C}\\
\hat{D}
\end{bmatrix}=\begin{bmatrix}\sum b_{i}\\
\sum t_{i}b_{i}
\end{bmatrix}$
**

Example: Three measurements b1 , b2 , b3 are marked on Figure 3.9a:  
b = 1 at t = − 1, b = 1 at t = 1, b = 3 at t = 2.  
The first step is to write the equations that would hold if a line could go through all three points. Then
every $C + Dt$ would agree exactly with b:   
$C-D=1$  
$C+D=1$  
$C+2D=3$  
The first column
of A contains 1s, and the second column contains the times $t_i$. If those equations $Ax = b$ could be solved, there would be no errors. They can’t be solved because the points are not on a line. Therefore they are solved by least squares: $\begin{bmatrix}3 & 2\\
2 & 6
\end{bmatrix}\begin{bmatrix}\hat{C}\\
\hat{D}
\end{bmatrix}=\begin{bmatrix}5\\
6
\end{bmatrix}$. The best solution is $\hat{C} = 97 , \hat{D} = 47$ and the best line is $9/7 + 4/7 t$.

Note the beautiful connections between the two figures. The problem is the same but the art shows it differently. In Figure 3.9b, b is not a combination of the columns (1, 1, 1) and (−1, 1, 2). In Figure 3.9, the three points are not on a line. Least squares replaces points $b$ that are not on a line by points $p$ that are! Unable to solve $Ax = b$, we solve $A\hat{x} = p$.

The line 9/7 + 4t/7 has heights 5/7 , 13/7, 17/7 at the measurement times −1, 1, 2. Those points do lie on a line. Therefore the vector p = (5/7 , 13/7, 17/7) is in the column space. *This vector is the projection*. Figure 3.9b is in three dimensions (or m dimensions if there are m points, [m rows]) and Figure 3.9a is in two dimensions (or n dimensions if there are n parameters).

Subtracting p from b, the errors are e = (2/7 , −6/7 , 4/7). Those are the vertical errors in Figure 3.9a, and they are the components of the dashed vector in Figure 3.9b.

