# Overview of Gilbert Strang's 2018 Matrix Methods course

- Linear algebra => Optimisation => Deep learning
- Linear Algebra => Statistics => Deep learning
- ["Learning from data" book](math.mit.edu/learningfromdata)

## Lecture 01: The column space of $A$ contains all vectors $A x$

- Think of the product $A x$ as a linear combination of the columns of $A$: $A x = A_{:,1} x_1 + A_{:,2} x_2 + \ldots + A_{:,n} x_n$
- $A = C R$ where the columns of $C$ form a basis for $C(A)$, and each column in $C$ is a column of $A$; then $R$ is the first $rank(A)$ rows of (a column-permutation of) $rref(A)$.
- Given $C$, a matrix formed from $r = rank(A)$ l.i. columns of $A$, and $R$, a matrix formed from $r$ l.i. rows of $A$, then there is a matrix $U$ such that $A = C U R$, and $U$ is an $r \times r$ invertible matrix
  - Question: Are there any more properties of $U$?

## Lecture 02: Multiplying and factoring matrices

### 5 key factorisations

- $A = L U$ -- Elimination
- $A = Q R$ -- Gram-Schmidt decomposition
- $S = Q \Lambda Q^T$ -- Spectral theorem (for symmetric matrices $S$)
- $A = X \Lambda X^{-1}$ -- Doesn't work for all matrices
- $A = U \Sigma V^T$ -- Singular Value Decomposition; works for all matrices; orthogonal * diagonal * orthogonal

### LU decomposition in rank-1 picture

- $A = l_1 u_1^T + \begin{pmatrix}0 & 0 \\ 0 & l_2 u_2^T \end{pmatrix} + \begin{pmatrix}0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & l_3 u_3^T \end{pmatrix} + \ldots $

### Orthogonality of fundamental spaces of a matrix $A$

- $C(A^T)$ is orthogonal to $N(A)$ 
  - i.e. because each row in $A$ is orthogonal to any vector in $N(A)$
- $C(A)$ is orthogonal to $N(A^T)$



## Lecture 03: Orthonormal columns in $Q$ give $Q^T Q = I$

$Q$ is used to denote a matrix with orthonormal columns - that is, $q_{:,i}^T q_{:,j} = \delta_{i,j}$.

Thus:
- $Q^T Q = I_m$, and
- $Q Q^T = \begin{pmatrix}I_n & 0 \\ 0 & 0\end{pmatrix}$

If $Q^T Q = Q Q^T = I$, then $Q$ is 'orthogonal'.

### Orthogonal matrices preserve length under $l_2$

i.e. $|Q x| = |x|$

**proof**: $|Q x|^2 = |(Q x)^T (Q x)| = |x^T (Q^T Q) x| = |x^T x| = |x|^2$

### Examples of orthogonal matrices

#### rotation matrices
$\begin{pmatrix} cos{\theta} & sin{\theta} \\ -sin{\theta} & cos{\theta}\end{pmatrix}$

Rotates anti-clockwise by $\theta$ around the origin in 2-d
  
#### reflection matrices
$\begin{pmatrix} cos{\theta} & sin{\theta} \\ sin{\theta} & -cos{\theta}\end{pmatrix}$

Reflects plane in the line at $\theta/2$

#### "Householder reflections"
Given unit vector $u$, then $H = I - 2 u u^T$

#### "Hadamard" matrices

$H_2 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$

$H_{2^n} = \frac{1}{\sqrt{2}} \begin{pmatrix} H_{2^{n-1}} & H_{2^{n-1}} \\ H_{2^{n-1}} & -H_{2^{n-1}} \end{pmatrix}$

**Conjecture**: There is an orthogonal matrix of size $n \times n$ with entries $1$ and $-1$ for $n$ a multiple of $4$ --- known up to $n=668$.

#### Wavelets

$W_4 = \begin{pmatrix}
1 & 1 & 1 & 0 \\
1 & 1 &-1 & 0 \\
1 &-1 & 0 & 1 \\
1 &-1 & 0 &-1
\end{pmatrix}$

(with some scaling on the columns to make them orthonormal)

Haar invented in 1910; Ingrid Daubechies 1988 - found families of wavelets with entries that were not just 1 and -1.

#### Eigenvectors of a symmetric matrix

Example: discrete fourier transform is the matrix of eigenvectors of $Q^T Q$, with $Q = P_{2,3,\ldots,n-1,n,1}$ (i.e. $Q$ is the permutation matrix that puts row 2 in row 1, row 3 in row 2, etc.)

*I didn't understand this bit*

## Lecture 04: Eigenvalues and Eigenvectors

Useful because they allow you to work with powers of matrices.

The eigenvectors of a general matrix $A$ are not necessarily orthogonal to each other. (*Find an example*)

### Similar matrices
**Definition**: $B$ is *similar* to $A$ if there exists a matrix $M$ such that $B = M^{-1} A M$.

If $B$ is similar to $A$, then they have the same eigenvalues. (Easy to prove)

Corollary: (invertible) $A B$ and $B A$ have the same non-zero eigenvalues.  (Just use $M = B$).

Computing eigenvalues of $A$ --- usually involves picking better and better values of $M$ to find a triangular matrix similar to $A$. (*How does this work?*)

### (Real) Symmetric matrices
- Have real eigenvalues (*prove this*)
- Have orthogonal eigenvectors (*prove this*)
- And thus $Q$, the matrix with eigenvectors as columns is orthonormal, and
- $S = Q \Lambda Q^T$

## Lecture 05: Positive definite and semidefinite matrices

### Symmetric positive definite matrices
Equivalent definitions:
 1. All eigenvalues are real and positive ($\lambda_i > 0$)
 2. Energy $x^T S x > 0$ for any $x \ne 0$
 3. $S = A^T A$ (independent columns in $A$)
 4. All leading determinants are > 0
 5. All pivots in elimination are > 0
 
**Example:**
$S = \begin{pmatrix}3 & 4 \\ 4 & 5\end{pmatrix}$ is *not* positive definite (its determinant is $|A| = 15 - 16 = -1$).

$x^T S x$ is a quadratic form -- if $S$ is positive definite, then $S$ is convex with a unique minimum.

### Symmetric positive semi-definite matrices
Equivalent definitions:
 1. All eigenvalues are real and positive or zero ($\lambda_i \ge 0$)
 2. Energy $x^T S x \ge 0$ for any $x \ne 0$
 3. $S = A^T A$ (dependent columns allowed in $A$)
 4. All leading determinants are $\ge 0$
 5. All pivots in elimination are $\ge 0$

## Lecture 06: Singular Value Decomposition

Like eigenvalues, but works for rectangular and singular matrices

For a symmetric matrix, e'vals and e'vecs exist and are complete

For general square matrix, not the case

For rectangular matrix, certainly not

The SVD of the $m \times n$ matrix $A$ is $A = U \Sigma V^T$, where $U$ is an $m \times m$ orthogonal matrix, $V$ is an $n \times n$ orthogonal matrix, and

$\Sigma = \begin{bmatrix}
\sigma_1 & 0        & \ldots & 0 \\
0        & \sigma_2 & \ldots & 0 \\
0        & 0        & \ddots & 0 \\
\end{bmatrix}$,

where $s_i$ are the 'singular values'. $\Sigma$ is $m \times n$.  The $s_i$ are all positive.

### The details

1. The key observation is that $A^T A$ is symmetric, square, and positive semi-definite.
2. Thus: $A^T A$ can be decomposed: $A^T A = V \Lambda V^T$, with $V$ orthogonal, and $\Lambda$ positive.
3. We also have $A A^T$ is symmetric and positive semi-definite, and we have $A A^T = U \Lambda U^T$
4. Now look for $A v_i = \sigma_i u_i$, where the $v_i$ and $u_i$ are sets of orthogonal vectors.

"We're looking for one set of orthogonal vectors in the 'input space' of $A$, and a set of orthogonal vectors in the 'output space' of $A$ that transform to each other via $A$".

We then have $A V = U \Sigma$, which then gives us $A = U \Sigma V^T$.

Now... what are the $V$s and what are the $U$s.

We can see that if $A = U \Sigma V^T$ then $A^T A = V \Sigma^T U^T U \Sigma V^T = V \Sigma^2 V^T = V \Lambda V^T$ --- that is, the $V$s are the eigenvectors of $A^T A$, and the $\Lambda = \Sigma^2$ are the eigenvalues of $A^T A$. Similarly for $A A^T$.

Haven't quite finished: need to deal with the case of repeated eigenvalues --- and hence have 'eigenspaces'. Need to pick the appropriate eigenvectors from these spaces to satisfy $A v_i = \sigma_i u_i$.

We do this fixing the $v$s to be a particular set of eigenvectors of $A^T A$, and then solving $u_i = A v_i / \sigma_i$ - this ensures that whatever choices we make for the $v_i$, we get the appropriate $u_i$ in the degenerate cases.

Finally just need to show that the $u_i$'s picked in this way are orthogonal: $u_i^T u_i = \frac{v_i^T A^T A v_j}{\sigma_i \sigma_j} = v_i^T v_j \frac{\sigma_j^2}{\sigma_i \sigma_j} = \delta_{i,j}$

### Geometry of the SVD

"Every matrix factors into a rotation, then 'stretch', then rotation"

![Geometry of the SVD](./images/06-01-svd_geometry.png)

### The SVD as a sum of rank-1 matrices

Can rewrite the SVD in the following form:

$A = U \Sigma V^T = \sum_{i=1}^n \sigma_i u_i v_i^T$

where the singular values $\sigma_i$ are in descending order (that is, $\sigma_i \ge \sigma_j$ when $i \le j$).

Each of the $u_i v_i^T$ is a rank-1 matrix.

## Lecture 07: Eckart-Young: The Closest Rank k Matrix to A

### Eckart-Young theorem:

The rank-$k$ approximation to $A$ you get by keeping only the $k$ highest singular values and their associated rank-1 matrices is in a sense the 'best' rank-$k$ approximation to $A$:

Let $A_k = \sum_{i=1}^k \sigma_i u_i v_i^T$

**Theorem**: Given any rank-$k$ matrix $B$, then $||A - B|| \ge ||A - A_k||$. (NB: *though not for all norms - ?they must be orthogonally invariant?*)

**Proof**: *Fill in*

### Matrix norms

Examples

1. $L_2$ norm: $||A||_2 = \sigma_1$, the largest singular value
2. Frobenius norm: $||A||_F = \sqrt{\sum_{i,j} |A_{i,j}|^2} $
3. Nuclear norm: $||A||_{n} = \sum_i \sigma_i$

### Principal Component Analysis - PCA

Give data $X$, find the best rank-$k$ approximation to the sample covariance matrix $\frac{(X-\bar{X})(X-\bar{X})^T}{N-1}$.

## Lecture 08: Norms of matrices and vectors

A 'norm' is a way to measure the size of a thing.

### Vector norms

#### $||v||_p$

$||v||_p = \left(|v_1|^p + |v_2|^p + \ldots \right)^{1/p}$.

What about p=0 (or really, $p \lt 1$)?

$||v||_0 = \textrm{# of nonzero components}$ - but this is not a norm.

$p$-balls of radius 1 in $R^2$:
![p-balls of radius 1](./images/08-01-p-balls.png)

#### $S$-norm

Given a symmetric positive definite matrix $S$, then define $||v||_S = \sqrt{v^T S v}$.

This is a 'weighted norm', and the unit ball is an ellipse in $R^2$.

#### A common problem class

*Problem*: Minimise $||x||$, subject to the constraint that $A x = b$.

When $||\cdot||$ is the $l_1$ norm, this is called basis pursuit. The 'winning $x$' is sparse.

When $||\cdot||$ is the $l_2$ norm, this is least squares, or ridge regression. The 'winning $x$' is not sparse.

### Matrix norms

Have a matrix $A$.

#### Getting a matrix norm from a vector norm

In general, given a vector norm $||\cdot||$, you can define a related matrix norm via

$||A|| = \max_x \frac{||A x||}{||x||}$.

**Proof**: *Prove that $||A||$ defined in this way is a valid norm*

#### Example 1: $||A||_2 = \sigma_1$

This comes about by using the $l_2$ vector norm in the above.

#### Example 2: The Frobenius norm $||A||_F = \sqrt{\sum |a_{ij}|^2}$

**Proof**: *Prove how the Frobenius norm is related to the singular values*

#### Example 3: The nuclear norm $||A||_n = \sum |\sigma_i|$

Professor Srebro at U. Chicago has conjecture that deep learning  picks out the weight matrix that minimises the nuclear norm.



## Lecture 09: Four ways to solve Least Squares Problems

1. Solve the normal equations $A^T A \hat{x} = A^T b$
2. Pseudoinverse of $m \times n$ matrix $A$
3. Orthogonalise first - Gram-Schmidt procedure
4. (didn't get to this)

### 1. Solve the normal equations

Gauss's suggestion that when $A x = b$ has no solution, we should find the $\hat{x}$ that minimises $||b - A \hat{x}||_2$ leads to the *normal equations*:

$A^T A \hat{x} = A^T x$

When $A$ has independent columns ($A$ is rank $n$), then 
$A^T A$ is invertible, and it's easy to solve $A^T A \hat{x} = A^T b$ --- we just have $\hat{x} = (A^T A)^{-1} A^T b$.

### 2. Pseudoinverse $A^+$ of $A$

Given $n \times m$ matrix $A$, then the pseudomatrix $A^+$ is $n \times m$, and is the matrix that is 'as close to the inverse of $A$ as possible' --- that is $A A^+$ is as close to $I$ as possible.

The pseudoinverse inverts the action of $A$ on vectors in its row space (i.e. because that's the bit that *is* invertible):
![Action of the Pseudoinverse](./images/09-01-pseudoinverse-action.png)

**To do**: There are some properties of the four fundamental spaces of $A$ that I'd like to make sure I understand:

1. I *think* that if $x \in R(A)$ then $A x \ne 0$.  Yes: this is true: $R^n = n(A) \oplus c(A^T)$ -- I'm struggling to find an intuitive way to see this. The best I've come up with so far is to think about projections onto the null space or row space.

The pseudoinverse of $A$ can easily be expressed relative to its SVD, $A = U \Sigma V^T$. That is, $A^+ = V \Sigma^+ U^T$, where the pseudoinverse $\Sigma^+$ of $\Sigma$ is the $n \times m$ matrix:

$\Sigma^+ = \begin{bmatrix}
1/\sigma_1 & 0          & 0      \\
0          & 1/\sigma_2 & 0      \\
0          & 0          & \ddots \\
\vdots     & \vdots     & 
\end{bmatrix}$.

Connecting this back to the solutions of the normal equaions, consider the case that $A$ has independent columns (is rank $n$ or $N(A) = 0$). Then $A^T A$ is invertible, and we have $A^+ = (A^T A)^{-1} A^T$. *Check this*

### 3. Gram-Schmidt

This applies when $A$ has rank $n$ (just as for solving the normal equations).  Find an orthogonal basis for $C(A)$ by successively orthogonalising columns of $A$.


## Lecture 10: Survey of difficulties with $A x = b$

Have received the problem $A x = b$ (the key problem of linear algebra); have to produce an answer...

0. Using the pseudoinverse: this always works $x = A^+ b$ - but may not be computationally feasible / stable
1. Easy case: size of $A$ is ok, and $A$ is well conditioned, then $x = A \backslash b$
2. Too many equations: $m > n$, then $A^T A \hat{x} = A^T b$ still works
3. Underdetermined: $n < m$, then there are many solutions, and we need to have additional condition - e.g. minimise the $l^1$ norm. The key thing in ML applications is that the solution needs to generalise beyond the training data.
4. Columns in bad condition (high correlation between them) - then Gram-Schmidt (orthogonalise the columns; $A = Q R$); Use column-pivoting to improve numerics.
5. Near singular; have an 'inverse problem' - then add a penalty to regularise: $\min{||A x - b||^2 + \delta^2 ||x||^2}$ -- question is: how big to make the penalty $\delta$ (**focus for this lecture**)
6. Too big - i.e. $m$ and $n$ are really big; can't fit in memory. Then use iterative methods (like conjugate gradient or Krylov)
7. Waaay too big - i.e. $m$ and $n$ are really, really big. Then 'randomized linear algebra': sample the columns of $A$, and work with the sample.

Focussing on case 5.

### Regularising $A x = b$ where $A$ is a near-singular matrix

We solve for $\hat{x}$ where $\hat{x} = \textrm{argmin}_x ||A x - b||^2 + \delta^2 ||x||^2$ with $\delta^2 \gt 0$.  Equivalent to solving:

$\begin{bmatrix} A \\ \delta I\end{bmatrix} x = \begin{bmatrix} b \\ 0\end{bmatrix}$

Natural question is what $\delta$ should I choose? - won't answer in this lecture.

Rather: what happens to $\hat{x}$ in the limit as $\delta \rightarrow 0$?  We converge towards $x = A^+ b$ -- that is: for any $A$, $(A^T A + \delta^2 I)^{-1} A^T$ converges to $A^+$ as $\delta \rightarrow 0$.

To prove this, use the SVD. ("If you want to prove anything about matrices, the SVD is your best tool")