# SINGULAR VALUE DECOMPOSITION (SVD)

# Content
- four fundamental spaces of linear algebra
- under-determined and over-determined least squares regressions
- principal components analysis (PCA)

The singular value decomposition (SVD) is a work-horse in applications of least squares projection that form foundations for many statistical and machine learning methods. After defining the SVD, we’ll describe how it connects to four fundamental spaces of linear algebra, under-determined and over-determined least squares regressions, and principal components analysis (PCA).

## The Setting

Let $X$ be an $m\times n$ matrix of rank $p$.

Necessarily, $p\leq min(m,n)$. In much of this notebook, we’ll think of $X$ as a matrix of data in which
- each column is an **individual** – a time period or person, depending on the application
- each row is a **random variable** describing an attribute of a time period or a person, depending on the application

We’ll be interested in two situations
- A short and fat case in which $m << n$, so that there are many more columns (individuals) than rows (attributes).
- A tall and skinny case in which $m >> n$, so that there are many more rows (attributes) than columns (individuals).

We’ll apply a **singular value decomposition** of $X$ in both situations. 

In the $m << n$ case in which there are many more individuals $n$ than attributes $m$, we can calculate sample moments of a joint distribution by taking averages across observations of functions of the observations.

In this $m << n$ case, we’ll look for **patterns** by using a **singular value decomposition** to do a **principal components analysis** (PCA).

In the $m >> n$ case in which there are many more attributes than individuals $n$ and when we are in a time-series setting in which $n$ equals the number of time periods covered in the data set $X$, we’ll proceed in a different way. We’ll again use a singular value decomposition, but now to construct a **dynamic mode decomposition** (DMD)

## Singular Value Decomposition

A **singular value decomposition** of an $m\times n$ matrix $X$ of rank $p\leq min(m,n)$ is
$$
\begin{equation}
X = U\Sigma V^T
\end{equation}\tag{5.1}
$$
where
$$
UU^T = I, U^TU = I
$$
$$
VV^T = I, V^TV = I
$$
and
- $U$ is an $m\times m$ orthogonal matrix of **left singular vectors** of $X$
- Columns of $U$ are eigenvectors of $XX^T$
- $V$ is an $n\times n$ orthogonal matrix of r**ight singular vectors** of $X$
- Columns of $V$ are eigenvectors of $X^TX$
- $\Sigma$ is an $m\times n$ matrix in which the first $p$ places on its main diagonal are positive numbers $\sigma_1, \sigma_2,...,\sigma_p$ called **singular values**; remaining entries of $\Sigma$ are all zero
- The $p$ singular values are positive square roots of the eigenvalues of the $m\times m$ matrix $XX^T$ and also of the $n\times n$ matrix $X^TX$ 
- We adopt a convention that when $U$ is a complex valued matrix, $U^T$ denotes the **conjugate-transpose** or **Hermitian-transpose** of $U$, meaning that $U_{ij}^T$ is the complex conjugate of $U_{ji}$.
- Similarly, when $V$ is a complex valued matrix, $V^T$ denotes the **conjugate-transpose** or **Hermitian-transpose** of $V$

The matrices $U, \Sigma, V$ entail linear transformations that reshape in vectors in the following ways:

- multiplying vectors by the unitary matrices $U$ and $V$ **rotates** them, but leaves **angles between vectors** and **lengths of vectors** unchanged.

- multiplying vectors by the diagonal matrix $\Sigma$ leaves **angles between vectors** unchanged but **rescales** vectors.

Thus, representation (5.1) asserts that multiplying an $n\times 1$ vector $y$ by the matrix $m\times n$ amounts to performing the following three multiplications of $y$ sequentially:

- **rotating** $y$ by computing $V^Ty$
- **rescaling** $V^Ty$ by multiplying it by $\Sigma$
- **rotating** $\Sigma V^Ty$ by multiplying it by $U$

This structure of the $m\times n$ matrix $X$ opens the door to constructing systems of data **encoders** and **decoders**.

Thus,

- $V^Ty$ is an encoder
- $\Sigma$ is an operator to be applied to the encoded data
- $U$ is a decoder to be applied to the output from applying operator $\Sigma$ to the encoded data

We’ll apply this circle of ideas later in this notebook when we study Dynamic Mode Decomposition.

**Road Ahead**

What we have described above is called a **full** SVD.

In a full SVD, the shapes of $U, \Sigma$, and $V$ are $(m,m)$, $(m,n)$, $(n,n)$, respectively. Later we’ll also describe an **economy** or **reduced** SVD.

Before we study a **reduced** SVD we’ll say a little more about properties of a **full** SVD.

## Four Fundamental Subspaces

Let $\mathcal C$ denote a column space, $\mathcal N$ denote a null space, and $\mathcal R$ denote a row space.

Let’s start by recalling the four fundamental subspaces of an $m\times n$ matrix $X$ of rank $p$.
- The **column space** of $X$, denoted $\mathcal C(X)$, is the span of the columns of $X$, i.e., all vectors $y$ that can be written as linear combinations of columns of $X$. Its dimension is $p$.
- The **null space** of $X$, denoted $\mathcal N(X)$ consists of all vectors $y$ that satisfy $Xy = 0$. Its dimension is $n-p$.
- The **row space** of $X$, denoted $\mathcal R(X)$ is the column space of $X^T$. It consists of all vectors $z$ that can be written as linear combinations of rows of $X$. Its dimension is $p$.
- The l**eft null space** of $X$, denoted $\mathcal N(X^T)$, consist of all vectors $z$ such that $X^Tz = 0$. Its dimension is $m-p$.

For a full SVD of a matrix $X$, the matrix $U$ of **left singular vectors** and the matrix $V$ of **right singular vectors** contain orthogonal bases for all four subspaces.

They form two pairs of orthogonal subspaces that we’ll describe now.

Let $u_i, i=1,...,m$ be the $m$ column vectors of $U$ and let $v_i, i=1,...,n$ be the $n$ column vectors of $V$.

Let’s write the full SVD of $X$ as
$$
\begin{equation}
X = 
\begin{bmatrix}
U_L & U_R
\end{bmatrix}
\begin{bmatrix}
\Sigma_p & 0\\
0 & 0
\end{bmatrix}
\begin{bmatrix}
V_L & V_R
\end{bmatrix}^T 
\end{equation}\tag{5.2}
$$

where $\Sigma_p$ is a $p\times p$ diagonal matrix with the $p$ singular values on the diagonal and

$$
U_L = 
\begin{bmatrix}
u_1 & \dots & u_p
\end{bmatrix}, 
U_R = 
\begin{bmatrix}
u_{p+1} & \dots & u_m
\end{bmatrix}, 
$$
$$
V_L = 
\begin{bmatrix}
v_1 & \dots & v_p
\end{bmatrix}, 
V_R = 
\begin{bmatrix}
v_{p+1} & \dots & v_n
\end{bmatrix}, 
$$

Representation (5.2) implies that

$$
\begin{equation}
X
\begin{bmatrix}
V_L & V_R
\end{bmatrix}
 = 
\begin{bmatrix}
U_L & U_R
\end{bmatrix}
\begin{bmatrix}
\Sigma_p & 0\\
0 & 0
\end{bmatrix}
\end{equation}\tag{5.2}
$$

or

$$
\begin{align}
XV_L = U_L\Sigma_p\\
XV_R = 0
\end{align}\tag{5.3}
$$
or
$$
\begin{align}
Xv_i = \sigma_i u_i, i=1,...,p\\
Xv_i = 0, i=p+1,...,n
\end{align}\tag{5.4}
$$

Equations (5.4) tell how the transformation $X$ maps a pair of orthonormal vectors $v_i$, $v_j$ for $i$ and $j$ both less than or equal to the rank $p$ of $X$ into a pair of orthonormal vectors $u_i$, $u_j$. Equations (5.3) assert that
$$
\begin{align}
\mathcal C(X) = \mathcal C(U_L)\\
\mathcal N(X) = \mathcal C(V_R)
\end{align}
$$

Taking transposes on both sides of representation (5.2) implies

$$
X^T
\begin{bmatrix}
U_L & U_R
\end{bmatrix}
=
\begin{bmatrix}
V_L & V_R
\end{bmatrix}
\begin{bmatrix}
\Sigma_p & 0\\
0 & 0
\end{bmatrix}
$$

or 

$$
\begin{align}
X^TU_L = V_L\Sigma_p\\
X^TU_R = 0
\end{align}\tag{5.5}
$$

or

$$
\begin{align}
X^Tu_i = \sigma_i v_i, i=1,...,p\\
X^Tu_i = 0, i=p+1,...,m
\end{align}\tag{5.6}
$$

Notice how equations (5.6) assert that the transformation $X^T$ maps a pair of distinct orthonormal vectors $u_i$, $u_j$ for $i$ and $j$ both less than or equal to the rank $p$ of $X$ into a pair of orthonormal vectors $v_i$, $v_j$. Equations (5.5) assert that

$$
\begin{align}
\mathcal R(X) \equiv \mathcal C(X^T) = \mathcal C(V_L)\\
\mathcal N(X^T) = \mathcal C(U_R)
\end{align}
$$

Thus, taken together, the systems of equations (5.3) and (5.5) describe the four fundamental subspaces of $X$ in the following ways:
$$
\begin{align}
\mathcal C(X) = \mathcal C(U_L)\\
\mathcal N(X^T) = \mathcal C(U_R)\\
\mathcal R(X) \equiv \mathcal C(X^T) = \mathcal C(V_L)\\
\mathcal N(X) = \mathcal C(V_R)\\
\end{align}\tag{5.7}
$$



Since $U$ and $V$ are both orthonormal matrices, collection (5.7) asserts that
- $U_L$ is an orthonormal basis for the column space of $X$
- $U_L$ is an orthonormal basis for the null space of $X^T$
- $V_L$ is an orthonormal basis for the row space of $X$
- $V_R$ is an orthonormal basis for the null space of $X$

We have verified the four claims in (5.7) simply by performing the multiplications called for by the right side of (5.2) and reading them.

The claims in (5.7) and the fact that $U$ and $V$ are both unitary (i.e, orthonormal) matrices imply that
- the column space of $X$ is orthogonal to the null space of $X^T$
- the null space of $X$ is orthogonal to the row space of $X$

Sometimes these properties are described with the following two pairs of orthogonal complement subspaces:
- $\mathcal C(X)$ is the orthogonal complement of $\mathcal N(X^T)$ 
- $\mathcal R(X)$ is the orthogonal complement of $\mathcal N(X)$ 

Let’s do an example.

In [1]:
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

In [2]:
np.set_printoptions(precision=2)

# Define the matrix
A = np.array([[1, 2, 3, 4, 5],
              [2, 3, 4, 5, 6],
              [3, 4, 5, 6, 7],
              [4, 5, 6, 7, 8],
              [5, 6, 7, 8, 9]])

# Compute the SVD of the matrix
U, S, V = np.linalg.svd(A,full_matrices=True)

# Compute the rank of the matrix
rank = np.linalg.matrix_rank(A)

# Print the rank of the matrix
print("Rank of matrix:\n", rank)
print("S: \n", S)

# Compute the four fundamental subspaces
row_space = U[:, :rank]
col_space = V[:, :rank]
null_space = V[:, rank:]
left_null_space = U[:, rank:]


print("U:\n", U)
print("Column space:\n", col_space)
print("Left null space:\n", left_null_space)
print("V.T:\n", V.T)
print("Row space:\n", row_space.T)
print("Right null space:\n", null_space.T)

Rank of matrix:
 2
S: 
 [2.69e+01 1.86e+00 8.37e-16 5.83e-16 4.43e-17]
U:
 [[-0.27 -0.73  0.55  0.31  0.04]
 [-0.35 -0.42 -0.29 -0.78 -0.07]
 [-0.43 -0.11 -0.65  0.46  0.42]
 [-0.51  0.19 -0.06  0.22 -0.81]
 [-0.59  0.5   0.44 -0.19  0.41]]
Column space:
 [[-0.27 -0.35]
 [ 0.73  0.42]
 [ 0.14 -0.29]
 [ 0.53 -0.77]
 [ 0.32  0.14]]
Left null space:
 [[ 0.55  0.31  0.04]
 [-0.29 -0.78 -0.07]
 [-0.65  0.46  0.42]
 [-0.06  0.22 -0.81]
 [ 0.44 -0.19  0.41]]
V.T:
 [[-0.27  0.73  0.14  0.53  0.32]
 [-0.35  0.42 -0.29 -0.77  0.14]
 [-0.43  0.11 -0.25  0.2  -0.84]
 [-0.51 -0.19  0.81 -0.19 -0.05]
 [-0.59 -0.5  -0.41  0.23  0.42]]
Row space:
 [[-0.27 -0.35 -0.43 -0.51 -0.59]
 [-0.73 -0.42 -0.11  0.19  0.5 ]]
Right null space:
 [[-0.43  0.11 -0.25  0.2  -0.84]
 [-0.51 -0.19  0.81 -0.19 -0.05]
 [-0.59 -0.5  -0.41  0.23  0.42]]


## Eckart-Young Theorem

Suppose that we want to construct the best rank $r$ approximation of an $m\times n$ matrix $X$. By best, we mean a matrix $X_r$ of rank $r < p$ that, among all rank $r$ matrices, minimizes

$$
\Vert X - X_r\Vert
$$

where $\Vert\cdot\Vert$ denotes a norm of a matrix $X$ and where $X_r$ belongs to the space of all rank $r$ matrices of dimension $m\times n$. Three popular **matrix norms** of an $m\times n$ matrix $X$ can be expressed in terms of the singular values of $X$
- the **spectral** or $l^2$ norm $\Vert X\Vert_2 = \max_{\Vert y\Vert \neq 0}\frac{\Vert Xy\Vert}{\Vert y\Vert} = \sigma_1$
- the **Frobenius** norm $\Vert X\Vert_F = \sqrt{\sigma_1^2 + \cdots + \sigma_p^2}$
- the **nuclear** norm $\Vert X\Vert_N = \sigma_1 + \cdots + \sigma_p$ 

The Eckart-Young theorem states that for each of these three norms, same rank $r$ matrix is best and that it equals
$$
\begin{equation}
\hat{X}_r = \sigma_1U_1V^T_1 + \sigma_2U_2V^T_2 + \cdots + \sigma_rU_rV^T_r
\end{equation}\tag{5.8}
$$

This is a very powerful theorem that says that we can take our $m\times n$ matrix $X$ that in not full rank, and we can best approximate it by a full rank $p\times p$ matrix through the SVD.

Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can take $r$ leading singular values ordered by magnitude.

We’ll say more about this later when we present Principal Component Analysis. You can read about the Eckart-Young theorem and some of its uses [here](https://en.wikipedia.org/wiki/Low-rank_approximation). We’ll make use of this theorem when we discuss principal components analysis (PCA) and also dynamic mode decomposition (DMD).


## Full and Reduced SVD’s

Up to now we have described properties of a **full** SVD in which shapes of $U$, $\Sigma$, and $V$ are $(m,m)$, $(m,n)$, $(n,n)$, respectively.

There is an alternative bookkeeping convention called an **economy** or **reduced** SVD in which the shapes of $U$, $\Sigma$, and $V$  are different from what they are in a full SVD.

Thus, note that because we assume that $X$ has rank $p$, there are only $p$ nonzero singular values, where $p = rank(X)\leq \min(m,n)$.

A **reduced** SVD uses this fact to express $U$, $\Sigma$, and $V$ as matrices with shapes $(m,p)$, $(p,p)$, $(n,p)$.

You can read about reduced and full SVD [here](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html).

For a full SVD,

$$
UU^T = I, U^TU = I
$$
$$
VV^T = I, V^TV = I
$$

But not all these properties hold for a **reduced** SVD. Which properties hold depend on whether we are in a **tall-skinny** case or a **short-fat** case.

- In a tall-skinny case in which $m >> n$, for a **reduced** SVD
$$
UU^T \neq I, U^TU = I
$$
$$
VV^T = I, V^TV = I
$$
 
- In a short-fat case in which $m << n$, for a **reduced** SVD
$$
UU^T = I, U^TU = I
$$
$$
VV^T = I, V^TV \neq I
$$

When we study Dynamic Mode Decomposition below, we shall want to remember these properties when we use a reduced SVD to compute some DMD representations. Let’s do an exercise to compare **full** and **reduced** SVD’s.

To review,

- in a **full** SVD
    - $U$ is $m\times m$ 
    - $\Sigma$ is $m\times n$ 
    - $V$ is $n\times n$ 

- in a **reduced** SVD
    - $U$ is $m\times p$ 
    - $\Sigma$ is $p\times p$ 
    - $V$ is $n\times p$ 

First, let’s study a case in which $m=5>n=2$. (This is a small example of the **tall-skinny** case that will concern us when we study **Dynamic Mode Decompositions** below.)

In [3]:
import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True)  # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V =')
U, S, V

U, S, V =


(array([[-0.5 ,  0.58, -0.6 , -0.2 ,  0.15],
        [-0.03, -0.21, -0.12, -0.69, -0.68],
        [-0.66,  0.13,  0.72, -0.15,  0.02],
        [-0.54, -0.49, -0.29,  0.54, -0.32],
        [-0.17, -0.61, -0.14, -0.41,  0.64]]),
 array([1.91, 0.2 ]),
 array([[-0.78, -0.63],
        [-0.63,  0.78]]))

In [4]:
print('Uhat, Shat, Vhat = ')
Uhat, Shat, Vhat

Uhat, Shat, Vhat = 


(array([[-0.5 ,  0.58],
        [-0.03, -0.21],
        [-0.66,  0.13],
        [-0.54, -0.49],
        [-0.17, -0.61]]),
 array([1.91, 0.2 ]),
 array([[-0.78, -0.63],
        [-0.63,  0.78]]))

In [5]:
rr = np.linalg.matrix_rank(X)
print(f'rank of X = {rr}')

rank of X = 2


Properties:

- Where $U$ is constructed via a full SVD, $U^TU = I_{m\times m}$ and $UU^T = I_{m\times m}$
- Where $\hat{U}$ is constructed via a reduced SVD, although $\hat{U}^T\hat{U} = I_{p\times p}$, it happens that $\hat{U}\hat{U}^T \neq I_{m\times m}$

We illustrate these properties for our example with the following code cells.

In [6]:
UTU = U.T@U
UUT = U@U.T
print('UUT, UTU = ')
UUT, UTU

UUT, UTU = 


(array([[ 1.00e+00, -1.74e-16,  5.20e-18, -6.14e-17, -6.59e-17],
        [-1.74e-16,  1.00e+00, -4.26e-17,  1.99e-16,  9.80e-17],
        [ 5.20e-18, -4.26e-17,  1.00e+00, -1.49e-17, -2.86e-17],
        [-6.14e-17,  1.99e-16, -1.49e-17,  1.00e+00,  2.79e-17],
        [-6.59e-17,  9.80e-17, -2.86e-17,  2.79e-17,  1.00e+00]]),
 array([[ 1.00e+00, -1.54e-17, -4.01e-18,  9.81e-18,  3.07e-17],
        [-1.54e-17,  1.00e+00, -1.64e-17,  1.63e-16,  2.37e-16],
        [-4.01e-18, -1.64e-17,  1.00e+00,  5.41e-17,  3.88e-17],
        [ 9.81e-18,  1.63e-16,  5.41e-17,  1.00e+00,  4.53e-17],
        [ 3.07e-17,  2.37e-16,  3.88e-17,  4.53e-17,  1.00e+00]]))

In [7]:
UhatUhatT = Uhat@Uhat.T
UhatTUhat = Uhat.T@Uhat
print('UhatUhatT, UhatTUhat= ')
UhatUhatT, UhatTUhat

UhatUhatT, UhatTUhat= 


(array([[ 0.58, -0.11,  0.4 , -0.01, -0.27],
        [-0.11,  0.05, -0.01,  0.12,  0.13],
        [ 0.4 , -0.01,  0.45,  0.29,  0.03],
        [-0.01,  0.12,  0.29,  0.53,  0.39],
        [-0.27,  0.13,  0.03,  0.39,  0.4 ]]),
 array([[ 1.00e+00, -1.54e-17],
        [-1.54e-17,  1.00e+00]]))

**Remarks:**

The cells above illustrate the application of the `full_matrices=True` and `full_matrices=False` options. Using `full_matrices=False` returns a reduced singular value decomposition.

The **full** and **reduced** SVD’s both accurately decompose an $m\times n $ matrix $X$

When we study Dynamic Mode Decompositions below, it will be important for us to remember the preceding properties of full and reduced SVD’s in such tall-skinny cases.

Now let’s turn to a short-fat case.

To illustrate this case, we’ll set $m=2<5=n$ and compute both full and reduced SVD’s.

In [8]:
import numpy as np
X = np.random.rand(2,5)
U, S, V = np.linalg.svd(X,full_matrices=True)  # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V = ')
U, S, V

U, S, V = 


(array([[-0.8, -0.6],
        [-0.6,  0.8]]),
 array([1.35, 0.58]),
 array([[-0.3 , -0.07, -0.6 , -0.49, -0.55],
        [-0.34,  0.22,  0.01, -0.61,  0.69],
        [-0.52, -0.29,  0.72, -0.17, -0.32],
        [-0.71,  0.32, -0.22,  0.58,  0.06],
        [-0.15, -0.87, -0.27,  0.15,  0.35]]))

In [9]:
print('Uhat, Shat, Vhat = ')
Uhat, Shat, Vhat

Uhat, Shat, Vhat = 


(array([[-0.8, -0.6],
        [-0.6,  0.8]]),
 array([1.35, 0.58]),
 array([[-0.3 , -0.07, -0.6 , -0.49, -0.55],
        [-0.34,  0.22,  0.01, -0.61,  0.69]]))

Let’s verify that our reduced SVD accurately represents $X$

In [10]:
SShat=np.diag(Shat)
np.allclose(X, Uhat@SShat@Vhat)

True

## Polar Decomposition

A **reduced** singular value decomposition (SVD) of $X$ is related to a **polar decomposition** of $X$
$$
X = SQ
$$

where

$$
S = U\Sigma U^T
$$
$$
Q = UV^T
$$

Here
- $S$ is an $m\times m$ **symmetric** matrix
- $Q$ is an $m\times n$ **orthogonal** matrix

and in our reduced SVD
- $U$ is an $m\times p$ orthonormal matrix
- $\Sigma$ is a $p\times p$ diagonal matrix
- $V$ is an $n\times p$ orthonormal

## Application: Principal Components Analysis (PCA)

