# Conditioning
Due to the finite precision in our computers and probably some other practical reasons, there will be some errors in the input data $x$, which lead to the error in the solution $y$.

so for a problem $y = f(x):X \longmapsto Y$, we define its **conditional number** as:

$$\kappa = \max_{\Delta x} \left( \frac{\left\| \Delta f \right\|} {\left\| f(x) \right\|} \Big/ \frac{\left\| \Delta x \right\|} {\left\| x \right\|} \right)$$

Here $x$ and $f(x)$ are the exact input and solution, and $\Delta x$ and $\Delta f$ are absolute errors, and $\kappa$ represents the *sensitivity* of the solution to perturbations in the input.

If $\kappa$ is small as $10^2$ or less, then the solution of the problem is not sensitive to the perturbations in the input data. We say this problem is **well-conditioned**, otherwise it is **ill-conditioned**, like its $\kappa$ has reached $10^6$ or higher.

## Condition number of functions
Now we assume that $y = f(x):\mathbb{R} \longmapsto \mathbb{R}$ and $f(x)$ is differentiable. Then we have

$$\kappa = \max_{\Delta x} \left( \frac{\left\| \Delta f \right\|} {\left\| f(x) \right\|} \Big/ \frac{\left\| \Delta x \right\|} {\left\| x \right\|} \right) = \frac{ \lim_{\Delta x \to 0}\big( \left| \Delta f \right| \big/ \left| \Delta x \right| \big)} {\left| f (x) \right| \big/ \left| \,x\, \right|} = \Bigg|\, \frac{f'(x)} {f(x)/x} \, \Bigg| $$

为什么可以从max直接换到lim？？？

>**e.g.**
>
>$f(x) = \frac{1} {x}$ at mathbb{R} except for $0$. Here the condition number is
>
>$$\kappa = \Bigg|\, \frac{f'(x)} {f(x)/x} \, \Bigg| = \Bigg|\, \frac{x^{-2}} {x^{-1}/x} \, \Bigg| = 1$$
>
>So it is well-conditioned, even near zero. See $f(0.001)$ and $f(0.001001)$. The relative error of input is $10^{-3}$, and the relative error of output is $0.999/1000 \approx 10^{-3}$, and there quotient is just as the condition number, $1$.

***
>**e.g.**
>
>Another example is the *subtraction* operation. Let the inputs be $x_1, x_2 \in \mathbb{R}$, and the output be $y = f(x_1, x_2) = x_1 - x_2$.
>
>Consider a perturbation in the input as $\Delta x_1$ and $\Delta x_2$ , and the relative perturbation is then defined by $\max \left\{ \left| \frac{\Delta x_1} {x_1} \right|, \left| \frac{\Delta x_2} {x_2} \right| \right\}$, with assumption that $x_1, x_2 \neq 0$. And then the relative output error is:
>
>$$\left| \frac{\big((x_1 + \Delta x_1) - (x_2 + \Delta x_2) \big) - (x_1 - x_2)} {x_1 - x_2} \right| = \left| \frac{\Delta x_1 - \Delta x_2} {x_1 - x_2} \right|$$
>
>And this can be huge if $x_1 \approx x_2$ . This implies that the subtraction of two values is a ill-conditioned problem when the two values are very close.

## Condition number of finding the roots of polynomials
Finding the roots of polynomials is an ill-conditioned problem.

>**e.g.**
>
>The roots of $x^2 − 2x + 1 = 0$, which are $x = 1$. A perturbation would be like $x^2 − 2x + 0.99999999 = 0$, and now the roots are $0.99999$ and $1.00001$. That is a relative error in the coefficient in the order of $10^{-10}$ leads to a relative error of roots in the order of $10^{-5}$.

## Condition number of matrices
Given a $m \times m$ nonsingular matrix $A$, with input data $\vec{x}$ and then we have: $\vec{y} = A\vec{x}: \mathbb{R} \longmapsto \mathbb{R}$

And since we have 

$$\begin{align*}
\kappa &= \max_{\Delta x} \left( \frac{\left\| A(\vec{x} + \Delta \vec{x}) - A\vec{x} \right\|} {\left\| A\vec{x} \right\|} \middle/ \frac{\left\| \Delta \vec{x} \right\|} {\left\| \vec{x} \right\|} \right) \\
&= \max_{\Delta x} \left( \frac{\left\| A \Delta \vec{x} \right\|} {\left\| A\vec{x} \right\|} \middle/ \frac{\left\| \Delta \vec{x} \right\|} {\left\| \vec{x} \right\|} \right) \\
&= \max_{\Delta x} \left( \frac{\left\| A \Delta \vec{x} \right\|} {\left\| \Delta \vec{x} \right\|} \right) \cdot \frac{\left\| \vec{x} \right\|} {\left\| A\vec{x} \right\|} \\
&= \left\| A \right\| \cdot \frac{\left\| A^{-1}A\vec{x} \right\|} {\left\| A\vec{x} \right\|} \leq \small\| A \small\| \small\| A^{-1} \small\|
\end{align*}$$

So we define the condition number here as: $\DeclareMathOperator{\cond}{Cond}
\cond(A)= \small\| A \small\|_2 \small\| A^{-1} \small\|_2$

Later we can see that actually $\left\| A\right\|_2$ is the maximum singular values奇异值 of the matrix $A$, $\sigma_1$, and $\small\| A^{-1} \small\|_2$ is the inverse of the minimum singular values, $\frac{1} {\sigma_m}$.

*Singular value is the square roots of the eigenvalues of $A^{\mathrm{H}}A$, here $A^{\mathrm{H}}$ is the Conjugate Transpose* (from [MathWorld](http://mathworld.wolfram.com/SingularValue.html))

But for the case that $A$ is a **real symmetric matrix**, the singular values of $A$ are just the absolute values of its eigenvalues. 

## Application
And if the matrix is singular, then $\sigma_m = 0$, and so that $\cond(A) = \infty$. Same in MATLAB with `cond` command.

And when solving the linear equation: $A\vec{\mathstrut x} = \vec{\mathstrut b}$, this problem has the condition number bounded by $\small\| A \small\|_2 \small\| A^{-1} \small\|_2$

The following example demonstrates that how the accuracy of multiplying a matrix with a vector depends on the condition number of the matrix.

>**e.g.**
>
>Consider $A = 
\left[ 
\begin{array} {cc}
1 & 2 \\ 
1 & 1
\end{array}
\right], B= 
\left[ 
\begin{array} {cc}
1 & 1 + 10^{-14} \\ 
1 &  1
\end{array}
\right]$
>
>Using MATLAB, we can see that $\cond(A) \approx 6.8541, \cond(B) \approx 4.0191\times 10^{10}$. Now let us compute $A\vec{x}$ and $B\vec{x}$ for $\vec{x} = \left[ 
\begin{array} {c}
1 \\ 
-1
\end{array}
\right]$ and consider their accuracies for the vector
>
>Of course the exact answer is
>
>$$ A\vec{x} = \left[ \begin{array} {c}
1 \\ 
0
\end{array} \right], B\vec{x} = \left[ \begin{array} {c}
-10^{-14} \\ 
0
\end{array} \right]$$
>
>However in MATLAB, we can see that 
>
>$$ \widetilde{ A\vec{x}} = \left[ \begin{array} {c}
1 \\ 
0
\end{array} \right], \widetilde{ B\vec{x}} = \left[ \begin{array} {c}
-0.992 \times 10^{-14} \\ 
0
\end{array} \right]$$
>
>Since matrix $A$ has small condition number, MATLAB can give out a relatively accurate result, but for $B$, even the relative error of input is in the order of $10^{-16}$, the relative error of output is still of $10^{-3}$.

# Stability
To solve a problem accurately, first of all the **problem** has to be well-conditioned. Otherwise any input error will be amplified enormously in the solution. But that's not enough. The accuracy also depends on the **algorithm** used to solve the problem. Here we discuss the stability of algorithms used to solve problems.

We learn this through an example, consider solving 

$$A\vec{x} = \left[ \begin{array} {cc}
10^{-20} & 1 \\ 
1 & 1
\end{array} \right] \left[ \begin{array} {c}
x_1 \\ 
x_2
\end{array} \right] = \left[ \begin{array} {c}
1 \\ 
0
\end{array} \right]$$

Here the condition number of the matrix $A$ is $2.61$, so it's well-conditioned.

## Gaussian elimination and Pivoting step
$
\left\{ 
\begin{array}{ccrcll}
10^{-20} \, x_1 &+&  x_2 & =& 1 & (1)\\[1ex]
(1- 10^{20} \times 10^{-20}) \, x_1 &+&  (1- 10^{20} \times 1) \, x_2 & =& 0- 10^{20} \times 1 & (2)
\end{array}
\right. 
$

And this is an upper triangular matrix, so that from $\textrm{(2)}$ we have $ -10^{20} \, x_2 = -10^{20}$, because in computer, $ 1- 10^{20} = - 10^{20}$, even in double precision. Therefore $x_2 = 1$ and $x_1 = 0$. And that's just not the answer.

So we use **pivioting** step: before each step of the Gauss elimination, the rows need to be exchanged such that the value appearing at the pivot diagonal position be the largest one among all entries below the pivot position. So now what we're going to solve is:

$$B\vec{x} = \left[ \begin{array} {cc}
1 & 1 \\ 
10^{-20} & 1
\end{array} \right] \left[ \begin{array} {c}
x_1 \\ 
x_2
\end{array} \right] = \left[ \begin{array} {c}
0 \\ 
1
\end{array} \right]$$

Now then we have the Gaussian Elimination like below:

$
\left\{ 
\begin{array}{ccrcll}
x_1 &+&  x_2 & =& 0 & (3)\\[1ex]
(10^{-20}- 10^{-20} \times 1) \, x_1 &+&  (1- 10^{-20} \times 1) \, x_2 & =& 1- 10^{-20} & (4)
\end{array}
\right. 
$

So we can get that $x_2 = 1$ from $\textrm{(2)}$, Then plug in $x_2 = 1$ to the first equations, we have $x_1 = −1$, which is in fact the solution to a problem close to the original system.

## General condition
When we are solving $x \longrightarrow f(x): X \longmapsto Y$ in a computer, firstly the input $x$ is transformed to say $\tilde{x}$, and then the output is given and we denote this process as

$$\tilde{f}(x) : X \longmapsto Y$$

Here, $\tilde{f}(x)$ represents the computed result of the problem in the computer by the given algorithm.

$Def$

We say an algorithm $\tilde{f}(x)$ implemented on a computer to solve the mathematical problem $f(x)$ is **stable** (backward stable), if for any input $x \in X$, the computed result$\tilde{f}(x)$ is the true solution to a nearby input, $i.e.$, there exists a nearby input $\tilde{x} = x(1 + O(\varepsilon_{\text{machine}} ))$, such that

$$\tilde{f}(x) = f(\tilde{x})$$

And a not very surprising result is that for operation $+$, $-$, $\cdot$, $\div$ are all stable. Now we check the stability of $x_1 - x_2$.

Given real numbers $x_1, x_2$, then $fl(x_1) = x_1(1+\varepsilon_1),fl(x_2) = x_2(1+\varepsilon_2)$, where $\varepsilon_1, \varepsilon_2$ are in the order of $O(\varepsilon_{\text{machine}})$, so in computer:

$\begin{align}
fl(x_1) \ominus fl(x_2) &= \big(fl(x_1) - fl(x_2)\big)\cdot(1+ \varepsilon_3) \\
&= x_1/,(1+\varepsilon_1)(1+\varepsilon_3) - x_2/,(1+\varepsilon_2)(1+\varepsilon_3) \\
&= x_1/,(1+\varepsilon_4) - x_2/,(1+\varepsilon_5)
\end{align}$

This means that the computed value is the true difference of the two nearby values and therefore this operation is backward stable.

And an example of an unstable is outerproduct $\times$. Given $\vec{x} \in \mathbb{R}^n$, $\vec{y} \in \mathbb{R}^n$, compute $A = \vec{x} \times \vec{y}^{\mathrm{T}}$. And the obvious is to compute entry by entry as $x_iy_j$, denoted as $\tilde{A}$.

Due to the rounding error, it is almost unlike for $\tilde{A}$ to have a rank of 1, but the true value of $(x + \Delta x)^{\mathrm{T}} \times (y + \Delta y)^{\mathrm{T}}$ is always with rank of 1.

# Accuracy

$Def$

Given a mathematical problem $y = f(x):X \longmapsto Y$ and the algorithm to solve it in computer $\tilde{f}(x) : X \longmapsto Y$, we say that it is **accurate** if for any input $x \in X$, we have

$$\frac{\left\| \tilde{f}(x) - f(x) \right\|} {\left\| f(x) \right\|} = O(\varepsilon_{\text{machine}} )$$

Following Theorems shows that when a stable algorithm $\tilde{f}(x)$ is used to solve a problem $f(x)$, the accuracy of the algorithm is only dependent on the condition number of the problem $f(x)$.

$Theorem$

Let a stable algorithm $\tilde{f}(x)$ be used to solve a problem $f(x)$, whose condition number is $\kappa(x)$. Then, for any input x, the relative error of the computed solution satisfies:

$$\frac{\left\| \tilde{f}(x) - f(x) \right\|} {\left\| f(x) \right\|} = O(\kappa(x) \cdot \varepsilon_{\text{machine}} )$$

$Proof$

Since the algorithm is stable, we can know that $\tilde{f}(x) = f(\tilde{x})$, for some $\tilde{x}$ with $\displaystyle \frac{\left\| \tilde{x} - x \right\|} {\left\| x \right\|} = O(\varepsilon_{\text{machine}} )$

And we also have:

$$\kappa = \max_{\Delta x} \left( \frac{\left\| \Delta f \right\|} {\left\| f(x) \right\|} \Big/ \frac{\left\| \Delta x \right\|} {\left\| x \right\|} \right) \geq \left( \frac{\left\| f(\tilde{x}) - f(x) \right\|} {\left\| f(x) \right\|} \Big/ \frac{\left\| \tilde{x} - x \right\|} {\left\| x \right\|} \right)$$

So that we can finally get that:

$$\frac{\left\| \tilde{f}(x) - f(x) \right\|} {\left\| f(x) \right\|} = \frac{\left\| f(\tilde{x}) - f(x) \right\|} {\left\| f(x) \right\|} \leq \kappa(x) \cdot \frac{\left\| \tilde{x} - x \right\|} {\left\| x \right\|} = O(\kappa(x) \cdot \varepsilon_{\text{machine}} )$$

$\square$

So for accuracy:
1. condition number of the problem
2. the stability of the algorithm


# Singular Value Decomposition of Matrices (SVD)
## Hyperellipse
The singular value decomposition of matrix $A$ is motivated by the following geometric fact: *the image of
a unit sphere in the space $\mathbb{R}^n$ under any $m \times n$ matrix is a hyperellipse in thespace $\mathbb{R}^m$.*

Here the *hyperellipse* means an ellipse with $m$ orthogonal directions $\vec{u}_1, \vec{u}_2, \dots,\vec{u}_m \in \mathbb{R}^m$ corresponding to its principle semiaxes, with the length of the each principle semiaxes represented by $\sigma_1, \sigma_2 , \dots, \sigma_m$. And we assume that $\| u_i \|_2 = 1$ for $i = 1,2,\dots,m$ and $\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_m \geq 0$. Besides, some $\sigma_i$ can equal $0$. For example only r number of principle vector has norm greater than $0$, then the hyperellipse is located within a $r\textrm{-dimension}$ hypersurface of the space $\mathbb{R}^m$ spanned by orthogonal vector $\vec{u}_1, \vec{u}_2, \dots,\vec{u}_r$.

Then given matrix $A \in \mathbb{R}^{m \times n}$ ($m \geq n$), the image of the unit sphere in space $\mathbb{R}^n$ under the map $\vec{y} = A\vec{x}$ is a hyperellipse of $n\text{-dimension}$ hypersurface of the space $\mathbb{R}^m$. We denote the $n$ principle semiaxes of the hyperellipse as $\sigma_1 \vec{u}_1, \sigma_2 \vec{u}_2, \dots,\sigma_n \vec{u}_n$, which are the images of $n$ orthonormal vectors $\vec{v}_1, \vec{v}_2, \cdots, \vec{v}_n$ in the space $\mathbb{R}^n$ , $i.e.$,

$$A\vec{v_i} = \sigma_i \vec{u_i}, \;\; i = 1, 2, 3, \dots, n$$

This is called the singular value decomposition of the matrix $A$, and we call $\sigma_1, \sigma_2 , \dots, \sigma_n$ the singular values of the matrix $A$.

## Conclusions
1. If the rank of $A$ is $r$ and $r \leq n$, then only the first $r$ singular values $\sigma_1, \sigma_2 , \dots, \sigma_r$ of $A$ are nonzero and the others $\sigma_{r+1} = \sigma_{r+2} = \cdots = \sigma_{n} = 0$. The hyperellipse
is therefore located in the hypersurface formed by $\vec{u_1}, \vec{u_2}, \dots,\vec{u_r}$, which in fact form a **basis of the range space of matrix** $A$. 
2. The **kernel space** of $A$ is the space spanned by the vectors $\vec{u}_{r+1}, \vec{u}_{r+2}, \dots,\vec{u}_{n}$. 3. $\vec{v}_1$ is amplified the **most** and $\vec{v}_r$ is amplified the least, seeing from the magnitude relationship of $\sigma_1$ and $\sigma_r$.

We also write the SVD as:

$$ AV = A\cdot\left[ 
\begin{array}{c|c|c|c}
& & & \\[0.3ex]
\vec{v_1} & \vec{v_2} & \cdots & \vec{v_n} \\[0.3ex]
& & &
\end{array}
\right] = \left[ 
\begin{array}{c|c|c|c}
& & & \\[0.3ex]
\vec{u_1} & \vec{u_2} & \cdots & \vec{u_n} \\[0.3ex]
& & &
\end{array}
\right] \cdot \left[ 
\begin{array}{ccccccc}
\sigma_1 & 0 & \cdots & 0 & 0 & \cdots & 0 \\
0 & \sigma_2 & \cdots & 0 & 0 & \cdots & 0 \\
0 & 0 & \ddots & 0 & 0 & \cdots & 0 \\
0 & 0 & \cdots & \sigma_n & 0 & \cdots & 0 \\
0 & 0 & \cdots & 0 & 0 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 0 & 0 & \cdots & 0 
\end{array}
\right] = \widehat{U}\widehat{\Sigma}$$

where $V \in \mathbb{R}^{n \times n}, \widehat{U} \in \mathbb{R}^{m \times n}, \widehat{\Sigma} \in \mathbb{R}^{n \times n}$.

And we can also add an additional $m-n$ orthonormal vectors $\vec{u}_{n+1}, \vec{u}_{n+2}, \dots,\vec{u}_{m}$ to $\vec{u}_1, \vec{u}_2, \dots,\vec{u}_{n}$, such that a whole orthogonal basis of space $\mathbb{R}^m$ is formed. Anyway they will be multiplied by 0, no effects on our results. So now we can write:

$$AV = \left[ 
\begin{array}{c|c|c|c|c|c}
& & & & &\\[0.3ex]
\vec{u}_1 & \vec{u}_2 & \cdots & \vec{u}_n & \vec{u}_{n+1} & \cdots & \vec{u}_m\\[0.3ex]
& & & & &
\end{array}
\right] \cdot \left[ 
\begin{array}{ccccccc}
\sigma_1 & 0 & \cdots & 0 & 0 & \cdots & 0 \\
0 & \sigma_2 & \cdots & 0 & 0 & \cdots & 0 \\
0 & 0 & \ddots & 0 & 0 & \cdots & 0 \\
0 & 0 & \cdots & \sigma_n & 0 & \cdots & 0 \\
0 & 0 & \cdots & 0 & 0 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 0 & 0 & \cdots & 0 \\
0 & 0 & \cdots & 0 & 0 & \cdots & 0 \\
\vdots & \vdots & & \vdots & \vdots & & \vdots \\
0 & 0 & \cdots & 0 & 0 & \cdots & 0
\end{array}
\right] = U \Sigma$$

And now $V \in \mathbb{R}^{n \times n}, U \in \mathbb{R}^{m \times m}, \Sigma \in \mathbb{R}^{m \times n}$. And since $V$ is orthogonal, we finally get that:

$$\boxed{A = U \Sigma V^{\mathrm{T}}}$$

which is the matrix form of the SVD of the matrix $A$.

$Theorem 1$

*Let $r$ be the number of nonzero singular values of the matrix $A$. Then the **range space** of $A$ is spanned by the vectors $\vec{u}_1, \vec{u}_2, \dots,\vec{u}_r$, and the **null space** of *A* is spanned by $\vec{\upsilon}_{r+1}, \vec{\upsilon}_{r+2}, \cdots, \vec{\upsilon}_n$.

$Theorem 2$

Let $r$ be the number of **nonzero** singular values of the matrix $A$. Then the *rank* of $A$ is $r$, $i.e.$, the dimension of the **column space** of $A$ is $r$.

$Theorem 3$

Let $A$ be a $m \times m$ square matrix. Then the determinant of $A$ equals 
$$ \prod _{i = 1} ^{m} \sigma_1$$

***
Now let $\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_r > 0$ be the nonzero singular values of the matrix, $A$. Then the SVD can be also written as 

$$A = \sum _{i = 1} ^{r} \sigma_i \vec{u}_i \vec{\upsilon}_i^{\mathrm{T}}$$

$\dagger$  
This is the **sum of matrix**!

Also for $\nu = 1,2,\dots,r$ define $\displaystyle A_{\nu} = \sum _{i = 1} ^{\nu} \sigma_i \vec{u}_i \vec{\upsilon}_i^{\mathrm{T}}$, and after this definition we have

$Theorem 4$

$$\left\| A - A_{\nu}\right\| = \sigma_{\nu + 1}$$

$Proof$

Since $\vec{\upsilon}_1, \vec{\upsilon}_2, \cdots, \vec{\upsilon}_n$ is an orthonormal basis of the space $\mathbb{R}^n$, we denote the vector $\vec{x}$ by $x_1 \vec{\upsilon}_1 + x_2 \vec{\upsilon}_2 + \cdots + x_n \vec{\upsilon}_n$, where $x_{1}^{2} + x_{2}^{2} + \cdots +  x_{n}^{2} = 1$ since $\|x\|_2 = 1$. So we have

$$(A - A_{\nu})\vec{x} = \left( \sum _{i = \nu + 1} ^{n} \sigma_i \vec{u}_i \vec{\upsilon}_i ^{\mathrm{T}} \right) \big(x_1 \vec{\upsilon}_1 + x_2 \vec{\upsilon}_2 + \cdots + x_n \vec{\upsilon}_n \big) = \sum _{i = \nu + 1} ^{n} \sigma_i \vec{u}_i x_i$$

So that we have

$$\begin{align}
\big\| (A - A_{\nu})\vec{x} \big\|_{2} ^{2} &= \big( (A - A_{\nu})\vec{x} \big)^{\mathrm{T}} (A - A_{\nu})\vec{x} \\
&= \left( \sum _{i = \nu + 1} ^{n} \sigma_i \vec{u}_i^{\mathrm{T}} x_i \right) \cdot \left( \sum _{j = \nu + 1} ^{n} \sigma_j \vec{u}_j x_j \right) \\
&= \sum _{i = \nu + 1} ^{n} \sigma_i^{2}  x_i^{2} \leq \sum _{i = \nu + 1} ^{n} \sigma_{\nu + 1}^{2}  x_i^{2} = \sigma_{\nu + 1}^{2}
\end{align}$$

$i.e.$, $\big\| (A - A_{\nu})\vec{x} \big\|_{2} \leq \sigma_{\nu + 1}$. Now we have an upper limit, and at the same time, if we take $\vec{x} = \vec{\upsilon}_{\nu + 1}$, obviously we have 

$$\begin{align}
\big\| (A - A_{\nu})\vec{x} \big\|^{2}_{2} &= \big\| \sigma _{\nu + 1} \vec{u} _{\nu + 1} \vec{\upsilon} ^{\mathrm{T}} _{\nu + 1} \vec{\upsilon} _{\nu + 1} + \sigma _{\nu + 2} \vec{u} _{\nu + 2} \vec{\upsilon} ^{\mathrm{T}} _{\nu + 2} \vec{\upsilon} _{\nu + 1} + \cdots + \sigma_{r} \vec{u}_{r} \vec{\upsilon} ^{\mathrm{T}} _{r} \vec{\upsilon} _{\nu + 1} \big\|^{2}_{2} \\
&=\big\| \sigma _{\nu + 1} \vec{u} _{\nu + 1} \cdot 1 + \sigma _{\nu + 2} \vec{u} _{\nu + 2} \cdot 0 + \cdots + \sigma_{r} \vec{u}_{r} \cdot 0 \big\|^{2}_{2} = \sigma^{2}_{\nu + 1}
\end{align}$$

$\dagger$  
Other value for $\vec{x}$ would be like for $i = 1,2,\dots,r$, take $\vec{x} = \vec{\upsilon}_{i}$. When $i \leq \nu$, $\big\| (A - A_{\nu})\vec{x} \big\|^{2}_{2} = 0$, and when $i > \nu$, $\big\| (A - A_{\nu})\vec{x} \big\|^{2}_{2} = \sigma^{2}_{i}$

That is to say such upper limit is reachable. Therefore, $\left\| A - A_{\nu}\right\| = \sigma_{\nu + 1}$.

$\square$
***
$Theorem 5$

For any $m \times n$ matrix $A$,

$$\left\|A\right\|_2 = \sigma_1$$,

$i.e.$, the **largest singular value** of the matrix $A$. And if it is a nonsingular $m \times m$ matrix, then 

$$\left\| A^{-1}\right\|_2 = \frac{1} {\sigma_m}$$

$Proof$

The first part is just the special case in $Theorem 4$ for $\nu = 0$. We will just skip that, and focus on the rest part. We notice that the singular value of nonsingular matrix $A^{-1}$ are $\displaystyle \frac{1} {\sigma_m} \geq \frac{1} {\sigma_{m-1}} \geq \cdots \geq \frac{1} {\sigma_2} \geq \frac{1} {\sigma_1} > 0$. And so that we can use the first result and get the result we want! All Done!

$\square$