# Lecture notes for 2020-05-08

Note: I have [review notes](https://www.cs.cornell.edu/courses/cs4220/2017sp/lec/review.pdf)
from previous semesters that do a pretty good job of covering things.
I do not plan to put those into notebook format, at least not right now. 
But I recommend them to you for a written "big picture" view.

## Logistics

- Will close everything by end of May 11
- Final will be May 11-May 21 (*new end date*)
- Early submissions are appreciated!
- Final will be *optional* for S/U (will average rest of grades)
- Final still required for letter grade
- Final covers whole course
  - But this mostly means "the second half relies on the first half!"

## The solver stack

Big numerical codes come in layers, e.g.

- A continuation strategy to do a parameter study
- A Newton iteration (or related iteration) to solve nonlinear subproblems
- A Krylov solver (or related iteration) for Jacobian solves
- A sparse matrix factorization to precondition the Krylov solver
- Dense factorizations used inside the sparse solver

You usually don't have to implement every layer.  But you have to understand accuracy/stability and performance
implications of the choices at each layer in order to make things work correctly.

For those of you who are interested in the high-performance computing and systems aspect of building such solver
stacks, I will teach CS 5220 (Applications of Parallel Computers) in the Fall 2020 semester.

## The nutshell

We did plenty of calculus and linear algebra foundations in this class, but the main "meat" of the course
is in building factorizations and iterations to solve different linear and nonlinear problems.

### Factorizations

We talked about many factorizations, including

- LU factorization / Gaussian elimination
- Cholesky factorization
- QR factorization
- Singular value decomposition
- Schur decomposition and symmetric eigendecomposition

The first three we actually spent a lot of time learning to compute.  We mostly thought about how to use
the latter two.  For most of the linear algebra problems that we talked about in the class, from solving linear systems
and least squares to determinants and data compression, we can use one or more of these factorizations.

#### Questions

How could you compute $\log |\det(A)|$ using each of the factorizations mentioned above?  Why might computing
$\det(A)$ in general be numerically tricky?

### Iterations

Though this is not the only class of iterations we discussed, our chief iteration
building block is *fixed point iterations*
$$
  x_{k+1} = G(x_k).
$$
The most important of these is probably Newton's iteration.

The general challenges in using iterative methods are:

- Getting good enough initial guesses
- Deciding when to terminate
- Keeping the cost per step low
- Keeping the number of steps low

The goal is "right enough, fast enough."  What meets those goals is often application specific.
Hence, being an informed user of iterative methods often requires more monkeying around than
when we solve standard linear algebra problems.  Of course, if someone else has solved a problem
for you, that's great (and this is often the case for 1D nonlinear equation solving, for example).

#### Questions

Suppose $G$ is Lipschitz with constant $\alpha < 1$ on all of $\mathbb{R}^n$.
How can you argue that the iteration converges to a fixed point $x_*$?  Given an initial bound on $\|x_0-x_*\|$,
how would you decide how many steps you'd probably need to achieve a given (absolute) error tolerance $\tau$?

## Background

### Linear algebra

You should know about

- Concrete and abstract vector spaces
  - Standard examples ($\mathbb{R}^n$, $\mathcal{P}_d$)
  - Norms
  - Inner products
- Matrices and their interpretations
  - Mappings between vector spaces
  - Operators on a vector space
  - Quadratic forms on a vector space
- Block matrices
- Matrix norms and operator norms
- Matrix vector products (and how to do them fast)
- Singular value decomposition and eigenvalues

#### Questions

What is the Frobenius norm in terms of the SVD?  What is the 2-norm?  Why can the Frobenius norm not be a 2-norm?

### Calculus

You should know about

- Taylor's theorem in 1D
- Multivariate Taylor's theorem (out to first or second order)
  - Jacobians, gradients, and Hessians, oh my!
- Variational notation and matrix calculus

#### Questions

What is $\frac{d}{ds}\left|_{s=0} (A+sE)^{-1} \right.$?

### CS background

You should know about

- Order notation and performance
- A little graph theory and how it relates to sparse matrices

And, of course, how to program and debug.

#### Question

What is the graph associated with a tridiagonal?

## Error analysis concepts

You should know different ways of thinking about error

- Forward error
- Backward error / residual error
- Normwise vs componentwise error
- Absolute vs relative error

You should also know about condition numbers and the concept of backward stability.

We are not going to drill down on floating point error analysis, but you should have some
appreciation for how numbers are represented and the types of things that can go wrong
when you compute in floating point.

#### Question

What is the condition number for the problem of solving $f(x) = 0$?

## Linear systems

For this part of the class, we talked about $Ax = b$ where $A \in \mathbb{R}^{n \times n}$ is nonsingular.
We mostly considered *direct* solvers based on factorization.  This was our first real experience using the
concepts of condition numbers to understand error propagation; in particular, we saw
$$
  \frac{\|\hat{x}-x\|}{\|x\|} \leq \kappa(A) \frac{\|E\|}{\|A\|} + O(\|E\|^2),
$$
and that we can understand the rounding in a process like Gaussian elimination in terms of the backward error $\|E\|$.

We saw several ways of thinking about Gaussian elimination, with the Schur complement playing a central role.  We also
saw the related concept of Cholesky factorization.  I talked some about sparse solvers, and a
lot about the importance of re-using factorization work and putting parens in the right order.
We also saw iterative refinement, which was our first example of a fixed point iteration in the class.

#### Questions

Suppose $H = R^T R$ is a Cholesky factorization.  How would you add a row/column to $H$ and update the factorization
in $O(n^2)$ time?

## Least squares

The problem of least squares is a natural follow-up for the problem of linear systems.  We talked about different
ways of deriving / thinking about the normal equations and the Moore-Penrose pseudoinverse in the full-rank case;
the QR factorization and the SVD played substantial roles.  We briefly discussed the perturbation theory for least
squares problems, which is much more complicated than in the linear systems case.
We also talked about the idea of regularization, with a particular emphasis on Tikhonov regularization (though
we also discussed factor selection, the lasso, and truncated SVD). 

#### Question

How could you write the Moore-Penrose pseudoinverse in terms of the (economy) QR factorization or SVD?
How could you compute the L-curve used to look at $\|x_\lambda\|$ vs $\|r_\lambda\|$ using these factorizations?

## Eigenvalues

We talked a bit about the Jordan form, the Schur form, the spectral mapping theorem, and various applications of
eigenvalue comptuations before we moved on to algorithms.  In terms of algorithms, we built up

- Power iteration
- Then inverse iteration
- Then shift-invert
- Then Rayleigh quotient iteration
- Then subspace iteration
- Then QR iteration
- Then Hessenberg reduction
- And finally, all the ingredients to gether in the Francis double-shift QR

We did not talk as much about algorithms for the symmetric eigenvalue problem, except to point out that the
Rayleigh quotient plays a central role.

#### Question

If $f(z)$ is an analytic function and $f(A)$ makes sense (e.g. there is a power series representation),
argue that $f(A) v = 0$ for $v \neq 0$ iff there is at least one eigenvalue $\lambda$ of $A$ such that
$f(\lambda) = 0$.  If this is an isolated case (a single simple eigenvalue is involved),
how could you extract $\lambda$ knowing only $A$ and $v$ without reference to $f$?

## Iterations for linear systems

We talked about two flavors of iterations.  *Stationary* iterations can be analyzed via a matrix splitting:
$$
  Ax = (M-K)x = b,
$$
and the iteration becomes
$$
  M x_{k+1} = K x_k + b.
$$
The matrix $M$ should be easy to solve with.  Methods like Gauss-Seidel and Jacobi just take pieces of the matrix
(a triangular part or diagonal part).  The solve with $M$ is often "implicit" -- it looks like a so-called sweeping
computation rather than an explicit matrix formation and solve.  Convergence is determined by $R = M^{-1} K$; we
usually have to understand structural features of $A$ and $M$ in order to think about how to actually prove convergence.
We gave a couple examples: Jacobi iteration with diagonally dominant matrices and Gauss-Seidel with SPD matrices.
Often, though, the easiest way to determine convergence on any particular problem is to run the iteration and
see if it converges.  We then briefly discussed CG and Krylov subspace solvers.  These are hard to ask questions about
in a class like this.

#### Questions

If $\|R\| \leq \alpha$, how many steps are needed to achieve a relative residual $\|r\|/\|b\| < \tau$ for a given
tolerance $\tau$?

## Solvers in 1D

We have several different strategies in 1D for nonlinear equations:

- Bisection
- Newton iteration (and more general fixed point)
- Secant iteration
- Safeguarded methods (like Brent) that go between bisection and Newton/secant/etc

We also talked briefly about 1D optimization and golden section search.

We spent a long time writing error iterations over and over again and making semi-log plots of
convergence.  These are clearly things that you should know how to do going forward!

#### Questions

What is the Newton iteration for computing $\sqrt{a}$?  For $a \geq 1$, can you argue that iteration from the
initial guess $(1+a)/2$ always converges?

## Nonlinear equations and optimization in $\mathbb{R}^n$

In more than one dimension, things are harder.  We no longer have bisection, and methods like secant become much more
complex.  Our basic strategies usually involve Newton and Newton-like iterations.  One of the difficulties with Newton
is that forming and factoring Jacobians at every step can be expensive, so we talked about many ways of approximating
the Newton fixed point iteration: chord iterations, Shaminskii, Newton-Krylov.  We also talked about non-fixed-point iterations like Broyden and various sequence acceleration and extrapolation schemes.  These latter types of iterations
are great for projects, but they are not an easy source of exam questions.

The (unconstrained) optimization picture is closely related to the nonlinear equation solving picture.  We're still
trying to solve a nonlinear equation, $\nabla \phi(x) = 0$.  But the optimization perspective gives us some additional
tools for thinking about things.  In particular, we have additional iterations (e.g. gradient descent and similar
first-order iterations) and different ways of judging the standard iterations (e.g. checking whether a given step
is a descent direction).  That alternate perspective plays a huge role in how we think about *globalization*
strategies, of which we discussed three:

- Backtracking line search methods and things like the Armijo test
- Trust region methods
- Continuation methods

Approximate Newton iterations for particular problems are a *great* source of exam questions, partly because these are
places where there is a lot of room for thoughtful special-purpose tweaks to algorithms.

#### Questions

Describe a fast algorithm to solve $Ax = b(x_n)$ where $A$ is a nonsingular square matrix and
$b : \mathbb{R} \rightarrow \mathbb{R}^n$.

## Constrained problems

The material on methods for dealing with constraints that we covered this week is optional, and will not
be on the exam.  That said, you should absolutely know how to use the method of Lagrange multipliers to
write down the stationary point equations for an equality-constrained optimization problems, and how to
compute with that.

#### Questions

How would you solve the problem of minimizing $\|Ax-b\|_2$ subject to $\sum_j x_j = 1$?