Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified Publications_and_Presentations/MPArray.pdf
Binary file not shown.
61 changes: 39 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,11 @@

# MultiPrecisionArrays.jl v0.1.3

## v0.1.3 will have new features but no breaking changes from v0.1.2, the current release.

## v0.1.3 will have better performance but no breaking changes from v0.1.2, the current release. Performance fix in the works. Here is a list ...
- The termination criteria will change and I will add some options to make it more/less expensive. Computing __opnorm(A,Inf)__ for the termination criteria costs as much as a couple IR iterations. So that will become an option and terminating on small residuals will be the default. I will also use $\| A \|_1$ because
it is the least expensive norm to compute. Further economy will be in v0.1.4.
- The AppleAccelerate BLAS and Open BLAS can perform differently on Apple M* machines. It is not clear which is best and that may depend on number of cores, BLAS threads, version of Julia ... Play with this if you see poor performane in the solve phase of IR.

## [C. T. Kelley](https://ctk.math.ncsu.edu)

This package provides data structures and solvers for several variants of iterative refinement (IR). It will become much more useful when half precision (aka ```Float16```) is fully supported in LAPACK/BLAS. For now, its only general-purpose
Expand Down Expand Up @@ -43,6 +46,8 @@ __The half precision LU for Float16 in this package is much faster (more than 10
- Krylov-IR for high precision residuals

- v0.1.3: Still better docs and ..
- Fixing a performance bug.
- Add options to termination criterion. Change default to small residuals.

## Can I complain about this package?

Expand Down Expand Up @@ -168,6 +173,11 @@ Now we will see how the results look. In this example we compare the result with
As you can see the results are equally good. Note that the factorization object ```MPF``` is the
output of ```mplu```. This is analogous to ```AF=lu(A)``` in LAPACK.

You may not get exactly the same results for this example on
different hardware, BLAS, number of cores, versions of Julia/OS/MulitPrecisionArrays.
I am still playing with the termination criteria and the iteration
count could grow or shrink as I do that as could the residual for the converged result.

```
julia> using MultiPrecisionArrays

Expand All @@ -192,7 +202,7 @@ julia> ze=norm(z-x,Inf); zr=norm(b-A*z,Inf)/norm(b,Inf);
julia> we=norm(w-x,Inf); wr=norm(b-A*w,Inf)/norm(b,Inf);

julia> println("Errors: $ze, $we. Residuals: $zr, $wr")
Errors: 1.33227e-15, 7.41629e-14. Residuals: 1.33243e-15, 7.40609e-14
Errors: 2.22045e-16, 6.71685e-14. Residuals: 2.22045e-16, 6.71685e-14
```

So the results are equally good.
Expand All @@ -206,10 +216,10 @@ is 50% more than ```lu```.

```
julia> @belapsed mplu($A)
8.60945e-02
8.59528e-02

julia> @belapsed lu!(AC) setup=(AC=copy($A))
1.42840e-01
1.42112e-01

```
It is no surprise that the factorization in single precision took roughly half as long as the one in double. In the double-single precision case, iterative refinement is a great
Expand All @@ -219,17 +229,18 @@ The advantages of IR increase as the dimension increases. IR is less impressive
julia> N=30; A=I + Gmat(N);

julia> @belapsed mplu($A)
4.19643e-06
5.22217e-06

julia> @belapsed lu!(AC) setup=(AC=copy($A))
3.70825e-06
3.64062e-06
```

### A few details

Look at the docs for things like
- [Memory allocation costs](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/#Memory-Allocations-for-mplu)
- [Terminating the while loop](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/Details/Termination/)
- [Is O(N^2) work really negligible?](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/Details/N2Work)
- [Options and data structures for mplu](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/#Options-and-data-structures-for-mplu)


Expand Down Expand Up @@ -270,22 +281,23 @@ julia> # Use \ with reporting=true
julia> mpout=\(MPF, b; reporting=true);

julia> norm(b-A*mpout.sol, Inf)
1.33227e-15
2.22045e-16

julia> # Now look at the residual history

julia> mpout.rhist
5-element Vector{Float64}:
9.99878e-01
1.21892e-04
5.25805e-11
2.56462e-14
1.33227e-15
6-element Vector{Float64}:
1.00000e+00
1.89553e-05
4.56056e-11
3.08642e-14
4.44089e-16
2.22045e-16
```
As you can see, IR does well for this problem. The package uses an initial
iterate of $x = 0$ and so the initial residual is simply $r = b$
and the first entry in the residual history is $|| b ||_\infty$. The
iteration terminates successfully after four matrix-vector products.
iteration terminates successfully after five matrix-vector products.

You may wonder why the residual after the first iteration was so much
larger than single precision roundoff. The reason is that the default
Expand All @@ -297,18 +309,23 @@ One can enable interprecision transfers on the fly and see the difference.
```
julia> MPF2=mplu(A; onthefly=true);

julia> @belapsed $MPF2\$b
2.47693e-02

julia> mpout2=\(MPF2, b; reporting=true);

julia> mpout2.rhist
5-element Vector{Float64}:
9.99878e-01
6.17721e-07
3.84581e-13
7.99361e-15
8.88178e-16
5-element Vector{Float64}:
1.00000e+00
6.29385e-07
3.98570e-13
5.21805e-15
2.22045e-16
```
So the second iteration is much better, but the iteration terminated after
four iterations in both cases.
So the second iteration took fewer iterations but a bit more time.



There are more examples for this in the [paper](https://github.com/ctkelley/MultiPrecisionArrays.jl/blob/main/Publications_and_Presentations/MPArray.pdf).

Expand Down
4 changes: 2 additions & 2 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ makedocs(
"Home"=>"index.md",
"Half Precision and Krylov-IR" => Any["Half_1.md",],
"More than you want to know" =>
Any["Details/Termination.md", "Details/Interprecision_1.md",
"Details/Extended.md",],
Any["Details/Termination.md", "Details/N2Work.md",
"Details/Interprecision_1.md", "Details/Extended.md",],
"MPArray Constructors" =>
Any["functions/MPArray.md","functions/MPGArray.md",
"functions/MPBArray.md",],
Expand Down
6 changes: 6 additions & 0 deletions docs/src/Details/Extended.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ Here is an example with a badly conditioned matrix. You must tell
```mplu``` to factor in the working precision and use the
```kwargs``` in ```mplu``` to set ```TR```.

You may not get exactly the same results for this example on
different hardware, BLAS, versions of Julia or this package.
I am still playing with the termination criteria and the iteration
count could grow or shrink as I do that.


```
julia> using MultiPrecisionArrays

Expand Down
83 changes: 83 additions & 0 deletions docs/src/Details/N2Work.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Is O(N^2) work negligible?

In this sectiion ```TR = TW = Float64``` and
```TF = TS = Float32```, which means that the iterprecision transfers
in the triangular solvers are done in-place. We terminate on small residuals.

The premise behind IR is that reducing the $O(N^3)$ cost of the
factorization will make the solve faster because everything else
is $O(N^2)$ work. It's worth looking to this.

We will use the old-fashioned defintion of a FLOP as an add, a multiply
and a bit of address computation. So we have $N^2$ flops for any of

- matrix-vector multiply $A x$,
- the two triangular solves with the LU factors $(LU)^{-1} b$, and
- computation of the $\ell^1$ or $\ell^\infty$ matrix operator norms $|| A ||_{1,\infty}$.

A linear solve with an LU factorization and the standard triangular
solve has a cost of $(N^3/3) + N^2$ TR-FLOPS. The factorization for IR
has a cost of $N^3/3$ TF-FLOPS or $N^3/6$ TR-FLOPS.

A single IR iteration costs a matrix-vector product in precision TR
and a triangular solve in precision TF for a total of
$3 N^2/2$ TR-FLOPS. Hence a linear solve with IR that needs $n_I$ iterations
costs
```math
\frac{N^3}{6} + 3 n_I N^2/2
```
TR-FLOPS if one terminates on small residuals and an extra $N^2$ TR-FLOPS
if one computes the norm of $\ma$ in precision TR.

IR will clearly be better for large values of $N$. How large is that?
In this example we compare the cost of factorization and solve
using ```lu!``` and ```ldiv``` (cols 2-4) with the equivalent multiprecision
commands $mplu$ and $\backslash$

The operator is ```A = I - Gmat(N)```. We tabulate

- LU: time for ```AF=lu!(A)```

- TS: time for ```ldiv!(AF,b)```

- TOTL = LU+TS

- MPLU: time for ```MPF=mplu(A)```

- MPS: time for ```MPF\b```

- TOT: MPLU+MPS

- OPNORM: Cost for $\| A \|_1$, which one needs to terminate on small normwise backward error.

The message from the tables is that the triangular solves are more costly
than operation counts might indicate. One reason for this is that the
LU factorization exploits multi-core computing better than a triangular
solve. It is also interesting to see how the choice of BLAS affects the
results and how the cost of the operator norm of the matrix is more than
a triangular solve.

For both cases, multiprecision arrays perform better when $N \ge 2048$
with the difference becoming larger as $N$ increases.

openBLAS
```
N LU TS TOTL MPLU MPS TOT OPNORM
512 9.87e-04 5.35e-05 1.04e-03 7.74e-04 2.90e-04 1.06e-03 1.65e-04
1024 3.67e-03 2.14e-04 3.88e-03 3.21e-03 8.83e-04 4.10e-03 7.80e-04
2048 2.10e-02 1.20e-03 2.22e-02 1.54e-02 4.72e-03 2.02e-02 3.36e-03
4096 1.46e-01 5.38e-03 1.51e-01 8.93e-02 1.91e-02 1.08e-01 1.43e-02
8192 1.10e+00 2.01e-02 1.12e+00 5.98e-01 6.89e-02 6.67e-01 5.83e-02
```

AppleAcclerateBLAS
```
N LU TS TOTL MPLU MPS TOT OPNORM
512 7.86e-04 1.10e-04 8.96e-04 4.76e-04 3.58e-04 8.35e-04 1.65e-04
1024 3.28e-03 4.54e-04 3.73e-03 2.44e-03 1.30e-03 3.74e-03 7.80e-04
2048 2.27e-02 2.46e-03 2.52e-02 1.32e-02 1.13e-02 2.45e-02 3.36e-03
4096 1.52e-01 1.13e-02 1.63e-01 6.51e-02 5.62e-02 1.21e-01 1.40e-02
8192 1.17e+00 5.35e-02 1.23e+00 6.27e-01 2.48e-01 8.75e-01 5.67e-02
```


5 changes: 5 additions & 0 deletions docs/src/Details/Stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ __reporting__ keyword argument to the solvers. The easiest way
to do this is with the backslash command. When you use this option you
get a data structure with the solution and the residual history.

You may not get exactly the same results for this example on
different hardware, BLAS, versions of Julia or this package.
I am still playing with the termination criteria and the iteration
count could grow or shrink as I do that.

```
julia> using MultiPrecisionArrays

Expand Down
37 changes: 30 additions & 7 deletions docs/src/Details/Termination.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,41 @@
# Terminating the while loop

We terminate the loop when
Todays values are
```math
\| r \| < \tau (\| b \| + \| A \| \| x \|)
C_e = 1.0, C_r = 20.0, r_{max} = .1
```
where we use $\tau = 0.5 * eps(TW)$, where $eps(TW)$ is the working
precision floating
point machine epsilon. The problem with this criterion is

Floating point roundoff is
```math
u_r = 0.5 * eps(TR)
```

We can terminate on a small residual
```math
\| r \| < C_r u_r \| b \|
```
or a small relative normwise backward error
```math
\| r \| < C_e u_r (\| b \| + \| A \| \| x \|)
```

Termination on small residual is the default because computing $\| A \|$
is $N^2$ work is more expensive that a few IR iterations. I am using
$\| A \|_1$ because that takes less time (data locality) than
$\| A \|_\infty$ but it is still a pain. The reason $C_r > C_e$ is
that putting $\| A \|$ in the termination criterion is very useful
and making $C_r$ large helps compensate for that.

The problem with these criteria is
that IR can stagnate, especially for ill-conditioned problems, before
the termination criterion is attained. We detect stagnation by looking
for a unacceptable decrease (or increase) in the residual norm. So we will
terminate the iteration if
also terminate the iteration if
```math
\| r_{new} \| \ge .9 \| r_{old} \|
\| r_{new} \| \ge r_{max} \| r_{old} \|
```
even if the small residual condition is not satisfied.

I am still playing with the termination criteria and the iteration
counts and timings could grow or shrink as I do that.

6 changes: 6 additions & 0 deletions docs/src/Half_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ can have a large error. Here is an example to illustrate this point.
The matrix here is modestly ill-conditioned and you can see that in the
error from a direct solve in double precision.

You may not get exactly the same results for this example on
different hardware, BLAS, versions of Julia or this package.
I am still playing with the termination criteria and the iteration
count could grow or shrink as I do that.


```
julia> A=I - 800.0*G;

Expand Down
9 changes: 7 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,11 @@ Now we will see how the results look. In this example we compare the result with
As you can see the results are equally good. Note that the factorization object ```MPF``` is the
output of ```mplu```. This is analogous to ```AF=lu(A)``` in LAPACK.

You may not get exactly the same results for the examples on
different hardware, BLAS, versions of Julia or this package.
I am still playing with the termination criteria and the iteration
count could grow or shrink as I do that.

```
julia> using MultiPrecisionArrays

Expand All @@ -187,7 +192,7 @@ julia> ze=norm(z-x,Inf); zr=norm(b-A*z,Inf)/norm(b,Inf);
julia> we=norm(w-x,Inf); wr=norm(b-A*w,Inf)/norm(b,Inf);

julia> println("Errors: $ze, $we. Residuals: $zr, $wr")
Errors: 8.88178e-16, 7.41629e-14. Residuals: 1.33243e-15, 7.40609e-14
Errors: 2.22045e-16, 6.71685e-14. Residuals: 2.22045e-16, 6.71685e-14

```

Expand Down Expand Up @@ -216,7 +221,7 @@ The advantages of IR increase as the dimension increases. IR is less impressive
julia> N=30; A=I + Gmat(N);

julia> @belapsed mplu($A)
4.19643e-06
5.22217e-06

julia> @belapsed lu!(AC) setup=(AC=copy($A))
3.70825e-06
Expand Down
Loading