ctkelley · ctkelley · Jul 22, 2024 · Jul 16, 2024 · Jul 17, 2024 · Jul 18, 2024
diff --git a/Publications_and_Presentations/MPArray.pdf b/Publications_and_Presentations/MPArray.pdf
diff --git a/README.md b/README.md
@@ -8,8 +8,11 @@
 
 # MultiPrecisionArrays.jl v0.1.3
 
-##  v0.1.3 will have new features but no breaking changes from v0.1.2, the current release.
-
+##  v0.1.3 will have better performance but no breaking changes from v0.1.2, the current release. Performance fix in the works. Here is a list ...
+- The termination criteria will change and I will add some options to make it more/less expensive. Computing __opnorm(A,Inf)__ for the termination criteria costs as much as a couple IR iterations. So that will become an option and terminating on small residuals will be the default. I will also use $\| A \|_1$ because
+it is the least expensive norm to compute. Further economy will be in v0.1.4.
+- The AppleAccelerate BLAS and Open BLAS can perform differently on Apple M* machines. It is not clear which is best and that may depend on number of cores, BLAS threads, version of Julia ... Play with this if you see poor performane in the solve phase of IR.   
+
 ## [C. T. Kelley](https://ctk.math.ncsu.edu)
 
 This package provides data structures and solvers for several variants of iterative refinement (IR). It will become much more useful when half precision (aka ```Float16```) is fully supported in LAPACK/BLAS. For now, its only general-purpose
@@ -43,6 +46,8 @@ __The half precision LU for Float16 in this package is much faster (more than 10
    - Krylov-IR for high precision residuals
 
 - v0.1.3: Still better docs and ..
+   - Fixing a performance bug.
+   - Add options to termination criterion. Change default to small residuals.
 
 ##  Can I complain about this package?
 
@@ -168,6 +173,11 @@ Now we will see how the results look. In this example we compare the result with
 As you can see the results are equally good. Note that the factorization object ```MPF``` is the
 output of ```mplu```. This is analogous to ```AF=lu(A)``` in LAPACK.
 
+You may not get exactly the same results for this example on
+different hardware, BLAS, number of cores, versions of Julia/OS/MulitPrecisionArrays.
+I am still playing with the termination criteria and the iteration
+count could grow or shrink as I do that as could the residual for the converged result.
+
 ```
 julia> using MultiPrecisionArrays
 
@@ -192,7 +202,7 @@ julia> ze=norm(z-x,Inf); zr=norm(b-A*z,Inf)/norm(b,Inf);
 julia> we=norm(w-x,Inf); wr=norm(b-A*w,Inf)/norm(b,Inf);
 
 julia> println("Errors: $ze, $we. Residuals: $zr, $wr")
-Errors: 1.33227e-15, 7.41629e-14. Residuals: 1.33243e-15, 7.40609e-14
+Errors: 2.22045e-16, 6.71685e-14. Residuals: 2.22045e-16, 6.71685e-14
 ```
 
 So the results are equally good.
@@ -206,10 +216,10 @@ is 50% more than ```lu```.
 
 ```
 julia> @belapsed mplu($A)
-8.60945e-02
+8.59528e-02
 
 julia> @belapsed lu!(AC) setup=(AC=copy($A))
-1.42840e-01
+1.42112e-01
 
 ```
 It is no surprise that the factorization in single precision took roughly half as long as the one in double. In the double-single precision case, iterative refinement is a great
@@ -219,17 +229,18 @@ The advantages of IR increase as the dimension increases. IR is less impressive
 julia> N=30; A=I + Gmat(N); 
 
 julia> @belapsed mplu($A)
-4.19643e-06
+5.22217e-06
 
 julia> @belapsed lu!(AC) setup=(AC=copy($A))
-3.70825e-06
+3.64062e-06
 ```
 
 ### A few details
 
 Look at the docs for things like
 - [Memory allocation costs](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/#Memory-Allocations-for-mplu)
 - [Terminating the while loop](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/Details/Termination/)
+- [Is O(N^2) work really negligible?](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/Details/N2Work)
 - [Options and data structures for mplu](https://ctkelley.github.io/MultiPrecisionArrays.jl/dev/#Options-and-data-structures-for-mplu)
 
 
@@ -270,22 +281,23 @@ julia> # Use \ with reporting=true
 julia> mpout=\(MPF, b; reporting=true);
 
 julia> norm(b-A*mpout.sol, Inf)
-1.33227e-15
+2.22045e-16
 
 julia> # Now look at the residual history
 
 julia> mpout.rhist
-5-element Vector{Float64}:
- 9.99878e-01
- 1.21892e-04
- 5.25805e-11
- 2.56462e-14
- 1.33227e-15
+6-element Vector{Float64}:
+ 1.00000e+00
+ 1.89553e-05
+ 4.56056e-11
+ 3.08642e-14
+ 4.44089e-16
+ 2.22045e-16
 ```
 As you can see, IR does well for this problem. The package uses an initial
 iterate of $x = 0$ and so the initial residual is simply $r = b$
 and the first entry in the residual history is $|| b ||_\infty$. The
-iteration terminates successfully after four matrix-vector products.
+iteration terminates successfully after five matrix-vector products.
 
 You may wonder why the residual after the first iteration was so much
 larger than single precision roundoff. The reason is that the default 
@@ -297,18 +309,23 @@ One can enable interprecision transfers on the fly and see the difference.
  ```
 julia> MPF2=mplu(A; onthefly=true);
 
+julia> @belapsed $MPF2\$b
+2.47693e-02
+
 julia> mpout2=\(MPF2, b; reporting=true);
 
 julia> mpout2.rhist
 5-element Vector{Float64}:
- 9.99878e-01
- 6.17721e-07
- 3.84581e-13
- 7.99361e-15
- 8.88178e-16
+ 5-element Vector{Float64}:
+ 1.00000e+00
+ 6.29385e-07
+ 3.98570e-13
+ 5.21805e-15
+ 2.22045e-16
 ```
-So the second iteration is much better, but the iteration terminated after
-four iterations in both cases.
+So the second iteration took fewer iterations but a bit more time.
+
+
 
 There are more examples for this in the [paper](https://github.com/ctkelley/MultiPrecisionArrays.jl/blob/main/Publications_and_Presentations/MPArray.pdf).
 

diff --git a/docs/make.jl b/docs/make.jl
@@ -38,8 +38,8 @@ makedocs(
         "Home"=>"index.md",
         "Half Precision and Krylov-IR" => Any["Half_1.md",], 
         "More than you want to know" => 
-           Any["Details/Termination.md", "Details/Interprecision_1.md",
-                "Details/Extended.md",],
+           Any["Details/Termination.md", "Details/N2Work.md",
+         "Details/Interprecision_1.md", "Details/Extended.md",],
         "MPArray Constructors" => 
            Any["functions/MPArray.md","functions/MPGArray.md",
                "functions/MPBArray.md",],

diff --git a/docs/src/Details/Extended.md b/docs/src/Details/Extended.md
@@ -27,6 +27,12 @@ Here is an example with a badly conditioned matrix. You must tell
 ```mplu``` to factor in the working precision and use the
 ```kwargs``` in ```mplu``` to set ```TR```.
 
+You may not get exactly the same results for this example on
+different hardware, BLAS, versions of Julia or this package.
+I am still playing with the termination criteria and the iteration
+count could grow or shrink as I do that.
+
+
 ```
 julia> using MultiPrecisionArrays
 

diff --git a/docs/src/Details/N2Work.md b/docs/src/Details/N2Work.md
@@ -0,0 +1,83 @@
+# Is O(N^2) work negligible?
+
+In this sectiion ```TR = TW = Float64``` and
+```TF = TS = Float32```, which means that the iterprecision transfers
+in the triangular solvers are done in-place. We terminate on small residuals.
+
+The premise behind IR is that reducing the $O(N^3)$ cost of the
+factorization will make the solve faster because everything else
+is $O(N^2)$ work. It's worth looking to this.
+
+We will use the old-fashioned defintion of a FLOP as an add, a multiply
+and a bit of address computation. So we have $N^2$ flops for any of
+
+ - matrix-vector multiply $A x$,
+ - the two triangular solves with the LU factors $(LU)^{-1} b$, and
+ - computation of the $\ell^1$ or $\ell^\infty$ matrix operator norms $|| A ||_{1,\infty}$.
+
+A linear solve with an LU factorization and the standard triangular
+solve has a cost of $(N^3/3) + N^2$ TR-FLOPS. The factorization for IR
+has a cost of $N^3/3$ TF-FLOPS or $N^3/6$ TR-FLOPS.
+
+A single IR iteration costs a matrix-vector product in precision TR
+and a triangular solve in precision TF for a total of
+$3 N^2/2$ TR-FLOPS. Hence a linear solve with IR that needs $n_I$ iterations
+costs
+```math
+\frac{N^3}{6} + 3 n_I N^2/2
+```
+TR-FLOPS if one terminates on small residuals and an extra $N^2$ TR-FLOPS
+if one computes the norm of $\ma$ in precision TR.
+
+IR will clearly be better for large values of $N$. How large is that?
+In this example we compare the cost of factorization and solve
+using ```lu!``` and ```ldiv``` (cols 2-4) with the equivalent multiprecision
+commands $mplu$ and $\backslash$
+
+The operator is ```A = I - Gmat(N)```. We tabulate
+
+ - LU: time for ```AF=lu!(A)```
+
+ - TS: time for ```ldiv!(AF,b)```
+
+ - TOTL = LU+TS
+
+ - MPLU: time for ```MPF=mplu(A)```
+
+ - MPS: time for ```MPF\b```
+
+ - TOT: MPLU+MPS
+
+ - OPNORM: Cost for $\| A \|_1$, which one needs to terminate on small normwise backward error.
+
+The message from the tables is that the triangular solves are more costly
+than operation counts might indicate. One reason for this is that the
+LU factorization exploits multi-core computing better than a triangular
+solve. It is also interesting to see how the choice of BLAS affects the
+results and how the cost of the operator norm of the matrix is more than
+a triangular solve.
+
+For both cases, multiprecision arrays perform better when $N \ge 2048$
+with the difference becoming larger as $N$ increases.
+
+openBLAS
+```
+   N        LU        TS       TOTL      MPLU       MPS       TOT    OPNORM  
+   512  9.87e-04  5.35e-05  1.04e-03  7.74e-04  2.90e-04  1.06e-03  1.65e-04 
+  1024  3.67e-03  2.14e-04  3.88e-03  3.21e-03  8.83e-04  4.10e-03  7.80e-04 
+  2048  2.10e-02  1.20e-03  2.22e-02  1.54e-02  4.72e-03  2.02e-02  3.36e-03 
+  4096  1.46e-01  5.38e-03  1.51e-01  8.93e-02  1.91e-02  1.08e-01  1.43e-02 
+  8192  1.10e+00  2.01e-02  1.12e+00  5.98e-01  6.89e-02  6.67e-01  5.83e-02 
+```
+
+AppleAcclerateBLAS
+```
+   N        LU        TS       TOTL      MPLU       MPS       TOT    OPNORM  
+   512  7.86e-04  1.10e-04  8.96e-04  4.76e-04  3.58e-04  8.35e-04  1.65e-04 
+  1024  3.28e-03  4.54e-04  3.73e-03  2.44e-03  1.30e-03  3.74e-03  7.80e-04 
+  2048  2.27e-02  2.46e-03  2.52e-02  1.32e-02  1.13e-02  2.45e-02  3.36e-03 
+  4096  1.52e-01  1.13e-02  1.63e-01  6.51e-02  5.62e-02  1.21e-01  1.40e-02 
+  8192  1.17e+00  5.35e-02  1.23e+00  6.27e-01  2.48e-01  8.75e-01  5.67e-02 
+```
+
+
diff --git a/docs/src/Details/Stats.md b/docs/src/Details/Stats.md
@@ -5,6 +5,11 @@ __reporting__ keyword argument to the solvers. The easiest way
 to do this is with the backslash command. When you use this option you
 get a data structure with the solution and the residual history.
 
+You may not get exactly the same results for this example on
+different hardware, BLAS, versions of Julia or this package.
+I am still playing with the termination criteria and the iteration
+count could grow or shrink as I do that.
+
 ```
 julia> using MultiPrecisionArrays
 

diff --git a/docs/src/Details/Termination.md b/docs/src/Details/Termination.md
@@ -1,18 +1,41 @@
 # Terminating the while loop
 
-We terminate the loop when 
+Todays values are
 ```math
-\| r \| < \tau (\| b \| + \| A \| \| x \|)
+C_e = 1.0, C_r = 20.0, r_{max} = .1
 ```
-where we use $\tau = 0.5 * eps(TW)$, where $eps(TW)$ is the working
-precision floating
-point machine epsilon.  The problem with this criterion is
+
+Floating point roundoff is 
+```math
+u_r = 0.5 * eps(TR)
+```
+
+We can terminate on a small residual
+```math
+\| r \| < C_r u_r \| b \|
+```
+or a small relative normwise backward error
+```math
+\| r \| < C_e u_r (\| b \| + \| A \| \| x \|)
+```
+
+Termination on small residual is the default because computing $\| A \|$
+is $N^2$ work is more expensive that a few IR iterations. I am using
+$\| A \|_1$ because that takes less time (data locality) than 
+$\| A \|_\infty$ but it is still a pain. The reason $C_r > C_e$ is
+that putting $\| A \|$ in the termination criterion is very useful
+and making $C_r$ large helps compensate for that.
+
+The problem with these criteria is
 that IR can stagnate, especially for ill-conditioned problems, before
 the termination criterion is attained. We detect stagnation by looking
 for a unacceptable decrease (or increase) in the residual norm. So we will
-terminate the iteration if
+also terminate the iteration if
 ```math
-\| r_{new} \| \ge .9 \| r_{old} \|
+\| r_{new} \| \ge r_{max} \| r_{old} \|
 ```
 even if the small residual condition is not satisfied.
 
+I am still playing with the termination criteria and the iteration
+counts and timings could grow or shrink as I do that. 
+
diff --git a/docs/src/Half_1.md b/docs/src/Half_1.md
@@ -59,6 +59,12 @@ can have a large error. Here is an example to illustrate this point.
 The matrix here is modestly ill-conditioned and you can see that in the 
 error from a direct solve in double precision.
 
+You may not get exactly the same results for this example on
+different hardware, BLAS, versions of Julia or this package.
+I am still playing with the termination criteria and the iteration
+count could grow or shrink as I do that.
+
+
 ```
 julia> A=I - 800.0*G;
 

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -165,6 +165,11 @@ Now we will see how the results look. In this example we compare the result with
 As you can see the results are equally good. Note that the factorization object ```MPF``` is the
 output of ```mplu```. This is analogous to ```AF=lu(A)``` in LAPACK.
 
+You may not get exactly the same results for the examples on
+different hardware, BLAS, versions of Julia or this package. 
+I am still playing with the termination criteria and the iteration
+count could grow or shrink as I do that.
+
 ```
 julia> using MultiPrecisionArrays
 
@@ -187,7 +192,7 @@ julia> ze=norm(z-x,Inf); zr=norm(b-A*z,Inf)/norm(b,Inf);
 julia> we=norm(w-x,Inf); wr=norm(b-A*w,Inf)/norm(b,Inf);
 
 julia> println("Errors: $ze, $we. Residuals: $zr, $wr")
-Errors: 8.88178e-16, 7.41629e-14. Residuals: 1.33243e-15, 7.40609e-14
+Errors: 2.22045e-16, 6.71685e-14. Residuals: 2.22045e-16, 6.71685e-14
 
 ```
 
@@ -216,7 +221,7 @@ The advantages of IR increase as the dimension increases. IR is less impressive
 julia> N=30; A=I + Gmat(N); 
 
 julia> @belapsed mplu($A)
-4.19643e-06
+5.22217e-06
 
 julia> @belapsed lu!(AC) setup=(AC=copy($A))
 3.70825e-06