SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogonalization #16749

kronbichler · 2024-03-13T22:06:24Z

~~This PR builds currently on top of #16745, but only the last two commits are relevant for this PR.~~

This PR implements a new variant of orthogonalization in SolverGMRES and SolverFGMRES. It performs reorthogonalization unconditionally, but it does so in a smart way: In order not to increase the number of global reductions and memory access compared to the classical Gram-Schmidt process, it performs the reorthogonalization one iteration later. This is called delayed reorthogonalization. It is highly robust and I would like to make it the default GMRES implementation in a later PR. The implementation is essentially algorithm 4 in the recent contribution by Bielich et al. (2022).

The algorithm implemented here uses the switch originally introduced in #14349 to support matrix-vector operations within the ortogonalization scheme, which is the primary motivation to use classical Gram-Schmidt after all; for other vector types, I cannot really do much optimizations because our wrappers do not allow those (and one should use PETSc's own or Trilinos' Belos solvers to get the optimized methods). I plan to extend this also to dealii::Vector and dealii::BlockVector in an upcoming PR because we should really use the optimized path for all of our own vectors.

Once we are happy with the direction and the PR #16745 is merged, I will also write a changelog for this PR.

Fixes #16077, fixes #14864.

kronbichler · 2024-03-14T17:22:28Z

I now implemented the orthogonalization in a dedicated class, rather than in free functions. This helps me to keep all variables local in the context they are used and overall encapsulate the Arnoldi orthogonalization better. On this path, I added extensive documentation of the new class to make sure the interface is understandable. I think the concept is now much better, which can be seen by how compact the solve functions for both GMRES solvers are now.

…alization

kronbichler · 2024-03-15T06:13:22Z

Rebased after the merge of #16745.

peterrum

Very nice 👍

peterrum · 2024-03-19T09:23:52Z

include/deal.II/lac/solver_gmres.h

-      const LinearAlgebra::OrthogonalizationStrategy orthogonalization_strategy,
-      const internal::SolverGMRESImplementation::TmpVectors<VectorType>
-                                               &orthogonal_vectors,
+    ArnoldiProcess::orthonormalize_nth_vector(


Maybe rename dim? I guess it is n?

We have many occurrences; I took the chance to change them all from dim (which we use with different connotation in most places of deal.II, so I agree with your assessment) to n. The one downside is that we have many uses of n formerly dim, so that might be a question. I left it as a separate comment, please express your opinion to see what we like more.

peterrum · 2024-03-19T09:24:59Z

include/deal.II/lac/solver_gmres.h

    }



-    template <typename VectorType,
+    template <bool delayed_reorthogonalization,


How much slower is the code without the template argument?

I use fixed-size arrays that depend on this value to use as good register loop blocking as reasonably possible, as I want to load 12 values in the classical Gram-Schmidt and 6 values in the new delayed orthogonalization code path (due to 2 parallel accumulations for the 2 orthogonalization processes running, it gets 12 registers for partial results or input again). I can try with 6 in both cases if you want and measure for a case from cache and a case from main memory (the latter should be memory bandwidth bound, but the shorter the loop, the more likely are visible effects from the translation lookaside buffer or prefetching limitations). I guess this just made a case for measuring, didn't it?

Thank you for the explanation. I am fine with the current status.

I ran a short benchmark, using a locally owned size of 1000 (cached) or 1000000 (not cached). Here is the result with the current code on 72 Intel Ice Lake cores:

Timing MGS size 1000: 5.63975e-05 5.63983e-05 s or 326.823 GB/s Timing MGS size 1000000: 0.11909 0.119101 s or 154.773 GB/s Timing CGS size 1000: 1.47914e-05 1.47922e-05 s or 1246.13 GB/s Timing CGS size 1000000: 0.0689767 0.0689878 s or 267.221 GB/s Timing DCGS2 size 1000: 1.22783e-05 1.2279e-05 s or 1501.19 GB/s Timing DCGS2 size 1000000: 0.066445 0.0664467 s or 277.402 GB/s

Here CGS is classical Gram-Schmidt, MGS is modified G-S, and DCGS2 is the classical Gram-Schmidt method with delayed reorthogonalization (proposed patch). On this system, the new algorithm runs even faster than the classical Gram-Schmidt approach, whereas I observed the opposite on my notebook. Without the template, I see

Timing MGS size 1000: 5.56948e-05 5.56949e-05 s or 330.947 GB/s Timing MGS size 1000000: 0.121417 0.121438 s or 151.808 GB/s Timing CGS size 1000: 1.5118e-05 1.51188e-05 s or 1219.21 GB/s Timing CGS size 1000000: 0.069091 0.0691066 s or 266.779 GB/s Timing DCGS2 size 1000: 1.24458e-05 1.24465e-05 s or 1480.98 GB/s Timing DCGS2 size 1000000: 0.0647302 0.0647319 s or 284.751 GB/s

As you see, the difference is not big (by comparing to MGS, we can see the effect of noise). From L2 cache, the performance is slightly lower, but not spectacularly so. I think we can keep the templates for now, we can always adjust later, I had to change 15 lines.

A note as to why the non-templated variant is so close: If I read the generated assembly code correctly, gcc is doing two copies of the inner loop, one for the delayed algorithm and one for the default, in order to keep the checks out of the critical path. I think we can simply generate the templated versions (I will refactor the actual inner loop into a .cc file in an upcoming PR because I want to pre-compile the code).

Thanks for the investigation!

masterleinad

Just a couple typos.

masterleinad · 2024-03-19T13:21:27Z

include/deal.II/lac/orthogonalization.h

+     * Use classical Gram-Schmidt algorithm with two orthogonalization
+     * iterations and delayed orthogonalization using the algorithm described
+     * in @cite Bielich2022. This approach works on multi-vectors with a
+     * single global reduction (of multiple elements) and more efficient than


Suggested change

* single global reduction (of multiple elements) and more efficient than

* single global reduction (of multiple elements) and is more efficient than

masterleinad · 2024-03-19T13:29:19Z

include/deal.II/lac/solver_gmres.h

-   * number. Calls the signals eigenvalues_signal and cond_signal with these
-   * estimates as arguments.
+   * during the inner iterations for @p n vectors in total. Uses these
+   * estimate to compute the condition number. Calls the signals


Suggested change

* estimate to compute the condition number. Calls the signals

* estimates to compute the condition number. Calls the signals

masterleinad · 2024-03-19T14:20:16Z

include/deal.II/lac/solver_gmres.h

+      // one where the orthogonalization has finished (i.e., end of inner
+      // iteration in GMRES) and we can safely overwrite the content of the
+      // tridiagonal matrix and right hand side, and the case during the inner
+      // iterations where need to create copies of the matrices in the QR


Suggested change

// iterations where need to create copies of the matrices in the QR

// iterations where we need to create copies of the matrices in the QR

masterleinad · 2024-03-19T14:44:47Z

Once we are happy with the direction and the PR #16745 is merged, I will also write a changelog for this PR.

Do you want to write a changelog entry or update one of the ones in #16745?

kronbichler · 2024-03-19T14:46:34Z

I will update the other changelog, but I would like to postpone this to an upcoming PR, where I want to switch the default algorithm for GMRES to the new variant. (I wanted to do that separately to have a better control over the test suite and where changes originate from.)

kronbichler added Linear Algebra ready to test labels Mar 13, 2024

kronbichler force-pushed the gmres_delayed_reorthogonalization branch 2 times, most recently from e5c008e to b5771a4 Compare March 14, 2024 09:36

kronbichler force-pushed the gmres_delayed_reorthogonalization branch from d341426 to a221d61 Compare March 14, 2024 17:54

This was referenced Mar 14, 2024

Cleanup of SolverGMRES and SolverFGMRES implementations #16745

Merged

SolverGMRES: Document roundoff issues and orthogonalization strategies #16077

Closed

kronbichler added 2 commits March 15, 2024 07:12

SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogon…

2725d91

…alization

New test cases

4e9130b

kronbichler force-pushed the gmres_delayed_reorthogonalization branch from a221d61 to 8ca0a3c Compare March 15, 2024 06:12

kronbichler added this to the Release 9.6 milestone Mar 15, 2024

kronbichler mentioned this pull request Mar 15, 2024

Clean up SolverGMRES and SolverFGMRES #14864

Closed

Clean up orthogonalization process by separate class

d4b8132

kronbichler force-pushed the gmres_delayed_reorthogonalization branch from 8ca0a3c to d4b8132 Compare March 15, 2024 08:17

Fix unnecessary allreduce

16f9a00

kronbichler mentioned this pull request Mar 16, 2024

Check all_zero in (F)GMRES? #14619

Closed

peterrum reviewed Mar 19, 2024

View reviewed changes

Change variable name 'dim' -> 'n'

b54285f

peterrum approved these changes Mar 19, 2024

View reviewed changes

masterleinad reviewed Mar 19, 2024

View reviewed changes

Review comments

2fad716

masterleinad approved these changes Mar 19, 2024

View reviewed changes

masterleinad added the Reviewed and ready to merge label Mar 19, 2024

kronbichler merged commit 8a1834b into dealii:master Mar 19, 2024
16 checks passed

kronbichler deleted the gmres_delayed_reorthogonalization branch March 19, 2024 17:13

kronbichler mentioned this pull request Mar 19, 2024

SolverGMRES: Switch default orthogonalization strategy to delayed CGS2 #16760

Merged

kronbichler mentioned this pull request Mar 20, 2024

SolverGMRES: Improve documentation #16762

Merged

kronbichler mentioned this pull request Apr 1, 2024

Switch SolverFGMRES to new delayed classical Gram-Schmidt #16833

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogonalization #16749

SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogonalization #16749

kronbichler commented Mar 13, 2024 •

edited

kronbichler commented Mar 14, 2024

kronbichler commented Mar 15, 2024

peterrum left a comment

peterrum Mar 19, 2024

kronbichler Mar 19, 2024

peterrum Mar 19, 2024

kronbichler Mar 19, 2024 •

edited

peterrum Mar 19, 2024

kronbichler Mar 19, 2024 •

edited

kronbichler Mar 19, 2024

peterrum Mar 19, 2024

masterleinad left a comment

masterleinad Mar 19, 2024

masterleinad Mar 19, 2024

masterleinad Mar 19, 2024

masterleinad commented Mar 19, 2024

kronbichler commented Mar 19, 2024

	* single global reduction (of multiple elements) and more efficient than
	* single global reduction (of multiple elements) and is more efficient than

	* estimate to compute the condition number. Calls the signals
	* estimates to compute the condition number. Calls the signals

	// iterations where need to create copies of the matrices in the QR
	// iterations where we need to create copies of the matrices in the QR

SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogonalization #16749

SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogonalization #16749

Conversation

kronbichler commented Mar 13, 2024 • edited

kronbichler commented Mar 14, 2024

kronbichler commented Mar 15, 2024

peterrum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kronbichler Mar 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kronbichler Mar 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Mar 19, 2024

kronbichler commented Mar 19, 2024

kronbichler commented Mar 13, 2024 •

edited

kronbichler Mar 19, 2024 •

edited

kronbichler Mar 19, 2024 •

edited