Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SolverGMRES: Implement classical Gram-Schmidt with delayed reorthogonalization #16749

Merged

Conversation

kronbichler
Copy link
Member

@kronbichler kronbichler commented Mar 13, 2024

This PR builds currently on top of #16745, but only the last two commits are relevant for this PR.

This PR implements a new variant of orthogonalization in SolverGMRES and SolverFGMRES. It performs reorthogonalization unconditionally, but it does so in a smart way: In order not to increase the number of global reductions and memory access compared to the classical Gram-Schmidt process, it performs the reorthogonalization one iteration later. This is called delayed reorthogonalization. It is highly robust and I would like to make it the default GMRES implementation in a later PR. The implementation is essentially algorithm 4 in the recent contribution by Bielich et al. (2022).

The algorithm implemented here uses the switch originally introduced in #14349 to support matrix-vector operations within the ortogonalization scheme, which is the primary motivation to use classical Gram-Schmidt after all; for other vector types, I cannot really do much optimizations because our wrappers do not allow those (and one should use PETSc's own or Trilinos' Belos solvers to get the optimized methods). I plan to extend this also to dealii::Vector and dealii::BlockVector in an upcoming PR because we should really use the optimized path for all of our own vectors.

Once we are happy with the direction and the PR #16745 is merged, I will also write a changelog for this PR.

Fixes #16077, fixes #14864.

@kronbichler kronbichler force-pushed the gmres_delayed_reorthogonalization branch 2 times, most recently from e5c008e to b5771a4 Compare March 14, 2024 09:36
@kronbichler
Copy link
Member Author

I now implemented the orthogonalization in a dedicated class, rather than in free functions. This helps me to keep all variables local in the context they are used and overall encapsulate the Arnoldi orthogonalization better. On this path, I added extensive documentation of the new class to make sure the interface is understandable. I think the concept is now much better, which can be seen by how compact the solve functions for both GMRES solvers are now.

@kronbichler kronbichler force-pushed the gmres_delayed_reorthogonalization branch from a221d61 to 8ca0a3c Compare March 15, 2024 06:12
@kronbichler
Copy link
Member Author

Rebased after the merge of #16745.

@kronbichler kronbichler force-pushed the gmres_delayed_reorthogonalization branch from 8ca0a3c to d4b8132 Compare March 15, 2024 08:17
Copy link
Member

@peterrum peterrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 👍

const LinearAlgebra::OrthogonalizationStrategy orthogonalization_strategy,
const internal::SolverGMRESImplementation::TmpVectors<VectorType>
&orthogonal_vectors,
ArnoldiProcess::orthonormalize_nth_vector(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename dim? I guess it is n?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many occurrences; I took the chance to change them all from dim (which we use with different connotation in most places of deal.II, so I agree with your assessment) to n. The one downside is that we have many uses of n formerly dim, so that might be a question. I left it as a separate comment, please express your opinion to see what we like more.

}



template <typename VectorType,
template <bool delayed_reorthogonalization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much slower is the code without the template argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use fixed-size arrays that depend on this value to use as good register loop blocking as reasonably possible, as I want to load 12 values in the classical Gram-Schmidt and 6 values in the new delayed orthogonalization code path (due to 2 parallel accumulations for the 2 orthogonalization processes running, it gets 12 registers for partial results or input again). I can try with 6 in both cases if you want and measure for a case from cache and a case from main memory (the latter should be memory bandwidth bound, but the shorter the loop, the more likely are visible effects from the translation lookaside buffer or prefetching limitations). I guess this just made a case for measuring, didn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. I am fine with the current status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a short benchmark, using a locally owned size of 1000 (cached) or 1000000 (not cached). Here is the result with the current code on 72 Intel Ice Lake cores:

Timing MGS size 1000: 5.63975e-05 5.63983e-05 s or 326.823 GB/s
Timing MGS size 1000000: 0.11909 0.119101 s or 154.773 GB/s
Timing CGS size 1000: 1.47914e-05 1.47922e-05 s or 1246.13 GB/s
Timing CGS size 1000000: 0.0689767 0.0689878 s or 267.221 GB/s
Timing DCGS2 size 1000: 1.22783e-05 1.2279e-05 s or 1501.19 GB/s
Timing DCGS2 size 1000000: 0.066445 0.0664467 s or 277.402 GB/s

Here CGS is classical Gram-Schmidt, MGS is modified G-S, and DCGS2 is the classical Gram-Schmidt method with delayed reorthogonalization (proposed patch). On this system, the new algorithm runs even faster than the classical Gram-Schmidt approach, whereas I observed the opposite on my notebook. Without the template, I see

Timing MGS size 1000: 5.56948e-05 5.56949e-05 s or 330.947 GB/s
Timing MGS size 1000000: 0.121417 0.121438 s or 151.808 GB/s
Timing CGS size 1000: 1.5118e-05 1.51188e-05 s or 1219.21 GB/s
Timing CGS size 1000000: 0.069091 0.0691066 s or 266.779 GB/s
Timing DCGS2 size 1000: 1.24458e-05 1.24465e-05 s or 1480.98 GB/s
Timing DCGS2 size 1000000: 0.0647302 0.0647319 s or 284.751 GB/s

As you see, the difference is not big (by comparing to MGS, we can see the effect of noise). From L2 cache, the performance is slightly lower, but not spectacularly so. I think we can keep the templates for now, we can always adjust later, I had to change 15 lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note as to why the non-templated variant is so close: If I read the generated assembly code correctly, gcc is doing two copies of the inner loop, one for the delayed algorithm and one for the default, in order to keep the checks out of the critical path. I think we can simply generate the templated versions (I will refactor the actual inner loop into a .cc file in an upcoming PR because I want to pre-compile the code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation!

Copy link
Member

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple typos.

* Use classical Gram-Schmidt algorithm with two orthogonalization
* iterations and delayed orthogonalization using the algorithm described
* in @cite Bielich2022. This approach works on multi-vectors with a
* single global reduction (of multiple elements) and more efficient than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* single global reduction (of multiple elements) and more efficient than
* single global reduction (of multiple elements) and is more efficient than

* number. Calls the signals eigenvalues_signal and cond_signal with these
* estimates as arguments.
* during the inner iterations for @p n vectors in total. Uses these
* estimate to compute the condition number. Calls the signals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* estimate to compute the condition number. Calls the signals
* estimates to compute the condition number. Calls the signals

// one where the orthogonalization has finished (i.e., end of inner
// iteration in GMRES) and we can safely overwrite the content of the
// tridiagonal matrix and right hand side, and the case during the inner
// iterations where need to create copies of the matrices in the QR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// iterations where need to create copies of the matrices in the QR
// iterations where we need to create copies of the matrices in the QR

@masterleinad
Copy link
Member

Once we are happy with the direction and the PR #16745 is merged, I will also write a changelog for this PR.

Do you want to write a changelog entry or update one of the ones in #16745?

@kronbichler
Copy link
Member Author

I will update the other changelog, but I would like to postpone this to an upcoming PR, where I want to switch the default algorithm for GMRES to the new variant. (I wanted to do that separately to have a better control over the test suite and where changes originate from.)

@kronbichler kronbichler merged commit 8a1834b into dealii:master Mar 19, 2024
16 checks passed
@kronbichler kronbichler deleted the gmres_delayed_reorthogonalization branch March 19, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SolverGMRES: Document roundoff issues and orthogonalization strategies Clean up SolverGMRES and SolverFGMRES
3 participants