Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide (user code) switchable vectorization support #11719

Open
3 tasks
tamiko opened this issue Feb 9, 2021 · 9 comments
Open
3 tasks

Provide (user code) switchable vectorization support #11719

tamiko opened this issue Feb 9, 2021 · 9 comments

Comments

@tamiko
Copy link
Member

tamiko commented Feb 9, 2021

Our current configuration logic for SIMD vectorization support is as follows:

  • We select the level of vectorization support statically at (deal.II library) configuration time and record it via DEAL_II_VECTORIZATION_WIDTH_IN_BITS (the successor of DEAL_II_VECTORIZATION_LEVEL).
  • If we detect that user code includes base/vectorization.h with a lower degree of vectorization support we throw an error.

Now all of this works really well when you compile the library and your application code for a specific hardware - say a laptop/desktop, small compute server, or cluster. Unfortunately, it doesn't work that well for the binary deal.II library we ship with Debian/Ubuntu. Here, we have to compile the library in a very generic configuration so that it can run on all supported machines for an architecture.

It would be very nice to allow for "dynamically" chosen (aka at user project configuration/compilation time) vectorization support. So that all of the binary distributed deal.II versions would become "first class" vectorization citizens as well.

Achieving this should not be too difficult in principle - after all, most of our explicitly vectorized code is templated and compiled in user code. An approach might be to

  • set DEAL_II_VECTORIZATION_WIDTH_IN_BITS dynamically in config.h depending on current vectorization support,
  • make the default width of VectorizedArray depend on that
  • ensure that we instantiate all necessary vectorization widths for deal.II internal instantiations.

This has one drawback, though: Compilation units might have a different notion of VectorizedArray depending on compiler flags.

Or am I having a fever dream and we already support this by just using VectorizedArray with an explicit template width?

Opinions?

@kronbichler ping

@tamiko tamiko added this to the Release 9.3 milestone Feb 9, 2021
@bangerth
Copy link
Member

I instantly thought what you thought as well: "This is a bad idea because then the notion of what VectorizedArray is is different in the deal.II library than in the executable that links against it". I think that if you want to make this work, you either

  • have to be exceedingly careful in how things work when/when not vectorization is available
  • make sure that such objects never cross the public library interface

@tamiko
Copy link
Member Author

tamiko commented Feb 10, 2021

@bangerth The main issue is really that having a default argument for the VectorizedArray template argument might lead to situations where one might miss instantiating code.

Differing default arguments might also lead to API issues - but I think with the latest change for allowing to "downgrade" vectorization (e.g. only using AVX2 instead of AVX512 even though it is configured)
we should have everything templated with the SIMD width as well.

I don't think that we would be facing any ABI issues. After all, the SIMD width is a template argument, so it is part of the API (and thus ABI).

I just want to point out that I think that this is a real issue here. We have gone into considerable effort and length to reliably package deal.II for Debian/Ubuntu and other Linux distributions. And I want that binary distributions of the library are "first class" citizens in terms of features as well.

@peterrum
Copy link
Member

This is quite an interesting topic.

but I think with the latest change for allowing to "downgrade" vectorization (e.g. only using AVX2 instead of AVX512 even though it is configured)
we should have everything templated with the SIMD width as well.

This sounds reasonable, but I am not sure that the issue is limited to VectorizedArray only. You cannot "downgrade" the selected instruction-set architecture in the rest of the code, i.e., force to compile VectorizedArray up to AVX512 and the rest of the code only up to SSE2, can you? (Maybe if we treat the different compilation units with different compilation flags).

Furthermore, we are using MatrixFree also within the library. E.g.:

std::shared_ptr<MatrixFree<dim, Number>> matrix_free(
new MatrixFree<dim, Number>());

where we implicitly select the highest ISA. What would be the solution here?

@kronbichler
Copy link
Member

I agree with the general outline regarding the steps. Overall, we are in a much better state than two years ago thanks to #8342 and some follow-up work we did there, because we support other vectorization variants in our ABI.

You cannot "downgrade" the selected instruction-set architecture in the rest of the code, i.e., force to compile VectorizedArray up to AVX512 and the rest of the code only up to SSE2, can you? (Maybe if we treat the different compilation units with different compilation flags).

I think the problem is not so much the vectorization width (as that can be solved as suggested by @tamiko above by making sure all visible interfaces are instantiated appropriately), but rather the instruction set support in general that we can expect. If we compile the deal.II library for an AVX-512 target, we cannot use the library on a machine without AVX-512 support - even without the code in VectorizedArray the compiler will leak in some instructions somewhere that will lead to invalid instruction errors. The solution to that is to compile multiple versions, I think gcc calls it target clones. I have not looked into what it would take to produce an appropriate solution, but technically it should be doable.

Of course, we could find some intermediate approach: If we have different binaries with support for a specific instruction set (say, an AVX-512 and an AVX-2 target, which should cover most Intel/AMD machines sold today), we could then allow the users to go below the vectorization width compiled in deal.II in their codes and simply pick their favorite VectorizedArray with default argument for the width. A user might still pay some penalty when switching between differently encoded parts (like VEX/normal state, vaddsd vs addsd assembly commands) in terms of execution on the CPU, but that is not our primary concern when distributing the library.

Furthermore, we are using MatrixFree also within the library.

That should not be a problem, as long as the user's CPU understands the instruction set extension, because it does not leak to the outside world here; what we need to be careful about is that we instantiate all widths so that a user entering with a different default for the width gets valid code.

@kronbichler
Copy link
Member

That should not be a problem, as long as the user's CPU understands the instruction set extension, because it does not leak to the outside world here; what we need to be careful about is that we instantiate all widths so that a user entering with a different default for the width gets valid code.

Just to give an example: We template even these classes here which are pure consumers of MatrixFree,
https://dealii.org/developer/doxygen/deal.II/classMatrixFreeOperators_1_1LaplaceOperator.html
with the vectorized array type, which means that we have come pretty far to enable a good user experience already.

@bangerth
Copy link
Member

I think what @tamiko has in mind is compiling the library with SSE2 (the minimal supported instruction set on x86_64) but allow the user to use higher vectorization levels. I wonder whether that would make a substantial difference in practice -- I'm sure that some percentage of time is actually spent in user code and in inlined functions that would benefit from it, principally in the assembly of matrices or matrix-free operators. But surely a good amount of time is spent in library functions that would not benefit from this.

The idea of variants could be implemented either by the compiler (very expensive, because everything would have to be compiled more than once) or in the form of building multiple shared libraries that contain all code, and then libdealII.so simply dlopens the correct one for the instruction set we determine at run time.

@kronbichler
Copy link
Member

kronbichler commented Feb 10, 2021

Ah, that would mean we create a VectorizedArray<double,8> also for SSE2 by concatenating arithmetic operations on several sub-arrays? That would be worth a try while keeping things ABI-compatible (to the extent that an array of two __m128d is ABI-compatible to one __m256). At least as long as the majority of SIMD execution happens in user code, most of the performance could still be retained.

@peterrum
Copy link
Member

peterrum commented May 1, 2021

This involves more time than we have at the moment. Let's postpone this and revisit the issue during the workshop!

@peterrum peterrum modified the milestones: Release 9.3, Release 10.0 May 1, 2021
@peterrum
Copy link
Member

peterrum commented Apr 22, 2022

@tamiko @kronbichler I think you talked about this topic recently. Do we have a plan here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants