Provide (user code) switchable vectorization support #11719

tamiko · 2021-02-09T22:37:08Z

Our current configuration logic for SIMD vectorization support is as follows:

We select the level of vectorization support statically at (deal.II library) configuration time and record it via DEAL_II_VECTORIZATION_WIDTH_IN_BITS (the successor of DEAL_II_VECTORIZATION_LEVEL).
If we detect that user code includes base/vectorization.h with a lower degree of vectorization support we throw an error.

Now all of this works really well when you compile the library and your application code for a specific hardware - say a laptop/desktop, small compute server, or cluster. Unfortunately, it doesn't work that well for the binary deal.II library we ship with Debian/Ubuntu. Here, we have to compile the library in a very generic configuration so that it can run on all supported machines for an architecture.

It would be very nice to allow for "dynamically" chosen (aka at user project configuration/compilation time) vectorization support. So that all of the binary distributed deal.II versions would become "first class" vectorization citizens as well.

Achieving this should not be too difficult in principle - after all, most of our explicitly vectorized code is templated and compiled in user code. An approach might be to

set DEAL_II_VECTORIZATION_WIDTH_IN_BITS dynamically in config.h depending on current vectorization support,
make the default width of VectorizedArray depend on that
ensure that we instantiate all necessary vectorization widths for deal.II internal instantiations.

This has one drawback, though: Compilation units might have a different notion of VectorizedArray depending on compiler flags.

Or am I having a fever dream and we already support this by just using VectorizedArray with an explicit template width?

Opinions?

@kronbichler ping

The text was updated successfully, but these errors were encountered:

bangerth · 2021-02-10T00:21:50Z

I instantly thought what you thought as well: "This is a bad idea because then the notion of what VectorizedArray is is different in the deal.II library than in the executable that links against it". I think that if you want to make this work, you either

have to be exceedingly careful in how things work when/when not vectorization is available
make sure that such objects never cross the public library interface

tamiko · 2021-02-10T01:23:05Z

@bangerth The main issue is really that having a default argument for the VectorizedArray template argument might lead to situations where one might miss instantiating code.

Differing default arguments might also lead to API issues - but I think with the latest change for allowing to "downgrade" vectorization (e.g. only using AVX2 instead of AVX512 even though it is configured)
we should have everything templated with the SIMD width as well.

I don't think that we would be facing any ABI issues. After all, the SIMD width is a template argument, so it is part of the API (and thus ABI).

I just want to point out that I think that this is a real issue here. We have gone into considerable effort and length to reliably package deal.II for Debian/Ubuntu and other Linux distributions. And I want that binary distributions of the library are "first class" citizens in terms of features as well.

peterrum · 2021-02-10T09:45:51Z

This is quite an interesting topic.

but I think with the latest change for allowing to "downgrade" vectorization (e.g. only using AVX2 instead of AVX512 even though it is configured)
we should have everything templated with the SIMD width as well.

This sounds reasonable, but I am not sure that the issue is limited to VectorizedArray only. You cannot "downgrade" the selected instruction-set architecture in the rest of the code, i.e., force to compile VectorizedArray up to AVX512 and the rest of the code only up to SSE2, can you? (Maybe if we treat the different compilation units with different compilation flags).

Furthermore, we are using MatrixFree also within the library. E.g.:

dealii/include/deal.II/numerics/vector_tools_project.templates.h

Lines 202 to 203 in 79b124c

    
           std::shared_ptr<MatrixFree<dim, Number>> matrix_free( 
        
             new MatrixFree<dim, Number>());

where we implicitly select the highest ISA. What would be the solution here?

kronbichler · 2021-02-10T11:23:38Z

I agree with the general outline regarding the steps. Overall, we are in a much better state than two years ago thanks to #8342 and some follow-up work we did there, because we support other vectorization variants in our ABI.

You cannot "downgrade" the selected instruction-set architecture in the rest of the code, i.e., force to compile VectorizedArray up to AVX512 and the rest of the code only up to SSE2, can you? (Maybe if we treat the different compilation units with different compilation flags).

I think the problem is not so much the vectorization width (as that can be solved as suggested by @tamiko above by making sure all visible interfaces are instantiated appropriately), but rather the instruction set support in general that we can expect. If we compile the deal.II library for an AVX-512 target, we cannot use the library on a machine without AVX-512 support - even without the code in VectorizedArray the compiler will leak in some instructions somewhere that will lead to invalid instruction errors. The solution to that is to compile multiple versions, I think gcc calls it target clones. I have not looked into what it would take to produce an appropriate solution, but technically it should be doable.

Of course, we could find some intermediate approach: If we have different binaries with support for a specific instruction set (say, an AVX-512 and an AVX-2 target, which should cover most Intel/AMD machines sold today), we could then allow the users to go below the vectorization width compiled in deal.II in their codes and simply pick their favorite VectorizedArray with default argument for the width. A user might still pay some penalty when switching between differently encoded parts (like VEX/normal state, vaddsd vs addsd assembly commands) in terms of execution on the CPU, but that is not our primary concern when distributing the library.

Furthermore, we are using MatrixFree also within the library.

That should not be a problem, as long as the user's CPU understands the instruction set extension, because it does not leak to the outside world here; what we need to be careful about is that we instantiate all widths so that a user entering with a different default for the width gets valid code.

kronbichler · 2021-02-10T11:25:47Z

That should not be a problem, as long as the user's CPU understands the instruction set extension, because it does not leak to the outside world here; what we need to be careful about is that we instantiate all widths so that a user entering with a different default for the width gets valid code.

Just to give an example: We template even these classes here which are pure consumers of MatrixFree,
https://dealii.org/developer/doxygen/deal.II/classMatrixFreeOperators_1_1LaplaceOperator.html
with the vectorized array type, which means that we have come pretty far to enable a good user experience already.

bangerth · 2021-02-10T16:07:43Z

I think what @tamiko has in mind is compiling the library with SSE2 (the minimal supported instruction set on x86_64) but allow the user to use higher vectorization levels. I wonder whether that would make a substantial difference in practice -- I'm sure that some percentage of time is actually spent in user code and in inlined functions that would benefit from it, principally in the assembly of matrices or matrix-free operators. But surely a good amount of time is spent in library functions that would not benefit from this.

The idea of variants could be implemented either by the compiler (very expensive, because everything would have to be compiled more than once) or in the form of building multiple shared libraries that contain all code, and then libdealII.so simply dlopens the correct one for the instruction set we determine at run time.

kronbichler · 2021-02-10T16:10:47Z

Ah, that would mean we create a VectorizedArray<double,8> also for SSE2 by concatenating arithmetic operations on several sub-arrays? That would be worth a try while keeping things ABI-compatible (to the extent that an array of two __m128d is ABI-compatible to one __m256). At least as long as the majority of SIMD execution happens in user code, most of the performance could still be retained.

peterrum · 2021-05-01T09:09:28Z

This involves more time than we have at the moment. Let's postpone this and revisit the issue during the workshop!

peterrum · 2022-04-22T19:23:43Z

@tamiko @kronbichler I think you talked about this topic recently. Do we have a plan here?

tamiko added Enhancement SIMD labels Feb 9, 2021

tamiko added this to the Release 9.3 milestone Feb 9, 2021

peterrum modified the milestones: Release 9.3, Release 10.0 May 1, 2021

peterrum modified the milestones: Release 9.4, Release 9.5 May 25, 2022

kronbichler mentioned this issue May 24, 2023

Horizontal add function for VectorizedArray #15240

Merged

bangerth modified the milestones: Release 9.5, Release 9.6 Jun 5, 2023

kronbichler mentioned this issue Jan 20, 2024

Clean up the Tensor class. #16465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide (user code) switchable vectorization support #11719

Provide (user code) switchable vectorization support #11719

tamiko commented Feb 9, 2021

bangerth commented Feb 10, 2021

tamiko commented Feb 10, 2021 •

edited

peterrum commented Feb 10, 2021

kronbichler commented Feb 10, 2021

kronbichler commented Feb 10, 2021

bangerth commented Feb 10, 2021

kronbichler commented Feb 10, 2021 •

edited

peterrum commented May 1, 2021

peterrum commented Apr 22, 2022 •

edited

Provide (user code) switchable vectorization support #11719

Provide (user code) switchable vectorization support #11719

Comments

tamiko commented Feb 9, 2021

bangerth commented Feb 10, 2021

tamiko commented Feb 10, 2021 • edited

peterrum commented Feb 10, 2021

kronbichler commented Feb 10, 2021

kronbichler commented Feb 10, 2021

bangerth commented Feb 10, 2021

kronbichler commented Feb 10, 2021 • edited

peterrum commented May 1, 2021

peterrum commented Apr 22, 2022 • edited

tamiko commented Feb 10, 2021 •

edited

kronbichler commented Feb 10, 2021 •

edited

peterrum commented Apr 22, 2022 •

edited