MatrixFree: Store indices of Dirichlet constrained DoFs as `-1` #14324

kronbichler · 2022-09-29T08:33:35Z

Implement the first suggestion in #14098 (comment): With this change, we would also store the Dirichlet constrained indices in MatrixFree/DoFInfo, encoded as numbers::invalid_unsigned_int. We then check at the point of reading the indices whether the value is numbers::invalid_unsigned_int, and set up a mask based on that information to selectively read from the source vector. This sounds like a strange way to do things in the scalar case, but in the vectorized case it turns out to be a real win, because it only adds 1 additional instruction in the AVX-512 case (vpcmpnequd) and only 2 instructions with AVX2. These instructions are all much cheaper than the underlying vgatherdpd / vscatterdpd instructions, so the cost is really not that high.

This PR is not complete yet:

there are a few cases where I do not yet filter away the diagonal (most notably the compute_diagonal function), so expect tests to fail
I really should introduce a check to not check for invalid_unsigned_int in the regular gather function, but do it in a separate function, because I assume we still want to avoid the overhead of building the mask. Or maybe not, I don't know yet.

Apart from these todos, I wanted to post it for discussion at this point and for possible interaction with @peterrum in terms of the second possible improvement listed in the issue mentioned above. I note that performance on an AVX-512 target is around 5% higher for double variables and 10% faster for float variables.

kronbichler · 2022-09-29T08:35:29Z

include/deal.II/base/vectorization.h

+    const __m256i invalid = _mm256_set1_epi32(numbers::invalid_unsigned_int);
+    __mmask8      mask    = _mm256_cmpneq_epu32_mask(invalid, index);


Just for the record, this function is not in AVX512F but in AVX512VL, so we probably want to guard this with an ifdef.

Is https://en.wikichip.org/wiki/x86/avx-512 correct in that this is only a problem for KNL?

peterrum

I like the modification, since we now don't store a pointer into the constraints cache (to an empty entry) for homogeneous DBCs anymore. Also, in a follow-up PR we can use these -1 as markers for inhomogeneities.

Question: do you have any idea how we can prevent that the development in ConstraintInfo and the code here diverges?

peterrum · 2022-09-30T12:01:15Z

include/deal.II/matrix_free/dof_info.templates.h

-                      touched_first_by[myindex] = chunk;
-                    touched_last_by[myindex] = chunk;
-                  }
+                  if (dof_indices[it] != numbers::invalid_unsigned_int)


I wrote the same comparison yesterday in another project :D

peterrum · 2022-10-10T08:58:16Z

I have been playing with this PR a bit (not yet from a performance point of view). I have written a small test:

#include <deal.II/base/aligned_vector.h>
#include <deal.II/base/vectorization.h>

#include <iostream>

using namespace dealii;

int
main()
{
  using VectorizedArrayType = VectorizedArray<double>;

  AlignedVector<double> data(100);

  for (unsigned int i = 0; i < data.size(); ++i)
    data[i] = i + 1.0;

  std::vector<unsigned int> indices(VectorizedArrayType::size(),
                                    numbers::invalid_unsigned_int);

  indices[0] = 1;
  indices[1] = 4;

  if (true) // working
    {
      VectorizedArrayType temp;
      temp.gather(data.data(), indices.data());

      std::cout << temp << std::endl;
    }

  if (false) // not working
    {
      AlignedVector<VectorizedArrayType> temp(4);
      vectorized_load_and_transpose(temp.size(),
                                    data.data(),
                                    indices.data(),
                                    temp.data());

      for (const auto i : temp)
        std::cout << i << std::endl;
    }
}

The first part tries out VA::gather() and the second one vectorized_load_and_transpose(). The first one works nicely, but the second one not. This is not surprising, since the indices are used direly (also -1). Making this work would allow to remove some if-statements in the code if I am not mistaken.

peterrum · 2022-10-10T13:44:03Z

As a reminder: I think the non-vectorized version have to be also updated.

kronbichler · 2022-11-13T08:14:32Z

The first part tries out VA::gather() and the second one vectorized_load_and_transpose(). The first one works nicely, but the second one not.

The case with vectorized_load_and_tranpose is not very easy to address because of the underlying costs: If we add a check to the inner loops, we will introduce additional instructions in a code that is at least partly limited by the instruction throughput already. More precisely, the shuffle units that get heavily used by this code might also need to take care of the additional branching. We could try to make the loads conditional via masks with all zeros instead. One would need to look up the instructions for those.

This is in contrast to the regular gather calls, where the cost is the single gather (or scatter) operation, so an additional instruction does not matter as much.

As I said previously, we could imagine one solution where we have two sets of gather function (one with checks, one without checks), and then we could do the same for the vectorized_load_and_transpose. That is a bit of extra code and we need some flag, but maybe it is still worth it.

kronbichler · 2022-11-13T08:15:01Z

/rebuild

kronbichler · 2023-06-12T14:44:28Z

This is too risky for the release, postponing to right after the release.

bangerth · 2023-06-12T18:34:27Z

I love that there is an instruction vpcmpnequd. They must have constructed these instruction names the same way people tried to encode things in 6-char BLAS names...

kronbichler added WIP ⚠️ Matrix-free SIMD labels Sep 29, 2022

kronbichler commented Sep 29, 2022

View reviewed changes

kronbichler force-pushed the store_dirichlet_indices branch 2 times, most recently from 52b278b to 1a9b4b5 Compare September 29, 2022 08:48

peterrum reviewed Sep 30, 2022

View reviewed changes

kronbichler and others added 3 commits November 12, 2022 21:27

Store indices of Dirichlet constrained DoFs as -1

611434a

Add AVX-512 variant

abc67b2

Add SSE2 version

a0b416d

kronbichler force-pushed the store_dirichlet_indices branch from 1a9b4b5 to a0b416d Compare November 13, 2022 07:02

kronbichler added the ready to test label Nov 13, 2022

kronbichler mentioned this pull request Mar 13, 2023

Handling of constrained dofs in matrix-free and iterative solvers exadg/exadg#383

Closed

kronbichler added this to the Release 9.5 milestone May 16, 2023

kronbichler mentioned this pull request May 16, 2023

Performance of FEEvaluation::read_dof_values with masks #14098

Open

kronbichler modified the milestones: Release 9.5, Release 9.6 Jun 12, 2023

tamiko removed the WIP ⚠️ label Jul 4, 2023

kronbichler mentioned this pull request Aug 12, 2023

Performance regression: Intel GDS mitigation (aka Downfall) #15873

Open

kronbichler removed this from the Release 9.6 milestone Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MatrixFree: Store indices of Dirichlet constrained DoFs as `-1` #14324

MatrixFree: Store indices of Dirichlet constrained DoFs as `-1` #14324

kronbichler commented Sep 29, 2022

kronbichler Sep 29, 2022

tjhei Nov 13, 2022

peterrum left a comment

peterrum Sep 30, 2022

peterrum commented Oct 10, 2022

peterrum commented Oct 10, 2022

kronbichler commented Nov 13, 2022

kronbichler commented Nov 13, 2022

kronbichler commented Jun 12, 2023

bangerth commented Jun 12, 2023

		const __m256i invalid = _mm256_set1_epi32(numbers::invalid_unsigned_int);
		__mmask8 mask = _mm256_cmpneq_epu32_mask(invalid, index);

MatrixFree: Store indices of Dirichlet constrained DoFs as -1 #14324

Are you sure you want to change the base?

MatrixFree: Store indices of Dirichlet constrained DoFs as -1 #14324

Conversation

kronbichler commented Sep 29, 2022

kronbichler Sep 29, 2022

Choose a reason for hiding this comment

tjhei Nov 13, 2022

Choose a reason for hiding this comment

peterrum left a comment

Choose a reason for hiding this comment

peterrum Sep 30, 2022

Choose a reason for hiding this comment

peterrum commented Oct 10, 2022

peterrum commented Oct 10, 2022

kronbichler commented Nov 13, 2022

kronbichler commented Nov 13, 2022

kronbichler commented Jun 12, 2023

bangerth commented Jun 12, 2023

MatrixFree: Store indices of Dirichlet constrained DoFs as `-1` #14324

MatrixFree: Store indices of Dirichlet constrained DoFs as `-1` #14324