Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MatrixFree: Store indices of Dirichlet constrained DoFs as -1 #14324

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kronbichler
Copy link
Member

Implement the first suggestion in #14098 (comment): With this change, we would also store the Dirichlet constrained indices in MatrixFree/DoFInfo, encoded as numbers::invalid_unsigned_int. We then check at the point of reading the indices whether the value is numbers::invalid_unsigned_int, and set up a mask based on that information to selectively read from the source vector. This sounds like a strange way to do things in the scalar case, but in the vectorized case it turns out to be a real win, because it only adds 1 additional instruction in the AVX-512 case (vpcmpnequd) and only 2 instructions with AVX2. These instructions are all much cheaper than the underlying vgatherdpd / vscatterdpd instructions, so the cost is really not that high.

This PR is not complete yet:

  • there are a few cases where I do not yet filter away the diagonal (most notably the compute_diagonal function), so expect tests to fail
  • I really should introduce a check to not check for invalid_unsigned_int in the regular gather function, but do it in a separate function, because I assume we still want to avoid the overhead of building the mask. Or maybe not, I don't know yet.

Apart from these todos, I wanted to post it for discussion at this point and for possible interaction with @peterrum in terms of the second possible improvement listed in the issue mentioned above. I note that performance on an AVX-512 target is around 5% higher for double variables and 10% faster for float variables.

Comment on lines +1152 to +1167
const __m256i invalid = _mm256_set1_epi32(numbers::invalid_unsigned_int);
__mmask8 mask = _mm256_cmpneq_epu32_mask(invalid, index);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record, this function is not in AVX512F but in AVX512VL, so we probably want to guard this with an ifdef.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is https://en.wikichip.org/wiki/x86/avx-512 correct in that this is only a problem for KNL?

@kronbichler kronbichler force-pushed the store_dirichlet_indices branch 2 times, most recently from 52b278b to 1a9b4b5 Compare September 29, 2022 08:48
Copy link
Member

@peterrum peterrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the modification, since we now don't store a pointer into the constraints cache (to an empty entry) for homogeneous DBCs anymore. Also, in a follow-up PR we can use these -1 as markers for inhomogeneities.

Question: do you have any idea how we can prevent that the development in ConstraintInfo and the code here diverges?

touched_first_by[myindex] = chunk;
touched_last_by[myindex] = chunk;
}
if (dof_indices[it] != numbers::invalid_unsigned_int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the same comparison yesterday in another project :D

@peterrum
Copy link
Member

I have been playing with this PR a bit (not yet from a performance point of view). I have written a small test:

#include <deal.II/base/aligned_vector.h>
#include <deal.II/base/vectorization.h>

#include <iostream>

using namespace dealii;

int
main()
{
  using VectorizedArrayType = VectorizedArray<double>;

  AlignedVector<double> data(100);

  for (unsigned int i = 0; i < data.size(); ++i)
    data[i] = i + 1.0;

  std::vector<unsigned int> indices(VectorizedArrayType::size(),
                                    numbers::invalid_unsigned_int);

  indices[0] = 1;
  indices[1] = 4;

  if (true) // working
    {
      VectorizedArrayType temp;
      temp.gather(data.data(), indices.data());

      std::cout << temp << std::endl;
    }

  if (false) // not working
    {
      AlignedVector<VectorizedArrayType> temp(4);
      vectorized_load_and_transpose(temp.size(),
                                    data.data(),
                                    indices.data(),
                                    temp.data());

      for (const auto i : temp)
        std::cout << i << std::endl;
    }
}

The first part tries out VA::gather() and the second one vectorized_load_and_transpose(). The first one works nicely, but the second one not. This is not surprising, since the indices are used direly (also -1). Making this work would allow to remove some if-statements in the code if I am not mistaken.

@peterrum
Copy link
Member

As a reminder: I think the non-vectorized version have to be also updated.

@kronbichler
Copy link
Member Author

The first part tries out VA::gather() and the second one vectorized_load_and_transpose(). The first one works nicely, but the second one not.

The case with vectorized_load_and_tranpose is not very easy to address because of the underlying costs: If we add a check to the inner loops, we will introduce additional instructions in a code that is at least partly limited by the instruction throughput already. More precisely, the shuffle units that get heavily used by this code might also need to take care of the additional branching. We could try to make the loads conditional via masks with all zeros instead. One would need to look up the instructions for those.

This is in contrast to the regular gather calls, where the cost is the single gather (or scatter) operation, so an additional instruction does not matter as much.

As I said previously, we could imagine one solution where we have two sets of gather function (one with checks, one without checks), and then we could do the same for the vectorized_load_and_transpose. That is a bit of extra code and we need some flag, but maybe it is still worth it.

@kronbichler
Copy link
Member Author

/rebuild

@kronbichler
Copy link
Member Author

This is too risky for the release, postponing to right after the release.

@kronbichler kronbichler modified the milestones: Release 9.5, Release 9.6 Jun 12, 2023
@bangerth
Copy link
Member

I love that there is an instruction vpcmpnequd. They must have constructed these instruction names the same way people tried to encode things in 6-char BLAS names...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants