New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MatrixFree: Store indices of Dirichlet constrained DoFs as -1
#14324
base: master
Are you sure you want to change the base?
Conversation
const __m256i invalid = _mm256_set1_epi32(numbers::invalid_unsigned_int); | ||
__mmask8 mask = _mm256_cmpneq_epu32_mask(invalid, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record, this function is not in AVX512F
but in AVX512VL
, so we probably want to guard this with an ifdef
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is https://en.wikichip.org/wiki/x86/avx-512 correct in that this is only a problem for KNL?
52b278b
to
1a9b4b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the modification, since we now don't store a pointer into the constraints cache (to an empty entry) for homogeneous DBCs anymore. Also, in a follow-up PR we can use these -1
as markers for inhomogeneities.
Question: do you have any idea how we can prevent that the development in ConstraintInfo
and the code here diverges?
touched_first_by[myindex] = chunk; | ||
touched_last_by[myindex] = chunk; | ||
} | ||
if (dof_indices[it] != numbers::invalid_unsigned_int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the same comparison yesterday in another project :D
I have been playing with this PR a bit (not yet from a performance point of view). I have written a small test: #include <deal.II/base/aligned_vector.h>
#include <deal.II/base/vectorization.h>
#include <iostream>
using namespace dealii;
int
main()
{
using VectorizedArrayType = VectorizedArray<double>;
AlignedVector<double> data(100);
for (unsigned int i = 0; i < data.size(); ++i)
data[i] = i + 1.0;
std::vector<unsigned int> indices(VectorizedArrayType::size(),
numbers::invalid_unsigned_int);
indices[0] = 1;
indices[1] = 4;
if (true) // working
{
VectorizedArrayType temp;
temp.gather(data.data(), indices.data());
std::cout << temp << std::endl;
}
if (false) // not working
{
AlignedVector<VectorizedArrayType> temp(4);
vectorized_load_and_transpose(temp.size(),
data.data(),
indices.data(),
temp.data());
for (const auto i : temp)
std::cout << i << std::endl;
}
} The first part tries out |
As a reminder: I think the non-vectorized version have to be also updated. |
1a9b4b5
to
a0b416d
Compare
The case with This is in contrast to the regular As I said previously, we could imagine one solution where we have two sets of |
/rebuild |
This is too risky for the release, postponing to right after the release. |
I love that there is an instruction |
Implement the first suggestion in #14098 (comment): With this change, we would also store the Dirichlet constrained indices in MatrixFree/DoFInfo, encoded as
numbers::invalid_unsigned_int
. We then check at the point of reading the indices whether the value isnumbers::invalid_unsigned_int
, and set up a mask based on that information to selectively read from the source vector. This sounds like a strange way to do things in the scalar case, but in the vectorized case it turns out to be a real win, because it only adds 1 additional instruction in the AVX-512 case (vpcmpnequd
) and only 2 instructions with AVX2. These instructions are all much cheaper than the underlyingvgatherdpd
/vscatterdpd
instructions, so the cost is really not that high.This PR is not complete yet:
compute_diagonal
function), so expect tests to failinvalid_unsigned_int
in the regulargather
function, but do it in a separate function, because I assume we still want to avoid the overhead of building the mask. Or maybe not, I don't know yet.Apart from these todos, I wanted to post it for discussion at this point and for possible interaction with @peterrum in terms of the second possible improvement listed in the issue mentioned above. I note that performance on an AVX-512 target is around 5% higher for double variables and 10% faster for float variables.