New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement InverseCellwiseMassMatrix with tensor var. coeff. #14850
Implement InverseCellwiseMassMatrix with tensor var. coeff. #14850
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
What I would probably do is to call with the new function the old one, i.e.,:
apply(&inverse_coefficient[0][0][0], n_actual_components, in_array, out_array, true);
with the last parameter indicating that we are dealing with a tensor. I hope that we can skip the explicit unrolling by &inverse_coefficient[0][0][0]
. Not sure... The old function
would need to be templated so that it can take any vector/array/ptr types.
The parameter indicating if we are dealing with scalar or tensorial coefficients could be propagated to the old CellwiseInverseMassMatrixImplBasic
, where the information is used to interpret the coefficients during run time. Merging of CellwiseInverseMassMatrixImplTensorialCoeff
and CellwiseInverseMassMatrixImplFlexible
/CellwiseInverseMassMatrixImplBasic
should be possible if I am not mistaken.
Thanks for checking it out @peterrum!
If I understand correctly, we need the explicit unrolling because: dealii/include/deal.II/base/tensor.h Lines 186 to 189 in e8569c6
I am looking into the other aspects you mentioned. I think you are right in thinking that I would also like to exract the flattening/unrolling piece of code - if we can't get rid of it - away from the |
I am struggling to find out what is the best solution here. The implementation I need is one that is similar to I tried augmenting the interface of It'd be great if you could look at this code again and ellaborate on what is wrong/bad and what would make it better. @kronbichler A feedback from you would also be much appreciated. Thanks! |
You can add this. As a comment: one does not have to copy the function. You can use tricks like this: dealii/include/deal.II/matrix_free/evaluation_kernels.h Lines 4438 to 4440 in a408f2c
My favorite at the moment is to extend |
With your suggestions, I think I got a nice solution @peterrum. Thanks a lot! Can you have a look again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! I left some comments. The most important one is to avoid code duplication for fe_degree==-1
and fe_degree!=-1
.
Also could I ask you to add a change-log entry (see the folder: https://github.com/dealii/dealii/tree/master/doc/news/changes/minor) and to add a test into the https://github.com/dealii/dealii/tree/master/tests/matrix_free folder. Just pick out any test that uses CellwiseInverseMassMatrix
as a starting point.
AssertDimension(n_desired_components * dofs_per_component, | ||
inverse_coefficients.size()); | ||
|
||
if (not dyadic_coefficients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (not dyadic_coefficients) | |
if (! dyadic_coefficients) |
|
||
for (unsigned int q = 0; q < dofs_per_component; ++q) | ||
out[q] *= inv_coefficient[q]; | ||
{ | ||
std::vector<Number> tmp(n_desired_components); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<Number> tmp(n_desired_components); | |
Number tmp[20]; |
and check that n_desired_components
is less than 20 in debug mode.
Never use std::vector
in such performance critical place ;) Ask @bergbauer how slow allocation can be!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, what about std::array
? Should be just as good! I'd really prefer to use that over the raw array 😆
for (unsigned int d1 = 0; d1 < n_desired_components; ++d1) | ||
{ | ||
Number sum(0.); | ||
for (unsigned int d2 = 0; d2 < n_desired_components; ++d2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouls start this from d2=1
and initialize the value of sum
with inverse_coefficients[q * n_coeff_components + d1 * n_desired_components] * tmp[0];
. One addition less^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, how nice ^^
@@ -1033,8 +1052,12 @@ namespace MatrixFreeOperators | |||
fill_inverse_JxW_values( | |||
AlignedVector<VectorizedArrayType> &inverse_jxw) const | |||
{ | |||
constexpr unsigned int dofs_per_component_on_cell = | |||
Utilities::pow(fe_degree + 1, dim); | |||
const unsigned int dofs_per_component_on_cell = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* on a vector holding the dof components. It also uses the inverse of the | ||
* JxW values provided by the `fe_eval` argument passed to the constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the function use JxW
? The comments seem to be very general and don't talk about the differences in comparison to the last function (scalar vs. tensorial coefficients).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, it doesn't. I forgot to update the documentation from the previous version with the separate implementation. I'll update it accordingly.
n_components, | ||
Number, | ||
VectorizedArrayType>:: | ||
apply(const AlignedVector<Tensor<2, n_components, VectorizedArrayType>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to call the old function by the new function, i.e., add the old function a parameter? I the reason why I am asking is that in production simulations you would pre-unroll the coefficients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the coefficients would be pre-unrolled before calling apply()
, then I might as well completely remove the new function and just add a new parameter to the first function as you say. This boolean would be just forwarded to the Flexible
implementation, i.e.:
void
apply(const AlignedVector<VectorizedArrayType> &inverse_coefficient,
const unsigned int n_actual_components,
const VectorizedArrayType * in_array,
VectorizedArrayType * out_array,
const bool dyadic_coefficient = false) const;
Would this be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would keep the version with Tensor
s. Maybe we don't need to unrool as indicated in #14850 (comment).
std::vector<Number> tmp(n_desired_components); | ||
for (unsigned int d = 0; d < n_desired_components; ++d) | ||
tmp[d] = out_array[q + d * dofs_per_component]; | ||
|
||
// This is basically a matrix-vector-product | ||
for (unsigned int d1 = 0; d1 < n_desired_components; ++d1) | ||
{ | ||
Number sum(0.); | ||
for (unsigned int d2 = 0; d2 < n_desired_components; ++d2) | ||
{ | ||
sum += | ||
inverse_coefficients[q * n_coeff_components + | ||
d1 * n_desired_components + d2] * | ||
tmp[d2]; | ||
} | ||
out_array[q + d1 * dofs_per_component] = sum; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask you to move this code (matrix-vector product) into a helper function? I would also suggest to template the function according to the number of components. At the caller side, I would write:
if(n_coeff_components==1)
vmult<1>(...); // scalar
if(n_coeff_components==dim)
vmult<dim>(...); // vector Laplace
else
vmult<-1>(...);
What do you think?
AlignedVector<VectorizedArrayType> inverse_coefficients( | ||
dofs_per_component * n_tensor_components); | ||
|
||
// Flatten the inverse dyadic coefficients into `inverse_coefficients` | ||
{ | ||
auto begin = inverse_coefficients.begin(); | ||
for (unsigned int q = 0; q < dofs_per_component; ++q) | ||
{ | ||
const auto end = std::next(begin, n_tensor_components); | ||
inverse_dyadic_coefficients[q].unroll(begin, end); | ||
begin = end; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this code? Apart from the fact that I would like to avoid the memory allocation of AlignedVector
, this also seems like unnecessary work. I would have simply passed in a Number*
or VectorizedArray*
to the internal run
functions. This forces you to also pass in the size, since you can't call inverse_coefficients.size()
any more. Of course, working with plain pointers seems less nice from a programming point of view, but these are performance critical functions, so we should not copy data around (and spill caches etc). We could of course opt to use ArrayView<const Number>
as the argument to run()
, which allows to query the size (in fact, it is precisely a pointer to the start and the size):
dealii/include/deal.II/base/array_view.h
Lines 368 to 377 in 947199b
/** | |
* A pointer to the first element of the range of locations in memory that | |
* this object represents. | |
*/ | |
value_type *starting_element; | |
/** | |
* The length of the array this object represents. | |
*/ | |
std::size_t n_elements; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kronbichler See discussion above (#14850 (review) and #14850 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to proceed here?
for (unsigned int q = 0; q < dofs_per_component; ++q) | ||
for (unsigned int d = 0; d < n_desired_components; ++d) | ||
out_array[q + d * dofs_per_component] *= | ||
inv_coefficient[q + d * shift_coefficient]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a bit unfortunate to have this function work with indices in a different order than the dyadic coefficients you use below. In your new code, the tensor components are the fastest running index, whereas we here have the quadrature points running fastest. This is backwards compatibility, right? I would have preferred if we could keep the case that does not couple between components within one single loop over components (in order to reduce the data transfer from caches between the tensor product evaluator calls and the quadrature point operation), and then create a second copy of evaluator.template hessians<...>(...)
that breaks out the loop as you need it for the case that couples between components. This creates more code, but we always do one or the other, so it won't affect the run time code access. If you do as @peterrum suggested, we will not even have a new function that gets instantiated but simply another path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #14850 (comment), I was thinking about something like:
n_comp_outer = dyadic_coefficients ? 1 : n_desired_components;
n_comp_inner = dyadic_coefficients ? n_desired_components : 1;
for (unsigned int d = 0; d < n_comp_outer; ++d)
for (unsigned int di = 0; di < n_comp_inner; ++di)
in_ = in + dofs_per_component * di;
out_ = out + dofs_per_component * di;
evaluator.template hessians<0, true, false>(in_, out_);
if (dim > 1)
evaluator.template hessians<1, true, false>(out_, out_);
if (dim > 2)
evaluator.template hessians<2, true, false>(out_, out_);
if(dyadic_coefficients)
for(unsigned int q = 0; q < dofs_per_component; ++q)
vmult(...);
else
for(unsigned int q = 0; q < dofs_per_component; ++q)
out[q] *= inv_coefficient[q];
for (unsigned int di = 0; di < n_comp_inner; ++di)
in_ = in + dofs_per_component * di;
out_ = out + dofs_per_component * di;
if (dim > 2)
evaluator.template hessians<2, false, false>(out_, out_);
if (dim > 1)
evaluator.template hessians<1, false, false>(out_, out_);
evaluator.template hessians<0, false, false>(out_, out_);
in += dofs_per_component;
out += dofs_per_component;
based on the `Flexible` implementation with its own new interface (vector of rank-2 tensors), see dealii#14843 - Unify the flexible implementation as well as fill inverse_JxW_values() for all `fe_degree`s. - Add boolean template parameter `dyadic_coefficients` to the flexible implementation. - Flexible implementation no longer takes the shape values, but fe_eval. - Matrix-vector product for dyadic coefficients implemented in vmult.
4fed85d
to
a99365b
Compare
I pushed a new commit, hopefully resolving our issues. I used @peterrum 's code from the last conversation , which worked wonders. Let me know what you think :) |
{ | ||
if (inverse_coefficients.size() != dofs_per_component) | ||
AssertDimension(n_desired_components * dofs_per_component, | ||
inverse_coefficients.size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverse_coefficients.size()) | |
inverse_coefficients.size()); |
} | ||
} | ||
|
||
apply(inverse_coefficients, n_components, in_array, out_array, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you should be able to call
apply(inverse_coefficients, n_components, in_array, out_array, true); | |
apply(&inverse_dyadic_coefficients[0][0][0], n_components, in_array, out_array, true); |
right? That would allow you to skip the rest of the code in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is connected to the discussion #14850 (comment), right? My understanding is that the explicit unrolling is necessary, because elements of tensors aren't guaranteed to lie in contiguous memory. Do you think it is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us assume that tensors are contiguous in memory, copying would be too expensive. All of this happens inside internal interfaces of deal.II, so whenever there is a change in deal.II that would break the contiguous memory layout, we would need to address this here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, one other problem is that the apply()
method expects an AlignedVector<VectorizedArrayType>
, so I just can't pass in the address. I couldn't find a way to construct an AlignedVector
with a raw array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I had suggested to switch this interface to either plain pointers or to ArrayView
, in order to make sure we can pass in all options. Maybe ArrayView
is good. All these are internal interfaces, so we can make the change without worrying about compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm adopting ArrayView
for the internal interfaces 👍 Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for the changes! Let's wait for @kronbichler's option! @btemuer In the meantime you can write the test. That should be only affected to a small extent.
std::enable_if_t<fe_degree == -1> * = nullptr) | ||
template <int n_components> | ||
static void | ||
vmult(const Number * inverse_coefficients, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this private?
Number *, | ||
std::enable_if_t<fe_degree == -1> * = nullptr) | ||
template <int n_components> | ||
static void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this inline?
@@ -648,7 +648,8 @@ namespace MatrixFreeOperators | |||
apply(const AlignedVector<VectorizedArrayType> &inverse_coefficient, | |||
const unsigned int n_actual_components, | |||
const VectorizedArrayType * in_array, | |||
VectorizedArrayType * out_array) const; | |||
VectorizedArrayType * out_array, | |||
const bool dyadic_coefficients = false) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I also ask you describe here what happens (in tensor-product notation).
AlignedVector<VectorizedArrayType> inverse_coefficients( | ||
dofs_per_component * n_tensor_components); | ||
|
||
// Flatten the inverse dyadic coefficients into `inverse_coefficients` | ||
{ | ||
auto begin = inverse_coefficients.begin(); | ||
for (unsigned int q = 0; q < dofs_per_component; ++q) | ||
{ | ||
const auto end = std::next(begin, n_tensor_components); | ||
inverse_dyadic_coefficients[q].unroll(begin, end); | ||
begin = end; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to proceed here?
- Remove explicit unrolling of tensors - Use ArrayView in internal functions instead of passing around AlignedVector - Make vmult private and inline - Documentation
wrong number of dofs for fe_degree=-1
- inverse_mass_10: tests cellwise inverse mass with random dyadic coefficients - inverse_mass_11: same as 10, but with fe_degree=-1 Both tests use a GMRES solver (coefficients not symmetric) with a reduced accuracy (1e-10) compared to other tests in this test suite, which use 1e-12. Reason is to make the tests more robust as the problem can become poorly conditioned due to the random coefficient.
ArrayView<const VectorizedArrayType>(inverse_coefficients.data(), | ||
inverse_coefficients.size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use this one:
ArrayView<const VectorizedArrayType>(inverse_coefficients.data(), | |
inverse_coefficients.size()), | |
make_array_view(inverse_coefficients), |
since we have this function:
dealii/include/deal.II/base/array_view.h
Lines 1090 to 1093 in 9bd903c
make_array_view(const AlignedVector<ElementType> &vector) | |
{ | |
return ArrayView<const ElementType>(vector.data(), vector.size()); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks great. 👍 for the two test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. A few minor comment from my side; after that: good to merge!
doc/news/changes/minor/20230315Temur
Outdated
Internal interfaces for the coefficients now use ArrayView instead | ||
of AlignedVector. | ||
<br> | ||
(Bugrahan Temur, 2023/03/15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a new line?
tests/matrix_free/inverse_mass_11.cc
Outdated
@@ -0,0 +1,260 @@ | |||
// --------------------------------------------------------------------- | |||
// | |||
// Copyright (C) 2014 - 2020 by the deal.II authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright (C) 2014 - 2020 by the deal.II authors | |
// Copyright (C) 2022 by the deal.II authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
// Copyright (C) 2014 - 2020 by the deal.II authors | |
// Copyright (C) 2023 by the deal.II authors |
tests/matrix_free/inverse_mass_10.cc
Outdated
@@ -0,0 +1,259 @@ | |||
// --------------------------------------------------------------------- | |||
// | |||
// Copyright (C) 2014 - 2020 by the deal.II authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright (C) 2014 - 2020 by the deal.II authors | |
// Copyright (C) 2022 by the deal.II authors |
* The second-rank tensor at each quadrature point defines a linear operator | ||
* on a vector holding the dof components. It is assumed that the passed | ||
* input and output arrays are of correct size, namely | ||
* FEEvaluation::dofs_per_cell long. The `in_array` and `out_array` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* FEEvaluation::dofs_per_cell long. The `in_array` and `out_array` | |
* FEEvaluation::dofs_per_cell long. The @p in_array and @p out_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two should be equivalent, and we use both in the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I keep it? I think the ticks are easier to read in the source code. @p makes it clear that it is a function parameter though..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I prefer to use @p
(for the reason you have provided); but I let you decide. I am fine with any version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with @p
.
* input and output arrays are of correct size, namely | ||
* FEEvaluation::dofs_per_cell long. The `in_array` and `out_array` | ||
* arguments may point to the same memory position. | ||
* `inverse_dyadic_coefficients` must be dofs_per_component long, and every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `inverse_dyadic_coefficients` must be dofs_per_component long, and every | |
* @p inverse_dyadic_coefficients must be dofs_per_component long, and every |
* arguments may point to the same memory position. | ||
* `inverse_dyadic_coefficients` must be dofs_per_component long, and every | ||
* element must be a second-rank tensor of dimension `n_components`. All | ||
* entries should also contain the inverse JxW values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* entries should also contain the inverse JxW values. | |
* entries should be multiplied with the inverse of the `JxW` values. |
/rebuild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thank you both very much! @peterrum @kronbichler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is not happy:
/home/runner/work/dealii/dealii/include/deal.II/base/vectorization.h:527:17: error: ‘tmp[2].dealii::VectorizedArray<double, 1>::data’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
527 | data *= vec.data;
I am not sure where the issue. Maybe this one!?
const unsigned int n_desired_components = | ||
(n_components > -1) ? n_components : n_given_components; | ||
|
||
std::array<Number, dim + 2> tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::array<Number, dim + 2> tmp; | |
std::array<Number, dim + 2> tmp = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to reproduce the error. I can look into it in more detail if your suggestion does not work.
initalization of array of VectorizedArray Co-authored-by: Peter Munch <peterrmuench@gmail.com>
@btemuer It worked :) |
based on the
Basic
implementation with its own new interface (vector of rank-2 tensors), see #14843