New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Utilities::MemorySpace namespace and MemoryBlock class #12821
Conversation
New: New namespace Utilities::MemorySpace that contains functions to help | ||
memory space independent code. New MemoryBlock class. This class allocates a | ||
block of memory on the host or the device. The underlying data can be access | ||
using ArrayView. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New: New namespace Utilities::MemorySpace that contains functions to help | |
memory space independent code. New MemoryBlock class. This class allocates a | |
block of memory on the host or the device. The underlying data can be access | |
using ArrayView. | |
New: The new Utilities::MemorySpace namespace contains functions to help | |
memory space independent code. The new MemoryBlock class allocates a | |
block of memory on the host or the device. The underlying data can be accessed | |
using ArrayView. |
include/deal.II/base/array_view.h
Outdated
@@ -87,6 +87,11 @@ class ArrayView | |||
*/ | |||
using value_type = ElementType; | |||
|
|||
/** | |||
* An alias the denotes the memory space of this conlainer-like class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* An alias the denotes the memory space of this conlainer-like class. | |
* An alias that denotes the memory space of this container-like class. |
include/deal.II/base/memory_block.h
Outdated
* This class allocates a block of memory on the host or the device. Access to | ||
* the elements of the block needs to be done using ArrayView. Note that when a | ||
* reinit() function is called the underlying pointer is changed and thus, one | ||
* need to call reinit() on the ArrayView associated with the reinitialized | ||
* MemoryBlock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This class allocates a block of memory on the host or the device. Access to | |
* the elements of the block needs to be done using ArrayView. Note that when a | |
* reinit() function is called the underlying pointer is changed and thus, one | |
* need to call reinit() on the ArrayView associated with the reinitialized | |
* MemoryBlock. | |
* This class allocates a block of memory on the host or the device. The elements | |
* of the block must be accessed using ArrayView. Note that when a | |
* reinit() function is called, the underlying pointer is changed and thus, one | |
* needs to call reinit() on the ArrayView associated with the reinitialized | |
* MemoryBlock. |
include/deal.II/base/memory_block.h
Outdated
{ | ||
public: | ||
/** | ||
* An alias the denotes the memory space of this conlainer-like class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* An alias the denotes the memory space of this conlainer-like class. | |
* An alias that denotes the memory space of this container-like class. |
include/deal.II/base/memory_block.h
Outdated
MemoryBlock(const MemoryBlock<ElementType, MemorySpaceType> &other); | ||
|
||
/** | ||
* Copy ther data in @p other and move it to the appropriate memory space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copy ther data in @p other and move it to the appropriate memory space. | |
* Copy the data stored in @p other and move it to the appropriate memory space. |
#endif | ||
|
||
/** | ||
* Allocate memory on the device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Allocate memory on the device. | |
* Allocate memory on the host. |
* Apply the functor @p f to the range [0,size). This function accepts a | ||
* lambda function instead of a functor. In this case, the code should look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is a lambda function special here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to compile deal.II with a special flag, you need to add __host__ __device__
, you can only capture by copy, and there are other restrictions from CUDA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I would say something like "In case the functor is a lambda, the code should look as follows [...]". To me, it sounded like this function can only be used if the functor is a lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you what you mean. It was poorly worded. I changed the sentence at both places.
inline void | ||
for_each(const dealii::MemorySpace::Host &, | ||
unsigned int const size, | ||
Functor f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't
Functor f) | |
const Functor& f) |
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does but then on the host, you pass by reference and on the device you have to pass by value. If you have a bug in your copy constructor, you can catch it on the host with this interface.
tests/cuda/memory_block.cu
Outdated
Utilities::MemorySpace::for_each(MemorySpace::Host{}, | ||
memory_block_host.size(), | ||
check_functor_zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the extra MemorySpace
here? While I can see that that makes some sense for allocate_data
, deallocate_data
, and copy
(since these are inherently memory operations), I would much rather see Utilities::for_each
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then it won't show up on the same doxygen page. All these functions are there to simplify writing code that's independent of the MemorySpace
. I agree that it's technically an execution space not a memory space but we don't have that concept in deal.II
include/deal.II/base/memory_block.h
Outdated
* Constructor. Allocate a block of @p size elements. The data is not | ||
* initialized. | ||
*/ | ||
MemoryBlock(unsigned int size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemoryBlock(unsigned int size); | |
MemoryBlock(const unsigned int size); |
/** | ||
* Copy constructor. | ||
*/ | ||
MemoryBlock(const MemoryBlock<ElementType, MemorySpaceType> &other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this does? That is, does it just copy the pointer, or does it in fact allocate memory and copies the objects in the memory block?
include/deal.II/base/memory_block.h
Outdated
* initialized. | ||
*/ | ||
void | ||
reinit(unsigned int size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reinit(unsigned int size); | |
reinit(const unsigned int size); |
include/deal.II/base/memory_block.h
Outdated
reinit(unsigned int size); | ||
|
||
/** | ||
* Clear the memory block, allocate a new block, and copy the data stored in @p other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Clear the memory block, allocate a new block, and copy the data stored in @p other. | |
* Release the memory block, allocate a new block, and copy the data stored in @p other. |
include/deal.II/base/memory_block.h
Outdated
MemoryBlock(const ArrayView<ElementType, MemorySpaceType2> &array_view); | ||
|
||
/** | ||
* Clear the memory block and allocate a new block. The data is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Clear the memory block and allocate a new block. The data is not | |
* Release the memory block and allocate a new block. The data is not |
const ::dealii::MemorySpace::Host &, | ||
Number *out, | ||
const ::dealii::MemorySpace::CUDA &, | ||
std::size_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::size_t size) | |
const std::size_t size) |
const ::dealii::MemorySpace::CUDA &, | ||
Number *out, | ||
const ::dealii::MemorySpace::CUDA &, | ||
std::size_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::size_t size) | |
const std::size_t size) |
*/ | ||
template <typename Functor> | ||
__global__ void | ||
for_each_impl(unsigned int size, Functor f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for_each_impl(unsigned int size, Functor f) | |
for_each_impl(const unsigned int size, Functor f) |
void | ||
for_each(const dealii::MemorySpace::CUDA &, | ||
const unsigned int const size, | ||
Functor f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function's argument list seems to be missing something. for_each
is typically described as doing an operation for each object in a collection, but there is no collection here: The memory space does not point to any objects, and the second or third argument don't either.
Would a better function name be something like for_each_index
?
|
||
/** | ||
* Constructor. Allocate a block of @p size elements. The data is not | ||
* initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Not initialized" only exists in C++ if the data type is a built-in type -- I think this is std::is_standard_layout<T>
but I forget the details. For all other types, one constructor or another needs to be run.
I suspect that you intend this class to only be used for number data types, but you write it as a generic ElementType
. It would be worthwhile encoding your assumption via a static_assert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always non-initialized on the device because we need to use the CUDA equivalent of malloc
.
I suspect that you intend this class to only be used for number data types, but you write it as a generic ElementType. It would be worthwhile encoding your assumption via a static_assert.
Actually, I don't know about the data types. In one hand, I would like the let users decide which data structure they want to use. In the other hand, if you are not careful the performance on the GPU will be pretty bad. At the end, I decided to go with ElementType
because it's what we do on the ArrayView
and the two classes work together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but if you can't guarantee initialization, then all a user will get is likely corrupted memory. I would much rather avoid that by having a static_assert
in place somewhere that checks whether we have a data type that has no constructor anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that std::is_standard_layout
is what we want to use then. Is std::is_trivial
https://en.cppreference.com/w/cpp/types/is_trivial what you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, is_trivial
is what I had in mind.
689b586
to
22483cf
Compare
I still don't quite agree with choosing |
If you named it |
In the end, I have the feeling that this sort of operation is really what libraries like Kokkos or Raja were made for. This may exceed what you have in mind for this patch, but out of curiosity, have you considered whether we should just build on one of these? |
Yes, exactly. That's why I've put that as first argument since it's where the executor goes in STL.
Yes, I did but I don't know Raja, so I would have to learn it first. With Kokkos, I am worry about the integration with our current code (which shouldn't be too horrible) and the integration with Trilinos. Maybe it works out-of-the box but maybe not. Since nobody has been pushing to use Kokkos, I went with the current solution which I know is compatible with our current code. |
OK. For the record, I would not be opposed to moving towards a model where we use Kokkos for these sorts (and probably plenty other) things. I recognize that it's another dependency and that may or may not play well with Trilinos. This might be a longer-term issue. As for the issue with executor vs memory space: I'd be ok with just documenting the issue: That the argument indicates an executor and that the executor is identified by the memory space it is run in. That's maybe not the most elegant solution, but works. Or are we expecting that longer term things move to a unified (globally addressable) memory space where one could execute a GPU kernel that reads and writes into CPU memory or the other way around? |
You can do already do that but depending on the GPU you are using it's pretty slow. I think the consensus is to avoid doing that. |
I think, that's the right move. We might be able to support |
Here is my plan for Intel GPU https://www.jlse.anl.gov/projects/exascale-computing-projects-ecp/ecp-2-4-3-05-hip-on-aurora/ I can't wait for the new ICE, we'll get. |
Sure, going with a backed that doesn't support them natively is surely a good idea. |
I agree with the general direction in this PR, and I also agree that we should set up and discuss a longer-term plan for memory-management in this kind of CPU/GPU code. I would be in favor of letting Kokkos many of our data structures with combined CPU/GPU scope if we keep the ability to work with raw pointer infrastructure where needed (which is nicely solved by Kokkos with its View concepts). My main question is the extent to which Kokkos and the upcoming STL functionality (pushed by some Kokkos people) overlap and how we most efficiently spend our resources. @masterleinad @Rombur you are closest to the development in these packages, what are the perspectives we should have from the deal.II side? |
The plan for Kokkos and C++ is to push Kokkos functionalities into the standard, with the idea that once it's in the standard, vendors will optimize that code. This is why Kokkos is pushing for things like MDSpan and BLAS in the standard. It works also the other way around where Kokkos has it's own implementation of many std algorithms so that you can use the same functions on the CPU and the GPU. Does that mean that we can just wait for everything to get into the standard and we don't need Kokkos? Unfortunately no. Getting things in the standard is just extremely slow. MDSpan may make it into C++23 but it's not sure. BLAS will not make it. If we wait for the C++ standard and we may get all we need in a decade or so. Personally, I think that using Kokkos is the right move long term. We won't have to worry what new architecture comes up because Kokkos will take care of it and we can always get the pointer to the underlying data if we need to. Trilinos and many other important codes for the DOE are built on Kokkos, so it won't disappear suddenly. Even PETSc has an experimental backend using Kokkos (and yes they download and install kokkos themselves like they do with MPI...). I will have some time to work on our GPU code in the next few months. My plan was to make it easier to use our GPU code but I could work on Kokkos integration instead. We would have to discuss in which part of the library we want to use Kokkos and what the interface to Kokkos should look like. |
@Rombur thanks for the detailed answer, this overlaps with what I'm observing. I think it makes a lot of sense to discuss the general data structures we would move towards the goal of generic capabilities on host/device and with different vendors. As you said, the important thing for us (or at least me) is to be able to keep working with the raw pointers and arrays in case we need to (performance-wise or feature-wise). |
I agree with @Rombur here in that there is no real point to wait for the C++ standard to catch up and that using |
Is there anything else blocking this PR? Am I fine closing this if we decide that we are moving to Kokkos but in that case, I would like to have more input on things like design, scope of the work, etc. |
We already recognize Kokkos in cmake, though I have to admit that it's not clear to me what for. I see no reason not to already now require Kokkos for CUDA code. I wouldn't be opposed to requiring Kokkos in general if that is necessary to make everyone's life easier. I'm not an expert in Kokkos, so I'm not sure what I can help with. But I'm happy to learn something about it if you'd like everyone interested in it to participate in a Zoom call. When you say "more input", what are you specifically looking for? I assume you're hoping for something you can also tell your management? |
I left a comment in #12894 and I'm also happy with having a Zoom call for this. |
We use Kokkos for ArborX, the catch is that it only works on the CPU. This avoids some of the ugliness of using Kokkos with nvcc.
There are two things I am looking for:
|
About 2, let's talk about it tomorrow. I tend to think that we all have to learn something like Kokkos at some point, and this might be the point. |
I've done the requested changes. |
Since we now require Kokkos, do we still need this patch? |
The goal of this PR is to make it easier to write code that works both on the host and the device. For that reason the PR does the following:
new
on the host butcudaMalloc
on the devicefor_each
function that performs a simple for loop on the host and launches a kernel on the deviceLA::distributed::Vector
. I have added a new classMemoryBlock
that allocate a block of memory either on the host or the device. To access the underlying data of this class you need to use anArrayView
. I've done this because of limitations of lambda function with CUDA. In particular, you cannot use a private member data in the lambda. Since you have to use anArrayView
to access the data, you can safely have a privateMemoryBlock
and then, create anArrayView
just before calling the lambda. The other advantage is that the copy constructor ofMemoryBlock
is a regular copy constructor, ifMemoryBlock
could be used directly in a kernel, we would have to do a shallow copy and then keep track of the number of copies.