Introduce Utilities::MemorySpace namespace and MemoryBlock class #12821

Rombur · 2021-10-12T19:52:45Z

The goal of this PR is to make it easier to write code that works both on the host and the device. For that reason the PR does the following:

Add helper functions to allocate memory, free memory, and copy memory on the host and the device. I use tag dispatching to choose the correct implement, i.e, allocate memory uses new on the host but cudaMalloc on the device
Add for_each function that performs a simple for loop on the host and launches a kernel on the device
Right now the only data structure that works on the host and the device is LA::distributed::Vector. I have added a new class MemoryBlock that allocate a block of memory either on the host or the device. To access the underlying data of this class you need to use an ArrayView. I've done this because of limitations of lambda function with CUDA. In particular, you cannot use a private member data in the lambda. Since you have to use an ArrayView to access the data, you can safely have a private MemoryBlock and then, create an ArrayView just before calling the lambda. The other advantage is that the copy constructor of MemoryBlock is a regular copy constructor, if MemoryBlock could be used directly in a kernel, we would have to do a shallow copy and then keep track of the number of copies.

masterleinad · 2021-10-18T13:47:49Z

doc/news/changes/minor/20211012Turcksin

+New: New namespace Utilities::MemorySpace that contains functions to help
+memory space independent code. New MemoryBlock class. This class allocates a
+block of memory on the host or the device. The underlying data can be access
+using ArrayView.


Suggested change

New: New namespace Utilities::MemorySpace that contains functions to help

memory space independent code. New MemoryBlock class. This class allocates a

block of memory on the host or the device. The underlying data can be access

using ArrayView.

New: The new Utilities::MemorySpace namespace contains functions to help

memory space independent code. The new MemoryBlock class allocates a

block of memory on the host or the device. The underlying data can be accessed

using ArrayView.

masterleinad · 2021-10-18T13:48:40Z

include/deal.II/base/array_view.h

@@ -87,6 +87,11 @@ class ArrayView
   */
  using value_type = ElementType;

+  /**
+   * An alias the denotes the memory space of this conlainer-like class.


Suggested change

* An alias the denotes the memory space of this conlainer-like class.

* An alias that denotes the memory space of this container-like class.

masterleinad · 2021-10-18T13:58:50Z

include/deal.II/base/memory_block.h

+ * This class allocates a block of memory on the host or the device. Access to
+ * the elements of the block needs to be done using ArrayView. Note that when a
+ * reinit() function is called the underlying pointer is changed and thus, one
+ * need to call reinit() on the ArrayView associated with the reinitialized
+ * MemoryBlock.


Suggested change

* This class allocates a block of memory on the host or the device. Access to

* the elements of the block needs to be done using ArrayView. Note that when a

* reinit() function is called the underlying pointer is changed and thus, one

* need to call reinit() on the ArrayView associated with the reinitialized

* MemoryBlock.

* This class allocates a block of memory on the host or the device. The elements

* of the block must be accessed using ArrayView. Note that when a

* reinit() function is called, the underlying pointer is changed and thus, one

* needs to call reinit() on the ArrayView associated with the reinitialized

* MemoryBlock.

masterleinad · 2021-10-18T13:59:48Z

include/deal.II/base/memory_block.h

+{
+public:
+  /**
+   * An alias the denotes the memory space of this conlainer-like class.


Suggested change

* An alias the denotes the memory space of this conlainer-like class.

* An alias that denotes the memory space of this container-like class.

masterleinad · 2021-10-18T14:00:41Z

include/deal.II/base/memory_block.h

+  MemoryBlock(const MemoryBlock<ElementType, MemorySpaceType> &other);
+
+  /**
+   * Copy ther data in @p other and move it to the appropriate memory space.


Suggested change

* Copy ther data in @p other and move it to the appropriate memory space.

* Copy the data stored in @p other and move it to the appropriate memory space.

masterleinad · 2021-10-18T14:43:36Z

include/deal.II/base/memory_space_utils.h

+#endif
+
+    /**
+     * Allocate memory on the device.


Suggested change

* Allocate memory on the device.

* Allocate memory on the host.

include/deal.II/base/memory_space_utils.h

masterleinad · 2021-10-18T14:52:48Z

include/deal.II/base/memory_space_utils.h

+     * Apply the functor @p f to the range [0,size). This function accepts a
+     * lambda function instead of a functor. In this case, the code should look


How is a lambda function special here?

You need to compile deal.II with a special flag, you need to add __host__ __device__, you can only capture by copy, and there are other restrictions from CUDA.

Then I would say something like "In case the functor is a lambda, the code should look as follows [...]". To me, it sounded like this function can only be used if the functor is a lambda.

I see you what you mean. It was poorly worded. I changed the sentence at both places.

masterleinad · 2021-10-18T14:54:53Z

include/deal.II/base/memory_space_utils.h

+    inline void
+    for_each(const dealii::MemorySpace::Host &,
+             unsigned int const size,
+             Functor            f)


Doesn't

Suggested change

Functor f)

const Functor& f)

work?

It does but then on the host, you pass by reference and on the device you have to pass by value. If you have a bug in your copy constructor, you can catch it on the host with this interface.

masterleinad · 2021-10-18T15:03:32Z

tests/cuda/memory_block.cu

+  Utilities::MemorySpace::for_each(MemorySpace::Host{},
+                                   memory_block_host.size(),
+                                   check_functor_zero);


Do we really need the extra MemorySpace here? While I can see that that makes some sense for allocate_data, deallocate_data, and copy (since these are inherently memory operations), I would much rather see Utilities::for_each.

But then it won't show up on the same doxygen page. All these functions are there to simplify writing code that's independent of the MemorySpace. I agree that it's technically an execution space not a memory space but we don't have that concept in deal.II

bangerth · 2021-10-25T02:09:49Z

include/deal.II/base/memory_block.h

+   * Constructor. Allocate a block of @p size elements. The data is not
+   * initialized.
+   */
+  MemoryBlock(unsigned int size);


Suggested change

MemoryBlock(unsigned int size);

MemoryBlock(const unsigned int size);

bangerth · 2021-10-25T02:11:45Z

include/deal.II/base/memory_block.h

+  /**
+   * Copy constructor.
+   */
+  MemoryBlock(const MemoryBlock<ElementType, MemorySpaceType> &other);


Can you explain what this does? That is, does it just copy the pointer, or does it in fact allocate memory and copies the objects in the memory block?

bangerth · 2021-10-25T02:19:00Z

include/deal.II/base/memory_block.h

+   * initialized.
+   */
+  void
+  reinit(unsigned int size);


Suggested change

reinit(unsigned int size);

reinit(const unsigned int size);

bangerth · 2021-10-25T02:19:14Z

include/deal.II/base/memory_block.h

+  reinit(unsigned int size);
+
+  /**
+   * Clear the memory block, allocate a new block, and copy the data stored in @p other.


Suggested change

* Clear the memory block, allocate a new block, and copy the data stored in @p other.

* Release the memory block, allocate a new block, and copy the data stored in @p other.

bangerth · 2021-10-25T02:19:22Z

include/deal.II/base/memory_block.h

+  MemoryBlock(const ArrayView<ElementType, MemorySpaceType2> &array_view);
+
+  /**
+   * Clear the memory block and allocate a new block. The data is not


Suggested change

* Clear the memory block and allocate a new block. The data is not

* Release the memory block and allocate a new block. The data is not

bangerth · 2021-10-25T02:24:46Z

include/deal.II/base/memory_space_utils.h

+         const ::dealii::MemorySpace::Host &,
+         Number *out,
+         const ::dealii::MemorySpace::CUDA &,
+         std::size_t size)


Suggested change

std::size_t size)

const std::size_t size)

bangerth · 2021-10-25T02:26:08Z

include/deal.II/base/memory_space_utils.h

+         const ::dealii::MemorySpace::CUDA &,
+         Number *out,
+         const ::dealii::MemorySpace::CUDA &,
+         std::size_t size)


Suggested change

std::size_t size)

const std::size_t size)

bangerth · 2021-10-25T02:26:20Z

include/deal.II/base/memory_space_utils.h

+       */
+      template <typename Functor>
+      __global__ void
+      for_each_impl(unsigned int size, Functor f)


Suggested change

for_each_impl(unsigned int size, Functor f)

for_each_impl(const unsigned int size, Functor f)

bangerth · 2021-10-25T02:29:56Z

include/deal.II/base/memory_space_utils.h

+    void
+    for_each(const dealii::MemorySpace::CUDA &,
+             const unsigned int const size,
+             Functor                  f)


This function's argument list seems to be missing something. for_each is typically described as doing an operation for each object in a collection, but there is no collection here: The memory space does not point to any objects, and the second or third argument don't either.

Would a better function name be something like for_each_index?

bangerth · 2021-10-25T02:31:40Z

include/deal.II/base/memory_block.h

+
+  /**
+   * Constructor. Allocate a block of @p size elements. The data is not
+   * initialized.


"Not initialized" only exists in C++ if the data type is a built-in type -- I think this is std::is_standard_layout<T> but I forget the details. For all other types, one constructor or another needs to be run.

I suspect that you intend this class to only be used for number data types, but you write it as a generic ElementType. It would be worthwhile encoding your assumption via a static_assert.

It is always non-initialized on the device because we need to use the CUDA equivalent of malloc.

I suspect that you intend this class to only be used for number data types, but you write it as a generic ElementType. It would be worthwhile encoding your assumption via a static_assert.

Actually, I don't know about the data types. In one hand, I would like the let users decide which data structure they want to use. In the other hand, if you are not careful the performance on the GPU will be pretty bad. At the end, I decided to go with ElementType because it's what we do on the ArrayView and the two classes work together.

Right, but if you can't guarantee initialization, then all a user will get is likely corrupted memory. I would much rather avoid that by having a static_assert in place somewhere that checks whether we have a data type that has no constructor anyway.

I don't think that std::is_standard_layout is what we want to use then. Is std::is_trivial https://en.cppreference.com/w/cpp/types/is_trivial what you have in mind?

Yes, is_trivial is what I had in mind.

masterleinad · 2021-10-25T16:37:10Z

I still don't quite agree with choosing Utilities::MemorySpace::for_each_index instead of Utilities::for_each_index. Does anyone else have an opinion?

bangerth · 2021-10-25T18:08:13Z

If you named it Utilities::for_each_index then you could make the memory space argument a default argument at the end that most people will just disagree. I haven't quite understood what the argument is supposed to represent though. Are you expressing what kind of "executor" you are using (CPU or GPU) and that you identify the executor with the memory space it works on?

bangerth · 2021-10-25T18:09:03Z

In the end, I have the feeling that this sort of operation is really what libraries like Kokkos or Raja were made for. This may exceed what you have in mind for this patch, but out of curiosity, have you considered whether we should just build on one of these?

Rombur · 2021-10-25T19:16:34Z

Are you expressing what kind of "executor" you are using (CPU or GPU) and that you identify the executor with the memory space it works on?

Yes, exactly. That's why I've put that as first argument since it's where the executor goes in STL.

In the end, I have the feeling that this sort of operation is really what libraries like Kokkos or Raja were made for. This may exceed what you have in mind for this patch, but out of curiosity, have you considered whether we should just build on one of these?

Yes, I did but I don't know Raja, so I would have to learn it first. With Kokkos, I am worry about the integration with our current code (which shouldn't be too horrible) and the integration with Trilinos. Maybe it works out-of-the box but maybe not. Since nobody has been pushing to use Kokkos, I went with the current solution which I know is compatible with our current code.

bangerth · 2021-10-25T20:27:51Z

OK. For the record, I would not be opposed to moving towards a model where we use Kokkos for these sorts (and probably plenty other) things. I recognize that it's another dependency and that may or may not play well with Trilinos. This might be a longer-term issue.

As for the issue with executor vs memory space: I'd be ok with just documenting the issue: That the argument indicates an executor and that the executor is identified by the memory space it is run in. That's maybe not the most elegant solution, but works. Or are we expecting that longer term things move to a unified (globally addressable) memory space where one could execute a GPU kernel that reads and writes into CPU memory or the other way around?

Rombur · 2021-10-25T20:53:19Z

Or are we expecting that longer term things move to a unified (globally addressable) memory space where one could execute a GPU kernel that reads and writes into CPU memory or the other way around?

You can do already do that but depending on the GPU you are using it's pretty slow. I think the consensus is to avoid doing that.

masterleinad · 2021-10-25T20:54:00Z

OK. For the record, I would not be opposed to moving towards a model where we use Kokkos for these sorts (and probably plenty other) things. I recognize that it's another dependency and that may or may not play well with Trilinos. This might be a longer-term issue.

I think, that's the right move. We might be able to support HIP with our current code but I don't see how to use Intel GPUs. We just need someone that has the time to do it. 🙂

Rombur · 2021-10-25T21:02:38Z

Here is my plan for Intel GPU https://www.jlse.anl.gov/projects/exascale-computing-projects-ecp/ecp-2-4-3-05-hip-on-aurora/ I can't wait for the new ICE, we'll get.

masterleinad · 2021-10-25T21:08:01Z

Here is my plan for Intel GPU jlse.anl.gov/projects/exascale-computing-projects-ecp/ecp-2-4-3-05-hip-on-aurora I can't wait for the new ICE, we'll get.

Sure, going with a backed that doesn't support them natively is surely a good idea.

kronbichler · 2021-10-27T09:16:56Z

I agree with the general direction in this PR, and I also agree that we should set up and discuss a longer-term plan for memory-management in this kind of CPU/GPU code. I would be in favor of letting Kokkos many of our data structures with combined CPU/GPU scope if we keep the ability to work with raw pointer infrastructure where needed (which is nicely solved by Kokkos with its View concepts). My main question is the extent to which Kokkos and the upcoming STL functionality (pushed by some Kokkos people) overlap and how we most efficiently spend our resources. @masterleinad @Rombur you are closest to the development in these packages, what are the perspectives we should have from the deal.II side?

Rombur · 2021-10-27T13:56:46Z

The plan for Kokkos and C++ is to push Kokkos functionalities into the standard, with the idea that once it's in the standard, vendors will optimize that code. This is why Kokkos is pushing for things like MDSpan and BLAS in the standard. It works also the other way around where Kokkos has it's own implementation of many std algorithms so that you can use the same functions on the CPU and the GPU. Does that mean that we can just wait for everything to get into the standard and we don't need Kokkos? Unfortunately no. Getting things in the standard is just extremely slow. MDSpan may make it into C++23 but it's not sure. BLAS will not make it. If we wait for the C++ standard and we may get all we need in a decade or so.

Personally, I think that using Kokkos is the right move long term. We won't have to worry what new architecture comes up because Kokkos will take care of it and we can always get the pointer to the underlying data if we need to. Trilinos and many other important codes for the DOE are built on Kokkos, so it won't disappear suddenly. Even PETSc has an experimental backend using Kokkos (and yes they download and install kokkos themselves like they do with MPI...).

I will have some time to work on our GPU code in the next few months. My plan was to make it easier to use our GPU code but I could work on Kokkos integration instead. We would have to discuss in which part of the library we want to use Kokkos and what the interface to Kokkos should look like.

kronbichler · 2021-10-27T14:28:33Z

@Rombur thanks for the detailed answer, this overlaps with what I'm observing. I think it makes a lot of sense to discuss the general data structures we would move towards the goal of generic capabilities on host/device and with different vendors. As you said, the important thing for us (or at least me) is to be able to keep working with the raw pointers and arrays in case we need to (performance-wise or feature-wise).

masterleinad · 2021-10-27T16:06:30Z

I agree with @Rombur here in that there is no real point to wait for the C++ standard to catch up and that using Kokkos instead makes sense.
The transition to Kokkos from the current CUDA implementation can be done in steps since Kokkos can seamlessly be mixed with CUDA (as long as only target CUDA). Probably, it doesn't make sense to use Kokkos for the host implementation anyway but it should be easy to evaluate that once the transition to Kokkos is complete. Of course, there is some code that isn't quite portable, like copying to constant memory, but we could probably specialize for the targeted backend.

Rombur · 2021-11-02T18:08:03Z

Is there anything else blocking this PR? Am I fine closing this if we decide that we are moving to Kokkos but in that case, I would like to have more input on things like design, scope of the work, etc.

bangerth · 2021-11-03T00:04:42Z

We already recognize Kokkos in cmake, though I have to admit that it's not clear to me what for. I see no reason not to already now require Kokkos for CUDA code. I wouldn't be opposed to requiring Kokkos in general if that is necessary to make everyone's life easier.

I'm not an expert in Kokkos, so I'm not sure what I can help with. But I'm happy to learn something about it if you'd like everyone interested in it to participate in a Zoom call. When you say "more input", what are you specifically looking for? I assume you're hoping for something you can also tell your management?

masterleinad · 2021-11-03T15:56:32Z

I left a comment in #12894 and I'm also happy with having a Zoom call for this.

Rombur · 2021-11-03T18:24:21Z

We already recognize Kokkos in cmake, though I have to admit that it's not clear to me what for.

We use Kokkos for ArborX, the catch is that it only works on the CPU. This avoids some of the ugliness of using Kokkos with nvcc.

When you say "more input", what are you specifically looking for?

There are two things I am looking for:

some technical feedback on how the new interface should look like but I would expect that only @masterleinad knows enough about Kokkos to help
buy-in from the other developers. So far, people who didn't use CUDA could basically pretends that the CUDA code didn't exist. This will be harder to do with Kokkos. I don't want to get into a situation where some code needs to be changed/needs to use Kokkos because of CUDA but the PR is rejected because someone refuses to have Kokkos in that part of the code. We could keep Kokkos separated the way CUDA is right now, but to me that really decreases the interest of using Kokkos.

bangerth · 2021-11-05T04:04:48Z

About 2, let's talk about it tomorrow. I tend to think that we all have to learn something like Kokkos at some point, and this might be the point.

Rombur · 2022-03-16T01:21:18Z

I've done the requested changes.

drwells · 2023-03-25T16:01:13Z

Since we now require Kokkos, do we still need this patch?

Rombur added the GPU label Oct 12, 2021

Rombur changed the title ~~Introde Utilities::MemorySpace namespace and MemoryBlock class~~ Introduce Utilities::MemorySpace namespace and MemoryBlock class Oct 18, 2021

masterleinad requested changes Oct 18, 2021

View reviewed changes

Rombur force-pushed the cuda_misc branch from ea35a76 to ec94131 Compare October 19, 2021 00:53

bangerth requested changes Oct 25, 2021

View reviewed changes

Rombur force-pushed the cuda_misc branch 2 times, most recently from 689b586 to 22483cf Compare October 25, 2021 16:06

Rombur force-pushed the cuda_misc branch from 22483cf to 5179c45 Compare October 25, 2021 17:26

Rombur force-pushed the cuda_misc branch from 5179c45 to 8d61eb2 Compare October 25, 2021 20:46

Rombur mentioned this pull request Oct 27, 2021

Support for Kokkos in deal.II #12894

Closed

Rombur added 2 commits March 16, 2022 01:14

Extend ArrayView operator[] to work on GPU

83d325f

Remove outdated test

5ac9117

Rombur force-pushed the cuda_misc branch from 8d61eb2 to f5dbdfc Compare March 16, 2022 01:17

Rombur added 4 commits March 16, 2022 01:24

Add functions to help writing memory space agnostic code

55a3b6c

Add new MemoryBlock class

04a302b

Add test for MemoryBlock

4fd30cb

Add entry in changelog

a5a3a47

Rombur force-pushed the cuda_misc branch from f5dbdfc to a5a3a47 Compare March 16, 2022 01:25

Rombur closed this Mar 27, 2023

Rombur deleted the cuda_misc branch May 25, 2023 15:11

	* An alias the denotes the memory space of this conlainer-like class.
	* An alias that denotes the memory space of this container-like class.

	* Copy ther data in @p other and move it to the appropriate memory space.
	* Copy the data stored in @p other and move it to the appropriate memory space.

	* Allocate memory on the device.
	* Allocate memory on the host.

		* Apply the functor @p f to the range [0,size). This function accepts a
		* lambda function instead of a functor. In this case, the code should look

	MemoryBlock(unsigned int size);
	MemoryBlock(const unsigned int size);

	* Clear the memory block, allocate a new block, and copy the data stored in @p other.
	* Release the memory block, allocate a new block, and copy the data stored in @p other.

	* Clear the memory block and allocate a new block. The data is not
	* Release the memory block and allocate a new block. The data is not

	for_each_impl(unsigned int size, Functor f)
	for_each_impl(const unsigned int size, Functor f)

Introduce Utilities::MemorySpace namespace and MemoryBlock class #12821

Introduce Utilities::MemorySpace namespace and MemoryBlock class #12821

Conversation

Rombur commented Oct 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Oct 25, 2021

bangerth commented Oct 25, 2021

bangerth commented Oct 25, 2021

Rombur commented Oct 25, 2021

bangerth commented Oct 25, 2021

Rombur commented Oct 25, 2021

masterleinad commented Oct 25, 2021

Rombur commented Oct 25, 2021

masterleinad commented Oct 25, 2021

kronbichler commented Oct 27, 2021

Rombur commented Oct 27, 2021

kronbichler commented Oct 27, 2021

masterleinad commented Oct 27, 2021

Rombur commented Nov 2, 2021

bangerth commented Nov 3, 2021

masterleinad commented Nov 3, 2021

Rombur commented Nov 3, 2021

bangerth commented Nov 5, 2021

Rombur commented Mar 16, 2022

drwells commented Mar 25, 2023