Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Utilities::MemorySpace namespace and MemoryBlock class #12821

Closed
wants to merge 6 commits into from

Conversation

Rombur
Copy link
Member

@Rombur Rombur commented Oct 12, 2021

The goal of this PR is to make it easier to write code that works both on the host and the device. For that reason the PR does the following:

  1. Add helper functions to allocate memory, free memory, and copy memory on the host and the device. I use tag dispatching to choose the correct implement, i.e, allocate memory uses new on the host but cudaMalloc on the device
  2. Add for_each function that performs a simple for loop on the host and launches a kernel on the device
  3. Right now the only data structure that works on the host and the device is LA::distributed::Vector. I have added a new class MemoryBlock that allocate a block of memory either on the host or the device. To access the underlying data of this class you need to use an ArrayView. I've done this because of limitations of lambda function with CUDA. In particular, you cannot use a private member data in the lambda. Since you have to use an ArrayView to access the data, you can safely have a private MemoryBlock and then, create an ArrayView just before calling the lambda. The other advantage is that the copy constructor of MemoryBlock is a regular copy constructor, if MemoryBlock could be used directly in a kernel, we would have to do a shallow copy and then keep track of the number of copies.

@Rombur Rombur added the GPU label Oct 12, 2021
@Rombur Rombur changed the title Introde Utilities::MemorySpace namespace and MemoryBlock class Introduce Utilities::MemorySpace namespace and MemoryBlock class Oct 18, 2021
Comment on lines 1 to 4
New: New namespace Utilities::MemorySpace that contains functions to help
memory space independent code. New MemoryBlock class. This class allocates a
block of memory on the host or the device. The underlying data can be access
using ArrayView.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
New: New namespace Utilities::MemorySpace that contains functions to help
memory space independent code. New MemoryBlock class. This class allocates a
block of memory on the host or the device. The underlying data can be access
using ArrayView.
New: The new Utilities::MemorySpace namespace contains functions to help
memory space independent code. The new MemoryBlock class allocates a
block of memory on the host or the device. The underlying data can be accessed
using ArrayView.

@@ -87,6 +87,11 @@ class ArrayView
*/
using value_type = ElementType;

/**
* An alias the denotes the memory space of this conlainer-like class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* An alias the denotes the memory space of this conlainer-like class.
* An alias that denotes the memory space of this container-like class.

Comment on lines 30 to 33
* This class allocates a block of memory on the host or the device. Access to
* the elements of the block needs to be done using ArrayView. Note that when a
* reinit() function is called the underlying pointer is changed and thus, one
* need to call reinit() on the ArrayView associated with the reinitialized
* MemoryBlock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This class allocates a block of memory on the host or the device. Access to
* the elements of the block needs to be done using ArrayView. Note that when a
* reinit() function is called the underlying pointer is changed and thus, one
* need to call reinit() on the ArrayView associated with the reinitialized
* MemoryBlock.
* This class allocates a block of memory on the host or the device. The elements
* of the block must be accessed using ArrayView. Note that when a
* reinit() function is called, the underlying pointer is changed and thus, one
* needs to call reinit() on the ArrayView associated with the reinitialized
* MemoryBlock.

{
public:
/**
* An alias the denotes the memory space of this conlainer-like class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* An alias the denotes the memory space of this conlainer-like class.
* An alias that denotes the memory space of this container-like class.

MemoryBlock(const MemoryBlock<ElementType, MemorySpaceType> &other);

/**
* Copy ther data in @p other and move it to the appropriate memory space.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copy ther data in @p other and move it to the appropriate memory space.
* Copy the data stored in @p other and move it to the appropriate memory space.

#endif

/**
* Allocate memory on the device.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Allocate memory on the device.
* Allocate memory on the host.

include/deal.II/base/memory_space_utils.h Outdated Show resolved Hide resolved
Comment on lines 120 to 124
* Apply the functor @p f to the range [0,size). This function accepts a
* lambda function instead of a functor. In this case, the code should look
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is a lambda function special here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to compile deal.II with a special flag, you need to add __host__ __device__, you can only capture by copy, and there are other restrictions from CUDA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would say something like "In case the functor is a lambda, the code should look as follows [...]". To me, it sounded like this function can only be used if the functor is a lambda.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you what you mean. It was poorly worded. I changed the sentence at both places.

inline void
for_each(const dealii::MemorySpace::Host &,
unsigned int const size,
Functor f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't

Suggested change
Functor f)
const Functor& f)

work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does but then on the host, you pass by reference and on the device you have to pass by value. If you have a bug in your copy constructor, you can catch it on the host with this interface.

Comment on lines 69 to 71
Utilities::MemorySpace::for_each(MemorySpace::Host{},
memory_block_host.size(),
check_functor_zero);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the extra MemorySpace here? While I can see that that makes some sense for allocate_data, deallocate_data, and copy (since these are inherently memory operations), I would much rather see Utilities::for_each.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then it won't show up on the same doxygen page. All these functions are there to simplify writing code that's independent of the MemorySpace. I agree that it's technically an execution space not a memory space but we don't have that concept in deal.II

* Constructor. Allocate a block of @p size elements. The data is not
* initialized.
*/
MemoryBlock(unsigned int size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MemoryBlock(unsigned int size);
MemoryBlock(const unsigned int size);

/**
* Copy constructor.
*/
MemoryBlock(const MemoryBlock<ElementType, MemorySpaceType> &other);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this does? That is, does it just copy the pointer, or does it in fact allocate memory and copies the objects in the memory block?

* initialized.
*/
void
reinit(unsigned int size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reinit(unsigned int size);
reinit(const unsigned int size);

reinit(unsigned int size);

/**
* Clear the memory block, allocate a new block, and copy the data stored in @p other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Clear the memory block, allocate a new block, and copy the data stored in @p other.
* Release the memory block, allocate a new block, and copy the data stored in @p other.

MemoryBlock(const ArrayView<ElementType, MemorySpaceType2> &array_view);

/**
* Clear the memory block and allocate a new block. The data is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Clear the memory block and allocate a new block. The data is not
* Release the memory block and allocate a new block. The data is not

const ::dealii::MemorySpace::Host &,
Number *out,
const ::dealii::MemorySpace::CUDA &,
std::size_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t size)
const std::size_t size)

const ::dealii::MemorySpace::CUDA &,
Number *out,
const ::dealii::MemorySpace::CUDA &,
std::size_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t size)
const std::size_t size)

*/
template <typename Functor>
__global__ void
for_each_impl(unsigned int size, Functor f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for_each_impl(unsigned int size, Functor f)
for_each_impl(const unsigned int size, Functor f)

void
for_each(const dealii::MemorySpace::CUDA &,
const unsigned int const size,
Functor f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function's argument list seems to be missing something. for_each is typically described as doing an operation for each object in a collection, but there is no collection here: The memory space does not point to any objects, and the second or third argument don't either.

Would a better function name be something like for_each_index?


/**
* Constructor. Allocate a block of @p size elements. The data is not
* initialized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Not initialized" only exists in C++ if the data type is a built-in type -- I think this is std::is_standard_layout<T> but I forget the details. For all other types, one constructor or another needs to be run.

I suspect that you intend this class to only be used for number data types, but you write it as a generic ElementType. It would be worthwhile encoding your assumption via a static_assert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always non-initialized on the device because we need to use the CUDA equivalent of malloc.

I suspect that you intend this class to only be used for number data types, but you write it as a generic ElementType. It would be worthwhile encoding your assumption via a static_assert.

Actually, I don't know about the data types. In one hand, I would like the let users decide which data structure they want to use. In the other hand, if you are not careful the performance on the GPU will be pretty bad. At the end, I decided to go with ElementType because it's what we do on the ArrayView and the two classes work together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but if you can't guarantee initialization, then all a user will get is likely corrupted memory. I would much rather avoid that by having a static_assert in place somewhere that checks whether we have a data type that has no constructor anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that std::is_standard_layout is what we want to use then. Is std::is_trivial https://en.cppreference.com/w/cpp/types/is_trivial what you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, is_trivial is what I had in mind.

@Rombur Rombur force-pushed the cuda_misc branch 2 times, most recently from 689b586 to 22483cf Compare October 25, 2021 16:06
@masterleinad
Copy link
Member

I still don't quite agree with choosing Utilities::MemorySpace::for_each_index instead of Utilities::for_each_index. Does anyone else have an opinion?

@bangerth
Copy link
Member

If you named it Utilities::for_each_index then you could make the memory space argument a default argument at the end that most people will just disagree. I haven't quite understood what the argument is supposed to represent though. Are you expressing what kind of "executor" you are using (CPU or GPU) and that you identify the executor with the memory space it works on?

@bangerth
Copy link
Member

In the end, I have the feeling that this sort of operation is really what libraries like Kokkos or Raja were made for. This may exceed what you have in mind for this patch, but out of curiosity, have you considered whether we should just build on one of these?

@Rombur
Copy link
Member Author

Rombur commented Oct 25, 2021

Are you expressing what kind of "executor" you are using (CPU or GPU) and that you identify the executor with the memory space it works on?

Yes, exactly. That's why I've put that as first argument since it's where the executor goes in STL.

In the end, I have the feeling that this sort of operation is really what libraries like Kokkos or Raja were made for. This may exceed what you have in mind for this patch, but out of curiosity, have you considered whether we should just build on one of these?

Yes, I did but I don't know Raja, so I would have to learn it first. With Kokkos, I am worry about the integration with our current code (which shouldn't be too horrible) and the integration with Trilinos. Maybe it works out-of-the box but maybe not. Since nobody has been pushing to use Kokkos, I went with the current solution which I know is compatible with our current code.

@bangerth
Copy link
Member

OK. For the record, I would not be opposed to moving towards a model where we use Kokkos for these sorts (and probably plenty other) things. I recognize that it's another dependency and that may or may not play well with Trilinos. This might be a longer-term issue.

As for the issue with executor vs memory space: I'd be ok with just documenting the issue: That the argument indicates an executor and that the executor is identified by the memory space it is run in. That's maybe not the most elegant solution, but works. Or are we expecting that longer term things move to a unified (globally addressable) memory space where one could execute a GPU kernel that reads and writes into CPU memory or the other way around?

@Rombur
Copy link
Member Author

Rombur commented Oct 25, 2021

Or are we expecting that longer term things move to a unified (globally addressable) memory space where one could execute a GPU kernel that reads and writes into CPU memory or the other way around?

You can do already do that but depending on the GPU you are using it's pretty slow. I think the consensus is to avoid doing that.

@masterleinad
Copy link
Member

OK. For the record, I would not be opposed to moving towards a model where we use Kokkos for these sorts (and probably plenty other) things. I recognize that it's another dependency and that may or may not play well with Trilinos. This might be a longer-term issue.

I think, that's the right move. We might be able to support HIP with our current code but I don't see how to use Intel GPUs. We just need someone that has the time to do it. 🙂

@Rombur
Copy link
Member Author

Rombur commented Oct 25, 2021

Here is my plan for Intel GPU https://www.jlse.anl.gov/projects/exascale-computing-projects-ecp/ecp-2-4-3-05-hip-on-aurora/ I can't wait for the new ICE, we'll get.

@masterleinad
Copy link
Member

Here is my plan for Intel GPU jlse.anl.gov/projects/exascale-computing-projects-ecp/ecp-2-4-3-05-hip-on-aurora I can't wait for the new ICE, we'll get.

Sure, going with a backed that doesn't support them natively is surely a good idea.

@kronbichler
Copy link
Member

I agree with the general direction in this PR, and I also agree that we should set up and discuss a longer-term plan for memory-management in this kind of CPU/GPU code. I would be in favor of letting Kokkos many of our data structures with combined CPU/GPU scope if we keep the ability to work with raw pointer infrastructure where needed (which is nicely solved by Kokkos with its View concepts). My main question is the extent to which Kokkos and the upcoming STL functionality (pushed by some Kokkos people) overlap and how we most efficiently spend our resources. @masterleinad @Rombur you are closest to the development in these packages, what are the perspectives we should have from the deal.II side?

@Rombur
Copy link
Member Author

Rombur commented Oct 27, 2021

The plan for Kokkos and C++ is to push Kokkos functionalities into the standard, with the idea that once it's in the standard, vendors will optimize that code. This is why Kokkos is pushing for things like MDSpan and BLAS in the standard. It works also the other way around where Kokkos has it's own implementation of many std algorithms so that you can use the same functions on the CPU and the GPU. Does that mean that we can just wait for everything to get into the standard and we don't need Kokkos? Unfortunately no. Getting things in the standard is just extremely slow. MDSpan may make it into C++23 but it's not sure. BLAS will not make it. If we wait for the C++ standard and we may get all we need in a decade or so.

Personally, I think that using Kokkos is the right move long term. We won't have to worry what new architecture comes up because Kokkos will take care of it and we can always get the pointer to the underlying data if we need to. Trilinos and many other important codes for the DOE are built on Kokkos, so it won't disappear suddenly. Even PETSc has an experimental backend using Kokkos (and yes they download and install kokkos themselves like they do with MPI...).

I will have some time to work on our GPU code in the next few months. My plan was to make it easier to use our GPU code but I could work on Kokkos integration instead. We would have to discuss in which part of the library we want to use Kokkos and what the interface to Kokkos should look like.

@kronbichler
Copy link
Member

@Rombur thanks for the detailed answer, this overlaps with what I'm observing. I think it makes a lot of sense to discuss the general data structures we would move towards the goal of generic capabilities on host/device and with different vendors. As you said, the important thing for us (or at least me) is to be able to keep working with the raw pointers and arrays in case we need to (performance-wise or feature-wise).

@masterleinad
Copy link
Member

I agree with @Rombur here in that there is no real point to wait for the C++ standard to catch up and that using Kokkos instead makes sense.
The transition to Kokkos from the current CUDA implementation can be done in steps since Kokkos can seamlessly be mixed with CUDA (as long as only target CUDA). Probably, it doesn't make sense to use Kokkos for the host implementation anyway but it should be easy to evaluate that once the transition to Kokkos is complete. Of course, there is some code that isn't quite portable, like copying to constant memory, but we could probably specialize for the targeted backend.

@Rombur
Copy link
Member Author

Rombur commented Nov 2, 2021

Is there anything else blocking this PR? Am I fine closing this if we decide that we are moving to Kokkos but in that case, I would like to have more input on things like design, scope of the work, etc.

@bangerth
Copy link
Member

bangerth commented Nov 3, 2021

We already recognize Kokkos in cmake, though I have to admit that it's not clear to me what for. I see no reason not to already now require Kokkos for CUDA code. I wouldn't be opposed to requiring Kokkos in general if that is necessary to make everyone's life easier.

I'm not an expert in Kokkos, so I'm not sure what I can help with. But I'm happy to learn something about it if you'd like everyone interested in it to participate in a Zoom call. When you say "more input", what are you specifically looking for? I assume you're hoping for something you can also tell your management?

@masterleinad
Copy link
Member

I left a comment in #12894 and I'm also happy with having a Zoom call for this.

@Rombur
Copy link
Member Author

Rombur commented Nov 3, 2021

We already recognize Kokkos in cmake, though I have to admit that it's not clear to me what for.

We use Kokkos for ArborX, the catch is that it only works on the CPU. This avoids some of the ugliness of using Kokkos with nvcc.

When you say "more input", what are you specifically looking for?

There are two things I am looking for:

  1. some technical feedback on how the new interface should look like but I would expect that only @masterleinad knows enough about Kokkos to help
  2. buy-in from the other developers. So far, people who didn't use CUDA could basically pretends that the CUDA code didn't exist. This will be harder to do with Kokkos. I don't want to get into a situation where some code needs to be changed/needs to use Kokkos because of CUDA but the PR is rejected because someone refuses to have Kokkos in that part of the code. We could keep Kokkos separated the way CUDA is right now, but to me that really decreases the interest of using Kokkos.

@bangerth
Copy link
Member

bangerth commented Nov 5, 2021

About 2, let's talk about it tomorrow. I tend to think that we all have to learn something like Kokkos at some point, and this might be the point.

@Rombur
Copy link
Member Author

Rombur commented Mar 16, 2022

I've done the requested changes.

@drwells
Copy link
Member

drwells commented Mar 25, 2023

Since we now require Kokkos, do we still need this patch?

@Rombur Rombur closed this Mar 27, 2023
@Rombur Rombur deleted the cuda_misc branch May 25, 2023 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
CUDA support
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

5 participants