Add a method that copies memory only if it is in a different memory #28

BenjaminW3 · 2015-01-14T21:31:27Z

This allows some methods that have const memory to prevent double buffering on the host, a method copyIfDifferentMem should be implemented.

The text was updated successfully, but these errors were encountered:

j-stephan · 2021-11-10T09:28:23Z

We will try and tackle this for 0.9 since it was mentioned by CMS as nice-to-have.

bernhardmgruber · 2022-10-25T15:44:19Z

Is this issue about zero-copying between two buffers, if they reside on the same device? If yes, this should be similar to Kokkos::create_mirror_view (https://github.com/kokkos/kokkos/wiki/View).

We definitely should have this functionality in alpaka. I needed it a couple of times already. I would propose an API similar to Kokkos, but use the target device explicitely:

auto buf = alpaka::allocBuf(devA, ...);

auto buf2 = alpaka::mirrorBuf(buf, devA); // buf2 shares the same memory with buf
alpaka::memcpy(buf, buf2); // zero-copy, does nothing

auto buf3 = alpaka::mirrorBuf(buf, devB); // buf3 may share the same memory, depending on whether devA and devB share memory
alpaka::memcpy(buf, buf2); // maybe zero-copy, maybe deep copy

The typical use case would be that devA is the accelerator device and devB is the host device. If the accelerator device devA lives in the same memory as the host device, we would share the allocation. alpaka::memcpy could do a simple check on the two buffers whether they point to the same allocation and avoid the copy in case they do.

fwyzard · 2022-10-27T20:54:11Z

Hi @bernhardmgruber, I'm not familiar with Kokkos' approach, but if you write something like

auto buf3 = alpaka::mirrorBuf(buf, devB); // buf3 may share the same memory, depending on whether devA and devB share memory

I would expect that buf3 is automatically updated whenever buf is modified, and viceversa.

Instead, from the description of the issue I understand that the intended behaviour is that the copy is a one-time action, and that it should simply be elided if the two buffers are on the same device ?

Something like

auto src = alpaka::allocBuf(devA, ...);
auto dst = alpaka::allocBuf(devB, ...);

// if the two buffers are on a different device, copy the content, otherwise alias the buffer
if constexpr(std::is_same_v<decltype(devA), decltype(devB)>) {
  if (devA == devB) {
    dst = src;
  } else {
    alpaka::memcpy(dst, src);
} else {
  alpaka::memcpy(dst, src);
}

If I understood correctly, I think that adding a new kind of buffer or view may not be the best approach, and instead we could just add an alternative to memcopy.

Something like (for lack of a better name)

template <...>
alpaka::convey(TQueue queue, TViewDst&& dst, TViewSrc const& src)
{
  ...
}

j-stephan · 2022-10-28T10:06:55Z

Maybe something like make_available? If it already is available on the given device nothing happens, otherwise a copy is performed.

bernhardmgruber · 2022-10-29T11:37:46Z

@fwyzard That is roughly the implementation I had in mind! Maybe we can also allow e.g. different CPU devices to also have this sharing behavior or, whenever unified shared memory is implemented, have it for CPU/GPU as well.
I just saw Kokkos::create_mirror_view on a slide at ACAT22 and thought that we should really have that now.

You are right that mirrorBuffer could imply that they keep in sync automatically, which is not what is intended. The intent is to allow zero-copies and fallback to deep copies when necessary. This should also be the default API users should reach for instead of memcpy when they just want to transfer data to the device for a kernel and then transfer back. This is why I thought I could add it to memcpy directly, so we do not need to train users to use a different API. I guess this is why Kokkos names memcpy deepCopy explicitly.

If I understood correctly, I think that adding a new kind of buffer or view may not be the best approach, and instead we could just add an alternative to memcopy.

Yes, no new buffer type. Yes, I want to have a smarter memcpy that may return a buffer that shares storage with the old one or not.

Maybe something like make_available? If it already is available on the given device nothing happens, otherwise a copy is performed.

Valid name! convey is shorter, but I had to look it up in a dictionary quickly. We can have a vote later when I have a PR.

fwyzard · 2022-10-29T12:28:46Z

Yes, I want to have a smarter memcpy that may return a buffer that shares storage with the old one or not.

I agree... however we should keep in mind that this would be an API with a different behaviour depending on the accelerator.

For example:

if src is a CPU buffer and dst is a GPU buffer, after smart_memcpy(queue, dst, src) has completed the src buffer can be safely freed or modified;
if both src and dst are GPU buffers, after smart_memcpy(queue, dst, src) has completed the src buffer should not be modified, because any changes would be reflected in the dst buffer.

So one has to be aware of the difference when writing generic code !

On the other hand, I wouldn't know how to make it more obvious from the API itself, apart from documenting the behaviour :-)

Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

psychocoderHPC · 2022-11-01T07:50:54Z

IMO to model the ideas described in this issue we need memory visibility information within alpaka.
On Fontrier/crusher ORNL we have e.g. a fully cache coherent system where it is not required to use managed memory.
On CUDA but also some ROCm devices you must use managed memory. If you use to manage memory you need to make the user aware that two buffers representing the same memory can not be executed in parallel on two different accelerators e.g. CUDA and HOST.

My point is that we must extent first the descriptive of our Buffers/Views to model a solution for this issue.

fwyzard · 2022-11-01T09:01:15Z

From what I just read, the Frontier/Crusher approach is not very dissimilar from unified/managed memory. The differences are that

malloced memory can be treated as managed memory (IIRC CUDA on Power has a similar capability);
device memory can be accessed from the host as "zero copy" over the fabric (similar to a device accessing pinned host memory, but in the opposite direction).

That's an interesting approach, and it should be possible to handle it in the same way as unified/managed memory.

So, between the different backends, the possible cases would be:

standard host memory, not accessible by the device;
pinned host memory, accessible by the device as "zero copy" over the fabric;
standard device memory, not accessible by the host;
fancy (?) device memory, accessible by the host as "zero copy" over the fabric;
managed memory, accessible by the host and device, may migrate or be "zero copy" or be physically shared, etc.

Standard and pinned host memory are handled by a BufCpu.

Standard device memory is handled by a BufCudaRt/BufHipRt/etc.
Fancy (?) device memory could be handled by extending the BufHipRt class ?
In fact I think if one uses a BufHipRt buffer and accessed the memory from the host, it would just work.

Managed memory probably needs a new kind of buffer type, in order to expose new methods to migrate the memory across the host and device(s).

Maybe a more general approach could be to extend the interface of the buffer classes to include an additional device (or list of additional devices) from where the memory can be accessed ?

But IMHO all this is orthogonal to the method discussed here: we would simply not implement the convey or smart_memcpy method for managed memory buffers (at least, I cannot see a way to implement based on the way Alpaka buffers are coded).

bernhardmgruber · 2022-11-03T10:24:00Z

Further comment from @fwyzard (#1820 (comment)):

By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu); ?
Make a copy, or let SomeGpu access the buffer from OtherGpu memory through nvlink/PCIe/etc. ?

Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

bernhardmgruber · 2023-01-25T13:08:37Z

By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu); ?
Make a copy, or let SomeGpu access the buffer from OtherGpu memory through nvlink/PCIe/etc. ?

Having thought about this for a bit, it should be the latter. If the user intends to have a full copy they should use alpaka::memcpy. zero-copying is about making the data available in a different realm, exactly to avoid a copy.

Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

bernhardmgruber · 2023-01-25T15:53:08Z

Standard and pinned host memory are handled by a BufCpu.
Standard device memory is handled by a BufCudaRt/BufHipRt/etc.
Fancy (?) device memory could be handled by extending the BufHipRt class ?
In fact I think if one uses a BufHipRt buffer and accessed the memory from the host, it would just work.

Managed memory probably needs a new kind of buffer type, in order to expose new methods to migrate the memory across the host and device(s).

I was wondering myself today and this lead to the question what it actually means for a buffer to be e.g. a BufCpu or a BufUniformCudaHip. It seems that those types mean that the buffers are resident in RAM or CUDA device memory. Funnily, it does not necessarily mean which API they are using to manage them. E.g. a BufCpu can use the CUDA API to delete itself, if it was e.g. allocated as a mapped buffer.

I came across this question when considering what the type of buf2 should be:

auto buf1 = alpaka::allocMappedBuf(devHost, ...);
auto buf2 = alpaka::makeAvailable(queue, devCuda, buf1);

Because makeAvailable needs to be written such that it can allocate a new buffer on the destination device devCuda, it's type is what alpaka::allocBuf returns, which is an alpaka::BufUniformCudaHip. So, per C++ language rules of having the same return types in all branches (and because whether or not we can zero-copy or need to allocate is a runtime decision), the type buf2 needs to be alpaka::BufUniformCudaHip. So now we have buf1 and buf2 being of two different types, but sharing the same allocation internally. Also, then alpaka::BufUniformCudaHip does not mean anymore that the buffer is GPU resident. Furthermore, not all operations that you can typically do on an alpaka::BufUniformCudaHip, would make sense anymore if it internally held a buffer mapped from host memory.

Notice that I did not start talking about managed memory yet and whether it needs its own data type or not. We somewhat have that problem today already. Well, we can ignore the problem largely if we don't merge and evolve #1820, but still: The buffer data type does not tell us where memory is accessible from.

The reason we did not need to address this issue until now is that alpaka can type-erase all that information by calling getPtrNative(buf) -> T* and then be careful to which kernel we pass this T*, as the T* does no longer carry any information.

So, maybe managed memory needs a new kind of buffer type. And maybe such a general "MultiDeviceBuffer" is also needed to be returned from makeAvailable to cover the cases of a newly allocated buffer or a mapped, or a managed buffer. I don't know the answer to this and I will also probably not have the time to find it.

Maybe a more general approach could be to extend the interface of the buffer classes to include an additional device (or list of additional devices) from where the memory can be accessed ?

@psychocoderHPC suggested that as well to me again yesterday. This may be a proper solution. It also covers the case that we can add and remove devices to this list, when we map/unmap memory to additional devices. However, what is the difference then between a BufCpu accessible from host and gpu1, and a BufUniformCudaHip accessible from gpu1 and host? What is the type telling us?

But IMHO all this is orthogonal to the method discussed here: we would simply not implement the convey or smart_memcpy method for managed memory buffers (at least, I cannot see a way to implement based on the way Alpaka buffers are coded).

Yes and no. makeAvailable(queue, dstDev, srcView) -> dstView should just "do the right thing" and take all extra information from buffers/views and devices into account. So if the source view uses e.g. managed memory, or is already mapped to the target device, then a zero-copy (= adjusting internal handles) will be performed as well.

We can ignore that for now though and that is still useful, as is the state of #1820. There, we just perform a zero-copy if the source and destination device are the same. This already coveres the important use case of avoiding the copies in a host->device copy, kernel, device->host copy program. However, going further and allowing makeAvailable to perform a zero-copy in more cases requires us to solve the issue.

Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

fwyzard · 2023-03-04T08:00:41Z

Some random ideas about what makeAvailable could or should do...

I think we may need to distinguish "make available for reading once" and "make available for efficiently working with"

pinned host memory is available to a GPU, it could be a reasonable solution for reading some sparse data, but I would not consider it a good solution for a memory-intensive kernel
accessing GPU memory from on another GPU (e.g. after calling cudaDeviceEnablePeerAccess(...)) may be more or less efficient than making a full copy
managed/unified memory could be accessed directly, or migrated explicitly migrated

For the first case, here called makeAvailableOnce (it's not a good name), I've come up with:

from:	to: host	to: device	to: other device
host, default memory	same buffer	device buffer	other device buffer
host, pinned memory	same buffer	same buffer	other device buffer
device memory	host buffer, pinned memory	same buffer	same buffer if peer access, other device buffer otherwise
managed memory	same buffer	same buffer	same buffer

For the second case, here called makeAvailableCached (it's not a good name, either), I've come up with:

from:	to: host	to: device	to: other device
host, default memory	same buffer	device buffer	other device buffer
host, pinned memory	same buffer	device buffer	other device buffer
device memory	host buffer, pinned memory	same buffer	same buffer if fast peer access, other device buffer otherwise ?
managed memory	same buffer, request migration?	same buffer request migration?	same buffer request migration?

In this table, makeAvailableCached always has the same return type for accessing host memory from a device, irrespective if the host buffer was pinned/mapped or not.

On the other hand makeAvailableOnce has the same problem mentioned by @bernhardmgruber: the return type should be different if the host buffer is pinned/mapped or not.
The only solution I can think of is to return an std::variant<BufCpu, BufUniformCudaHip>.
Or, to avoid changing the Alpaka API, to define a new ViewAnyBuf type that can wrap different types of buffers, and forward any call to the right type ?

BenjaminW3 added the Type:Enhancement label Jan 18, 2015

BenjaminW3 changed the title ~~Add a method that copies memory only if it is in a different MemorySpace~~ Add a method that copies memory only if it is in a different Memory Apr 6, 2015

BenjaminW3 changed the title ~~Add a method that copies memory only if it is in a different Memory~~ Add a method that copies memory only if it is in a different memory Apr 6, 2015

j-stephan mentioned this issue Dec 17, 2020

Feature wish list #1232

Open

j-stephan added Backend:Boost.Fiber Backend:CUDA Backend:HIP Backend:OpenACC Backend:OpenMP Backend:std::thread Backend:SYCL Backend:TBB Type:Refactoring labels Nov 10, 2021

j-stephan added this to To do in Release 0.9 via automation Nov 10, 2021

j-stephan added this to the Version 0.9.0 (I/2022) milestone Nov 10, 2021

j-stephan removed this from To do in Release 0.9 Mar 29, 2022

j-stephan added this to To do in Release 1.0 via automation Mar 29, 2022

j-stephan removed this from the Version 0.9.0 (I/2022) milestone Mar 29, 2022

bernhardmgruber mentioned this issue Oct 29, 2022

Memcpy between the same buffer #1819

Open

bernhardmgruber mentioned this issue Oct 29, 2022

[abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) #1820

Draft

4 tasks

bernhardmgruber removed this from To do in Release 1.0 Dec 9, 2022

j-stephan removed the Backend:Boost.Fiber label Dec 9, 2022

j-stephan removed the Backend:OpenACC label Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a method that copies memory only if it is in a different memory #28

Add a method that copies memory only if it is in a different memory #28

BenjaminW3 commented Jan 14, 2015

j-stephan commented Nov 10, 2021

bernhardmgruber commented Oct 25, 2022 •

edited

Loading

fwyzard commented Oct 27, 2022

j-stephan commented Oct 28, 2022

bernhardmgruber commented Oct 29, 2022

fwyzard commented Oct 29, 2022

psychocoderHPC commented Nov 1, 2022

fwyzard commented Nov 1, 2022

bernhardmgruber commented Nov 3, 2022

bernhardmgruber commented Jan 25, 2023

bernhardmgruber commented Jan 25, 2023

fwyzard commented Mar 4, 2023

Add a method that copies memory only if it is in a different memory #28

Add a method that copies memory only if it is in a different memory #28

Comments

BenjaminW3 commented Jan 14, 2015

j-stephan commented Nov 10, 2021

bernhardmgruber commented Oct 25, 2022 • edited Loading

fwyzard commented Oct 27, 2022

j-stephan commented Oct 28, 2022

bernhardmgruber commented Oct 29, 2022

fwyzard commented Oct 29, 2022

psychocoderHPC commented Nov 1, 2022

fwyzard commented Nov 1, 2022

bernhardmgruber commented Nov 3, 2022

bernhardmgruber commented Jan 25, 2023

bernhardmgruber commented Jan 25, 2023

fwyzard commented Mar 4, 2023

bernhardmgruber commented Oct 25, 2022 •

edited

Loading