Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a method that copies memory only if it is in a different memory #28

Open
BenjaminW3 opened this issue Jan 14, 2015 · 12 comments
Open

Comments

@BenjaminW3
Copy link
Member

This allows some methods that have const memory to prevent double buffering on the host, a method copyIfDifferentMem should be implemented.

@BenjaminW3 BenjaminW3 changed the title Add a method that copies memory only if it is in a different MemorySpace Add a method that copies memory only if it is in a different Memory Apr 6, 2015
@BenjaminW3 BenjaminW3 changed the title Add a method that copies memory only if it is in a different Memory Add a method that copies memory only if it is in a different memory Apr 6, 2015
@j-stephan
Copy link
Member

We will try and tackle this for 0.9 since it was mentioned by CMS as nice-to-have.

@bernhardmgruber
Copy link
Member

bernhardmgruber commented Oct 25, 2022

Is this issue about zero-copying between two buffers, if they reside on the same device? If yes, this should be similar to Kokkos::create_mirror_view (https://github.com/kokkos/kokkos/wiki/View).

We definitely should have this functionality in alpaka. I needed it a couple of times already. I would propose an API similar to Kokkos, but use the target device explicitely:

auto buf = alpaka::allocBuf(devA, ...);

auto buf2 = alpaka::mirrorBuf(buf, devA); // buf2 shares the same memory with buf
alpaka::memcpy(buf, buf2); // zero-copy, does nothing

auto buf3 = alpaka::mirrorBuf(buf, devB); // buf3 may share the same memory, depending on whether devA and devB share memory
alpaka::memcpy(buf, buf2); // maybe zero-copy, maybe deep copy

The typical use case would be that devA is the accelerator device and devB is the host device. If the accelerator device devA lives in the same memory as the host device, we would share the allocation. alpaka::memcpy could do a simple check on the two buffers whether they point to the same allocation and avoid the copy in case they do.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 27, 2022

Hi @bernhardmgruber, I'm not familiar with Kokkos' approach, but if you write something like

auto buf3 = alpaka::mirrorBuf(buf, devB); // buf3 may share the same memory, depending on whether devA and devB share memory

I would expect that buf3 is automatically updated whenever buf is modified, and viceversa.

Instead, from the description of the issue I understand that the intended behaviour is that the copy is a one-time action, and that it should simply be elided if the two buffers are on the same device ?

Something like

auto src = alpaka::allocBuf(devA, ...);
auto dst = alpaka::allocBuf(devB, ...);

// if the two buffers are on a different device, copy the content, otherwise alias the buffer
if constexpr(std::is_same_v<decltype(devA), decltype(devB)>) {
  if (devA == devB) {
    dst = src;
  } else {
    alpaka::memcpy(dst, src);
} else {
  alpaka::memcpy(dst, src);
}

If I understood correctly, I think that adding a new kind of buffer or view may not be the best approach, and instead we could just add an alternative to memcopy.

Something like (for lack of a better name)

template <...>
alpaka::convey(TQueue queue, TViewDst&& dst, TViewSrc const& src)
{
  ...
}

@j-stephan
Copy link
Member

Maybe something like make_available? If it already is available on the given device nothing happens, otherwise a copy is performed.

@bernhardmgruber
Copy link
Member

@fwyzard That is roughly the implementation I had in mind! Maybe we can also allow e.g. different CPU devices to also have this sharing behavior or, whenever unified shared memory is implemented, have it for CPU/GPU as well.
I just saw Kokkos::create_mirror_view on a slide at ACAT22 and thought that we should really have that now.

You are right that mirrorBuffer could imply that they keep in sync automatically, which is not what is intended. The intent is to allow zero-copies and fallback to deep copies when necessary. This should also be the default API users should reach for instead of memcpy when they just want to transfer data to the device for a kernel and then transfer back. This is why I thought I could add it to memcpy directly, so we do not need to train users to use a different API. I guess this is why Kokkos names memcpy deepCopy explicitly.

If I understood correctly, I think that adding a new kind of buffer or view may not be the best approach, and instead we could just add an alternative to memcopy.

Yes, no new buffer type. Yes, I want to have a smarter memcpy that may return a buffer that shares storage with the old one or not.

Maybe something like make_available? If it already is available on the given device nothing happens, otherwise a copy is performed.

Valid name! convey is shorter, but I had to look it up in a dictionary quickly. We can have a vote later when I have a PR.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 29, 2022

Yes, I want to have a smarter memcpy that may return a buffer that shares storage with the old one or not.

I agree... however we should keep in mind that this would be an API with a different behaviour depending on the accelerator.

For example:

  • if src is a CPU buffer and dst is a GPU buffer, after smart_memcpy(queue, dst, src) has completed the src buffer can be safely freed or modified;
  • if both src and dst are GPU buffers, after smart_memcpy(queue, dst, src) has completed the src buffer should not be modified, because any changes would be reflected in the dst buffer.

So one has to be aware of the difference when writing generic code !

On the other hand, I wouldn't know how to make it more obvious from the API itself, apart from documenting the behaviour :-)

bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Oct 29, 2022
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Oct 29, 2022
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Oct 29, 2022
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
@psychocoderHPC
Copy link
Member

IMO to model the ideas described in this issue we need memory visibility information within alpaka.
On Fontrier/crusher ORNL we have e.g. a fully cache coherent system where it is not required to use managed memory.
On CUDA but also some ROCm devices you must use managed memory. If you use to manage memory you need to make the user aware that two buffers representing the same memory can not be executed in parallel on two different accelerators e.g. CUDA and HOST.

My point is that we must extent first the descriptive of our Buffers/Views to model a solution for this issue.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 1, 2022

From what I just read, the Frontier/Crusher approach is not very dissimilar from unified/managed memory. The differences are that

  • malloced memory can be treated as managed memory (IIRC CUDA on Power has a similar capability);
  • device memory can be accessed from the host as "zero copy" over the fabric (similar to a device accessing pinned host memory, but in the opposite direction).

That's an interesting approach, and it should be possible to handle it in the same way as unified/managed memory.

So, between the different backends, the possible cases would be:

  • standard host memory, not accessible by the device;
  • pinned host memory, accessible by the device as "zero copy" over the fabric;
  • standard device memory, not accessible by the host;
  • fancy (?) device memory, accessible by the host as "zero copy" over the fabric;
  • managed memory, accessible by the host and device, may migrate or be "zero copy" or be physically shared, etc.

Standard and pinned host memory are handled by a BufCpu.

Standard device memory is handled by a BufCudaRt/BufHipRt/etc.
Fancy (?) device memory could be handled by extending the BufHipRt class ?
In fact I think if one uses a BufHipRt buffer and accessed the memory from the host, it would just work.

Managed memory probably needs a new kind of buffer type, in order to expose new methods to migrate the memory across the host and device(s).

Maybe a more general approach could be to extend the interface of the buffer classes to include an additional device (or list of additional devices) from where the memory can be accessed ?

But IMHO all this is orthogonal to the method discussed here: we would simply not implement the convey or smart_memcpy method for managed memory buffers (at least, I cannot see a way to implement based on the way Alpaka buffers are coded).

@bernhardmgruber
Copy link
Member

Further comment from @fwyzard (#1820 (comment)):

By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu); ?
Make a copy, or let SomeGpu access the buffer from OtherGpu memory through nvlink/PCIe/etc. ?

bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Nov 11, 2022
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
@bernhardmgruber bernhardmgruber removed this from To do in Release 1.0 Dec 9, 2022
bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Jan 13, 2023
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Jan 24, 2023
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Jan 24, 2023
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
@bernhardmgruber
Copy link
Member

By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu); ?
Make a copy, or let SomeGpu access the buffer from OtherGpu memory through nvlink/PCIe/etc. ?

Having thought about this for a bit, it should be the latter. If the user intends to have a full copy they should use alpaka::memcpy. zero-copying is about making the data available in a different realm, exactly to avoid a copy.

bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Jan 25, 2023
Adds an overload set `zero_memcpy` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Jan 25, 2023
Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
@bernhardmgruber
Copy link
Member

Standard and pinned host memory are handled by a BufCpu.
Standard device memory is handled by a BufCudaRt/BufHipRt/etc.
Fancy (?) device memory could be handled by extending the BufHipRt class ?
In fact I think if one uses a BufHipRt buffer and accessed the memory from the host, it would just work.

Managed memory probably needs a new kind of buffer type, in order to expose new methods to migrate the memory across the host and device(s).

I was wondering myself today and this lead to the question what it actually means for a buffer to be e.g. a BufCpu or a BufUniformCudaHip. It seems that those types mean that the buffers are resident in RAM or CUDA device memory. Funnily, it does not necessarily mean which API they are using to manage them. E.g. a BufCpu can use the CUDA API to delete itself, if it was e.g. allocated as a mapped buffer.

I came across this question when considering what the type of buf2 should be:

auto buf1 = alpaka::allocMappedBuf(devHost, ...);
auto buf2 = alpaka::makeAvailable(queue, devCuda, buf1);

Because makeAvailable needs to be written such that it can allocate a new buffer on the destination device devCuda, it's type is what alpaka::allocBuf returns, which is an alpaka::BufUniformCudaHip. So, per C++ language rules of having the same return types in all branches (and because whether or not we can zero-copy or need to allocate is a runtime decision), the type buf2 needs to be alpaka::BufUniformCudaHip. So now we have buf1 and buf2 being of two different types, but sharing the same allocation internally. Also, then alpaka::BufUniformCudaHip does not mean anymore that the buffer is GPU resident. Furthermore, not all operations that you can typically do on an alpaka::BufUniformCudaHip, would make sense anymore if it internally held a buffer mapped from host memory.

Notice that I did not start talking about managed memory yet and whether it needs its own data type or not. We somewhat have that problem today already. Well, we can ignore the problem largely if we don't merge and evolve #1820, but still: The buffer data type does not tell us where memory is accessible from.

The reason we did not need to address this issue until now is that alpaka can type-erase all that information by calling getPtrNative(buf) -> T* and then be careful to which kernel we pass this T*, as the T* does no longer carry any information.

So, maybe managed memory needs a new kind of buffer type. And maybe such a general "MultiDeviceBuffer" is also needed to be returned from makeAvailable to cover the cases of a newly allocated buffer or a mapped, or a managed buffer. I don't know the answer to this and I will also probably not have the time to find it.

Maybe a more general approach could be to extend the interface of the buffer classes to include an additional device (or list of additional devices) from where the memory can be accessed ?

@psychocoderHPC suggested that as well to me again yesterday. This may be a proper solution. It also covers the case that we can add and remove devices to this list, when we map/unmap memory to additional devices. However, what is the difference then between a BufCpu accessible from host and gpu1, and a BufUniformCudaHip accessible from gpu1 and host? What is the type telling us?

But IMHO all this is orthogonal to the method discussed here: we would simply not implement the convey or smart_memcpy method for managed memory buffers (at least, I cannot see a way to implement based on the way Alpaka buffers are coded).

Yes and no. makeAvailable(queue, dstDev, srcView) -> dstView should just "do the right thing" and take all extra information from buffers/views and devices into account. So if the source view uses e.g. managed memory, or is already mapped to the target device, then a zero-copy (= adjusting internal handles) will be performed as well.

We can ignore that for now though and that is still useful, as is the state of #1820. There, we just perform a zero-copy if the source and destination device are the same. This already coveres the important use case of avoiding the copies in a host->device copy, kernel, device->host copy program. However, going further and allowing makeAvailable to perform a zero-copy in more cases requires us to solve the issue.

bernhardmgruber added a commit to bernhardmgruber/alpaka that referenced this issue Jan 25, 2023
Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
@fwyzard
Copy link
Contributor

fwyzard commented Mar 4, 2023

Some random ideas about what makeAvailable could or should do...

I think we may need to distinguish "make available for reading once" and "make available for efficiently working with"

  • pinned host memory is available to a GPU, it could be a reasonable solution for reading some sparse data, but I would not consider it a good solution for a memory-intensive kernel
  • accessing GPU memory from on another GPU (e.g. after calling cudaDeviceEnablePeerAccess(...)) may be more or less efficient than making a full copy
  • managed/unified memory could be accessed directly, or migrated explicitly migrated

For the first case, here called makeAvailableOnce (it's not a good name), I've come up with:

from: to: host to: device to: other device
host,
default memory
same buffer device buffer other device buffer
host,
pinned memory
same buffer same buffer other device buffer
device memory host buffer,
pinned memory
same buffer same buffer if peer access,
other device buffer otherwise
managed memory same buffer same buffer same buffer

For the second case, here called makeAvailableCached (it's not a good name, either), I've come up with:

from: to: host to: device to: other device
host,
default memory
same buffer device buffer other device buffer
host,
pinned memory
same buffer device buffer other device buffer
device memory host buffer,
pinned memory
same buffer same buffer if fast peer access,
other device buffer otherwise ?
managed memory same buffer,
request migration?
same buffer
request migration?
same buffer
request migration?

In this table, makeAvailableCached always has the same return type for accessing host memory from a device, irrespective if the host buffer was pinned/mapped or not.

On the other hand makeAvailableOnce has the same problem mentioned by @bernhardmgruber: the return type should be different if the host buffer is pinned/mapped or not.
The only solution I can think of is to return an std::variant<BufCpu, BufUniformCudaHip>.
Or, to avoid changing the Alpaka API, to define a new ViewAnyBuf type that can wrap different types of buffers, and forward any call to the right type ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants