-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues related to GPU data structures #29297
Comments
A new Issue was created by @fwyzard Andrea Bocci. @Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign heterogeneous |
In principle I agree, but IIRC e.g. Eigen matrices do not satisfy that, because the default constructor is non-trivial (and therefore we have cmssw/HeterogeneousCore/CUDAUtilities/interface/host_unique_ptr.h Lines 60 to 62 in 0884d51
etc.). Allowing non-trivial default constructor (i.e. going to TriviallyCopyable) then raises the question whether that constructor should be called on the device side for device allocations (e.g. |
I've come to the same conclusion. Memory-space aware data formats would also lead to an explosion of the ROOT dictionary declarations (N(classes) x N(memory spaces)). A downside is that the framework's input product type check does not catch errors like host consumer reading a device product. In principle we can do a run-time check within the consumer module.
Do I understand correctly that "multiple ... wrappers" means that there are multiple objects for "the same data"? And does this "multiplet memory space-aware wrappers" essentially mean that the wrapper class holds a pointer for each memory space, and knows internally which deleter to call? Would only one of the memory-space pointers be occupied at a time, or could many of them be? Or in other words, would an EDProducer be always needed for the transfer, or would the wrapper itself be capable to do the transfer "internally"?
Just to note now (I'll come back with more thoughts later) that currently this approach would lead to the aforementioned explosion of ROOT dictionary declarations. |
We have discussed earlier about the use case of a "
Has anything else surfaced that would make "memory space independent unique identifier" useful? |
GPU-friendly data structures
Constraints and requirements
I've considered what kind of restrictions and requirements we should impose on the types used for the heterogeneous algorithms, and for the host-device (and potentially host-host and device-device) communications.
C++11 introduced some concepts that can prove useful; as an example SYCL currently requires that buffers be TriviallyCopiable and DefaultLayoutType.
TriviallyCopyable
from cppreference.com:
IMHO these requirements imply that objects of this types behave like C structs, and can be
memcpy
ed andfree
d. However, a non-trivial constructor is allowed.TrivialType
from cppreference.com:
IMHO these requirements imply that this type behaves like a C struct, and objects of this type can be
memcpy
ed,malloc
ed andfree
d.StandardLayoutType
See cppreference.com for a full description.
Basically, it's a type that does not make use of inheritance for its data members: all non-static data members are defined either in a single base class, or in the most derived class. It also has restrictions on multiple inheritance, which should not be a concern for us.
PODType (deprecated since C++ 20)
from cppreference.com:
It's not clear that the StandardLayoutType requirement is useful for our case (see below). If we relax that requirement, we end up with a TrivialType. If we relax the requiremt for trivial contructors, we end up with a TriviallyCopyable type.
Conclusion on requirements
It looks like the minimal requirement we want is for the GPU-friendly data formats to be TriviallyCopyable. Adding the requirement for a trivial constructor (i.e. requiring a Triviallytype) could simplify the allocation of objects on the accelerator devices from the host.
It's not clear to me if a PODType or StandardLayoutType would give us any useful guarantee, or if they would prevent us from any useful constructs. SYCL buffers are currently supposed to be TriviallyCopyable and StandardLayoutType, but there is a proposal to relax the latter and keep only the TriviallyCopyable requirement.
Proposal: all types used for heterogeneous producers and for host-device communication should satisfy the TrivialType requirements.
Handling different memory spaces and EDM integration
Over the various Patatrack developments we have used (and are using) at least two different approaches to migrating data from the host to the device (and/or vice versa):
BeamSpotCUDA
) the data format is aware of the CUDA memory space, and contains acms::cuda::device::unique_ptr<>
to the concrete payload; constructing an object from a host data will immediately allocate the device memory and copy the contents there;pixelTrack::TrackSoA
) the data format is unaware of the different memory spaces; a generic wrapper (HeterogeneousSoA
) handles the different memory spaces,The second approach allows to reuse the same underlying type for "SoA producers" running on the host as well as different devices, and is the obvious choice for a "heterogeneous producer" that can be compiled for multiple back-ends.
Proposal: the underlying types used for heterogeneous producers and for host-device communication should be memory-space agnostic.
This underlying type
T
needs to be wrapped by a memory-space aware container or smart pointer (e.g.HeterogeneousSoA<T>
).Such wrapper can be aware of the different memory spaces, or be itself agnostic of them, delegating the actual allocations and copies to e.g. and
EDProducer
.Multiple memory space agnostic wrappers
A possible approach is the one used by
std::shared_ptr<T>
: store a single raw pointer, along with (a pointer to) the function that can be used to destroy the pointed-to object. Assuming we restrict the underlying types to be TriviallyCopyable, they have a trivial destructor, so the only information we need to store is how to deallocate the memory (i.e. usingfree()
,cudaFree()
,cudaFreeHost()
, or returning it to the relevant allocator pool).Using a
std::shared_ptr
-like wrapper can scale to an arbitrary number of backends (since the information is encoded only at runtime), and allows to use anEDProducer
to schedule copies on demand.One downside is that - since the same data in different memory spaces is owned by different products - it is not possible to use a single id to uniquely identify it across all memory spaces.
Multiple memory space-aware wrappers
A
HeterogeneousSoA<T>
can hold a unique pointer to data in three memory spaces:malloc()
);cudaMallocHost()
orcudaHostAlloc()
);cudaMalloc()
orcudaMallocManaged()
).This approach can be extended to additional memory spaces, as long as they are limited and known at runtime, with little overhead (one or two pointers per memory space). It can be implemented as a
class
that lists each memory space explicitly, or as anstd::tuple
orstd::variant
.Using a memory space-aware wrapper allows for a single type to store data on an any of many memory spaces.
An
EDProducer
can be used to schedule the copy from one memory space to an other; e.g. consume anHeterogeneousSoA<T>
on the host and produce aHeterogeneousSoA<T>
on the device.This approach has the same downside as the previous option, that it is not possible to use a single id to uniquely identify it across all memory spaces.
Single memory space-aware wrapper
Using a memory space-aware wrapper can also allow for a single EDM product to store multiple copies of the same data in different memory spaces: a host product in CPU memory, a device product in GPU memory, etc. A single identifier (a pointer or
edm::Ref
) can uniquely identify the product, irrespective of whether it is stored on the CPU or on the GPU.The downside is that the copy from one memory space to an other cannot be implemented with an
EDProducer
for the sameWrapper<T>
, since it would only update the wrapper in-place and not "produce" anything.Single wrapper with access tokens
A hybrid approach could be to use a single wrapper (e.g.
HeterogeneousWrapper<T>
) to hold pointers (aware or agnostic wrappers) to the same data in multiple memory spaces (e.g. in anarray
,tuple
,vector
,map
, etc.), and to use a set of tokens to identify in which memory spaces the data is available (e.g.HeterogeneousToken<T, MemorySpace>
).When a module produces a
HeterogeneousWrapper<T>
it shall also produce one or moreHeterogeneousToken<T, MemorySpace>
to identify all copies ofT
in the different memory spaces; e.g.HeterogeneousToken<T, HOST>
,HeterogeneousToken<T, CUDA>
, etc.When a module consumes a
HeterogeneousWrapper<T>
it shall declare a dependency on the relevantHeterogeneousToken<T, MemorySpace>
.When data needs to be copied from one memory space to another, an
EDProducer
can be scheduled to perform the copy and produce theHeterogeneousToken<T, MemorySpace>
for the new copy.Note: this approach basically implements in CMSSW the equivalent of CUDA Unified Memory and SYCL buffers. It can also be used to wrap those construct, while keeping the possibility of scheduling explicit (or on-demand, see below) memory copies.
Implementation detail #1
The token itself can be an empty object, or hold an
edm::RefProd
to theHeterogeneousWrapper<T>
. In the first case the module that consumes it can avoid an extra indirection but needs to depend explicitly on theHeterogeneousWrapper<T>
. In the second case it can depend only on theHeterogeneousToken<T, MemorySpace>
, and access theT
object from it.Implementation detail #2
Access to data on a memory space that is not available (e.g. access from the GPU to data on the CPU) can either raise an exception (safer, requires an explicit
EDProducer
to schedule the transfer), or trigger an on-demand copy (less optimal, makes the explicitEDProducer
an optimisation rather than a requirement).Conclusion
If the possibility of uniquely identifying the underlying data is not deemed necessary, the simplest approach seems to be to use a memory space-agnostic wrapper as
EDProduct
.If instead is deemed useful, the use of a single wrapper with access tokens should be evaluated.
The text was updated successfully, but these errors were encountered: