Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues related to GPU data structures #29297

Open
fwyzard opened this issue Mar 25, 2020 · 6 comments
Open

Issues related to GPU data structures #29297

fwyzard opened this issue Mar 25, 2020 · 6 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Mar 25, 2020

GPU-friendly data structures

Constraints and requirements

I've considered what kind of restrictions and requirements we should impose on the types used for the heterogeneous algorithms, and for the host-device (and potentially host-host and device-device) communications.
C++11 introduced some concepts that can prove useful; as an example SYCL currently requires that buffers be TriviallyCopiable and DefaultLayoutType.

TriviallyCopyable

from cppreference.com:

Requirements:

  • every copy constructor is trivial or deleted
  • every move constructor is trivial or deleted
  • every copy assignment operator is trivial or deleted
  • every move assignment operator is trivial or deleted
  • at least one copy constructor, move constructor, copy assignment operator, or move assignment operator is non-deleted
  • trivial non-deleted destructor

This implies that the class has no virtual functions or virtual base classes.
Scalar types and arrays of TriviallyCopyable objects are TriviallyCopyable as well.

IMHO these requirements imply that objects of this types behave like C structs, and can be memcpyed and freed. However, a non-trivial constructor is allowed.

TrivialType

from cppreference.com:

Requirements:

  • TriviallyCopyable
  • has one or more default constructors, all of which are either trivial or deleted, and at least one of which is not deleted.

IMHO these requirements imply that this type behaves like a C struct, and objects of this type can be memcpyed, malloced and freed.

StandardLayoutType

See cppreference.com for a full description.

Basically, it's a type that does not make use of inheritance for its data members: all non-static data members are defined either in a single base class, or in the most derived class. It also has restrictions on multiple inheritance, which should not be a concern for us.

PODType (deprecated since C++ 20)

from cppreference.com:

Requirements:

  • a scalar type;
  • a class type (class or struct or union) that:
    • is TrivialType;
    • is StandardLayoutType;
    • has no non-static members that are not PODTypes;
  • an array of such type.

It's not clear that the StandardLayoutType requirement is useful for our case (see below). If we relax that requirement, we end up with a TrivialType. If we relax the requiremt for trivial contructors, we end up with a TriviallyCopyable type.

Conclusion on requirements

It looks like the minimal requirement we want is for the GPU-friendly data formats to be TriviallyCopyable. Adding the requirement for a trivial constructor (i.e. requiring a Triviallytype) could simplify the allocation of objects on the accelerator devices from the host.

It's not clear to me if a PODType or StandardLayoutType would give us any useful guarantee, or if they would prevent us from any useful constructs. SYCL buffers are currently supposed to be TriviallyCopyable and StandardLayoutType, but there is a proposal to relax the latter and keep only the TriviallyCopyable requirement.

Proposal: all types used for heterogeneous producers and for host-device communication should satisfy the TrivialType requirements.

Handling different memory spaces and EDM integration

Over the various Patatrack developments we have used (and are using) at least two different approaches to migrating data from the host to the device (and/or vice versa):

  • in some cases (e.g. BeamSpotCUDA) the data format is aware of the CUDA memory space, and contains a cms::cuda::device::unique_ptr<> to the concrete payload; constructing an object from a host data will immediately allocate the device memory and copy the contents there;
  • in other cases (e.g. pixelTrack::TrackSoA) the data format is unaware of the different memory spaces; a generic wrapper (HeterogeneousSoA) handles the different memory spaces,

The second approach allows to reuse the same underlying type for "SoA producers" running on the host as well as different devices, and is the obvious choice for a "heterogeneous producer" that can be compiled for multiple back-ends.

Proposal: the underlying types used for heterogeneous producers and for host-device communication should be memory-space agnostic.

This underlying type T needs to be wrapped by a memory-space aware container or smart pointer (e.g. HeterogeneousSoA<T>).

Such wrapper can be aware of the different memory spaces, or be itself agnostic of them, delegating the actual allocations and copies to e.g. and EDProducer.

Multiple memory space agnostic wrappers

A possible approach is the one used by std::shared_ptr<T>: store a single raw pointer, along with (a pointer to) the function that can be used to destroy the pointed-to object. Assuming we restrict the underlying types to be TriviallyCopyable, they have a trivial destructor, so the only information we need to store is how to deallocate the memory (i.e. using free(), cudaFree(), cudaFreeHost(), or returning it to the relevant allocator pool).

Using a std::shared_ptr-like wrapper can scale to an arbitrary number of backends (since the information is encoded only at runtime), and allows to use an EDProducer to schedule copies on demand.

One downside is that - since the same data in different memory spaces is owned by different products - it is not possible to use a single id to uniquely identify it across all memory spaces.

Multiple memory space-aware wrappers

A HeterogeneousSoA<T> can hold a unique pointer to data in three memory spaces:

  • standard host memory (i.e. memory allocated with malloc());
  • page-locked host memory (i.e. memory allocated with cudaMallocHost() or cudaHostAlloc());
  • CUDA device or unified memory (i.e. memory allocated with cudaMalloc() or cudaMallocManaged()).

This approach can be extended to additional memory spaces, as long as they are limited and known at runtime, with little overhead (one or two pointers per memory space). It can be implemented as a class that lists each memory space explicitly, or as an std::tuple or std::variant.

Using a memory space-aware wrapper allows for a single type to store data on an any of many memory spaces.
An EDProducer can be used to schedule the copy from one memory space to an other; e.g. consume an HeterogeneousSoA<T> on the host and produce a HeterogeneousSoA<T> on the device.

This approach has the same downside as the previous option, that it is not possible to use a single id to uniquely identify it across all memory spaces.

Single memory space-aware wrapper

Using a memory space-aware wrapper can also allow for a single EDM product to store multiple copies of the same data in different memory spaces: a host product in CPU memory, a device product in GPU memory, etc. A single identifier (a pointer or edm::Ref) can uniquely identify the product, irrespective of whether it is stored on the CPU or on the GPU.

The downside is that the copy from one memory space to an other cannot be implemented with an EDProducer for the same Wrapper<T>, since it would only update the wrapper in-place and not "produce" anything.

Single wrapper with access tokens

A hybrid approach could be to use a single wrapper (e.g. HeterogeneousWrapper<T>) to hold pointers (aware or agnostic wrappers) to the same data in multiple memory spaces (e.g. in an array, tuple, vector, map, etc.), and to use a set of tokens to identify in which memory spaces the data is available (e.g. HeterogeneousToken<T, MemorySpace>).

When a module produces a HeterogeneousWrapper<T> it shall also produce one or more HeterogeneousToken<T, MemorySpace> to identify all copies of T in the different memory spaces; e.g. HeterogeneousToken<T, HOST>, HeterogeneousToken<T, CUDA>, etc.

When a module consumes a HeterogeneousWrapper<T> it shall declare a dependency on the relevant HeterogeneousToken<T, MemorySpace>.

When data needs to be copied from one memory space to another, an EDProducer can be scheduled to perform the copy and produce the HeterogeneousToken<T, MemorySpace> for the new copy.

Note: this approach basically implements in CMSSW the equivalent of CUDA Unified Memory and SYCL buffers. It can also be used to wrap those construct, while keeping the possibility of scheduling explicit (or on-demand, see below) memory copies.

Implementation detail #1

The token itself can be an empty object, or hold an edm::RefProd to the HeterogeneousWrapper<T>. In the first case the module that consumes it can avoid an extra indirection but needs to depend explicitly on the HeterogeneousWrapper<T>. In the second case it can depend only on the HeterogeneousToken<T, MemorySpace>, and access the T object from it.

Implementation detail #2

Access to data on a memory space that is not available (e.g. access from the GPU to data on the CPU) can either raise an exception (safer, requires an explicit EDProducer to schedule the transfer), or trigger an on-demand copy (less optimal, makes the explicit EDProducer an optimisation rather than a requirement).

Conclusion

If the possibility of uniquely identifying the underlying data is not deemed necessary, the simplest approach seems to be to use a memory space-agnostic wrapper as EDProduct.

If instead is deemed useful, the use of a single wrapper with access tokens should be evaluated.

@cmsbuild
Copy link
Contributor

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard fwyzard changed the title Issues related to GPU data structures and code organisation Issues related to GPU data structures Mar 25, 2020
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 25, 2020

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@makortel,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Proposal: all types used for heterogeneous producers and for host-device communication should satisfy the TrivialType requirements.

In principle I agree, but IIRC e.g. Eigen matrices do not satisfy that, because the default constructor is non-trivial (and therefore we have

// No check for the trivial constructor, make it clear in the interface
template <typename T>
typename host::impl::make_host_unique_selector<T>::non_array make_host_unique_uninitialized(cudaStream_t stream) {

etc.).

Allowing non-trivial default constructor (i.e. going to TriviallyCopyable) then raises the question whether that constructor should be called on the device side for device allocations (e.g. Kokkos::View does that by default). I suppose in practice the constructor could be omitted for most cases, even though it would technically be against the standard.

@makortel
Copy link
Contributor

Proposal: the underlying types used for heterogeneous producers and for host-device communication should be memory-space agnostic.

I've come to the same conclusion. Memory-space aware data formats would also lead to an explosion of the ROOT dictionary declarations (N(classes) x N(memory spaces)). A downside is that the framework's input product type check does not catch errors like host consumer reading a device product. In principle we can do a run-time check within the consumer module.

Multiple memory space-aware wrappers

Do I understand correctly that "multiple ... wrappers" means that there are multiple objects for "the same data"? And does this "multiplet memory space-aware wrappers" essentially mean that the wrapper class holds a pointer for each memory space, and knows internally which deleter to call?

Would only one of the memory-space pointers be occupied at a time, or could many of them be? Or in other words, would an EDProducer be always needed for the transfer, or would the wrapper itself be capable to do the transfer "internally"?

Single wrapper with access tokens

...
to use a set of tokens to identify in which memory spaces the data is available (e.g. HeterogeneousToken<T, MemorySpace>).

Just to note now (I'll come back with more thoughts later) that currently this approach would lead to the aforementioned explosion of ROOT dictionary declarations.

@makortel
Copy link
Contributor

makortel commented Apr 6, 2020

One downside is that - since the same data in different memory spaces is owned by different products - it is not possible to use a single id to uniquely identify it across all memory spaces.
...
If the possibility of uniquely identifying the underlying data is not deemed necessary ...
...
If instead is deemed useful ...

We have discussed earlier about the use case of a "Ref", e.g. the following scenario

  • producer A produces a product on the GPU memory
  • producer B consumes A, produces another product on the GPU that contains "Refs" to A (essentially ProductID + index)
  • producer C transfers the product A to CPU memory (plain memcpy())
  • producer D transffers the product B to CPU memory (plain mempcy())
    • including the "Ref" to A, that now should be somehow remapped as "Ref" to C, essentially changing the ProductID would suffice

Has anything else surfaced that would make "memory space independent unique identifier" useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants