Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good SoA abstraction #272

Open
makortel opened this issue Feb 18, 2019 · 46 comments
Open

Good SoA abstraction #272

makortel opened this issue Feb 18, 2019 · 46 comments

Comments

@makortel
Copy link

We clearly need a good SoA abstraction to work with.

Some feature considerations for explicit memory

  • device memory owner
  • device memory "view" (arrays without the ownership; to be passed to kernels)
  • pinned host memory (for transfers to and from device)
  • regular host memory (for the output of CPU producer and the transfer-from-GPU producer)
  • transfers from host->device and device->host with minimal calls to cudaMemcpyAsync()
  • preferably only one definition of the arrays

With unified memory the situation would of course be simpler with

  • unified memory ownership
  • unified memory view
  • regular host memory
@makortel
Copy link
Author

#100 has an attempt with lots of duplication on the SoA definition side (so rather poor from programming model point of view).

@makortel
Copy link
Author

In #100 (comment) @VinInn commented

about SOA "abstractions":
this is the best I managed to get so far:
https://github.com/VinInn/ctest/blob/master/eigen/soaGeom.cpp
previous incarnations (not very satisfactory to my taste):
https://github.com/VinInn/ctest/blob/master/eigen/soaGeomV2.cpp
https://github.com/VinInn/ctest/blob/master/eigen/soaGeomV1.cpp

they are all CUDA compliant

@makortel
Copy link
Author

makortel commented Feb 18, 2019

#271 has an approach similar to #272 (comment), i.e. a C struct.

Some downsides regarding general use

  • Size/capacity is defined at compile time
    • which is what we mostly do now anyway
    • but a SoA with dynamic run-time-known capacity would be more flexible, so could be useful in some cases
  • User is responsible in ensuring 128-byte alignment
    • and in CPU side the the alignment requirement is shorter

@VinInn
Copy link

VinInn commented Feb 19, 2019

In production there is no need of runtime sizing: if one size does not fill all, we just compile different size and template everything. (for HLT I would like to see first a REAL USE CASE for runtime sizing)
dynamic allocation (i.e. resizing) for a SOA is simply not an option

I was thinking to enforce alignment at cacheline level (if not a page level)
These are not typically large block of memory...
at the moment one can use "alignas" but at the end one will have to useeither the c++17 "new" with alignment or aligned_alloc (AND YES, better the user know that)
on cuda side my understanding is that everything is taken care

@fwyzard
Copy link

fwyzard commented Feb 19, 2019

I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing.

I agree that, in general, we do not want to resize structures once the event data has been filled.

But I would say that we want to be able to handle collections of whose size we do not know a priory.

@makortel
Copy link
Author

I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing.

But I would say that we want to be able to handle collections of whose size we do not know a priory.

This was exactly the case I meant. I rephrased my earlier comment to avoid further confusion.

@makortel
Copy link
Author

on cuda side my understanding is that everything is taken care

For the beginning of the memory block yes, but with

struct Foo {
  float pt[100];
  float eta[100];
};

the element Foo::eta would not be 128-byte aligned.

@VinInn
Copy link

VinInn commented Feb 20, 2019

so we need to static assert that N is a multiple of 128 (which is sane anyway)

@makortel
Copy link
Author

so we need to static assert that N is a multiple of 128 (which is sane anyway)

Yup. But the alignment is still a user's responsibility.

@fwyzard
Copy link

fwyzard commented Feb 21, 2019

What about std::aligned_storage and/or alignas(...) ?

@VinInn
Copy link

VinInn commented Feb 21, 2019 via email

@VinInn
Copy link

VinInn commented Feb 22, 2019

a possible solution to keep the SOA fixed size and allow dynamic size setting is to bucketize (ASOA)
see
https://godbolt.org/z/-_aqEP
or
https://github.com/VinInn/ctest/blob/master/eigen/trivialASOA.cpp

provided the stride is a power of 2 it is quite trivial (and fast) to iterate and do random access....

@VinInn
Copy link

VinInn commented Feb 22, 2019

@fwyzard
Copy link

fwyzard commented May 15, 2019

See also #344 .

@fwyzard fwyzard pinned this issue May 21, 2019
@fwyzard
Copy link

fwyzard commented Aug 3, 2019

Here is my attempt to write a SoA that:

  • allows different data types
  • makes sure all arrays are aligned
  • has a friendly syntax for usage and access
  • has a decent syntax for declaration (best I could come up with without real reflection in C++)
  • value semantics (I guess ?)

Currently it does not support, but I plan to add them later:

  • variable size at construction time
  • optimised constant view

@fwyzard
Copy link

fwyzard commented Aug 3, 2019

step 1: what it might look like

The fields are something like

struct {
    double x, y, z;
    uint16_t colour;
    int32_t value;
};

definition

// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {

  // these could be moved to an external type trait to free up the symbol names
  static const size_t size = SIZE;
  static const size_t alignment = ALIGN;

  // AoS-like (row-wise) accessor to individual elements
  struct element {
    element(SoA & soa, size_t index) :
      x(soa.x[index]),
      y(soa.y[index]),
      z(soa.z[index]),
      colour(soa.colour[index]),
      value(soa.value[index]),
    { }

    double & x;
    double & y;
    double & z;
    uint16_t & colour;
    int32_t & value;
  };

  // AoS-like (row-wise) access to individual elements
  element operator[](size_t index) { return element(*this, index); }

  // data members
  alignas(ALIGN) double x[SIZE];
  alignas(ALIGN) double y[SIZE];
  alignas(ALIGN) double z[SIZE];
  alignas(ALIGN) uint16_t colour[SIZE];
  alignas(ALIGN) int32_t value[SIZE];
};

sample usage

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z[7] = 42;

  // AoS-like access
  soa[7].z = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z[7] == & (soa[7].z));
}

@fwyzard
Copy link

fwyzard commented Aug 3, 2019

step 2: use private members and public accessors

Note that we could also use soa.z(7) instead of soa.z()[7] if people think it would be cleaner.

definition

// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {

  // these could be moved to an external type trait to free up the symbol names
  static const size_t size = SIZE;
  static const size_t alignment = ALIGN;

  // AoS-like (row-wise) accessor to individual elements
  struct element {
    element(SoA & soa, size_t index) :
      soa_(soa),
      index_(index)
    { }

    double & x() { return soa_.x_[index_]; }
    double & y() { return soa_.y_[index_]; }
    double & z() { return soa_.z_[index_]; }
    uint16_t & colour() { return soa_.colour_[index_]; }
    int32_t & value() { return soa_.value_[index_]; }

    SoA & soa_;
    const size_t index_;
  };

  // AoS-like (row-wise) access to individual elements
  element operator[](size_t index) { return element(*this, index); }

  // accessors
  double* x() { return x_; }
  double* y() { return y_; }
  double* z() { return z_; }
  uint16_t* colour() { return colour_; }
  int32_t* value() { return value_; }

private:
  // data members
  alignas(ALIGN) double x_[SIZE];
  alignas(ALIGN) double y_[SIZE];
  alignas(ALIGN) double z_[SIZE];
  alignas(ALIGN) uint16_t colour_[SIZE];
  alignas(ALIGN) int32_t value_[SIZE];
  alignas(ALIGN) const char* name_[SIZE];
};

sample usage

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z()[7] = 42;

  // AoS-like access
  soa[7].z() = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z()[7] == & (soa[7].z()));
}

@fwyzard
Copy link

fwyzard commented Aug 3, 2019

step 3: make data member configurable

The syntax for the declaration has same flexibility, if one wants to play a bit with it.

Note: scroll right for the \ continuations.

definition

#include <boost/preprocessor/cat.hpp>
#include <boost/preprocessor/stringize.hpp>
#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/variadic/to_seq.hpp>

/* declare SoA data members; these should exapnd to, for example
 *
 *   alignas(ALIGN) double x_[SIZE];
 *
 */

#define _DECLARE_SOA_DATA_MEMBER_IMPL(TYPE, NAME)                                                                                   \
  alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]);

#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME)                                                                                \
  BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)

#define _DECLARE_SOA_DATA_MEMBERS(...)                                                                                              \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))


/* declare SoA accessors; these should expand to, for example
 *
 *   double* x() { return x_; }
 *
 */

#define _DECLARE_SOA_ACCESSOR_IMPL(TYPE, NAME)                                                                                      \
  TYPE* NAME() { return BOOST_PP_CAT(NAME, _); }

#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME)                                                                                   \
  BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ACCESSORS(...)                                                                                                 \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))


/* declare AoS-like element accessors; these should expand to, for example
 *
 *   double & x() { return * (soa_.x() + index_); }
 *
 */

#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(TYPE, NAME)                                                                              \
  TYPE & NAME() { return * (soa_. NAME () + index_); }

#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME)                                                                           \
  BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ELEMENT_ACCESSORS(...)                                                                                         \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))


/* compile-time sized SoA
 */

#define DECLARE_SOA_TEMPLATE(CLASS, ...)                                                                                            \
template <size_t SIZE, size_t ALIGN=0>                                                                                              \
struct CLASS {                                                                                                                      \
                                                                                                                                    \
  /* these could be moved to an external type trait to free up the symbol names */                                                  \
  using self_type = CLASS;                                                                                                          \
  static const size_t size = SIZE;                                                                                                  \
  static const size_t alignment = ALIGN;                                                                                            \
                                                                                                                                    \
  /* introspection */                                                                                                               \
  static void dump() {                                                                                                              \
    std::cout << #CLASS "<" << SIZE << ", " << ALIGN << "): " << std::endl;                                                         \
    std::cout << "  sizeof(...): " << sizeof(CLASS) << std::endl;                                                                   \
    std::cout << "  alignof(...): " << alignof(CLASS) << std::endl;                                                                 \
    _DECLARE_SOA_DUMP_INFOS(__VA_ARGS__)                                                                                            \
    std::cout << std::endl;                                                                                                         \
  }                                                                                                                                 \
                                                                                                                                    \
  /* AoS-like accessor to individual elements */                                                                                    \
  struct element {                                                                                                                  \
    element(CLASS & soa, size_t index) :                                                                                            \
      soa_(soa),                                                                                                                    \
      index_(index)                                                                                                                 \
    { }                                                                                                                             \
                                                                                                                                    \
    _DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__)                                                                                     \
                                                                                                                                    \
  private:                                                                                                                          \
    CLASS & soa_;                                                                                                                   \
    const size_t index_;                                                                                                            \
  };                                                                                                                                \
                                                                                                                                    \
  /* AoS-like accessor */                                                                                                           \
  element operator[](size_t index) { return element(*this, index); }                                                                \
                                                                                                                                    \
  /* accessors */                                                                                                                   \
  _DECLARE_SOA_ACCESSORS(__VA_ARGS__)                                                                                               \
                                                                                                                                    \
private:                                                                                                                            \
  /* data members */                                                                                                                \
  _DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__)                                                                                            \
                                                                                                                                    \
}

sample usage

DECLARE_SOA_TEMPLATE(SoA,
  (double, x),
  (double, y),
  (double, z),
  (uint16_t, colour),
  (int32_t, value),
);

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z()[7] = 42;

  // AoS-like access
  soa[7].z() = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z()[7] == & (soa[7].z()));
}

@fwyzard
Copy link

fwyzard commented Aug 3, 2019

step 4: add scalar types

I realised that we may want to have scalar types, that are unique for the whole SoA, instead of per-element.

definition

#include <boost/preprocessor.hpp>

/* declare "scalars" (one value shared across the whole SoA) and "columns" (one vale per element) */

#define SoA_scalar(TYPE, NAME) (0, TYPE, NAME)
#define SoA_column(TYPE, NAME) (1, TYPE, NAME)
 

/* declare SoA data members; these should exapnd to, for columns:
 *
 *   alignas(ALIGN) double x_[SIZE];
 *
 * and for scalars:
 *
 *   double x_;
 *
 */

#define _DECLARE_SOA_DATA_MEMBER_IMPL(IS_COLUMN, TYPE, NAME)                                                                        \
  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
    alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]);                                                                                \
  ,                                                                                                                                 \
    TYPE BOOST_PP_CAT(NAME, _);                                                                                                     \
  )

#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME)                                                                                \
  BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)

#define _DECLARE_SOA_DATA_MEMBERS(...)                                                                                              \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))


/* declare SoA accessors; these should expand to, for columns:
 *
 *   double* x() { return x_; }
 *
 * and for scalars:
 *
 *   double& x() { return x_; }
 *
 */

#define _DECLARE_SOA_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME)                                                                           \
  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
    TYPE* NAME() { return BOOST_PP_CAT(NAME, _); }                                                                                  \
  ,                                                                                                                                 \
    TYPE& NAME() { return BOOST_PP_CAT(NAME, _); }                                                                                  \
  )

#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME)                                                                                   \
  BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ACCESSORS(...)                                                                                                 \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))


/* declare AoS-like element accessors; these should expand to, for columns:
 *
 *   double & x() { return * (soa_.x() + index_); }
 *
 * and for scalars:
 *
 *   double & x() { return soa_.x(); }
 *
 */

#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME)                                                                   \
  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
    TYPE & NAME() { return * (soa_. NAME () + index_); }                                                                            \
  ,                                                                                                                                 \
    TYPE & NAME() { return soa_. NAME (); }                                                                                         \
  )

#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME)                                                                           \
  BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ELEMENT_ACCESSORS(...)                                                                                         \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))


/* compile-time sized SoA
*/

#define declare_SoA_template(CLASS, ...)                                                                                            \
template <size_t SIZE, size_t ALIGN=0>                                                                                              \
struct CLASS {                                                                                                                      \
                                                                                                                                    \
  /* these could be moved to an external type trait to free up the symbol names */                                                  \
  using self_type = CLASS;                                                                                                          \
  static const size_t size = SIZE;                                                                                                  \
  static const size_t alignment = ALIGN;                                                                                            \
                                                                                                                                    \
  /* AoS-like accessor to individual elements */                                                                                    \
  struct element {                                                                                                                  \
    element(CLASS & soa, size_t index) :                                                                                            \
      soa_(soa),                                                                                                                    \
      index_(index)                                                                                                                 \
    { }                                                                                                                             \
                                                                                                                                    \
    _DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__)                                                                                     \
                                                                                                                                    \
  private:                                                                                                                          \
    CLASS & soa_;                                                                                                                   \
    const size_t index_;                                                                                                            \
  };                                                                                                                                \
                                                                                                                                    \
  /* AoS-like accessor */                                                                                                           \
  element operator[](size_t index) { return element(*this, index); }                                                                \
                                                                                                                                    \
  /* accessors */                                                                                                                   \
  _DECLARE_SOA_ACCESSORS(__VA_ARGS__)                                                                                               \
                                                                                                                                    \
private:                                                                                                                            \
  /* data members */                                                                                                                \
  _DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__)                                                                                            \
                                                                                                                                    \
}

sample usage

// declare a statically-sized SoA, templated on the column size and (optional) alignment
declare_SoA_template(SoA,
  // predefined static scalars
  // size_t size;
  // size_t alignment;

  // columns: one value per element
  SoA_column(double, x),
  SoA_column(double, y),
  SoA_column(double, z),
  SoA_column(uint16_t, colour),
  SoA_column(int32_t, value),

  // scalars: one value for the whole structure
  SoA_scalar(const char *, description)
);


int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z()[7] = 42;

  // AoS-like access
  soa[7].z() = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z()[7] == & (soa[7].z()));
}

@fwyzard
Copy link

fwyzard commented Aug 3, 2019

@vkhristenko if you are interested

@makortel
Copy link
Author

I had a chance to discuss with Alex Huebl (former alpaka developer) in the ATPESC training (he was there as a student as well). He pointed me to the LLAMA (Low Level Abstraction of Memory Access) library that the same team has been developing.

On a cursory look it seems interesting (might not solve everything but at least has interesting constructs)

  • C++11 header only, based on templates
  • Independent from alpaka (but works well together), only dependence is Boost.Mp11 >= 1.66.0
  • User specifies separately the struct-like domain (can be nested arbitrarily) and the array-like domain. An address (index in array-like domain, nested names in struct-like domain) is mapped to memory by "view" with the help of a user-defined "mapping".

Nested structs like

struct Pixel {
    struct {
        float r
        float g
        float b;
    } color;
    char alpha;
};

are expressed like

struct color {};
struct alpha {};
struct r {};
struct g {};
struct b {};

using Pixel = llama::DatumStruct <
    llama::DatumElement < color, llama::DatumStruct <
        llama::DatumElement < r, float >,
        llama::DatumElement < g, float >,
        llama::DatumElement < b, float >
    > >,
    llama::DatumElement < alpha, char >
>;

(there are shorthands llama::DS and llama::DE for DatumStruct and DatumElement, respetively)
Note how DatumStructs can be included within other DatumStructs (good for composability). The DatumElement is uniquely identified by the tag class, i.e. if different DatumStructs have DatumElements with the same tag class, LLAMA knows that these elements are of the same type (and allows to do some neat tricks with them).

Assuming a 3D array of Pixel objects the element access

array[1][2][3].color.g = 1.0;

is expressed in LLAMA as

view( 1, 2, 3 )( color(), g() ) = 1.0;
// or
view( 1, 2, 3 ).access<color, g>() = 1.0;

This interface is always the same regardless of the underlying memory structure (AoS, SoA, something more complicated).

For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like.

@fwyzard
Copy link

fwyzard commented Aug 18, 2019

For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like.

@makortel thanks for the pointers. I've had a look and so far I am not overly fond of the syntax and complexity...

@ericcano
Copy link

ericcano commented Jun 29, 2021

A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax soa[i].x. This should help porting or AoS code to SoA. We do not have array iterator support yet, so this is not a drop-in replacement for all use cases, but iterating on the index works.

A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw).

@VinInn
Copy link

VinInn commented Jun 29, 2021 via email

@ericcano
Copy link

The main macro is used to define a SoA class for HGCal. The AoS-like syntax is achieved with operator[] returning a struct of SoAValues, which behave like scalars through cast operators.

@VinInn
Copy link

VinInn commented Jun 29, 2021

do you have ptx (even for some simple examples)?

@ericcano
Copy link

ericcano commented Jul 1, 2021

I finished a test (which generated a SoA GPU side and checks the contents CPU side). The generated PTX is in this gist.

The assignment step takes 2 steps to compute the address before accessing the global memory:

///data/user/cano/HGCAL/CMSSW_12_0_X_2021-06-23-2300/src/CUDADataFormats/Common/interface/SoAmacros.h:28     SOA_HOST_DEVICE T& operator= (const T2& v) { col_[idx_] = v; return col_[idx_]; }
	.loc	2 28 80, function_name Linfo_string0, inlined_at 1 79 5
	shl.b64 	%rd17, %rd1, 3;
	add.s64 	%rd18, %rd16, %rd17;
	st.global.f64 	[%rd18], %fd2;

@VinInn
Copy link

VinInn commented Jul 1, 2021

would be interesting to compare w/r/t a naive implementation a cross product:

v3[i].x = v2[i].y* v1[i].z +  v1[i].y* v2[i].z;
v3[i].y = v2[i].x* v1[i].z +  v1[i].x* v2[i].z;
v3[i].z = v2[i].y* v1[i].x +  v1[i].y* v2[i].x;

@ericcano
Copy link

ericcano commented Jul 8, 2021

I made various flavors of the cross product in this repository. The PTX is not very appealing as we get tons of integer computations for few floating point. I suspect and hope this is better optimized during the conversion from PTX to assembly (did not check yet).

On the C++ syntax front, we can note that the same templated function could be used with a mixture of a SoA and AoS.

Finally, the C++ language does not allow creation of references to temporary variables (and all compilers, gcc and nvcc alike, abide by this rule). Workarounds like this are probably unavoidable to call functions that get references (const references are not affected):

    auto ri = r[i];
    crossProduct(ri, a[i], b[i]);

@VinInn
Copy link

VinInn commented Jul 8, 2021

this is what Eigen is able to do
https://godbolt.org/z/jrGbcq5jE

@VinInn
Copy link

VinInn commented Jul 8, 2021

and here a little example that compiles and run
https://github.com/VinInn/ctest/blob/master/eigen/dynSoa.cpp

which also demonstrates that there is no problem in assigning to a temporary at least with Eigen.
and NO, I do not think that we shall rewrite Eigen.

@ericcano
Copy link

The issue was simply a compilation mistake: using the -G option cancels all optimizations. With this fixed, we get very similar PTX for the Eigen based cross product and for the hand made one. At run time, on a T4 (iterating on the same SoA of 10k elements, using doubles), we get 6.9us for the hand made version and 6.3 us for the Eigen based one.

The main difference between the two seems to be the ratio of integer computations vs pointer conversions (cvta.to.global.u64). The handmade version passes one pointer per vector coordinate, while the Eigen one has one pointer per vector only, at the cost of more integer operations: 14 integer operations + 3 pointer conversions for the Eigen version, 10 integer operations + 9 pointer conversions for the hand made version.

Handmade version also loads more parameters.

On the floating point side, we have 6 multiplications and 3 subtraction in both cases, as one would expect.

That would hint to a SoA system using more pointer arithmetic and less pointer storage to gain a bit of performance. I will try another flavor, where we store offsets instead of pointers.

Another direction of investigation is to integrate support for Eigen structures as columns: the current example uses the fact the we declared x, y and z next to each other to made a vector out of the 3. This should be automated to avoid code repetition and caveats (matrices will be impractical to declare in the current fashion).

Finally, this example uses a separate data block, and descriptor class pointing to it. We could merge the two together, further limiting the passing of big parameters to the kernel (from a structure to a single pointer). (I will leave this aside for the moment).

@VinInn
Copy link

VinInn commented Jul 12, 2021

let me note that in my example https://godbolt.org/z/K61q8zMW5 cuda loads the input in not-coherent memory (faster and separated from the standard cache)

@VinInn
Copy link

VinInn commented Jul 12, 2021

you may also note that in your ptx there are 12 loads while in eigen just 6 (as should be)

@VinInn
Copy link

VinInn commented Jul 12, 2021

another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory...

@ericcano
Copy link

Thanks, @VinInn , the double loading from global (with a cache hit on the second, but still...) was a significant difference. Adding the restricted attributes enabled both the use of non coherent cache and the removal of the duplicate load. This leads to a marginal gain in both cases (~50ns), but the difference remains (now 6.90us vs 6.25us).

@ericcano
Copy link

another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory...

Indeed. There are also integer multiplications by constant powers of 2 instead of shifts in the clang compiled code.

Also, nvcc uses mad on integers (but not fma). It could be a deliberate choice too, though, avoiding the extra neg for performance at the detriment of precision.

@ericcano
Copy link

The offset based version of the SoA (as opposed to pointer based) did allow the reduction of address conversions, but at the detriment of many things (more parameters, loss of non-coherent cache use, re-introduction of multiple fetches of the same variables from global memory). It is version 8 in the same repository. The new PTX for the indirectCrossProduct kernel is here.

Here are 2 measurements of the kernel timings (event based for a loop of 20 kernels) and via nvprof for both versions:

$ nvprof ./test_v7_cuda
Running ..==784533== NVPROF is profiling process 784533, command: ./test_v7_cuda
...indirectCrossProductSoA time=160.352 us.
.eigenCrossProductSoA time=156.032 us.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   21.92%  139.14us        21  6.6250us  6.3360us  7.3920us  indirectCrossProduct
                   20.22%  128.35us        20  6.4170us  6.1120us  6.8160us  eigenCrossProduct

$ nvprof ./test_v8_cuda
Running ..==784574== NVPROF is profiling process 784574, command: ./test_v8_cuda
...indirectCrossProductSoA time=164.832 us.
.eigenCrossProductSoA time=155.84 us.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   22.46%  143.42us        21  6.8290us  6.6240us  7.2960us  indirectCrossProduct
                   19.89%  127.01us        20  6.3500us  6.1440us  6.6240us  eigenCrossProduct

These numbers are slightly different from the previous ones which I got via an nvprof dump and looking at them in nvvp (but this new way is faster).

With more statistics, the version 7 is actually quite close to the eigen one. As the eigen one is slightly advantaged as we do some pointer computations host side, this is not that far off.

One positive outcome of this exercise is that the SoA layout could be changed without modifying the client code.

The next and more promising step is to add support for Eigen fixed size types to the pointer based version (v7).

@VinInn
Copy link

VinInn commented Jul 13, 2021

do not understand why more parameters:
if you have any "struct" say double x,y,z; int n,m; bool k; the offsets are easy to compute at compile time (a double is 2 int or 8 bools) no difference w/r/t say bool[3*8+2*4+1]; so it is just

auto y(int i) const  {  return *(((double const*)(storage)+ (1*1)*stride +i));} 
auto n(int i) const  {  return *(((int const*)(storage)+ (3*2)*stride +i));} 
auto k(int i) const {  return *(((bool const*)(storage)+ (3*8+2*4)*stride +i));} 

if I got the parenthesis right

@ericcano
Copy link

This is all with runtime defined size.
v7 keeps one pointer per column, v8 has one global base pointer + one offset per column.
So in v8, we get one extra parameter per SoA to load (the base pointer) on top of the offsets for any member used. In v7, we directly load the pointers.

The SoA structure is fully defined by a pointer, the number of elements (and an optional byte alignment parameter), so we could also pass just the first 2 and re-compute the SoA kernel side too.
Here is the generated v7 constructor, where members x_, y_, etc... are pointers:

    SoA(std::byte *mem, size_t nElements, size_t byteAlignment = defaultAlignment) : mem_(mem), nElements_(nElements), byteAlignment_(byteAlignment) {
      auto curMem = mem_;
      x_ = reinterpret_cast<double *>(curMem);
      curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
      y_ = reinterpret_cast<double *>(curMem);
      curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
      z_ = reinterpret_cast<double *>(curMem);
      curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
      colour_ = reinterpret_cast<uint16_t *>(curMem);
      curMem += (((nElements_ * sizeof(uint16_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
      value_ = reinterpret_cast<int32_t *>(curMem);
      curMem += (((nElements_ * sizeof(int32_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
      py_ = reinterpret_cast<double **>(curMem);
      curMem += (((nElements_ * sizeof(double *) - 1) / byteAlignment_) + 1) * byteAlignment_;
      description_ = reinterpret_cast<const char **>(curMem);
      curMem += (((sizeof(const char *) - 1) / byteAlignment_) + 1) * byteAlignment_;
      if (mem_ + computeDataSize(nElements_, byteAlignment_) != curMem)
        throw std::out_of_range("In "
                                "SoA"
                                "::"
                                "SoA"
                                ": unexpected end pointer.");
    }

In all examples, this computation is done host side, but we could also push it device side, and re-construct the SoA descriptor there. The mem buffer is pre-allocated by the user, with a size automatically computed by the static member function computeDataSize().

@ericcano
Copy link

@VinInn , yes. I used byte alignment, while you have index alignment (which ensures byte alignment and simplicity, at the cost of a bit more padding).

I will try that too, it's indeed going to solve the parameter problem.

This will require integrating the size of column members (yColumnPosition = xColumnPosition + sizeof (yElement) ), which might be tricky in the current macro based generation (to be investigated). The macros are written once for all and work fine, but are not easy to debug.

Which bring be to this question: what about using an external code generator (like a python script, or some pre-existing tool). This would not change the principle of what we have currently: a description of the structure turned into code by an engine.

Right now the description is:

  declare_SoA_template(SoA,
    // predefined static scalars
    // size_t size;
    // size_t alignment;

    // columns: one value per element
    SoA_column(double, x),
    SoA_column(double, y),
    SoA_column(double, z),
    SoA_column(uint16_t, colour),
    SoA_column(int32_t, value),
    SoA_column(double *, py),

    // scalars: one value for the whole structure
    SoA_scalar(const char *, description)
  );

The engine is soa_vX.h, and generated code is in the preprocessed output.

Does this sound like a good or bad idea to use an external tool? That would require some support from the build system, obviously.

@VinInn
Copy link

VinInn commented Jul 14, 2021

I personally dislike both macros and code generators (not that complex template manipulation is better, still)
The questiona about code generation are (the usual)

  1. scram integration
  2. git integration
  3. lxr/dxr/doxigen integration
  4. debugging
  5. platform dependency
  6. ....

@fwyzard
Copy link

fwyzard commented Jul 14, 2021

Some random comments, as I'm not fully following the discussion.

I would prefer to avoid an external code generator if C++ is enough. About macrosvs templates I have mixed feelings: templates are more interested in the language and (more) type safe than macros, but not flexible enough for a clean declaration syntax.
Macros turned out to be a bit lacking as well, but less so.

About individual pointers vs. base pointer + deltas: I assumed individual pointers would be better based on some comments Vincenzo made some time ago - but I don't have any evidence either way.
An advantage would be that they would allow slicing a SoA (like the EDM approach), a disadvantage is the amount of parameters being passed and/or an extra indirection.

For deltas I would use byte addressing, because I expect SoAs to have mixed types.

@VinInn
Copy link

VinInn commented Jul 14, 2021

My summary:
Under the assumption that

  1. optimizing the memory size beyond what the allocation for the AoS would do is not the goal (and I remind that we allocate memory by power of 2 so in practice will be very rare to allocate twice as much memory that needed)
  2. that PR Review will catch badly padded "struct" (only to optimize memory, no issue with delta computation)
  3. that anybody able to code 2K lines of cuda should also able to manage a pointer and some offsets
  4. that byte alignment can be decided at compile time (and most probably once for all)

something like this may work and seem to generate optimal access code (comparable to Eigen)
a) declare the struct as for AoS: say struct S {double x,y,z; int n,m; bool k; };
b) allocate storage with say auto storage = malloc(sizeof(S)*stride); (AoS and SoA occupy the same memory)
b2) where stride is byte aligned nElements
c) declare SoA accessor as

auto y(int i) const  {  return *((double const*)( storage+ offsetof(S,y)*stride)) +i);} 
auto n(int i) const  {  return *((int const*)(storage+  offsetof(S,n)*stride) +i);} 
auto k(int i) const {  return *((bool const*)(storage+  offsetof(S,k)*stride) +i);}   

pass storage and stride to the kernel
(and nElements (even if stride could be computed from nElements on the fly) but one may reuse the same storage for different nElements or wish to access only a limited range))

how this is generated up to you.

@ericcano
Copy link

ericcano commented Jul 15, 2021

The version with embedded Eigen structures now works, passes tests. I made a few variants, including trying to push the strict minimum (data pointer + size) and re-constructing the SoA descriptor device side (it's the local object variant). The resulting PTX is here, while the test timings are:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                    8.92%  139.23us        21  6.6300us  6.4000us  6.8480us  indirectCrossProductSoA
                    8.45%  131.78us        20  6.5880us  6.4320us  7.1040us  embeddedCrossProductLocalObjectSoA
                    8.14%  127.07us        20  6.3530us  6.1120us  6.5920us  eigenCrossProductSoA
                    7.89%  123.14us        20  6.1560us  6.0480us  6.3360us  embeddedCrossProductSoA		

Regarding the PTX, it's pretty well understood, except for the use of non-coherent cache, which I did not manage to get for the embedded (Eigen vector) versions.
There are no duplicate loads, and the parameter loads are as one would expect, matching the data used in the kernel.
The extreme case of local constructed object is not the fastest, in spite of loading only 2 parameters. The extra integer computations probably explain it.
The embedded cross product is the fastest, in spite of not using the NC cache. Saving the construction of the Eigen vectors might explain this (I did not count the integer operations precisely).

The embedding code is generic, so we could throw any fixed sized Eigen element into the SoA. The macro based declaration is like this:

  declare_SoA_template(SoA,
    // columns: one value per element
    SoA_column(double, x),
    SoA_column(double, y),
    SoA_column(double, z),
    SoA_eigenColumn(Eigen::Vector3d, a),
    SoA_eigenColumn(Eigen::Vector3d, b),
    SoA_eigenColumn(Eigen::Vector3d, r),
    SoA_column(uint16_t, colour),
    SoA_column(int32_t, value),
    SoA_column(double *, py),

    // scalars: one value for the whole structure
    SoA_scalar(const char *, description)
  );

and the SoA is constructed and used like this (example from the locally constructed descriptor in the kernel, data block is already populated):

    bool deviceConstructor = true;
    testSoA::SoA soa(deviceConstructor, data, nElements);
    soa[i].r() = soa[i].a().cross(soa[i].b());

I did not mange to get the cast trick to work here, so we might end up with slightly heavier syntax with parenthesis to access members (for uniformity).
The boolean was necessary to get a device-only constructor, it is not used.

@fwyzard
Copy link

fwyzard commented Aug 4, 2021

@VinInn , I'm finally getting back to the SoA business.

Since I'm still catching up, could you explain your example ?

and here a little example that compiles and run
https://github.com/VinInn/ctest/blob/master/eigen/dynSoa.cpp

which also demonstrates that there is no problem in assigning to a temporary at least with Eigen.

In particular, where is the assignment to a temporary ?

After sitting with Eric and looking at his approach, I now understand that the problem is not in "assigning to a temporary", but in "passing a non-const reference to a temporary".
In fact, we don't need to do that - the element we have in our SoA approach already behaves like a reference, so we can just pass it around by value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants