Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Array with/without Umpire #2427

Merged
merged 15 commits into from
Sep 16, 2021

Conversation

bendudson
Copy link
Contributor

@bendudson bendudson commented Sep 14, 2021

When Umpire and CUDA are enabled, needs to be able to run on GPU device (at least limited functionality).

Would like to keep memory pool feature, to minimise allocation overhead.

Working on possible performance issues.

elm-pb

Comparing next, next-hypre-outerloop-cuda-merged and this branch merged-array, with default input from this branch, nout=1 and timestep=0.2

next

2.000e-01 19 5.44e+02 99.6 0.3 0.0 0.0 0.1
Run time : 9 m 37 s

next-hypre-outerloop-cuda-merged

2.000e-01 19 3.41e+02 99.4 0.3 0.0 0.0 0.2
Run time : 6 m 1 s

merged-array

2.000e-01 19 5.51e+02 99.6 0.3 0.0 0.0 0.1
Run time : 9 m 37 s

With CMAKE_BUILD_TYPE=RELEASE:
2.000e-01 19 1.07e+01 96.2 1.7 0.1 0.4 1.6
Run time : 11 s

[elm-pb-outerloop gives
2.000e-01 19 7.14e+00 94.4 2.4 0.1 0.7 2.5
Run time : 8 ]

Note: elm-pb changed significantly between master and next, so can't compare directly. Instead use blob2d.

blob2d

master

2.000e+01 52 1.64e+00 61.6 17.4 0.1 1.2 19.7
Run time : 2 s

merged-array

2.000e+01 52 1.60e+02 97.1 2.1 0.0 0.0 0.8
Run time : 2 m 43 s

(100x slower??)
Both branches using check level 2 and -g flag. master compiling with -O2.

With -DCMAKE_BUILD_TYPE=RELEASE (adds -O3)
2.000e+01 52 8.69e+00 91.4 4.2 0.0 0.3 4.0
Run time : 9 s

** Make this the CMAKE_BUILD_TYPE default? **

Same result when array.hxx is copied from master into the merged-array branch.
=> Slowdown not related to changes to Array.

** Why slowdown in next? **

When Umpire and CUDA are enabled, needs to be able to run on GPU
device (at least limited functionality).

Would like to keep memory pool feature, to minimise allocation overhead.
@bendudson bendudson added the work in progress Not ready for merging label Sep 14, 2021
Some compilers (GCC 11.1.0) don't seem to need it; some do (Github
actions environment).
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/bout/array.hxx Outdated Show resolved Hide resolved
include/bout/array.hxx Outdated Show resolved Hide resolved
include/bout/array.hxx Outdated Show resolved Hide resolved
iterator<T> begin() const { return data; }
iterator<T> end() const { return data + len; }
int size() const { return len; }
void operator=(ArrayData<T>& in) { std::copy(std::begin(in), std::end(in), begin()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: operator=() should return ArrayData& [misc-unconventional-assign-operator]

iterator<T> begin() const { return data; }
iterator<T> end() const { return data + len; }
int size() const { return len; }
void operator=(ArrayData<T>& in) { std::copy(std::begin(in), std::end(in), begin()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: operator=() should take ArrayData const&, ArrayData&& or ArrayData [misc-unconventional-assign-operator]

data = new T[len];
counter = new int;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: assigning newly created gsl::owner<> to non-owner int * [cppcoreguidelines-owning-memory]

counter = new int;
^

#endif
}

BOUT_HOST_DEVICE ArrayData(const ArrayData& other) {
/// Move. No need to update counter
BOUT_HOST_DEVICE SharedArrayData(SharedArrayData&& other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: move constructors should be marked noexcept [performance-noexcept-move-constructor]

BOUT_HOST_DEVICE SharedArrayData(SharedArrayData&& other) {
                 ^
                                                           noexcept


BOUT_HOST_DEVICE inline ArrayData<T>& operator=(const ArrayData<T>& in) {
/// Assign, deep copying the underlying data
BOUT_HOST_DEVICE inline SharedArrayData<T>& operator=(const SharedArrayData<T>& in) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: operator=() does not handle self-assignment properly [bugprone-unhandled-self-assignment]

BOUT_HOST_DEVICE inline SharedArrayData<T>& operator=(const SharedArrayData<T>& in) {
                                            ^

/// Return the number of references to this data
/// Name chosen to emulate std::shared_ptr<ArrayData>
int use_count() {
if (!counter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: implicit conversion int * -> bool [readability-implicit-bool-conversion]

if (!counter) {
    ~^
             == nullptr

#else
template <typename T, typename Backing = ArrayData<T>, typename CreationPolicy = CreateSharedPointer>
#endif
class Array : private CreationPolicy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: class Array defines a non-default destructor, a copy constructor, a copy assignment operator and a move constructor but does not define a move assignment operator [cppcoreguidelines-special-member-functions]

class Array : private CreationPolicy {
      ^

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

iterator<T> begin() const { return data; }
iterator<T> end() const { return data + len; }
int size() const { return len; }
ArrayData<T>& operator=(const ArrayData<T>& in) { std::copy(std::begin(in), std::end(in), begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: operator=() does not handle self-assignment properly [bugprone-unhandled-self-assignment]

ArrayData<T>& operator=(const ArrayData<T>& in) { std::copy(std::begin(in), std::end(in), begin());
              ^

Previously void, now ArrayData<T>&
Recommended by Clang Tidy
CMAKE_BUILD_TYPE "DEBUG" is much slower (order or orders of
magnitude), so I would prefer that this was not the default
out of the box.
Ambiguous overload with assignment operator
Compiles with nvcc on Cori, but segfault at runtime.
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

iterator<T> begin() const { return data; }
iterator<T> end() const { return data + len; }
int size() const { return len; }
ArrayData<T>& operator=(const ArrayData<T>& in) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: operator=() does not handle self-assignment properly [bugprone-unhandled-self-assignment]

ArrayData<T>& operator=(const ArrayData<T>& in) {
              ^


BOUT_HOST_DEVICE inline T& operator[](int ind) { return data[ind]; };
BOUT_HOST_DEVICE inline const T& operator[](int ind) const { return data[ind]; };
T& operator[](int ind) { return data[ind]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use pointer arithmetic [cppcoreguidelines-pro-bounds-pointer-arithmetic]

T& operator[](int ind) { return data[ind]; }
                                ^

By passing the Umpire allocator to `std::allocate_shared`, Umpire
should be used for all internal `shared_ptr` storage.  This should
hopefully enable `std::shared_ptr` to be used on both CPU and GPU
builds.

Previous segfaults were due to conflating the data assignment
operator with pointer assignment in the `SharedArrayData` class.
This commit deletes that class and always uses `std::shared_ptr`.

Includes some documentation, links.
Setting CMAKE_BUILD_TYPE here doesn't seem to affect the default
build type.
This is here as a warning to others.

Implemented custom allocator which calls Umpire

1. There is umpire::TypedAllocator which does this already

2. It doesn't work; using an Array in a RAJA loop fails.
   I suspect the copy constructor but am not certain.
Umpire already includes a C++ Allocator which can be
passed to STL containers. Unfortunately there seem to be many
gotchas (see
https://umpire.readthedocs.io/en/develop/sphinx/tutorial/typed_allocators.html)

Gives same error (unspecified launch failure) as previous allocator.

[skip ci]
Ensuring that only a raw pointer which has been allocated with Umpire
is passed into the RAJA lambda function.
These Arrays can't be passed into RAJA loops. Instead
the raw data pointer must be extracted before the loop.

This enables memory pools, shared pointers etc. to be
used in CPU-only code.
Possible future improvements, for later/when have time
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

BOUT_HOST_DEVICE inline T& operator[](int ind) { return data[ind]; };
BOUT_HOST_DEVICE inline const T& operator[](int ind) const { return data[ind]; };

inline T& operator[](int ind) { return data[ind]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use pointer arithmetic [cppcoreguidelines-pro-bounds-pointer-arithmetic]

inline T& operator[](int ind) { return data[ind]; }
                                       ^

BOUT_HOST_DEVICE inline const T& operator[](int ind) const { return data[ind]; };

inline T& operator[](int ind) { return data[ind]; }
inline const T& operator[](int ind) const { return data[ind]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use pointer arithmetic [cppcoreguidelines-pro-bounds-pointer-arithmetic]

inline const T& operator[](int ind) const { return data[ind]; }
                                                   ^

@bendudson
Copy link
Contributor Author

Now tentatively declaring victory: This version of Array compiles and runs on Cori GPUs, and is I think a minimal modification from what is in master.

  • (+) Enables pools to be used, so Umpire isn't called repeatedly. If multi-threaded these use a thread-local arena.
  • (-) Arrays can be passed to or used inside RAJA CUDA loops

@bendudson bendudson merged commit 335c2c2 into next-hypre-outerloop-cuda-merged Sep 16, 2021
@bendudson bendudson deleted the merged-array branch September 16, 2021 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work in progress Not ready for merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant