Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should/can array element deinitialization be parallelized? #15215

Closed
bradcray opened this issue Mar 13, 2020 · 13 comments · Fixed by #21670
Closed

Should/can array element deinitialization be parallelized? #15215

bradcray opened this issue Mar 13, 2020 · 13 comments · Fixed by #21670

Comments

@bradcray
Copy link
Member

While looking through module code today, I noticed that an array of elements requiring deinitialization (e.g., an array of records with a deinit() procedure) deinitializes those elements serially rather than in parallel:

proc dsiDestroyDataHelper(dd, ddiNumIndices) {
compilerAssert(chpl_isDdata(dd.type));
for i in 0..ddiNumIndices-1 {
chpl__autoDestroy(dd[i]);
}
}

This issue asks whether this can legally be changed into a forall loop (seems like it) and whether that would ever bite us. @ronawho points out that our current registration/deregistration bottlenecks when a large number of arrays share the same domain could be a case that might hurt (though we have longstanding intentions to address that bottleneck in which case this seems like more of a no-brainer).

For most of our performance benchmarks and critical code, this seems unlikely to have an impact since they're arrays of raw data. But who knows what users will do and we are a parallel language, so deinitializing in parallel seems like a no-brainer.

Related issues:
#6689
#9414
#11333
#10911

bradcray added a commit that referenced this issue Mar 18, 2020
Extend realloc-in-place to arrays of non-POD data

[reviewed by @mppf]

* implement an @mppf observation that we _can_ realloc non-POD data,
  we simply need to call deinit on the elements that are going away before
  invoking realloc; this increases the number of cases that can use realloc

* add a config param to support disabling the realloc code path since
  it turned out not to be quite as simple as originally hoped (in case
  we run into users who have trouble with it in the field).

* rename allocD to reallocD in the array resize code, because I
  could never remember whether "alloc" meant "the domain describing
  the current allocation or the new allocation".  Hopefully the
  new term makes it clearer that it's the latter (?).

* add a comment about a potential opportunity to parallelize array
  element deinitialization in the future (related to issue #15215).
mppf added a commit that referenced this issue May 25, 2020
Adjust array copy initialization to avoid default-init/assign

For #14924 and #9421.

Closes #10374.
Closes #11498.
Closes #13234.
Closes #14925.

Previous to this PR, there were several problems with array
initialization where the compiler emitted a default-initialization and
assignment pattern instead of copy initialization. This behavior is
observable for records because the `=` or `init=` for the record could
have side effects.

Here are some of the problematic cases, assuming A is an array:
 * untyped copy initialization, e.g. `var B = A`
 * typed copy initialization, e.g. `var B:A.type = A`
 * typed or copy initialization for `in` intent arguments
 * initialization from loop expressions with shape, e.g. `var B = for i
   in 1..10 do new R(i)`
 * typed initialization from loop expressions, e.g. `var B:[1..10] int =
   for i in 1..10 do new R(i)`

This PR addresses these problems with a focus on getting the correct
behavior for record elements in arrays. Improving array allocation - so
in some cases a move can "steal" the buffer from an existing array - is
future work (see #9421). So is improving typed copy-initialization for
associative domains.

At a high level, the approach taken in this PR is to add
`chpl__coerceCopy` and `chpl__coerceMove` as siblings for
`chpl__initCopy` / `chpl__autoCopy`. Unlike the previously existing copy
functions, `chpl__coerceCopy` and `chpl__coerceMove` accept a `type t`
argument that stores the runtime type of the destination. These new
functions are only appropriate or useful for types with runtime types
(namely arrays and domains).

The compiler changes are focused on calling `chpl__coerceCopy`  /
`chpl__coerceMove` in the right places. The `chpl__coerceCopy` can turn
into `chpl__coerceMove` in the copy elision pass.

The PR makes significant changes to `wrappers.cpp` in order to get`in`
intent combined with default arguments to work correctly. Since default
args and runtime types used in `in` intent can both use previous
argument, this PR changes the approach in `wrappers.cpp` to handle these
elements one argument at a time (instead of handling all default args,
then all in intents, etc). I think this simplifies the code but the more
important thing is that it produces more correct behavior.

In order to support `chpl__coerceMove` and `chpl__coerceCopy`, adjusted
`_domain.buildArray` to add a `param initElts`. This`param initElts` is
threaded through various DSI calls and in particular `dsiBuildArray`.
This allows the compiler to know at compile-time whether the array
elements need to be default initialized (which will not be allowed for
all types). Refactored `init_elts` into `init_elts_method` and added
`_ddata_allocate_noinit` to help with this.

Similarly, when move initializing one array from another, the elements
are transferred to the new array, and so the old array needs to avoid
deinitializing the elements. For this reason, added `param deinitElts` to
`dsiDestroyArr`. Made dsiDestroyArr a required function since forgetting
it would lead to memory leaks which might be less obvious to debug.
Created some helper routines (`_deinitElementsIsParallel` /
`_deinitElements`) to help with these. 

Generally speaking this PR adjusts the existing distributions to have the
correct record copy behavior. However it does not address making it easy
to write such distributions (and in fact currently distributions need to
use pragmas to get reasonable behavior). Issue #15673 discusses the
future direction to address this problem.

Since the elements in an array might now be initialized separately from
the allocation, the PR also takes steps to ensure that the post-alloc
function needed by the runtime in some cases (e.g.
`chpl_mem_array_postAlloc`) is called after the initialization is
complete. In order to enable this behavior, it adds
`dsiElementInitializationComplete` which domain maps can use to
eventually call a function like `chpl_mem_array_postAlloc`. Made
dsiElementInitializationComplete a required function since forgetting it
would lead to first-touch errors which might be harder to find. Created
some alternative ddata functions to allocate without initing and
separately run the post-allocate. Added a `callPostAlloc` field to
DefaultRectangular to indicate if the post alloc callback was requested
by the runtime so it can be run in that event in
dsiElementInitializationComplete.

This PR needed to make significant changes to DefaultAssociative to get
tests working because otherwise arrays had a mix of initialized and
uninitialized elements and that was not being tracked appropriately. This
PR takes the strategy of adjusting DefaultAssociative to consider
`chpl__tableEntry`s that are empty/deleted to have uninitialized `idx`/
values. The process of updating DefaultAssociative led to the discovery
and fix of several bugs (for example the bug exercised by this test where
deleted slots were hiding existing elements when adding, leading to
duplicates -- test/associative/ferguson/check-removing-and-adding.chpl ).
One of the bug fixes has the unfortunate effect of un-doing PR #14065 -
this is addressed in future work described below involving refactoring
DefaultAssociative. While there, reduced the number of DSI methods that
associative relies on existing for all arrays and adjusted the ones that
do exist to have clearer behavior. Created an `_allSlots` iterator for
iterating over pieces of the hashtable since the parallel iteration
pattern is independent of the data stored. This helps to do the right
first-touch during associative domain allocation. (It can't be done after
index initialization since that might happen at an arbitrary future point
for some slots).

This PR ran into problems with replaceRecordWrappedRefs and the new array
initialization pattern. The reference to the element in
`chpl__transferArray` that was being initialized (say, `aa`) was becoming
a variable instead of a reference, which completely prevented
initialization. So, the PR applies a workaround to `chpl__transferArray` to
avoid the problem for susceptible types by iterating over the LHS domain
instead of array, and then doing array access in the loop.

Fixed a bug where copy elision was not removing a call to record
postinit. In order to do so, changed initializerRules to add the flag
`FLAG_HAS_POSTINIT` to types that have a postinit so that
`AggregateType::hasPostInitializer` can be called later in resolution.

Improved the wording for error messages to do with setting a non-nilable
variable to `nil` to follow the pattern discussed in #11118.

Improved isConstInitializable to work better with arrays so that it can
be called within array initCopy e.g.

Removed `PRIM_DEFAULT_INIT_FIELD` and `_createFieldDefault` (no longer
needed) and added `PRIM_STEAL` (to mark a source variable with
`FLAG_NO_AUTO_DESTROY`) to help write some of the move initialization
patterns.

Adjusted formal temps (those that go within function bodies for say
`inout` intent) to be destroyed at the end of the function rather than
after last use. This better matches the rule for when variables become
dead but #15645 discusses more work needed in this area.

Removed `chpl_checkCopyInit` because the check is not handled by
initializers.

Adjusted the unordered forall optimization to work with `PRIM_ASSIGN`
(because it comes up now in `chpl__transferArray`) and to get unordered
forall compilation report tests working adjusted
findLocationIgnoringInternalInlining to consider functions marked
`FIND_USER_LINE` similarly to inline functions in internal/standard
modules; while there applied this also to functions beginning with
`chpl__`.

Adjusted Sort.quickSort to call a new insertionSortMoveElts that moves
elements (because insertionSort is also called by sorts that assign
elements) to avoid problems when sorting arrays.

Reviewed by @benharsh - thanks!

- [x] primers pass valgrind and don't leak
- [x] full local testing
- [x] gasnet testing
- [x] multilocale perf testing - (EP, Global, Promoted) streams and
  miniMD are OK

Comm count changes:
 * test/optimizations/remoteValueForwarding/serialization/domains.chpl
   has 1 additional GET for the domain size due to using it in
   dsiElementInitializationComplete during an array initialization with a
   remote domain. DefaultRectangularArr could cache the number of
   elements allocated in its own structure to remove this GET and many
   others for cases like this where the domain is remote.
 * many Cyclic distribution tests in
   test/distributions/robust/arithmetic/performance/multilocale/ had 1
   more GET that seems to have to do with dsiDoms. But dsiDoms is doing
   GETs many many times for Cyclic already which makes me think that
   Cyclic needs some optimization (and this problem isn't made
   significantly worse by this PR).
 * array-assign-block-get.chpl was showing additional on statements that
   are probably related to dsiElementInitializationComplete, so I changed
   it to always use assignment instead of split-init/assign since that is
   what it has historically done. Issue
   Cray/chapel-private#964 tracks the TODO of
   improving dsiElementInitializationComplete.

Future work:
 * Check and tidy up other array initialization patterns - check for
   default-init/assign patterns in the compiler insertEpilogueTemps maybe
   insertFinalGenerate insertDestructureStatements    fieldToArg and make
   array literal creation "move" the values into the array literal
   Cray/chapel-private#955
 * improve workaround for replaceRecordWrappedRefs
   Cray/chapel-private#950
 * initialization/deinitialization order of formal temps #15645
 * untyped array formal arguments with in intent and defaults #15628 
 * fix partial reductions sketch tests
   Cray/chapel-private#945
 * #15215 - explore parallel array element deinit (e.g. updating
   `_deinitElementsIsParallel`)
 * #6689 - parallel array initialization for all types
 * Factor a DefaultHashtable out of DefaultAssociative. Add a `value` to
   `chpl_TableEntry` which defaults to `nothing`. Adjust the Map / Set
   implementations to use it.
   Cray/chapel-private#952
 * optimize allocations for array move initialization from another array
   - for same-type arrays it's likely it will be able to steal buffers
   Cray/chapel-private#953 and #9421
 * stop inlining transferArray and use `pragma "find user line"` instead
   Cray/chapel-private#954
 * improve upon the default-init + assign behavior for copying
   associative domains Cray/chapel-private#962
 * make it possible to write domain maps using only user-facing features
   #15673 
 * unify `chpl__coerceMove` and `chpl__unref` and `chpl__unalias`  #15674
 * records and copy initialization across locales #15675
 *  records and move initialization across locales #15676 
 * better optimize post-allocation calls for BlockDist
   Cray/chapel-private#964
@ronawho
Copy link
Contributor

ronawho commented Jul 23, 2020

This issue asks whether this can legally be changed into a forall loop (seems like it) and whether that would ever bite us. @ronawho points out that our current registration/deregistration bottlenecks when a large number of arrays share the same domain could be a case that might hurt (though we have longstanding intentions to address that bottleneck in which case this seems like more of a no-brainer).

This bottleneck was addressed in #15895

For most of our performance benchmarks and critical code, this seems unlikely to have an impact since they're arrays of raw data. But who knows what users will do and we are a parallel language, so deinitializing in parallel seems like a no-brainer.

I agree that for the most part this won't be too important for performance, but could still save some time for large enough arrays for types with a non-trivial deinit. I wanted to explore that overhead so I wrote a microbenchmark to look at deinit times for array-of-arrays, array-of-strings, and array-of-records:

use Time, CPtr;
config const nAoA = 5_000_000;
config const innerN = 50;
config const nAoS = nAoA * 50;
config const nAoR = nAoA * 50;

proc timeit(type T, desc: string) {
  var t: Timer;
  {
    t.start();
    var A: T;
    writef("%s create : %.3drs\n", desc, t.elapsed()); t.clear();
  }
  writef("%s destroy: %.3drs\n", desc, t.elapsed()); t.clear();
}

record RAlloc {
  var p: c_ptr(int);
  proc init() { p = c_malloc(int, innerN); }
  proc deinit() { c_free(p); }
}

record RNoop {
  var p: c_ptr(int);
  proc init() { }
  proc deinit() { }
}

timeit([1..nAoA][1..innerN] int, "AoA     ");
timeit([1..nAoS] string,         "AoStr   ");
timeit([1..nAoR] RAlloc,         "AoRalloc");
timeit([1..nAoR] RNoop,          "AoRnoop ");

And the following diff to toggle parallel init/deinit with a config const:

diff --git a/modules/internal/ChapelArray.chpl b/modules/internal/ChapelArray.chpl
index f87b81f331..14d46c74c3 100644
--- a/modules/internal/ChapelArray.chpl
+++ b/modules/internal/ChapelArray.chpl
@@ -3298,10 +3298,8 @@ module ChapelArray {
     _do_destroy_arr(array._unowned, array._instance, deinitElts);
   }

-  proc _deinitElementsIsParallel(type eltType) param {
-    // TODO: Would anything be hurt if this always returned true?
-    // one guess: arrays of arrays where all inner arrays share a domain?
-    return false;
+  proc _deinitElementsIsParallel(type eltType) {
+    return parallelInitDeinit && rootLocaleInitialized;
   }

   proc _deinitElements(array: _array) {
diff --git a/modules/internal/ChapelBase.chpl b/modules/internal/ChapelBase.chpl
index 9fb27d3e89..cdc619f9c9 100644
--- a/modules/internal/ChapelBase.chpl
+++ b/modules/internal/ChapelBase.chpl
@@ -27,6 +27,8 @@ module ChapelBase {
   pragma "locale private"
   var rootLocaleInitialized: bool = false;

+  config const parallelInitDeinit = false;
+
   use ChapelStandard;
   private use ChapelEnv, SysCTypes;

@@ -905,7 +907,7 @@ module ChapelBase {
       param heuristicThresh = 2 * 1024 * 1024;
       const heuristicWantsPar = arrsizeInBytes > heuristicThresh;

-      if heuristicWantsPar {
+      if heuristicWantsPar && parallelInitDeinit {
         initMethod = ArrayInit.parallelInit;
       } else {
         initMethod = ArrayInit.serialInit;

deinit times:

type serial parallel
AoA 1.164s 2.660s
AoStr 0.843s 0.098s
AoRalloc 6.317s 0.395s
AoRnoop 0.000s 0.005s

Generally what I see is that you really need the deinit to be free'ing some memory in order for parallel deinit to be worthwhile. And even then parallel deinit still hurts the array-of-array case, though it's not nearly as impacted as it was before #15895.

My preference for now is probably to just leave deinit serial so we're not hurting the AoA case and revisit that after #10911.

@mppf
Copy link
Member

mppf commented Jul 23, 2020

The AoA case probably has some nested parallelism. Perhaps that is presenting an issue here?

Could we turn on parallel deinit for all types except for arrays of arrays? I'm worried that if it is not parallel for user records that users will write programs with bugs that will be revealed when we turn it on later.

@ronawho
Copy link
Contributor

ronawho commented Jul 23, 2020

We squash nested parallelism so that shouldn't be an issue. I think it's contention/serialization from acquiring the domain's lock.

If there is a way to identify array-of-arrays (and really array-of-records/classes-with-arrays) we could conditionally select parallel init.

I'm not as personally motivated to get this case parallel since it doesn't have major impacts on application performance. Parallel init matters a lot because it affects numa-affinity and init can be really slow in cases, but deinit speed doesn't impact much beyond deinit itself and for most types I can think of deinit will be much faster than init. (e.g. allocating is much slower than freeing)

@bradcray
Copy link
Member Author

The magnitude of difference between these cases (e.g., AofRnoop and AofAofInt) seems concerning and worth investigating more. Or... Elliot, reading between the lines, are you positing that it's due to contention in the shared inner domain's hash table of linked arrays? (I don't think you said it directly, or else I missed it).

If so, since Engin's actively working the issue relating to const domains (I think?), I'm fine waiting a few weeks to see if that lands before doing more here.

For the given results, I feel curious what the behavior of AofAofString or AofAofRAlloc or AofAofRnoop or AofI would be. I.e., how much is the AofA vs. the element type / complexity?

My first thought was: What if we were to parallel deinit arrays whose element types were non-pod (by some definition of pod)? E.g., I'm guessing if we timed parallel dealloc AofAofString or AofAofRAlloc we'd see a benefit? (i.e., that it's less an AofA issue than what the element type is?) But I think we should only do this if Engin doesn't land his const domain change soon in a way that we can leverage that.

@ronawho
Copy link
Contributor

ronawho commented Jul 23, 2020

During the perf meeting we decided to hold off on this until the const domain change is in, and then revisit.

@ronawho
Copy link
Contributor

ronawho commented Sep 18, 2020

The const domain optimization was merged in #16218, but I'm still seeing a regression for the AoA case (#15215 (comment))

The serial AoA deinit times seem to have gotten much faster since the last measurements, but the parallel case is only marginally better than before.

Running with 9ef62f4 on chapcs:

type serial parallel
AoA 0.309s 2.133s
AoStr 0.983s 0.081s
AoRalloc 6.141s 0.585s
AoRnoop 0.000s 0.005s

FYI @bradcray and @e-kayrakli since this came up in the perf meeting.

This makes me suspect we're still modifying the arrays list of domains for the AoA case, but it is certainly faster than before so I'm not sure.

@e-kayrakli
Copy link
Contributor

e-kayrakli commented Sep 18, 2020

Can it be because we still lock, decrement counter, unlock; instead of using an atomic int for keeping the count?

We can do even more specialization for cases where a const domain is only used as an array-of-array's outer dimension. There may be some complications that I am not seeing there, though.

(We can use an atomic int pretty easily for const domains, however, domain constness is a dynamic information, so we'd still have to have a lock but just not use it while removing an array from a domain)

@ronawho
Copy link
Contributor

ronawho commented Sep 18, 2020

Yeah good point, the locking isn't helping. Just removing the locks but not making the counter atomic (racy) takes parallel deinit down to 0.6 seconds. Still 2x slower than serial, but not bad.

@bradcray
Copy link
Member Author

Is there a good reason that we use a lock rather than an atomic? Is it because in the event that we are tracking membership, we use the lock to guard the hash table?

Still 2x slower than serial

That seems...weird...

@e-kayrakli
Copy link
Contributor

Is there a good reason that we use a lock rather than an atomic? Is it because in the event that we are tracking membership, we use the lock to guard the hash table?

No good reason beyond what you suggest.

I didn't want to tinker too much with the array tracking implementation especially with the thinking that this new definedConst field can be param in the near future, and we may be able to revise that part again. But paramness (even constness) of that field turned out a bit harder than we thought initially.

That being said, it is perplexing that parallel deinit is slower even with regular int.

@bradcray
Copy link
Member Author

No good reason beyond what you suggest.

If the const bit was a param, we could potentially do something really clever like use a lock when we're maintaining the hash table and an atomic counter when we're not. But even with the const field it seems like there's a path forward: What if we stored an atomic, and used it as an atomic counter in the event that we're not using the hash table; but used it as a lock (via testAndSet, having it store 1/0 for full/empty, say) when we are?

@ronawho
Copy link
Contributor

ronawho commented Jan 18, 2023

I don't have any concrete timings yet, but deinit for array's of bigints seem quite slow. I'm hoping to revisit this soon.

@ronawho
Copy link
Contributor

ronawho commented Feb 17, 2023

During the most recent perf meeting we decided to move forward with this and accept the regression to array-of-arrays. Then longer term we can move to an atomic reference count, at least when we don't have to track domains. This is tracked in https://github.com/Cray/chapel-private/issues/1378. We also discussed just bumping the reference count once instead of once per element. I opened https://github.com/Cray/chapel-private/issues/4362 for that.

In the short term the plan is:

  • Commit a perf test
  • Switch to always using parallel deinit, accepting the hit for AoA

And eventually or as there's demand for it optimize the AoA case.

@ronawho ronawho self-assigned this Feb 22, 2023
ronawho added a commit that referenced this issue Feb 23, 2023
Add a performance test to track deinit times for array-of-arrays and
array-of-records (with alloc/dealloc). This is meant to track the impact
of an upcoming patch to parallelize array deinitialization.

Part of #15215
ronawho added a commit that referenced this issue Feb 27, 2023
[reviewed by @benharsh and @e-kayrakli]

Historically, we have deintialized array elements serially. For many
codes this isn't an issue because POD types don't need any element
deinit, but for types with non-trivial deinit (typically things that
deallocate) like arrays-of-strings or arrays-of-bigints this can cause
non-trivial slowdowns.

This changes array deinit to be parallel using the same heuristics as
parallel init. This ensures we don't create a lot of tasks for small
arrays and that affinity will be similar between init and deinit.

This speeds up deinit for most arrays whose elements require deinit, but
slows down array-of-arrays since the performance hit from contentions on
reference counters exceeds the speedup for parallel deallocation. Future
improvements to speed up reference counting and do bulk counting for AoA
is captured in Cray/chapel-private#1378 and Cray/chapel-private#4362.
And while it is a regression, it just makes AoA deinit as slow as init.

Performance for arrayInitDeinitPerf on chapcs:

| config     | before | after  |
| ---------- | -----: | -----: |
| AoA   init | 12.56s | 12.54s |
| AoA deinit |  2.86s | 10.35s |
| AoR   init |  0.28s |  0.28s |
| AoR deinit |  0.99s |  0.04s |

For an arkouda bigint_conversion test (which most recently motivated
this change) on 16-node-cs-hdr:

| config           | before | after  |
| ---------------- | -----: | -----: |
| bigint_from_uint | 2.19s  | 0.18s  |
| bigint_to_uint   | 2.63s  | 0.27s  |

Resolves #15215
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants