Add static analysis to avoid adding slow followers in some cases #16053

e-kayrakli · 2020-07-09T22:51:38Z

This PR adds a static analysis to detect whether zippered distributed arrays
share the same domain and thus can leverage the fast follower optimization.

This is a small part of what I attempted at #16001.

This PR also rearranges the newly-added autoLocalAccess.cpp to make it more
general purpose.

Summary of changes:

Add staticFastFollowerAnalysis to autoLocalAccess.cpp
Do some refactor on that file:
- Rename to preNormalizeOptimizations.cpp
- Put function declarations in the beginning of the file
- Add a function doPreNormalizeArrayOptimizations to handle automatic local
  access and fast follower analysis right before normalization.

Although the main motivation for this PR was to reduce the compilation time
overheads, in some of the benchmarks I have seen that clones removed by this PR
to constitute only a small partion of those extra clones. #16001 gives an idea
of what needs to be done to remove a larger number of copies, and I hope to
revisit that effort soon.

Test:

add a test to lock the behavior
standard
gasnet

ronawho · 2020-07-13T19:45:11Z

Can you give some examples of which cases you expect to be optimized?

For the simplified stream example:

use BlockDist;

config const m = 1000, alpha = 3.0;
config param promoted = false;

const Space = {1..m} dmapped Block({1..m});
var A, B, C: [Space] real;

if promoted {
  A = B + alpha * C;
} else {
  forall (a, b, c) in zip(A, B, C) do
    a = b + alpha * c;
}

I see the number of generated lines drop ~2.5K from 61,456 to 58,669 for the non-promoted case, which is great. (chpl stream.chpl --fast --savec gen && wc -l gen/* | grep total)

I don't see any change for the promoted case, and curiously I only an 850 line reduction for the real stream.

It's not surprising that the promoted case will require more work, but it did seem odd that the real stream didn't show much change.

e-kayrakli · 2020-07-13T22:19:12Z

I see similar behavior on the snippet. And as you said, reworking promotion is the next step. It was one of the things I had tried in the prior PR but couldn't get it working in a reasonable amount of time.

As for the real stream, this PR removes one dynamic check, but leaves the one in verifyResults. This is because (based on my experience in auto localAccess optimization):

verifyResults is a generic function
The loop duplication happens too late that we instantiate verifyResults first, during which we lose the information that the forall loop is amenable to fast follower optimization, so we go back to the old method of doing dynamic checks.

I have had similar problems with generic function instantiation when working on auto localAccess, but in its current status, all the duplication occurs before normalize, and we use special primitives to mark accesses local. So, we don't really need to carry any information when duplicating FnSymbols during function instantiation.

I still haven't cleared my head about how to refactor fast followers, and my reflex is to have it similar to how --auto-local-access works: do all the cloning prenormalize, prune some stuff out during resolve. I am not sure how easy it is to do that without changing forall to two nested fors with leaders and followers as early as normalization, though.

ronawho · 2020-07-13T22:50:08Z

compiler/AST/foralls.cpp

-                                          new_Expr("'move'(%S, chpl__dynamicFastFollowCheckZip(%S))", T2, iterRec),
-                                          new_Expr("'move'(%S, %S)", T2, gFalse)));
+
+      // override the dynamic check if the compiler can call it


"if the compiler can call it" -> "if the compiler can prove it's safe"?

ronawho · 2020-07-13T22:52:42Z

compiler/include/preNormalizeOptimizations.h


 #include "CallExpr.h"
 #include "symbol.h"

 // interface for normalize
-void autoLocalAccess();
+void doPreNormalizeOptimizations();


Minor, but I don't find the name particularly helpful. Maybe preNormalizeArrayOptimizations?

I think it's worth calling out that it's pre-normalize since source level semantics are important for these optimizations, but wonder if it'd be helpful to specify a little more.

ronawho · 2020-07-13T22:55:20Z

test/optimizations/fastFollowers/staticCheck.prediff

@@ -0,0 +1,3 @@
+#!/bin/sh
+
+grep dynamicFastFollow c_code/staticCheck.c >> $1.exec.out.tmp


Leftover debug code, or is this meant to count the number of dynamic checks in the generated code?

This is my attempt to make sure that there is no dynamic check call in the generated code. Do you have a better approach in mind?

Oh, duh -- gotcha. This makes sense, I was just confused by which file it was redirecting too. $2 is the name of the logfile, so you should just be able to do grep dynamicFastFollow c_code/staticCheck.c >> $2

Longer term, I wonder if there's a more general way to do this that would work for llvm. The only thing that comes to mind is a compilerWarning or something in the slow follower, but that either means adding a test distribution (which seems like a lot of code) or introducing a param protected warning in block, which feels wonky.

I have toyed with two different approaches that I believe can also work with LLVM:

Hijack the distribution's follower in the test code, do a writeln, forward to the actual follower. I implemented something similar to this here

Instead of grepping the generated code, grep something in the AST dump. Likely the dump of codegen.

e-kayrakli · 2020-07-15T21:28:08Z

@ronawho -- I interpreted your feedback as you are fine with the rest of the implementation and merged it to avoid doing that close to the weekend. Let me know if you have any post-merge feedback when you come back. (Including "I totally hated this, just revert it back")

ronawho · 2020-07-16T14:08:09Z

I interpreted your feedback as you are fine with the rest of the implementation and merged it to avoid doing that close to the weekend.

Yup, you interpreted correctly, I meant to approve this before taking off but I forgot.

Let me know if you have any post-merge feedback when you come back. (Including "I totally hated this, just revert it back")

Nope, looks great. This is something I've been wanting to do for a long time, but had no idea how to tackle it myself so I'm really excited to see progress on this.

@ronawho

Reduce loop cloning in fast followers This PR adjusts the module support for fast followers to reduce the amount of loop cloning we do. Initially attempted at #16001 Also see #16053, which has done similar reductions in loop cloning that happens with fast followers Motivation ---------- The static check that we do today is way too lax. Because for a leader that supports fast followers, we always have a fast loop. However that loop is unused if any of the followers don't support fast followers and/or not of same type as the leader. This decision seems to be made in 2010 when fast followers were first implemented. I assume that compiler wasn't this powerful and slow and/or there wasn't param unfolding etc. The commit where this was added is 0dd8331. Citing part of it that alludes to this decision: > This optimization works as follows: When 'parsing' a forall loop, we > generate code that generates two loops in a conditional, a fast > follower loop and a normal follower loop. The conditional is guarded > by or'ing the result of calling the static checks on all of the > followers, and then and'ing the result of calling the dynamic checks > on all of the followers. That is, if any follower has a fast version > that may be called, let's generate the code, but we're only going to > run the fast loop (all fast followers) if every follower that may be > called should be called. Where there is no strong argument for having this behavior. I believe these type of false optimizations are the main reason for the performance overhead of fast followers. Approach -------- The fast follower implementation in general is too complicated. I think the reason for that was we were overloading the meaning of static and dynamic checks probably to reduce the amount of code. These checks return a bool and thus can be used for binary choices. However, we have a more complicated taxonomy of types w.r.t fast followers: - Arrays - Array types that supports fast followers: can lead and generate fast followers - Array types that don't support fast followers: cannot lead or generate fast followers - Domains: Can lead fast followers. Can take place among other fast followers in a zippered loop. - Other iterators: Cannot lead or generate fast followers. However, can take place among other fast followers in a zippered loop This PR adds three helpers (and their overloads) to distinguish these: - `chpl__canHaveFastFollowers`: Returns true for an array, or a zip tuple that has at least one array. (This check doesn't involve the specific type of the array, it is still handled by `chpl__staticFastFollowCheck` - `chpl__canLeadFastFollowers`: Returns true for arrays and domains. - `chpl__hasInertFastFollowers`: Returns false for arrays. For a zip tuple, it returns false if there is one array. These helpers are used in the other existing checks to limit loop cleaning and improve readability. Performance ----------- For benchmarks where fast followers are used heavily (stream, prk-stencil) there isn't any change in compiler or application performance. We see small improvements in benchmarks where we create fast followers unnecessarily (cross-type array assignments/zippering are examples where this may happen). On miniMD: I see 3 seconds faster --fast compilation on XC and locally On SSCA2: I see 6 seconds faster --fast compilation locally, but not so much of an improvement on XC. The reduction in the code size is there but its impact on performance is a bit of a hit-and-miss and depends on where you compile. Readability Advantages ---------------------- - Static and dynamic checks are more symmetrical - Static check is and'ed instead of or'ed. i.e. we only have a clone if it makes sense - Dynamic check no longer returns `true` for generic argument. [Reviewed by @ronawho] Test: - [x] standard - [x] gasnet

e-kayrakli mentioned this pull request Jul 9, 2020

Improve static checks and add new static analysis for fast followers #16001

Closed

5 tasks

e-kayrakli marked this pull request as ready for review July 13, 2020 19:11

ronawho reviewed Jul 13, 2020

View reviewed changes

e-kayrakli added 8 commits July 15, 2020 10:42

Add the analysis for symbolic fast follower check

48601f6

Use the analysis result to override the dynamic check

4b56ed9

Minor cleanup

9d0ba3a

Comment out the debug print

6f77018

Move log output out of a common helper

be2567b

Do a major refactor to support both optimizations cleanly

8ec1a40

Change a filename in the compiler

b0d3b7a

Add test

c4664c7

e-kayrakli force-pushed the static-fast-follower-minimal branch from 418546b to c4664c7 Compare July 15, 2020 18:07

e-kayrakli added 3 commits July 15, 2020 11:11

Address Elliot's feedback

804a2b2

Fix the header file, too

b8ff901

Adjust the log output in tests

0af5b36

e-kayrakli merged commit b95172e into chapel-lang:master Jul 15, 2020

e-kayrakli deleted the static-fast-follower-minimal branch July 15, 2020 21:26

e-kayrakli mentioned this pull request Aug 12, 2020

Reduce loop cloning in fast followers #16229

Merged

2 tasks

ronawho mentioned this pull request Jan 22, 2021

Automatically aggregate some communication in foralls #16965

Merged

8 tasks

e-kayrakli mentioned this pull request Apr 20, 2021

Don't do dynamic fast follower checks for non-zippered foralls #17570

Merged

2 tasks

e-kayrakli mentioned this pull request Jan 5, 2024

Improve performance of GPU zippered iteration over identical domains #24131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add static analysis to avoid adding slow followers in some cases #16053

Add static analysis to avoid adding slow followers in some cases #16053

e-kayrakli commented Jul 9, 2020 •

edited

ronawho commented Jul 13, 2020

e-kayrakli commented Jul 13, 2020

ronawho Jul 13, 2020

ronawho Jul 13, 2020

ronawho Jul 13, 2020

e-kayrakli Jul 13, 2020

ronawho Jul 13, 2020

e-kayrakli Jul 13, 2020 •

edited

e-kayrakli commented Jul 15, 2020

ronawho commented Jul 16, 2020

		@@ -0,0 +1,3 @@
		#!/bin/sh

		grep dynamicFastFollow c_code/staticCheck.c >> $1.exec.out.tmp

Add static analysis to avoid adding slow followers in some cases #16053

Add static analysis to avoid adding slow followers in some cases #16053

Conversation

e-kayrakli commented Jul 9, 2020 • edited

ronawho commented Jul 13, 2020

e-kayrakli commented Jul 13, 2020

ronawho Jul 13, 2020

Choose a reason for hiding this comment

ronawho Jul 13, 2020

Choose a reason for hiding this comment

ronawho Jul 13, 2020

Choose a reason for hiding this comment

e-kayrakli Jul 13, 2020

Choose a reason for hiding this comment

ronawho Jul 13, 2020

Choose a reason for hiding this comment

e-kayrakli Jul 13, 2020 • edited

Choose a reason for hiding this comment

e-kayrakli commented Jul 15, 2020

ronawho commented Jul 16, 2020

e-kayrakli commented Jul 9, 2020 •

edited

e-kayrakli Jul 13, 2020 •

edited