Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add static analysis to avoid adding slow followers in some cases #16053

Merged
merged 11 commits into from Jul 15, 2020

Conversation

e-kayrakli
Copy link
Contributor

@e-kayrakli e-kayrakli commented Jul 9, 2020

This PR adds a static analysis to detect whether zippered distributed arrays
share the same domain and thus can leverage the fast follower optimization.

This is a small part of what I attempted at #16001.

This PR also rearranges the newly-added autoLocalAccess.cpp to make it more
general purpose.

Summary of changes:

  • Add staticFastFollowerAnalysis to autoLocalAccess.cpp
  • Do some refactor on that file:
    • Rename to preNormalizeOptimizations.cpp
    • Put function declarations in the beginning of the file
    • Add a function doPreNormalizeArrayOptimizations to handle automatic local
      access and fast follower analysis right before normalization.

Although the main motivation for this PR was to reduce the compilation time
overheads, in some of the benchmarks I have seen that clones removed by this PR
to constitute only a small partion of those extra clones. #16001 gives an idea
of what needs to be done to remove a larger number of copies, and I hope to
revisit that effort soon.

Test:

  • add a test to lock the behavior
  • standard
  • gasnet

@e-kayrakli e-kayrakli marked this pull request as ready for review July 13, 2020 19:11
@ronawho
Copy link
Contributor

ronawho commented Jul 13, 2020

Can you give some examples of which cases you expect to be optimized?

For the simplified stream example:

use BlockDist;

config const m = 1000, alpha = 3.0;
config param promoted = false;

const Space = {1..m} dmapped Block({1..m});
var A, B, C: [Space] real;

if promoted {
  A = B + alpha * C;
} else {
  forall (a, b, c) in zip(A, B, C) do
    a = b + alpha * c;
}

I see the number of generated lines drop ~2.5K from 61,456 to 58,669 for the non-promoted case, which is great. (chpl stream.chpl --fast --savec gen && wc -l gen/* | grep total)

I don't see any change for the promoted case, and curiously I only an 850 line reduction for the real stream.

It's not surprising that the promoted case will require more work, but it did seem odd that the real stream didn't show much change.

@e-kayrakli
Copy link
Contributor Author

I see similar behavior on the snippet. And as you said, reworking promotion is the next step. It was one of the things I had tried in the prior PR but couldn't get it working in a reasonable amount of time.

As for the real stream, this PR removes one dynamic check, but leaves the one in verifyResults. This is because (based on my experience in auto localAccess optimization):

  • verifyResults is a generic function
  • The loop duplication happens too late that we instantiate verifyResults first, during which we lose the information that the forall loop is amenable to fast follower optimization, so we go back to the old method of doing dynamic checks.

I have had similar problems with generic function instantiation when working on auto localAccess, but in its current status, all the duplication occurs before normalize, and we use special primitives to mark accesses local. So, we don't really need to carry any information when duplicating FnSymbols during function instantiation.

I still haven't cleared my head about how to refactor fast followers, and my reflex is to have it similar to how --auto-local-access works: do all the cloning prenormalize, prune some stuff out during resolve. I am not sure how easy it is to do that without changing forall to two nested fors with leaders and followers as early as normalization, though.

new_Expr("'move'(%S, chpl__dynamicFastFollowCheckZip(%S))", T2, iterRec),
new_Expr("'move'(%S, %S)", T2, gFalse)));

// override the dynamic check if the compiler can call it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if the compiler can call it" -> "if the compiler can prove it's safe"?


#include "CallExpr.h"
#include "symbol.h"

// interface for normalize
void autoLocalAccess();
void doPreNormalizeOptimizations();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I don't find the name particularly helpful. Maybe preNormalizeArrayOptimizations?

I think it's worth calling out that it's pre-normalize since source level semantics are important for these optimizations, but wonder if it'd be helpful to specify a little more.

@@ -0,0 +1,3 @@
#!/bin/sh

grep dynamicFastFollow c_code/staticCheck.c >> $1.exec.out.tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debug code, or is this meant to count the number of dynamic checks in the generated code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to make sure that there is no dynamic check call in the generated code. Do you have a better approach in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh -- gotcha. This makes sense, I was just confused by which file it was redirecting too. $2 is the name of the logfile, so you should just be able to do grep dynamicFastFollow c_code/staticCheck.c >> $2

Longer term, I wonder if there's a more general way to do this that would work for llvm. The only thing that comes to mind is a compilerWarning or something in the slow follower, but that either means adding a test distribution (which seems like a lot of code) or introducing a param protected warning in block, which feels wonky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have toyed with two different approaches that I believe can also work with LLVM:

  • Hijack the distribution's follower in the test code, do a writeln, forward to the actual follower. I implemented something similar to this here
  • Instead of grepping the generated code, grep something in the AST dump. Likely the dump of codegen.

@e-kayrakli e-kayrakli force-pushed the static-fast-follower-minimal branch from 418546b to c4664c7 Compare July 15, 2020 18:07
@e-kayrakli e-kayrakli merged commit b95172e into chapel-lang:master Jul 15, 2020
@e-kayrakli e-kayrakli deleted the static-fast-follower-minimal branch July 15, 2020 21:26
@e-kayrakli
Copy link
Contributor Author

@ronawho -- I interpreted your feedback as you are fine with the rest of the implementation and merged it to avoid doing that close to the weekend. Let me know if you have any post-merge feedback when you come back. (Including "I totally hated this, just revert it back")

@ronawho
Copy link
Contributor

ronawho commented Jul 16, 2020

I interpreted your feedback as you are fine with the rest of the implementation and merged it to avoid doing that close to the weekend.

Yup, you interpreted correctly, I meant to approve this before taking off but I forgot.

Let me know if you have any post-merge feedback when you come back. (Including "I totally hated this, just revert it back")

Nope, looks great. This is something I've been wanting to do for a long time, but had no idea how to tackle it myself so I'm really excited to see progress on this.

e-kayrakli added a commit that referenced this pull request Aug 13, 2020
Reduce loop cloning in fast followers

This PR adjusts the module support for fast followers to reduce the amount of
loop cloning we do.

Initially attempted at #16001
Also see #16053, which has done
similar reductions in loop cloning that happens with fast followers

Motivation
----------

The static check that we do today is way too lax. Because for a leader that
supports fast followers, we always have a fast loop. However that loop is unused
if any of the followers don't support fast followers and/or not of same type as
the leader.

This decision seems to be made in 2010 when fast followers were first
implemented. I assume that compiler wasn't this powerful and slow and/or there
wasn't param unfolding etc. 

The commit where this was added is 0dd8331.

Citing part of it that alludes to this decision:

>   This optimization works as follows: When 'parsing' a forall loop, we
>   generate code that generates two loops in a conditional, a fast
>   follower loop and a normal follower loop.  The conditional is guarded
>   by or'ing the result of calling the static checks on all of the
>   followers, and then and'ing the result of calling the dynamic checks
>   on all of the followers.  That is, if any follower has a fast version
>   that may be called, let's generate the code, but we're only going to
>   run the fast loop (all fast followers) if every follower that may be
>   called should be called.

Where there is no strong argument for having this behavior.

I believe these type of false optimizations are the main reason for the
performance overhead of fast followers.

Approach
--------

The fast follower implementation in general is too complicated. I think the
reason for that was we were overloading the meaning of static and dynamic checks
probably to reduce the amount of code. These checks return a bool and thus can
be used for binary choices. However, we have a more complicated taxonomy of
types w.r.t fast followers:

- Arrays
  - Array types that supports fast followers: can lead and generate fast
    followers
  - Array types that don't support fast followers: cannot lead or generate fast
    followers

- Domains: Can lead fast followers. Can take place among other fast followers in
  a zippered loop.

- Other iterators: Cannot lead or generate fast followers. However, can take
  place among other fast followers in a zippered loop

This PR adds three helpers (and their overloads) to distinguish these:
- `chpl__canHaveFastFollowers`: Returns true for an array, or a zip tuple that
  has at least one array. (This check doesn't involve the specific type of the
  array, it is still handled by `chpl__staticFastFollowCheck`

- `chpl__canLeadFastFollowers`: Returns true for arrays and domains.

- `chpl__hasInertFastFollowers`: Returns false for arrays. For a zip tuple, it
  returns false if there is one array.

These helpers are used in the other existing checks to limit loop cleaning and
improve readability.

Performance
-----------

For benchmarks where fast followers are used heavily (stream, prk-stencil) there
isn't any change in compiler or application performance. We see small
improvements in benchmarks where we create fast followers unnecessarily
(cross-type array assignments/zippering are examples where this may happen).

On miniMD: I see 3 seconds faster --fast compilation on XC and locally
On SSCA2: I see 6 seconds faster --fast compilation locally, but not so much of
an improvement on XC.

The reduction in the code size is there but its impact on performance is a bit
of a hit-and-miss and depends on where you compile.


Readability Advantages
----------------------

- Static and dynamic checks are more symmetrical
- Static check is and'ed instead of or'ed. i.e. we only have a clone if it makes
  sense
- Dynamic check no longer returns `true` for generic argument.

[Reviewed by @ronawho]

Test:
- [x] standard
- [x] gasnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants