[concurrency] SILOptimizer: optimize hop_to_executor instructions. #34593

eeckstein · 2020-11-05T17:54:17Z

Redundant hop_to_executor elimination: if a hop_to_executor is dominated by another hop_to_executor with the same operand, it is eliminated:

      hop_to_executor %a
      ... // no suspension points
      hop_to_executor %a // can be eliminated

Dead hop_to_executor elimination: if a hop_to_executor is not followed by any code which requires to run on its actor's executor, it is eliminated:

      hop_to_executor %a
      ... // no instruction which require to run on %a
      return

rdar://problem/70304809

* Redundant hop_to_executor elimination: if a hop_to_executor is dominated by another hop_to_executor with the same operand, it is eliminated: hop_to_executor %a ... // no suspension points hop_to_executor %a // can be eliminated * Dead hop_to_executor elimination: if a hop_to_executor is not followed by any code which requires to run on its actor's executor, it is eliminated: hop_to_executor %a ... // no instruction which require to run on %a return rdar://problem/70304809

eeckstein · 2020-11-06T13:04:56Z

@swift-ci smoke test

atrick · 2021-01-11T18:10:32Z

@eeckstein post-commit review!

The choice of algorithm seems good, but the only explanation is

if a hop_to_executor is dominated by another hop_to_executor with
the same operand, it is eliminated

That's true but misleading for someone trying to understand the actual algorithm, which nothing to do with dominance. There should be a statement about the optimistic data-flow approach.

if a hop_to_executor is not followed by any code which requires to run on its actor's executor, it is eliminated

This would be much more approachable if the two independent optimizations were separately defined and documented. The only data structure that is worth reusing across them is a block-to-index DenseMap.

The implementation is concise and nicely structured, but the algorithm is not evident from the code, so I can't reason about correctness in general. I expected a perfectly straightforward dataflow, and instead I'm baffled after spending hours trying to reverse engineer it from the implementation.

It should be possible to understand the dataflow algorithm by reading the definition of

struct BlockState {

But from that, I can't tell what the data flow lattice means and I can't make sense of the merge operation from the definition of BlockState. That's because it is serving all of these purposes without any description or self-explanatory interface:

dead-executor lattice
dead-executor local effects (block transfer function)
redundant-executor lattice
redundant-executor actor index
redundant-executor local effects

All of this should be clear in the definition of BlockState itself.

The first step is to use a separate class for the two unrelated optimizations. There's no need to reuse the BlockState type across optimizations because 'blockStates' can be a single allocation based on the computed number of blocks. If you free one after the first optimization, and allocate another before the next optimization, the system will reuse the memory for you. (Re)initializing that memory is far more expensive than the allocation.

While enumerating blocks you can populate a block-to-index map that can be reused across optimizations. In fact, I've often wanted this to be an analysis for use in many passes.

Then we can deal with each dataflow on its own. Backward dataflow, for example only needs two lattice states

Top:    NoExecutorNeeded (optimistic)

Bottom: ExecutorNeeded   (pessimistic)

While the transfer function needs three states:

- NoEffect

- ExecutorNeeded

- NoExecutorNeeded

You can combine the lattice and local info into a single enum if you want, but then you should explain what you're doing. What does the NotSet value mean in the context of the data flow? Why isn't data flow simply initialized to NoExecutorNeeded? The global data flow state should never have a NotSet value.

 NotSet = -2,

There must be a reason the code chooses specific values for the states, but that isn't explained. Note that it might be more efficient to initialize object bits to zero.

Later, burried within the code, we find out that these states are actually storing actor indices for one of the optimizations!

int entry = NotSet;
int intra = NotSet;
int exit = NotSet;

It isn't clear what intra means. You only need a single transient variable to hold dataflow state within a block; it doesn't need to be stored in the block vector. This seems to be holding the block's local effects, which should only be initialized and never updated.

It also isn't clear why you want to store exit for backward dataflow, but we'll get to that later.

 if (newExit != state.exit || firstRound) {

I don't understand why there would be a firstRound check. The dataflow initialization handles the first round of dataflow and happens earlier. If global data flow propagation needs to do something different across iterations, then something is broken. I would expect to see the body of the data flow propagation loop to be:

  do {
    // Iterate over the blocks in an approximation of post-order
    for (BlockState &state : reverse(blockStates)) {
      if (state.effect != NoEffect) {
        continue;
      }
      int newEntry = mergeSuccessors(state.block);
      if (newEntry != state.entry) {
        changed = true;
        state.entry = newEntry;
      }
    }
  } while (changed);

I've been seeing a lot of these 'while(changed)' style data flow loops in the code.
I think that's a good choice in this case because you're just iterating through a vector and the merge is cheap. I just want to point out that the alternative is to do something like this instead:

      if (state.entry != newEntry) {
        blockWorklist.insert(blockIndex)
      }

Then, for straight-line code processed postorder, you never enter the global data flow loop at all and do half as many CFG traversals and merges in the common case. Of course, the worklist itself is more expensive and complicates the code.

state.exit = (state.block->getSuccessors().empty() ?
BlockState::NoExecutorNeeded : BlockState::NotSet);

I don't understand why state.exit is needed, or why it would ever be set to NotSet. NotSet doesn't mean anything for global data flow, it only indicates that there is no transfer function for some piece of code. I would expect initial local propagation to process the blocks like this, assuming they are laid out in something close to reverse-postorder:

  for (BlockState &state : reverse(blockStates)) {
    auto localState = mergeSuccessors(state.block)
    for (SILInstruction &inst : llvm::reverse(*state.block)) {
      auto effect = needsExecutor(&inst);
      updateState(effect, localState)
      if (effect != NoEffect) {
        state.effect = effect
      }
    }
    state.entry = localState;
  }

// Last step: do the transformation.

You should not need to iterate over all the instructions again, only the blocks that have hop instruction, which is very uncommon. You can simply add the blocks that contain a hop instruction to a vector and that can drive the optimization.

Then there is no reason to store state.exit. It's trivial to call mergeSuccesors() for the few blocks that needs to be optimized before running the optimization.

if (auto *load = dyn_cast(inst)) {
return isGlobalMemory(load->getOperand());
}
if (auto *store = dyn_cast(inst)) {

With OSSA we have LoadBorrows and StoreBorrows. For newly written optimizations, we may want to take this into consideration by using some kind of load/store abstraction. @gottesman @meghana does one exist?

eeckstein · 2021-01-13T16:04:35Z

@atrick thanks for the review!

I agree, the comments could be better.

... is a block-to-index DenseMap.

This is a good idea.

I don't understand why there would be a firstRound check.

Without this, I would need to replicate the transfer function in the dataflow initialization, which is surprisingly easy to get wrong. After hitting several bugs with this (also in other optimizations), I decided to do it with a firstFound check. It's an easy and safe way to avoid such bugs.

I don't understand why state.exit is needed

Yeah, it's probably not needed.

I'll make some improvements the next time I touch this optimization.

atrick · 2021-01-13T17:39:11Z

I don't understand why there would be a firstRound check.

Without this, I would need to replicate the transfer function in the dataflow initialization, which is surprisingly easy to get wrong. After hitting several bugs with this (also in other optimizations), I decided to do it with a firstFound check. It's an easy and safe way to avoid such bugs.

You can have a single transfer function updateState or applyEffectToState that applies an effect to the current dataflow state, then it doesn't matter whether you're dealing with instructions or blocks. Blocks can store their local effect, as determined during initialization, and a helper function can get the effect for an instruction

    auto effect = needsExecutor(&inst);
   tempState = applyEffectToState(effect, tempState)

Then you can compute all effects during one initialization pass, the global dataflow propagation only iterates once and makes no changes for straight-line code, and there are no special cases.

Otherwise it's hard for me to understand what intra.state really means and I need to think hard about why the firstRound check is correct. I'm not convinced it's right yet.

eeckstein requested review from jckarter and DougGregor November 5, 2020 17:56

eeckstein merged commit 56928ba into apple:main Nov 9, 2020

eeckstein deleted the optimize_hte branch November 9, 2020 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[concurrency] SILOptimizer: optimize hop_to_executor instructions. #34593

[concurrency] SILOptimizer: optimize hop_to_executor instructions. #34593

eeckstein commented Nov 5, 2020

eeckstein commented Nov 6, 2020

atrick commented Jan 11, 2021

eeckstein commented Jan 13, 2021

atrick commented Jan 13, 2021

[concurrency] SILOptimizer: optimize hop_to_executor instructions. #34593

[concurrency] SILOptimizer: optimize hop_to_executor instructions. #34593

Conversation

eeckstein commented Nov 5, 2020

eeckstein commented Nov 6, 2020

atrick commented Jan 11, 2021

eeckstein commented Jan 13, 2021

atrick commented Jan 13, 2021