Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

cfallin · 2021-10-21T19:17:26Z

If a program has many instances of the pattern "goto next; next:" in a
row (i.e., no-op branches to the fallthrough address), the branch
simplification in MachBuffer would remove them all, as expected.
However, in order to work correctly, the algorithm needs to track all
labels that alias the current buffer tail, so that they can be adjusted
later if another branch chomp occurs.

When many thousands of this branch-to-next pattern occur, many thousands
of labels will reference the current buffer tail, and this list of
thousands of labels will be shuffled between the branch metadata struct
and the "labels at tail" struct as branches are appended and then
chomped immediately.

It's possible that with smarter data structure design, we could somehow
share the list of labels -- e.g., a single array of all labels, in order
they are bound, with ranges of indices in this array used to represent
lists of labels (actually, that seems like a better design in general);
but let's leave that to future optimization work.

For now, we can avoid the quadratic behavior by just "giving up" if the
list is too long; it's always valid to not optimize a branch. It is very
unlikely that the "normal" case will have more than 100 "goto next"
branches in a row, so this should not have any perf impact; if it does,
we will leave 1 out of every 100 such branches un-optimized in a long
sequence of thousands.

This takes total compilation time down on my machine from ~300ms to
~72ms for the foo.wasm case in #3441. For reference, the old backend
(now removed), built from arbitrarily-chosen-1-year-old commit
c7fcc344, takes 158ms, so we're ~twice as fast, which is what I would
expect.

(This PR also switches a few statics to consts just above where I added
a const, as s drive-by change.)

Fixes bytecodealliance#3468. If a program has many instances of the pattern "goto next; next:" in a row (i.e., no-op branches to the fallthrough address), the branch simplification in `MachBuffer` would remove them all, as expected. However, in order to work correctly, the algorithm needs to track all labels that alias the current buffer tail, so that they can be adjusted later if another branch chomp occurs. When many thousands of this branch-to-next pattern occur, many thousands of labels will reference the current buffer tail, and this list of thousands of labels will be shuffled between the branch metadata struct and the "labels at tail" struct as branches are appended and then chomped immediately. It's possible that with smarter data structure design, we could somehow share the list of labels -- e.g., a single array of all labels, in order they are bound, with ranges of indices in this array used to represent lists of labels (actually, that seems like a better design in general); but let's leave that to future optimization work. For now, we can avoid the quadratic behavior by just "giving up" if the list is too long; it's always valid to not optimize a branch. It is very unlikely that the "normal" case will have more than 100 "goto next" branches in a row, so this should not have any perf impact; if it does, we will leave 1 out of every 100 such branches un-optimized in a long sequence of thousands. This takes total compilation time down on my machine from ~300ms to ~72ms for the `foo.wasm` case in bytecodealliance#3441. For reference, the old backend (now removed), built from arbitrarily-chosen-1-year-old commit `c7fcc344`, takes 158ms, so we're ~twice as fast, which is what I would expect.

cranelift/codegen/src/machinst/buffer.rs

alexcrichton

I don't know enough about MachBuffer per se to review this in isolation, but I trust you and your knowledge of MachBuffer and that the invariants of MachBuffer are checked thoroughly enough internally so this seems fine by me. Thanks for taking a look at the performance here!

Out of further curiosity, even 70ms for a function like this seems somewhat high, is that still due to MachBuffer things or is it general "too much elbow grease is needed to bring that down further"

cfallin · 2021-10-21T20:35:44Z

Out of further curiosity, even 70ms for a function like this seems somewhat high, is that still due to MachBuffer things or is it general "too much elbow grease is needed to bring that down further"

I think it is mostly in the middle-end (analyses and optimizations), which will see the huge CFG with all the loops before it's reduced. The backend stages that are specifically broken out in the clif-util wasm -T output show: 3ms in CLIF -> VCode lowering; 4ms in regalloc; and 4ms in binary emission (MachBuffer + cpu-specific instruction encoding code). So only 11ms (EDIT: I can add I promise) in the "backend" and the rest in attempted optimization.

A perf profile of the compilation shows a lot of time in the kernel's pagefault path, so I think that just writing out the data structures has some overhead (for the large function body). I imagine we could probably be smarter about early optimizations that cut down the amount of work the later stages have to do; but nothing immediately obviously or anomalously bad is happening here, I think.

cfallin requested a review from alexcrichton October 21, 2021 19:17

bjorn3 reviewed Oct 21, 2021

View reviewed changes

cranelift/codegen/src/machinst/buffer.rs Show resolved Hide resolved

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Oct 21, 2021

alexcrichton approved these changes Oct 21, 2021

View reviewed changes

cfallin merged commit 54896ac into bytecodealliance:main Oct 21, 2021

cfallin deleted the machbuffer-quadratic-labels branch October 21, 2021 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

cfallin commented Oct 21, 2021

alexcrichton left a comment

cfallin commented Oct 21, 2021 •

edited

Loading

Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

Conversation

cfallin commented Oct 21, 2021

alexcrichton left a comment

Choose a reason for hiding this comment

cfallin commented Oct 21, 2021 • edited Loading

cfallin commented Oct 21, 2021 •

edited

Loading