Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

private_append optimization for binaries #6804

Merged
merged 11 commits into from Feb 10, 2023
Merged

Conversation

frej
Copy link
Contributor

@frej frej commented Feb 6, 2023

When a binary is grown by appending data to it using bs_create_bin,
a considerable performance improvement can be achieved if the append
can be done using the destructive private_append instead of
append. Using private_append flavor of bs_create_bin is only
possible when the binary being extended has been created by
bs_writable_binary or by another bs_create_bin using
private_append. As private_append is destructive, an additional
requirement is that there is only one live reference to the binary
being being extended.

This set of patches implements the compiler optimization described above together with its dependencies.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

CT Test Results

       2 files     295 suites   12m 10s ⏱️
   761 tests    757 ✔️ 2 💤 2
4 708 runs  4 702 ✔️ 2 💤 4

For more details on these failures, see this check.

Results for commit 858fff4.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@jhogberg jhogberg requested a review from bjorng February 6, 2023 16:39
@jhogberg jhogberg added team:VM Assigned to OTP team VM enhancement labels Feb 6, 2023
@bjorng bjorng self-assigned this Feb 7, 2023
bs_test_tail ->
SS1;
build_stacktrace ->
SS1; % XXXX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does XXXX mean? If this line still needs attention, please change it to something clearer, such as TODO: ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means i should have done a git grep XXXX before pushing :) I use at as a marker for things I should take a second look at.

At this particular location my thinking is that build_stacktrace is special in that it is only executed when we're exiting and additionally its only argument isn't a term which can contain a binary. I could do a aa_set_aliased([Dst|Args], SS1) but I think that is redundant. Correct? I added a comment about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any verdict on build_stacktrace @bjorng ? The updated comment is in 0af6c67.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_stacktrace is not only used when exiting. The code that follows build_stacktrace can do anything, for example print the stack trace to a log file and continue executing.

The argument can potentially contain binaries:

/*
* Building a symbolic representation of a saved stack trace. Note that
* the exception object 'exc', unless NIL, points to a cons cell which
* holds the given args and the quick-saved data (encoded as a bignum).
*
* If the bignum is negative, the given args is a complete stacktrace.
*/
Eterm
build_stacktrace(Process* c_p, Eterm exc) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's even worse, anything live at this point can be captured and aliased.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of: a stack trace (currently) only contains the arguments of the operation that crashed, so we only need to consider the arguments of all operations that can crash into the current exception handler.

I'm not sure that tracking it would be worth it though, so maybe the optimization should just give up when encountering these. I don't think that doing so would cost too much since it's pretty rare to use try/catch with captured stack traces in the middle of a hot loop that appends binaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not too bad. At least it can't alias terms which are not known when the analysis encounters build_stacktrace. Analyzing exactly which operations can crash into the handler sounds complicated, but just aliasing everything is quite simple.Hopefully users will wrap the loop with the try catch and not have the in the body. For example (with the fix in 63fd18d):

stacktrace() ->
    X = ..., % something unique
    try
	ex:code_that_fails(),
	X % still unique
    catch
	_:_:Stacktrace ->
	    io:format("Stacktrace: ~p~n", [Stacktrace]),
	    X % now conservatively aliased
    end.

Implementing this correction does not remove any private-append transforms in the set of files compiled by diffable.

Are there any plans to include more in the stack trace, i.e. arguments to functions higher on the stack? That would require the stack-crawler (or what it is called) to deep copy terms.

AlreadyFound = maps:get(Fun, Found0, []),
Found = Found0#{Fun => [{append,Dst,Var}|AlreadyFound]},
find_appends_is(Is, Fun, Found);
_ ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:

Suggested change
_ ->
false ->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@bjorng
Copy link
Contributor

bjorng commented Feb 7, 2023

Some benchmark results on my Intel iMac:

Erlang/OTP 25.2:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 16307 ms: 61 it/sec
fun base64:decode/1: 1000 iterations in 18402 ms: 54 it/sec

Current master branch:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 7832 ms: 127 it/sec
fun base64:decode/1: 1000 iterations in 10878 ms: 91 it/sec

This PR:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 5555 ms: 180 it/sec
fun base64:decode/1: 1000 iterations in 6433 ms: 155 it/sec

That makes both encode and decode of Base64 almost three times as fast compared to OTP 25.

Comment on lines 525 to 528
aa_set_status(#b_local{}, _Status, State) ->
State;
aa_set_status(#b_remote{}, _Status, State) ->
State;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered by our tests. Can this happen? Is there a test case missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't create an example which covers it, so away it goes! (fix in beb1977)

Comment on lines 486 to 492
{{hd,H},{tl,T}} ->
{pair,H,T};
{{tl,T},{hd,H}} ->
{pair,H,T};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the type analyzer does not handle the case when you "misuse" cons cells. For example:

b(L) ->
    b(L, [<<>>|<<>>]).

b([H|T], [A|B]) ->
    b([H|T], [<<A/binary,H:8>>|<<B/binary,H:8>>]);
b([], Acc) ->
    Acc.

Here the type analyzer will conclude that the type of the accumulator is nonempty_improper_list(bitstring(8,appendable), bitstring(8)) and not nonempty_improper_list(bitstring(256,appendable), bitstring(256,appendable)) which would be expected. If the type analysis pass is fixed so both fields of the cons are transformed by private append, we will hit this code with the example above.

Apart from a couple of lines dealing with nifs (will fix that), all other non-covered lines in beam_ssa_private_append.erl are due to this type analysis fault as we never try to transform pairs of binaries.

When I last looked at this, during the autumn, I seem to recall that the type was less specific, i.e. nonempty_improper_list(bitstring(8,appendable), any) so maybe this can now be fixed. I'll have a look.

Copy link
Contributor Author

@frej frej Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed in d1211f5, beam_ssa_type:type/5 had problems handling bs_create_bin when the
first fragment was typed as a #t_union{} containing an appendable bitstring. The same error was in beam_ssa_private_append:is_appendable/2` (fixed in 6952cb6).

Additionally 6952cb6 adds a tests which exercises ht/tl lines above [not the case, the code coverage percent is correct], this getting the coverage to 99% (the remaining non-covered lines are due to the NIF problem mentioned below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{tl,T},{hd,H}} never happens, patches are stored in an orddict, so only {{hd,H},{tl,T}} can occur. So the non-reachable code is removed in b33d99e.

@michalmuskala
Copy link
Contributor

This is an awesome improvement! The results for base64 are really impressive.

Is it possible to "pre-allocate" the binary? In many cases, like base64 you actually know exactly how big the result is going to be - it would be great if there was a way to communicate this information to the runtime system. This could probably improve the results even further.

@bjorng
Copy link
Contributor

bjorng commented Feb 7, 2023

Added to our daily builds along with an updated primary bootstrap.

@bjorng
Copy link
Contributor

bjorng commented Feb 7, 2023

Is it possible to "pre-allocate" the binary?

Yes, it would be possible, and it is already done for binary comprehensions for which the exact size can be calculated beforehand. This PR does not attempt to do any pre-allocating (the existing analysis for pre-allocation used for binary comprehensions is not general enough), but is something we will consider doing in OTP 27.

@michalmuskala
Copy link
Contributor

I don't think it even needs to be fully automated. I think giving control to the user would also work. Additionally, I don't think getting the size "wrong" would lead to any unsafely - the binary will either be re-allocated, or just remain with some unused bytes at the end wasting some memory. In particular, an API like binary:empty_appendable(100) could be useful.
There are probably cases where automated analysis wouldn't be useful, but optimising for the most common case could be extremely beneficial. For example, for JSON when encoding a string we know the result will be at least as long as the input string + some padding (some implementations have heuristics for that, e.g. over-allocate 10%), so we can start the allocations much higher and avoid couple rounds of re-allocations, even if the final result is a little off.

@max-au
Copy link
Contributor

max-au commented Feb 7, 2023

Binary preallocation is really useful for protocol serialisation. For example, protobuf and Thrift could be sped up significantly, and avoid cumbersome double transformation (term -> iolist -> binary).

@bjorng
Copy link
Contributor

bjorng commented Feb 8, 2023

I don't think it even needs to be fully automated. I think giving control to the user would also work. Additionally, I don't think getting the size "wrong" would lead to any unsafely - the binary will either be re-allocated, or just remain with some unused bytes at the end wasting some memory.

Yes, getting the size wrong is safe. The compiler's size calculation is extremely conservative to avoid allocating a huge binary that is not actually needed, not because it would be unsafe.

In particular, an API like binary:empty_appendable(100) could be useful.

Yes, that would work.

@garazdawi
Copy link
Contributor

In particular, an API like binary:empty_appendable(100) could be useful.

list_to_binary(List) ->
  list_to_binary(List, <<length(List)/size-hint>>). %% Creates an empty binary with a size hint.
list_to_binary([H|T], Acc) ->
  list_to_binary(T, <<H,Acc/binary>>);
list_to_binary([], Acc) ->
  Acc.

@frej frej force-pushed the frej/private-append branch 2 times, most recently from 27623e8 to d8cfd8f Compare February 8, 2023 13:24
@frej
Copy link
Contributor Author

frej commented Feb 8, 2023

I have pushed a squashed version of all the fixups on fixups. If you want to pull this branch for nightly testing @bjorng, now is good time.

@bjorng
Copy link
Contributor

bjorng commented Feb 8, 2023

Added the updated branch along with an updated bootstrap to our daily builds.

@frej
Copy link
Contributor Author

frej commented Feb 9, 2023

While working to increase the code-coverage in beam_ssa_alias.erl I (re-)discovered that for modules containing NIFs beam_ssa_opt:module/2 completely gives up on creating a function-db for all functions, not just the ones calling NIFs.
As beam_ssa_alias.erl is written to gracefully handle NIFs, but needs the caller/callee information provided by the function-db, this of course leads to some dead code (the predicates ) as we won't hit code paths where we deal with partial information.

Modifying beam_ssa_opt to build a partial function-db seems dangerous less than two days before code freeze, so (@bjorng, @jhogberg) should I excise the code dealing with NIFs in this PR or should I leave it in? As far as I can tell, this is 2 lines of code and I would be replacing them with TODO comments.

@bjorng
Copy link
Contributor

bjorng commented Feb 9, 2023

Don't worry about the dead code for this release candidate.

Can you squash the commits again? I will add it to our daily builds and we plan to merge tomorrow.

@frej
Copy link
Contributor Author

frej commented Feb 9, 2023

Can you squash the commits again? I will add it to our daily builds and we plan to merge tomorrow.

Will do within an hour or so, right now I'm in a meeting (where I'm supposed to be presenting).

@frej frej force-pushed the frej/private-append branch 2 times, most recently from 7d9a6cb to b33d99e Compare February 9, 2023 14:01
@frej
Copy link
Contributor Author

frej commented Feb 9, 2023

Took a bit longer, squashed and pushed @bjorng.

@bjorng
Copy link
Contributor

bjorng commented Feb 9, 2023

Thanks! Added to our daily builds.

When a bitstring is used as an accumulator and is being grown by
appending new data to it, the runtime system can make the append
operation very efficent if it is guaranteed that the bitstring being
extended:

1) Originated as a `<<>>`.

2) The append operation is done with `bs_create_bin` with the
   accumulator as the leftmost fragment.

3) The append does not change the unit of the bitstring.

4) There is only one live pointer to the accumulator on the heap or in
   VM registers.

This patch extends the type analysis pass to track 1) and 2). This is
done by extending the `t_bitstring` record with an `appendable` flag
which is updated alongside with the `unit` field. The basic principle
is that `<<>>` literals are constructed with the flag set and
operations which capture a pointer to the binary clears it. The
functions to calculate the least upper bound and greatest lower bound
are also updated to account for the appendable status.
Avoid any assumptions on the number and kind of annotations before a
call. Instead just track the last seen result type annotation and
apply it when calls are found.
Annotate ret-instruction with the type they return, skip the
annotation if nothing is yet known about the variable. The annotation
will be used by the alias analysis pass.
Make the SSA pretty printer print annotations on terminators. This
helps with debugging optimizations which depend on annotated
terminators.
For debugging in type-dependent optimization passes it is useful to
have a type pretty printer available in order to generate readable
trace and debug printouts.
This commit inserts an empty analysis pass which will eventually be
extended to do a full alias analysis.
Implement alias analysis as a prerequisite for alias analysis.
Calculate the killset for all functions. A killset allows us to look
up the variables which die at a code location.
Perform an alias analysis of all reachable functions, alias
information is added as annotations on the SSA code.

Alias analysis is done by an algorithm inspired by Kotzmann and
Mössenböck's 2005 algorithm for Escape Analysis
(https://www.usenix.org/events/vee05/full_papers/p111-kotzmann.pdf),
particularly their escape equivalent sets. But in contrast to
Kotzmann and Mössenböck, instead of just tracking escaping values
we track if a value in a variable is unique and/or aliased.

A variable is said to be unique if it currently is the only live
variable pointing to a particular term on the heap. Literals and
non-boxed terms are considered unique.

A variable is said to be aliased if it points to a heap term which can
be reached by other means than the boxed pointer in the variable.

The alias analysis is performed by traversing the functions in the
module and their code. For each operation the uniqueness and alias
status are updated. The unique/aliased status is maintained in a map
which maps a variable to a either a status or another variable. Thus
constructing equivalent sets in the same way a Kotzmann and
Mössenböck.

When the analysis finishes each instruction is annotated with
information about which of its arguments are unique or aliased.
Extend the BEAM SSA pretty printer to pretty-print annotations
regarding alias and unique status.
When a binary is grown by appending data to it using `bs_create_bin`,
a considerable performance improvement can be achieved if the append
can be done using the destructive `private_append` instead of
`append`. Using `private_append` flavor of `bs_create_bin` is only
possible when the binary being extended has been created by
`bs_writable_binary` or by another `bs_create_bin` using
`private_append`. As `private_append` is destructive, an additional
requirement is that there is only one live reference to the binary
being being extended.

This commit implements a new SSA optimization pass which finds
suitable code sequences which iteratively grow a binaries starting
with `<<>>` and rewrites them to start with an initial value created
by `bs_writable_binary` and use `private_append` for `bs_create_bin`.
@frej
Copy link
Contributor Author

frej commented Feb 10, 2023

Looking at other recently pushed PRs it looks like I'm not the only one who gets tripped on the "Test Erlang/OTP (system)" step of the CI, so unless someone tells me differently, I will consider the CI failure not-my-fault™.

(@bjorng and @jhogberg, you have my phone number in an email if you need my urgent attention this afternoon)

@bjorng bjorng merged commit a092c49 into erlang:master Feb 10, 2023
@frej frej deleted the frej/private-append branch February 13, 2023 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants