Rewrite BSM optimizations in the new SSA-based intermediate format #1958

jhogberg · 2018-09-19T10:54:44Z

This PR improves the bit-syntax match optimization pass, leveraging the new SSA intermediate format to perform much more aggressive optimizations.

While mostly done this is still a work in progress, and I'm posting it a bit early because I need help adding HiPE support for a few new instructions. @kostis, @margnus1: could you lend me a hand?

bs_start_match3/4 : As bs_start_match2, but without "position slots."
bs_get_position/3 : Stores the current position in a register as an integer.
bs_set_position/2 : Restores position from a register.
bs_get_tail/3 : Returns the tail of a context at its current position, similar to bs_context_to_binary but non-destructive.

Some highlights:

Sub-binary creation can now be delayed within function bodies

Sub-binaries aren't created until they're actually used as binaries, making it significantly easier to write optimized happy paths. The following code used to disable reuse altogether:

    delayed(<<A, Rest/binary>>) ->
        case bar(A) of
            %% Match context will be reused here.
            true -> delayed(Rest);
            %% Sub-binary for 'Rest' will be created here.
            false -> Rest
        end.

This also applies when an alias is taken for the entire match:

    aliased(<<A, Rest/binary>>=Alias) ->
        case bar(A) of
            %% Match context will be reused here.
            true -> aliased(Rest);
            %% Sub-binary for 'Alias' will be created here.
            false -> Alias
        end.

The cost of backtracking is significantly reduced

Match contexts only keep track of their current position, and backtrack positions are now stored in plain registers instead of "slots" on the context. This means that passing them to another function or using them near a try block will no longer disable reuse, improving performance for modules that do a lot of backtracking.

    %% Example taken from the uri_string module
    parse_hier(<<"//"/utf8, Rest/binary>>), URI) ->
        try parse_userinfo(Rest, URI) of
            {T, URI1} ->
                %% Match context will be reused here.
                Userinfo = calculate_parsed_userinfo(Rest, T),
                %% Sub-binary will be created here.
                {Rest, URI1#{userinfo => Userinfo}}
        catch
            throw:{_,_,_} ->
                %% Match context will be reused here
                {T, URI1} = parse_host(Rest, URI),
                Host = calculate_parsed_host_port(Rest, T),
                %% Sub-binary will be created here.
                {Rest, URI1#{host => remove_brackets(Host)}}
        end;

As an added benefit this also avoids creating a new match context when a function needs more slots than the context it was given, which was a very surprising (but thankfully uncommon) performance sink.

Improved context reuse within function bodies

All matches operating on the same binary are combined into a single match sequence, greatly reducing the number of match contexts created.

    %% Example taken from the string module
    bin_search_str_1(Bin0, Start, First, SearchCPs) ->
        <<_:Start/binary, Bin/binary>> = Bin0,
        case binary:match(Bin, First) of
            nomatch -> {nomatch, byte_size(Bin0), []};
            {Where0, _} ->
                Where = Start+Where0,
                %% We already started a match on Bin0, so its context
                %% gets reused
                <<Keep:Where/binary, Cs0/binary>> = Bin0,
                case prefix_1(Cs0, SearchCPs) of
                    nomatch ->
                        %% This is a continuation on the match on Bin0, so we
                        %% reuse it here too. Note that this happens even though
                        %% it was passed to prefix_1 above.
                        <<_/utf8, Cs/binary>> = Cs0,
                        KeepSz = byte_size(Bin0) - byte_size(Cs),
                        %% ... and of course it's reused here as well.
                        bin_search_str_1(Bin0, KeepSz, First, SearchCPs);
                    [] ->
                        {Keep, [Cs0], <<>>};
                    Rest ->
                        {Keep, [Cs0], Rest}
                end
        end.

Wrappers no longer disable context reuse

Wrapper functions used to disable reuse in even the simplest cases, but no longer:

    example(<<0,1,2,Rest/binary>>) ->
        %% Match context will be reused here.
        sum_bytes(Rest).

    sum_bytes(Bin) ->
        sum_bytes_1(Bin, 0).
    sum_bytes_1(<<N, Rest/binary>>, Acc) ->
        sum_bytes_1(Rest, Acc + N);
    sum_bytes_1(<<>>, Acc) ->
        Acc.

The upcoming beam_ssa_bsm pass allows match contexts to be used across function calls that take said context as an argument, which means it's fairly common for them to end up in Y registers.

The upcoming beam_ssa_bsm pass makes this redundant.

get_anno/3: as get_anno but with a default value definitions/1-2: returns a map of variable definitions (#b_set{}) uses/1-2: returns a map of all uses of a given variable mapfold_blocks_rpo/4: mapfolds over blocks

lib/compiler/src/beam_ssa_pre_codegen.erl

jhogberg · 2018-09-26T16:26:39Z

I've fixed a performance regression on 32-bit platforms and cleaned up a few cosmetic issues. It doesn't look like there's much left to do so I'll merge it on Friday if nothing new pops up. HiPE is still not fixed but there's plenty of time until OTP 22.

I've also got some benchmarks now. The old pass did a pretty good job and a lot of our code was written with its limitations in mind (hurting readability in some cases), so most of the benchmarks are unchanged, but the ones I set out to improve have seen substantial gains.

The blue line is this PR and the green line is maint:

kostis · 2018-09-26T22:45:23Z

I would have assumed that there would be a line also for master, in order to also see the baseline of this PR -- and the performance improvement it offers.

Also, since presumably you have an easy way to create these graphs, perhaps it would be a good idea to include HiPE on maint (and maybe on master if it works on these two programs) in order to have points of reference for that too.

It's a super-busy period right now, but today we briefly talked with @margnus1 about at least providing some guidance to what parts of HiPE need to be changed and how. Let's see whether we will be able to do this next week...

jhogberg · 2018-09-27T07:00:59Z

I would have assumed that there would be a line also for master, in order to also see the baseline of this PR -- and the performance improvement it offers.

Also, since presumably you have an easy way to create these graphs, perhaps it would be a good idea to include HiPE on maint (and maybe on master if it works on these two programs) in order to have points of reference for that too.

The graphs were taken from the nightly builds and are of maint+PRs and master+PRs, there were no other PRs in master that touched related parts so I simplified the blue line as "this PR." I've included historical benchmarks below and it's pretty easy to spot when the PR was included:

You can also see that master was a lot slower than maint. This was mainly caused by the SSA rewrite changing how the beam code looks which confused the old pass a bit. Practically all benchmarks related to binary matching have been improved by this PR compared to master, but compared to maint it's roughly unchanged aside from the above benchmarks.

This is a bit unfortunate. Your code no longer has to be arcane to be fast (or at least less arcane), but our benchmarks won't show that because they tend to look at things we've written with the old pass in mind.

It's a super-busy period right now, but today we briefly talked with @margnus1 about at least providing some guidance to what parts of HiPE need to be changed and how. Let's see whether we will be able to do this next week...

Sounds good.

This commit improves the bit-syntax match optimization pass, leveraging the new SSA intermediate format to perform much more aggressive optimizations. Some highlights: * Watch contexts can be reused even after being passed to a function or being used in a try block. * Sub-binaries are no longer eagerly extracted, making it far easier to keep "happy paths" free from binary creation. * Trivial wrapper functions no longer disable context reuse.

The beam_ssa_bsm pass welds chained matches together, but the match expressions themselves are unchanged and if there's a tail alignment check it will be done each time. This subpass figures out the checks we've already done and deletes the redundant ones.

Remove the variable aliasing support that was needed for the old beam_bsm pass.

This has been superseded by bs_get_tail/3. Note that it is NOT removed from the emulator or beam_disasm, as old modules are still legal.

jhogberg · 2018-10-01T10:32:02Z

I removed the tests because we already know that they fail hard and we don't want to litter our test result page with them. git rm was the simplest way to make sure they're completely out of the picture.

This is a temporary measure and we always planned to revert it in the branch that reintroduces support, I didn't mention it in the commit message because I figured it was obvious.

In either case I hope this won't remain a problem for long, the faster I get help the faster this gets resolved.

kostis · 2018-10-01T11:19:04Z

Please try to understand what I am complaining about.

The commit that removed these tests (afa36d2) is actually the same that suggested to use the option no_bsm3 to run programs that contain binary matching in native code. How do we know/ensure that this option works (and will continue to work)? Surely not by git removing all tests that exercise the native code compiler on programs containing binary matching, even if that "is the simplest way to do things."

So, independently of whether the HiPE compiler is fixed to work for the new instructions, there needs to be a way to test that the no_bsm3 / r21 path to generating native code works. I kindly ask you to revert this commit now and restore these tests with a no_bsm3 compiler option for them.

bjorng · 2018-10-01T12:40:48Z

The r21 and no_bsm3 options are totally unsupported. The only reason for them being there in the first place is for some test suites that test backward compatibility to previous releases. It was our intention that the single mention of no_bsm3 should be reverted well before the release of OTP 22. The change to the documentation was only intended for the heroes that test out the master branch before the release of OTP 22.

Therefore, we have no interest that the hipe compiler in OTP 22 can handle code that has been compiled with no_bsm3. Unless you have a good reason otherwise, I suggest that you remove the support for the old binary matching instructions in master and only implement support for the new ones.

Also, note that there is at least one test case in hipe_SUITE that will crash the emulator if the old binary optimizations are disabled. (The one I mailed about earlier; see https://gist.github.com/bjorng/3666b4133b0d242ddcf1aa0626a67178.) There could be even more problematic test cases, because no_bsm3 totally disables the old binary match optimizations. So if we reverted the commit now, we would have to add an additional commit to disable at least one test case, and we would be testing a configuration that is never intended to be supported or released in OTP 22. I don't really see the point of doing that.

My suggestion (unless you have very good reasons otherwise), is that you revert the commit yourself as the first thing in your branch that implements the support for the new binary matching instructions.

jhogberg added team:VM Assigned to OTP team VM enhancement labels Sep 19, 2018

jhogberg self-assigned this Sep 19, 2018

jhogberg force-pushed the john/compiler/ssa-bsm-opt branch 2 times, most recently from 6e56fcb to 671119e Compare September 19, 2018 11:42

jhogberg added 3 commits September 24, 2018 06:54

Support using match contexts from Y registers

cf5094c

The upcoming beam_ssa_bsm pass allows match contexts to be used across function calls that take said context as an argument, which means it's fairly common for them to end up in Y registers.

Remove match context reuse annotations from core/kernel passes

2a6df91

The upcoming beam_ssa_bsm pass makes this redundant.

beam_ssa: Add helper functions and export more types

b1726b0

get_anno/3: as get_anno but with a default value definitions/1-2: returns a map of variable definitions (#b_set{}) uses/1-2: returns a map of all uses of a given variable mapfold_blocks_rpo/4: mapfolds over blocks

jhogberg force-pushed the john/compiler/ssa-bsm-opt branch from 308dbf5 to fc1ac1b Compare September 26, 2018 05:43

bjorng reviewed Sep 26, 2018

View reviewed changes

lib/compiler/src/beam_ssa_pre_codegen.erl Outdated Show resolved Hide resolved

jhogberg force-pushed the john/compiler/ssa-bsm-opt branch from fc1ac1b to 9f6f690 Compare September 26, 2018 15:01

jhogberg and others added 6 commits September 28, 2018 11:31

Improve coverage of 21 compatibility

f9dcf72

beam_ssa_pre_codegen: Remove unused variable aliasing support

3306292

Remove the variable aliasing support that was needed for the old beam_bsm pass.

Remove unused instruction bs_context_to_binary from the compiler

1fbaf15

This has been superseded by bs_get_tail/3. Note that it is NOT removed from the emulator or beam_disasm, as old modules are still legal.

hipe: Document new limitations and disable known failing tests

afa36d2

jhogberg force-pushed the john/compiler/ssa-bsm-opt branch from 9f6f690 to afa36d2 Compare September 28, 2018 09:41

jhogberg merged commit 7d941c5 into erlang:master Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite BSM optimizations in the new SSA-based intermediate format #1958

Rewrite BSM optimizations in the new SSA-based intermediate format #1958

jhogberg commented Sep 19, 2018

jhogberg commented Sep 26, 2018 •

edited

Loading

kostis commented Sep 26, 2018

jhogberg commented Sep 27, 2018 •

edited

Loading

jhogberg commented Oct 1, 2018

kostis commented Oct 1, 2018

bjorng commented Oct 1, 2018

Rewrite BSM optimizations in the new SSA-based intermediate format #1958

Rewrite BSM optimizations in the new SSA-based intermediate format #1958

Conversation

jhogberg commented Sep 19, 2018

Sub-binary creation can now be delayed within function bodies

The cost of backtracking is significantly reduced

Improved context reuse within function bodies

Wrappers no longer disable context reuse

jhogberg commented Sep 26, 2018 • edited Loading

kostis commented Sep 26, 2018

jhogberg commented Sep 27, 2018 • edited Loading

jhogberg commented Oct 1, 2018

kostis commented Oct 1, 2018

bjorng commented Oct 1, 2018

jhogberg commented Sep 26, 2018 •

edited

Loading

jhogberg commented Sep 27, 2018 •

edited

Loading