Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undesired(?) slow down in ssa_opt_alias #7432

Closed
josevalim opened this issue Jun 22, 2023 · 11 comments
Closed

Undesired(?) slow down in ssa_opt_alias #7432

josevalim opened this issue Jun 22, 2023 · 11 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@josevalim
Copy link
Contributor

Describe the bug

Take the Erlang file from this gist: https://gist.github.com/josevalim/c59e1d706daf6e863f6f2745a04fb810

Now compile it with time erlc +time Elixir.HelloTest.erl. In my machine, it takes 0.430s and ssa_opt_alias reports:

ssa_opt_alias              :      0.111 s  67 %

Now search for whatever:a( (on lines 150 and 183) and replace it simply by a( (i.e. convert the remote call into a local call). It runs much faster in 0.270s where:

ssa_opt_alias              :      0.002 s  21 %

This slow down does not happen on Erlang/OTP 25. From an external glance, I wouldn't expect swapping a local call by a remote one to cause such slow down but I can be wrong.

Also note the larger the code contents, the slower it gets. :)

Affected versions

Erlang/OTP 26.

Additional context

Originally reported here: elixir-lang/elixir#12696

@josevalim josevalim added the bug Issue is reported as a bug label Jun 22, 2023
@bjorng bjorng added the team:VM Assigned to OTP team VM label Jun 26, 2023
@jhogberg jhogberg self-assigned this Jun 26, 2023
@josevalim
Copy link
Contributor Author

josevalim commented Jun 28, 2023

Here is another file which went from a dozen seconds to compile in Erlang/OTP 25 to minutes in Erlang/OTP 26: https://gist.github.com/josevalim/94bfd65e8eaf892ed700df349838796a

It is an Elixir file but I can convert to Erlang if necessary. I suspect it is the exact the same issue, so the file in the issue description should be enough. I will edit this comment once profiling finishes.

EDIT: Yup, also ssa_opt_alias:

 beam_ssa_opt                  :    193.316 s    9243.5 kB
    %% Sub passes of beam_ssa_opt from slowest to fastest:
    ssa_opt_alias              :    187.968 s  97 %

@frej
Copy link
Contributor

frej commented Jun 28, 2023

@jhogberg, I missed this the first time around. I will take a look.

@frej
Copy link
Contributor

frej commented Jun 28, 2023

From an external glance, I wouldn't expect swapping a local call by a remote one to cause such slow down but I can be wrong.

@josevalim, that's not strange, the alias analysis pass is fairly costly. When you change whatever:a( to a(, in the first example, the compiler is smart enough to notice that anything after the first a( will be unreachable (as noop doesn't match {{ok, _subbaz@1}, [_event1@1]}) and will therefore just remove it as dead code long before it reaches the alias analysis. If you run erlc -S you'll see that the fast variant only produces around a 1/6 of the code.

You can disable the alias analysis pass by running erlc with +no_ssa_opt_alias. On my machine the compile time goes from 0.8s to 0.5s. I'll investigate further and see what I can do to speed things up.

@josevalim
Copy link
Contributor Author

@frej that makes perfect sense. We were just short-circuiting part of the analysis then, thanks!

@josevalim
Copy link
Contributor Author

@frej benchmarking the code points to the sets operations being the source of slow down. I assume we build sets with all variables in a function and, for large functions, that's part of the problem?

In any case, in kills_is:

kills_is([I|Is], Live0, KillsMap0, Blk) ->
    {Live, Key} = case I of
                      #b_set{dst=Dst} ->
                          {sets:del_element(Dst, Live0), Dst};
                      _ ->
                          {Live0, {terminator, Blk}}
                  end,
    Uses = sets:from_list(beam_ssa:used(I), [{version,2}]),
    RemainingUses = sets:union(Live0, Uses),
    Killed = sets:subtract(RemainingUses, Live0),
    KillsMap = KillsMap0#{Key => Killed},
    kills_is(Is, sets:union(Live, Killed), KillsMap, Blk);

I think

    RemainingUses = sets:union(Live0, Uses),
    Killed = sets:subtract(RemainingUses, Live0),

is the same as:

    Killed = sets:subtract(Uses, Live0),

And, I assume Live0 is always larger than Uses, if that's the case, maybe we could have this instead:

    Killed = sets:from_list([not sets:is_element(L, Live0) || L <- beam_ssa:used(I)], [{version, 2}]),

?

@josevalim
Copy link
Contributor Author

josevalim commented Jun 28, 2023

Thinking about the algorithm, my understanding is that you want to know when a variable/label is no longer used by computing the killsets. Today that is implemented with a pass forward, to compute all variables (that's the expensive one as it builds a large set), and then an additional pass.

Could it perhaps be done with a single pass backwards? If you do a backward pass, the first time a variable appears, it is the last time it is used, therefore it belongs to the killset of that instruction and you will store it in the set of killed variables (call this the killed-but-not-defined set). As you traverse up and define those variables, you can remove them from the set. That's meant to be the benefit of this approach: you won't have to build very large sets for very large programs (unless the variables lifetime spans the whole program).

The issue is branching code. You need to know if a label is defined within that branch or before, so now you are back with two passes: the first pass computes the killsets (alongside the killed-but-not-defined-set) and also stores all variables defined-but-not-killed in their own set (those are either unused or used within a branch). Those two sets should not get very large. All of this without going inside branches. Then we do another pass with the goal of traversing branching code.

Inside the branching code, we have the same process, starting with a backward pass, such that when we see a variable for the first time (i.e. its last use):

  1. if it is in the killed-but-not-defined set, it is going to be killed later, so no-op

  2. if it is in the defined-but-not-killed set, it comes from the parent, we can kill it now, but we need to sync across branches

  3. otherwise it was defined within this branch, so we can kill it now

The first pass and the pass within branches can be unified by setting the "killed-but-not-defined" to be variables given as arguments (so perhaps a better name is the "do-not-kill" set.

Perhaps I am over-simplifying or over-complicating the problem, perhaps this is already what you are doing, but in case it does help somehow, there it goes (if it doesn't, apologies).

@frej
Copy link
Contributor

frej commented Jun 29, 2023

@josevalim I'll incorporate the equivalence (A ∪ B) ∖ A <=> B ∖ A, unfortunately it doesn't make a noticeable speedup for the example.

The alias pass uses the standard fixpoint algorithm for calculating the liveness for all variables in the function, and then the liveness is used to calculate the kill-sets. For this example we have over 800 variables and the liveness calculation needs 248 iterations before it converges. Doing an experiment were I do the killset calculation twice [edit: from the liveness information] doesn't change the execution time in any noticeable way. Doing the liveness calculation twice increases the total runtime by around 30%. So your idea to just calculate the kill sets directly looks attractive.

@josevalim
Copy link
Contributor Author

I also found this paper which defines an implementation similar to the one I outlined above, without fixpoint computation, but it is also capable of handing loops: https://inria.hal.science/inria-00558509/PDF/RR-7503.pdf

@frej
Copy link
Contributor

frej commented Jun 29, 2023

I also found this paper

Nice! Section 4 covers the algorithm I'm using for liveness.

The reason I'm going the long way over liveness is that I had a liveness pass already lying around, so I didn't look into a specialized algorithm. I ran profiling on the modules compiled by scripts/diffable and made sure alias analysis was en-par with other passes, but I guess that the Elixir compiler, compared to hand written code, produces a lot more variables and also larger functions.

frej added a commit to frej/otp that referenced this issue Jul 19, 2023
Variables which die in a basic block cannot influence the alias status
of variables in successor blocks. By pruning dead variables, which are
not part of a parent-child derivation relationship of live variables,
the size of the active sharing state is reduced. The reduction in size
speeds up subsequent aa_mergs_ss/3 operations.

Combined with the improved kill-set calculation and the changed data
structure for describing the alias status of variables, this patch
provides a substantial reduction in time required for alias
analysis. For the set of modules compiled by `scripts/diffable` the
time spent in alias analysis is reduced by approximately 55%. For the
example in Issue erlang#7432 [1], provided by José Valim, which has a large
number of variables, the reduction is even more dramatic. The time
spent in the alias analysis pass is reduced by 97%,

[1] erlang#7432
@frej
Copy link
Contributor

frej commented Jul 19, 2023

@josevalim, a quick update. I've spent some time on this during the last weeks, I have a branch which provides a 55% speedup for the sets of modules compiled by scripts/diffable. For your example the speedup is 97%. [edit, clarification: The speedup is only for the time spent in alias analysis, not the whole compiler] As the branch is based on #7448, I'm currently talking to @jhogberg about how to proceed.

@josevalim
Copy link
Contributor Author

Very exciting! Thanks for the work and sharing the updates!

frej added a commit to frej/otp that referenced this issue Aug 2, 2023
Variables which die in a basic block cannot influence the alias status
of variables in successor blocks. By pruning dead variables, which are
not part of a parent-child derivation relationship of live variables,
the size of the active sharing state is reduced. The reduction in size
speeds up subsequent `aa_merge_ss/3` operations.

Combined with improved kill-set calculation
(9ad5c5377cb5779f9c581f2549f0f48d224d0664) and an improved data
structure for describing the alias status of
variables (c389665), this patch
provides a substantial reduction in time required for alias
analysis. For the set of modules compiled by `scripts/diffable` the
time spent in alias analysis is reduced by approximately 55%. For the
example in Issue erlang#7432 [1], provided by José Valim, which has a large
number of variables, the reduction is even more dramatic. The time
spent in the alias analysis pass is reduced by 97%.

[1] erlang#7432

Closes: erlang#7432
frej added a commit to frej/otp that referenced this issue Aug 2, 2023
Variables which die in a basic block cannot influence the alias status
of variables in successor blocks. By pruning dead variables, which are
not part of a parent-child derivation relationship of live variables,
the size of the active sharing state is reduced. The reduction in size
speeds up subsequent `aa_merge_ss/3` operations.

Combined with improved kill-set calculation
(8545471) and an improved data
structure for describing the alias status of
variables (c389665), this patch
provides a substantial reduction in time required for alias
analysis. For the set of modules compiled by `scripts/diffable` the
time spent in alias analysis is reduced by approximately 55%. For the
example in Issue erlang#7432 [1], provided by José Valim, which has a large
number of variables, the reduction is even more dramatic. The time
spent in the alias analysis pass is reduced by 97%.

[1] erlang#7432

Closes: erlang#7432
frej added a commit to frej/otp that referenced this issue Aug 2, 2023
Variables which die in a basic block cannot influence the alias status
of variables in successor blocks. By pruning dead variables, which are
not part of a parent-child derivation relationship of live variables,
the size of the active sharing state is reduced. The reduction in size
speeds up subsequent `aa_merge_ss/3` operations.

Combined with improved kill-set calculation
(8545471) and an improved data
structure for describing the alias status of
variables (c389665), this patch
provides a substantial reduction of the time required for alias
analysis. For the set of modules compiled by `scripts/diffable` the
time spent in alias analysis is reduced by approximately 55%. For the
example in Issue erlang#7432 [1], provided by José Valim, which has a large
number of variables, the reduction is even more dramatic. The time
spent in the alias analysis pass is reduced by 97%.

[1] erlang#7432

Closes: erlang#7432
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

4 participants