Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BeamAsm - a JIT for Erlang/OTP #2745

Merged
merged 32 commits into from
Sep 22, 2020
Merged

Conversation

garazdawi
Copy link
Contributor

@garazdawi garazdawi commented Sep 11, 2020

This PR introduces BeamAsm, a JIT compiler for the Erlang VM.

Implementation

BeamAsm provides load-time conversion of Erlang beam instructions into native code on x86-64. This allows the loader to eliminate any instruction dispatching overhead and also specialize each instruction on their argument types.

BeamAsm does not do any cross instruction optimizations and the x and y register arrays work the same as when interpreting beam instructions. This allows the erlang run-time system to be largely unchanged except for places that need to work with loaded beam instructions like code loading, tracing, and a few others.

BeamAsm uses asmjit to generate native code in run-time. Only small parts of the Assembler API of asmjit is used. At the moment asmjit only supports x86 32/64 bit assembler, but work is ongoing to also support arm 64-bit.

For a more lengthy description of how the implementation works, you can view the internal documentation of BeamAsm.

Performance

How much faster is BeamAsm than the interpreter? That will depend a lot on what your application is doing.

For example, the number of Estones as computed by the estone benchmark suite becomes about 50% larger, meaning about 50% more work can be done during the same time period. Individual benchmarks within the estone benchmark suite vary from a 170% increase (pattern matching) to no change at all (huge messages). So, not surprising, computation heavy workload can show quite a large gain, while communication heavy workloads remain about the same.

If we run the JSON benchmarks found in the Poison or Jason, BeamAsm achieves anything from 30% to 130% increase (average at about 70%) in the number of iterations per second for all Erlang/Elixir implementations. For some benchmarks, BeamAsm is even faster than the pure C implementation jiffy.

More complex applications tend to see a more moderate performance increase, for instance, RabbitMQ is able to handle 30% to 50% more messages per second depending on the scenario.

Profiling/Debugging

One of the great things about executing native code is that some of the utilities used to profile C/C++/Rust/go can be used to profile Erlang code. For instance, this is what a run of perf on Linux can look like:

perf-beamasm

There are more details in the internal documentation of BeamAsm on how to achieve this.

Drawbacks

Loading native code uses more memory. We expect the loaded code to be about 10% larger when using BeamAsm than when using the interpreter.

This PR includes a major rewrite of how the Erlang code loader works. The new loader does not include HiPE support, which means that it will not be possible to run HiPE compiled code in OTP-24.

We are still looking for anyone that wants to maintain HiPE so that it can continue to push the boundary on what high-performance erlang looks like.

Try it out!

We are looking for any feedback you can provide about the functionality and performance of BeamAsm. To compile it you need a relatively modern C++ compiler and an operating system that allows memory to be executable and writable at the same time (which is most OSs, except OpenBSD).

If you are on windows you can download installers here:

Note that these are built using our internal nightly tests, so contains more changes than what this PR includes.

@garazdawi garazdawi changed the title Beamasm Implement BeamAsm - a JIT for Erlang/OTP Sep 11, 2020
erts/configure.in Outdated Show resolved Hide resolved
@bjorng bjorng added the team:VM Assigned to OTP team VM label Sep 11, 2020
@nox
Copy link
Contributor

nox commented Sep 11, 2020

Congratulations, can't believe this is happening! Now RIIR it, jk.

erts/emulator/internal_doc/BeamAsm.md Outdated Show resolved Hide resolved
erts/emulator/internal_doc/BeamAsm.md Outdated Show resolved Hide resolved
erts/emulator/internal_doc/BeamAsm.md Outdated Show resolved Hide resolved
erts/emulator/internal_doc/beam_makeops.md Show resolved Hide resolved
lib/compiler/src/compile.erl Show resolved Hide resolved
@isubasinghe
Copy link

Wow thanks for this work, this is awesome to see

@gilbertwong96
Copy link
Contributor

Awesome! 👍

@dqzjh0319
Copy link

Just can't wait for the release and have a test on my own, thanks for the great work.

@tschuett
Copy link

tschuett commented Sep 12, 2020

Just curious, why asmjit instead of LLVM? Smaller dependency?
Awesome work btw.

@garazdawi
Copy link
Contributor Author

LLVM is much slower at generating code when compared to asmjit. LLVM can do a lot more, but it's main purpose is not to be a JIT compiler.

With asmjit we get full control over all the register allocation and can do a lot of simplifications when generating code.

On the downside we don't get any of LLVMs built-in optimizations.

We also considered using dynasm, but found the tooling that comes with asmjit to be better.

@tschuett
Copy link

tschuett commented Sep 12, 2020

My bad. I was referring to ORCJIT:
https://llvm.org/docs/ORCv2.html

@garazdawi
Copy link
Contributor Author

So was I. LLVM is still too slow for what we want to do here. We used llvm in other JIT attempts, but we always had issues with the time it takes for the compiler to run. The sea of nodes approach that llvm ir uses just can't be as fast as we need it to be, even if you disable all of the optimizations. Maybe if you emit machine instructions instead of llvm ir, but even then I'm doubtful.

Also, as you mention the size of the dependency does play in. Adding 10s of megabytes and having to support any faults done by llvm is a huge undertaking.

@tschuett
Copy link

tschuett commented Sep 12, 2020

This is great work you did. There is no doubt.

But honestly, my Erlang VMs run for hours. I don't really care about startup time (https://github.com/scalaris-team/scalaris).

@SisMaker
Copy link

very cool

@nox
Copy link
Contributor

nox commented Sep 13, 2020

Sorry for the side-tracking, as I could look into that myself, but I have a small question: is the implementation modular, done in a way where we could later experiment with different JIT backends, such as Rust's cranelift?

@jhogberg
Copy link
Contributor

But honestly, my Erlang VMs run for hours. I don't really care about startup time (https://github.com/scalaris-team/scalaris).

Many people do care and not just those whose instances are short-lived. We don't want to make that trade-off at the moment.

Sorry for the side-tracking, as I could look into that myself, but I have a small question: is the implementation modular, done in a way where we could later experiment with different JIT backends, such as Rust's cranelift?

Yep, it was one of our design goals. Changing to a different assembler is pretty straightforward and going down the IR route shouldn't be too difficult either, albeit tedious as you need to re-implement every instruction.

@nox
Copy link
Contributor

nox commented Sep 13, 2020

Yep, it was one of our design goals. Changing to a different assembler is pretty straightforward and going down the IR route shouldn't be too difficult either, albeit tedious as you need to re-implement every instruction.

Super cool to know! And I'm super happy to see progress in that part of the Erlang VM. Now if only Ericsson was hiring remote… 😎

@tschuett
Copy link

But honestly, my Erlang VMs run for hours. I don't really care about startup time (https://github.com/scalaris-team/scalaris).

Many people do care and not just those whose instances are short-lived. We don't want to make that trade-off at the moment.

Please don't get me wrong. This is no criticism. I am REALLY happy to see this coming.

@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Sep 14, 2020
@garazdawi garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Sep 14, 2020
@joaohf
Copy link
Contributor

joaohf commented Sep 14, 2020

Hello

First of all great work!

I'm the maintainer the meta-erlang project. I'm testing this PR using a set of qemu machines and checking if I can run common erlang applications that meta-erlang supports (like tsung, yaws, emq, rabbitmq, ..). That is time consuming, but I will post here my results once I get there.

Thanks

@garazdawi
Copy link
Contributor Author

New changes have been force pushed.

I've renamed the FLAVOR of the JIT to be jit instead of asm. Also I've expanded the internal documentation to address some of the questions and problems that I've heard about.

I've also added the function erlang:system_info(emu_flavor) that will return jit if the current emulator is a jit emulator.

The branch is also rebased into the current latest master branch.

@garazdawi
Copy link
Contributor Author

garazdawi commented Sep 16, 2020

I've added a description of the files involved in the jit for those interested.

@dominicletz
Copy link
Contributor

dominicletz commented Sep 16, 2020

Thanks @garazdawi and team this is awesome! Some feedback from trying:

  1. It's super easy to get running on linux ❤️
kerl build git https://github.com/garazdawi/otp.git beamasm 24.jit
kerl install 24.jit 24.jit
. ./24.jit/activate
  1. I've run some tests and system_information(emu_flavor) returns jit 👍

  2. I couldn't really measure any performance differences though -- I've test run my pure beam sha256 implementation as I thought it must have gotten better but 😭

Selection_315

I feel like I've missed something obvious -- e.g. is there a threshold after which the jit will be activated? Can I somehow check whether the code executed is jitted for real?

Also let me know if I should provide this or more detailed feedback somewhere else. I'm happy to help or to dig into this more.

garazdawi and others added 9 commits September 21, 2020 16:40
This is done in preperation from BeamAsm

Co-authored-by: John Högberg <john@erlang.org>
Co-authored-by: Dan Gudmundsson <dgud@erlang.org>
Co-authored-by: John Högberg <john@erlang.org>
Co-authored-by: Dan Gudmundsson <dgud@erlang.org>
We do this in order to align the C and C++ code to
use similar formatting options when using clang-format.
Compiler enhancements in OTP 22 and later have rendered
the `literal_type_tests/1` test case ineffective.
garazdawi and others added 7 commits September 22, 2020 07:51
Co-authored-by: John Högberg <john@erlang.org>
Co-authored-by: Dan Gudmundsson <dgud@erlang.org>
Co-authored-by: Björn Gustavsson <bjorn@erlang.org>
Co-authored-by: Lukas Larsson <lukas@erlang.org>
Co-authored-by: Björn Gustavsson <bjorn@erlang.org>
Co-authored-by: Dan Gudmundsson <dgud@erlang.org>
Co-authored-by: Lukas Larsson <lukas@erlang.org>
Co-authored-by: John Högberg <john@erlang.org>
Co-authored-by: Dan Gudmundsson <dgud@erlang.org>
Using perf dump is superior to using perf map as we are
able to use `perf annotate` which means we can view which
x86 assembly instruction was using the most CPU.

    perf record -k mono erl +JPperf true
    perf inject --jit -i perf.data -o perf.jitted.data
    perf report -M intel -i perf.jitted.data

The implementation was inspired from the mono repo:

https://github.com/mono/mono/blob/master/mono/mini/mini-runtime.c

It should be easy to add support for Erlang source file and
line mapping if we want to do that.
@garazdawi
Copy link
Contributor Author

Merged for release in OTP-24. Please continue testing and reporting issues either in this PR, erlang-questions or bugs.erlang.org.

We have a fix for the issue experienced by @dominicletz, there will be a new PR with that "soon".

@garazdawi garazdawi deleted the beamasm branch September 22, 2020 06:36
samuelpordeus added a commit to samuelpordeus/asmjit that referenced this pull request Sep 28, 2020
Opening the PR as suggested here: erlang/otp#2745 (comment)
@GeraldXv
Copy link

@garazdawi JIT is awsome! I tried some bench tests, which achieved more than 50% perfomance increase!
While the fibonacci tests fails.

-module(fib).
-export([loop_test/1]).
loop_test(Num) ->
F = fun Loop(N) ->
            case N of
                0 -> ok;
                _ ->
                   fib(35),
                    Loop(N-1)
            end
        end,
 F((Num) ).
fib(X) when X <2 ->
    1;
fib(X) when X >= 2 ->
    fib(X-1) + fib(X-2).

Erlang/OTP 23 [erts-11.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Eshell V11.1 (abort with ^G)
1> c(fib).
{ok,fib}
2> timer:tc(fun() -> fib:loop_test(100) end).
{29970315,ok}
3> c(fib, [native]).
{ok,fib}
4> timer:tc(fun() -> fib:loop_test(100) end).
{8904689,ok}

Erlang/OTP 24 [DEVELOPMENT] [erts-11.1.3] [source-4981289bf8] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V11.1.3 (abort with ^G)
1> c(fib).
{ok,fib}
2> timer:tc(fun() -> fib:loop_test(100) end).
{15541481,ok}

OTP 23 ~= 30s, OTP 23(HIPE) ~= 9s, OTP 24(JIT) ~= 15.5s

Btw, the test result on go is 7 seconds.

@peaceful-james
Copy link

I've added a description of the files involved in the jit for those interested.

This link is broken JFYI

@garazdawi
Copy link
Contributor Author

This link is broken JFYI

Should be fixed now.

@heri16
Copy link

heri16 commented Aug 21, 2021

This is great work you did. There is no doubt.

But honestly, my Erlang VMs run for hours. I don't really care about startup time (https://github.com/scalaris-team/scalaris).

What happened to scalaris by the way? Looks cool but no new release for over 5 years. Is it dead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet