Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected Performance/Throughput #72

Open
billywhizz opened this issue Oct 4, 2022 · 7 comments
Open

Expected Performance/Throughput #72

billywhizz opened this issue Oct 4, 2022 · 7 comments

Comments

@billywhizz
Copy link

billywhizz commented Oct 4, 2022

I have been doing some benchmarking of workerd against other JS runtimes and I am not seeing very good results. Do you have any recommendations for config for peak performance for benchmarking, or an expectation of what kind of numbers we should see?

RPS using wrk for a simple hello world benchmark is currently showing only one tenth what i see with Deno or Bun.sh and tail latencies are also very high.

Memory usage is also very high compared to Deno - ~220 MB for a simple hello world.

@kentonv
Copy link
Member

kentonv commented Oct 4, 2022

Hi @billywhizz,

A couple questions:

  • What platform are you testing on? (Mac/Linux? Intel/arm?)
  • Are Deno and Bun configured to use multiple threads in your setup?
  • Did you build the binary yourself or did you use one from npm? If you built it, what flags did you use?

As noted in the readme, workerd really isn't ready for benchmarking yet as we know there is a bunch of low-hanging fruit in terms of performance tuning. Our internal build (which is old, not bazel-based, a huge mess, but does a lot more tuning) actually produces much faster binaries right now. Some of the things we need to do here include:

  • Enable distributing load across multiple threads/cores. Currently workerd uses only a single thread, so to utilize multiple cores you would need to run multiple instances of workerd.
  • Use a better memory allocator. Currently workerd uses the system allocator but we know tcmalloc or jemalloc is likely to produce much better results.
  • Tune compiler flags like LTO (link-time optimization).
  • On Mac, use kqueue instead of poll. I wrote Add support for kqueue in UnixEventPort. capnproto/capnproto#1555 a few days ago but this is not intregrated into workerd yet.

@billywhizz
Copy link
Author

thanks for the detailed response Kenton! i didn't include details as i wanted to see if there were any recommendations first. the benchmarks i have run are on Core i5 on Ubuntu 22 docker in privileged mode running on Ubuntu 18 host (i.e. my laptop!).

They are all single process serving a very basic hello world reponse. I used your example for workerd.

I used npm workerd as I was having issues getting workerd to build from scratch but can try that too once I have it building.

Afaik Deno uses a separate thread for IO and Bun is all on one thread. Node.js 16 and 18 are also around 3-4x better throughput on a single thread. I'll see if i can share full results later today.

Congrats on the release - am looking forward to diving into it in more detail.

@kentonv
Copy link
Member

kentonv commented Oct 4, 2022

Hmm your results seem significantly worse than my own tests though I'm not sure how much it's worth digging in until we've done some more tuning. I wonder if we accidentally published an unoptimized binary to npm. Since we don't have CI set up to do the publishing yet there could have been some human error here.

@kentonv
Copy link
Member

kentonv commented Oct 4, 2022

BTW note that there's some inherent overhead from managing multiple isolates and having to enter/exit specific isolates which single-isolate runtimes don't have to deal with. So we shouldn't be expecting parity on this kind of benchmark, but it should be much closer.

The memory issue is a separate issue but is something we've noticed and are working on. Basically V8 isn't garbage-collecting aggressively enough by default. You can tune this with certain V8 flags but we should make it work better out-of-the-box.

@kentonv
Copy link
Member

kentonv commented Oct 4, 2022

Also, as always, note that benchmarks like this may not be telling you anything useful when it comes to real-world use. A "hello world" benchmark is essentially benchmarking the HTTP implementation, but in a real application the HTTP implementation is likely a tiny fraction of overall CPU usage so having a slightly faster or slower HTTP isn't going to make a big difference. In real apps you're going to spend most of your time executing JavaScript, and V8 is what ultimately matters there.

@billywhizz
Copy link
Author

thanks for the responses. yes - is early days and these microbenchmarks are not really applicable to real world scenarios as you say, but they do tend to flag up overhead when done comparatively and for some use cases that extra latency can be important, especially when you are being billed by the second for it.

when benching with this

wrk -c 256 -t 2 -d 30 http;//127.0.0.1:3000/

i get ~12k RPS and ~40ms P99 latency. that's about 0.35 of node.js throughput and 20x node.js P99 latency. I am on a pretty old kernel so i'll try to test on a more recent setup.

@Wallacy
Copy link

Wallacy commented Nov 28, 2022

Hi @billywhizz,

A couple questions:

  • What platform are you testing on? (Mac/Linux? Intel/arm?)
  • Are Deno and Bun configured to use multiple threads in your setup?
  • Did you build the binary yourself or did you use one from npm? If you built it, what flags did you use?

As noted in the readme, workerd really isn't ready for benchmarking yet as we know there is a bunch of low-hanging fruit in terms of performance tuning. Our internal build (which is old, not bazel-based, a huge mess, but does a lot more tuning) actually produces much faster binaries right now. Some of the things we need to do here include:

  • Enable distributing load across multiple threads/cores. Currently workerd uses only a single thread, so to utilize multiple cores you would need to run multiple instances of workerd.
  • Use a better memory allocator. Currently workerd uses the system allocator but we know tcmalloc or jemalloc is likely to produce much better results.
  • Tune compiler flags like LTO (link-time optimization).
  • On Mac, use kqueue instead of poll. I wrote Add support for kqueue in UnixEventPort. capnproto/capnproto#1555 a few days ago but this is not intregrated into workerd yet.

Also is nice to consider https://github.com/microsoft/mimalloc as provide a nice guard pages, randomized allocation, encrypted free lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants