A demo project for benchmarking Phoenix.
- Dependencies
- What to benchmark?
- Dedicated Machines
- Tuning
- Building a release
- Benchmark Results with tuning
- Conclusion
- References
- Erlang R23
- Elixir 1.11.2
wrk
The response time of a request in a standard MVC web application which has no database call.
This benchmark uses two Vultr $5/month machines in the same data center:
- Machine A - running
phx-benchmark-demo
- Machine B - running
wrk
I will hide machines' IP in following content and use
IP_A
orIP_B
for referencing the real IP.
Testing network latency from Machine B:
$ ping IP_A
PING IP_A (IP_A) 56(84) bytes of data.
64 bytes from IP_A: icmp_seq=1 ttl=61 time=0.413 ms
64 bytes from IP_A: icmp_seq=2 ttl=61 time=0.403 ms
64 bytes from IP_A: icmp_seq=3 ttl=61 time=0.377 ms
64 bytes from IP_A: icmp_seq=4 ttl=61 time=0.412 ms
64 bytes from IP_A: icmp_seq=5 ttl=61 time=0.386 ms
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel Xeon Processor (Cascadelake)
Stepping: 6
CPU MHz: 2999.998
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32 KiB
L1i cache: 32 KiB
L2 cache: 4 MiB
L3 cache: 16 MiB
Total online memory: 1G
Tune Machine A with following command:
$ ulimit -n 20000000
$ sysctl -w fs.file-max=12000500
$ sysctl -w fs.nr_open=20000500
$ sysctl -w net.ipv4.tcp_mem='10000000 10000000 10000000'
$ sysctl -w net.ipv4.tcp_rmem='1024 4096 16384'
$ sysctl -w net.ipv4.tcp_wmem='1024 4096 16384'
$ sysctl -w net.ipv4.ip_local_port_range='1024 65536'
$ sysctl -w net.core.rmem_max=16384
$ sysctl -w net.core.wmem_max=16384
Edit rel/vm.args.eex
:
## Increase number of concurrent ports/sockets
+Q 65536
Increase max_keepalive
in order to handle more requests on the same connection:
config :hello, HelloWeb.Endpoint,
http: [
port: String.to_integer(System.get_env("PORT") || "4000"),
transport_options: [socket_opts: [:inet6]],
protocol_options: [max_keepalive: 5_000_000] # added
],
secret_key_base: secret_key_base
Suppress logging for each request:
Too much logging has a huge impact in performance.
config :logger, level: :warn
export LANG=en_US.UTF-8
export MIX_ENV=prod
mix local.hex --force
mix local.rebar --force
mix deps.get --only prod
mix compile
mix phx.digest
mix release
Running 1m test @ http://IP_A:PORT
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 491.16us 331.40us 12.86ms 98.53%
Req/Sec 2.09k 109.18 2.38k 72.33%
Latency Distribution
50% 453.00us
75% 506.00us
90% 567.00us
99% 1.08ms
124536 requests in 1.00m, 303.57MB read
Requests/sec: 2075.43
Transfer/sec: 5.06MB
- current average CPU Usage is
58%
- request latency is equal to network latency nearly.
Let's add more connections.
Running 1m test @ http://IP_A:PORT
2 threads and 2 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 569.07us 328.83us 12.36ms 98.21%
Req/Sec 1.80k 109.76 2.05k 72.75%
Latency Distribution
50% 532.00us
75% 595.00us
90% 668.00us
99% 1.26ms
215057 requests in 1.00m, 524.22MB read
Requests/sec: 3583.06
Transfer/sec: 8.73MB
- current average CPU Usage is
83%
- request latency is equal to network latency nearly.
Let's add more connections.
Running 1m test @ http://IP_A:PORT
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.89ms 397.40us 14.71ms 96.99%
Req/Sec 1.14k 82.97 1.31k 66.83%
Latency Distribution
50% 0.86ms
75% 0.97ms
90% 1.08ms
99% 1.70ms
271907 requests in 1.00m, 662.80MB read
Requests/sec: 4529.75
Transfer/sec: 11.04MB
- current average CPU Usage is
98%
, which means the system is going to exceed its max capacity. - request latency is increasing, but acceptable.
Let's add more connections.
Running 1m test @ http://IP_A:PORT
8 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.64ms 426.48us 15.06ms 92.62%
Req/Sec 616.42 38.11 0.87k 74.50%
Latency Distribution
50% 1.60ms
75% 1.73ms
90% 1.88ms
99% 2.71ms
294693 requests in 1.00m, 718.34MB read
Requests/sec: 4907.09
Transfer/sec: 11.96MB
- current average CPU Usage is
100%
. - request latency is increasing, but acceptable.
Let's add more connections.
Running 1m test @ http://IP_A:PORT
16 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.22ms 519.59us 15.42ms 86.47%
Req/Sec 311.80 22.81 626.00 73.11%
Latency Distribution
50% 3.15ms
75% 3.42ms
90% 3.68ms
99% 4.63ms
298212 requests in 1.00m, 726.92MB read
Requests/sec: 4964.86
Transfer/sec: 12.10MB
- current average CPU Usage is
100%
. - request latency is increasing, but acceptable.
Let's add more connections.
Running 1m test @ http://IP_A:PORT
32 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.44ms 0.92ms 23.08ms 84.29%
Req/Sec 155.45 14.18 210.00 57.47%
Latency Distribution
50% 6.28ms
75% 6.74ms
90% 7.35ms
99% 9.77ms
297768 requests in 1.00m, 725.84MB read
Requests/sec: 4956.40
Transfer/sec: 12.08MB
- current average CPU Usage is
100%
. - request latency is increasing, but acceptable.
Let's take a break.
connections | average latency | request / second |
---|---|---|
1 | 0.491 ms | 2075.43 |
2 | 0.569 ms | 3583.06 |
4 | 0.890 ms | 4529.75 |
8 | 1.640 ms | 4907.09 |
16 | 3.220 ms | 4964.86 |
32 | 6.440 ms | 4956.40 |
Starting from 4 connections, the average latency is increasing in a linear way. That means we are reaching system limits, and the max RPS is almost 4.5k.
Even though we can handle more connections, it comes at the cost of latency. Let's prove it!
Before running the benchmark, I predict the average latency will be almost 25ms
.
Running 1m test @ http://IP_A:PORT
128 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 26.34ms 3.09ms 96.75ms 84.12%
Req/Sec 37.98 5.36 232.00 71.99%
Latency Distribution
50% 25.90ms
75% 27.45ms
90% 29.50ms
99% 35.96ms
291769 requests in 1.00m, 711.21MB read
Requests/sec: 4854.76
Transfer/sec: 11.83MB
Above prediction was right. Let's prove it with more tests.
Before running the benchmark, I predict the average latency will be:
benchmark/start.sh 256
- almost50ms
benchmark/start.sh 512
- almost100ms
Running 1m test @ http://IP_A:PORT
256 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 59.35ms 7.05ms 317.90ms 81.94%
Req/Sec 16.86 4.83 190.00 68.42%
Latency Distribution
50% 58.20ms
75% 62.43ms
90% 67.08ms
99% 77.90ms
259492 requests in 1.00m, 632.54MB read
Requests/sec: 4317.57
Transfer/sec: 10.52MB
Running 1m test @ http://IP_A:PORT
512 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 125.54ms 21.18ms 664.84ms 91.12%
Req/Sec 8.61 2.55 60.00 74.14%
Latency Distribution
50% 123.18ms
75% 130.93ms
90% 140.51ms
99% 192.63ms
245913 requests in 1.00m, 599.44MB read
Requests/sec: 4090.92
Transfer/sec: 9.97MB
As we icreasing the number of connections, the RPS begin to decrease. That means the system is overloading too much.
As we have seen, the performance is predictable, just like Saša Jurić said:
The goal of the platform isn’t to squeeze out as many requests per second as possible, but to keep performance predictable and within limits. The level of performance your Erlang system achieves on a given machine shouldn’t degrade significantly, meaning there shouldn’t be unexpected system hiccups due to, for example, the garbage collector kicking in. Furthermore, as explained earlier, long-running BEAM processes don’t block or significantly impact the rest of the system. Finally, as the load increases, BEAM can use as many hardware resources as possible. If the hardware capacity isn’t enough, you can expect graceful system degradation — requests will take longer to process, but the system won’t be paralyzed. This is due to the preemptive nature of the BEAM scheduler, which performs frequent context switches that keep the system ticking and favors short-running processes. And of course, you can address higher system demand by adding more hardware. -- Elixir in Action, 2nd Edition
A Vultr $5/month machine can handle 4.5k request / second
, which is impressive and exciting.
I think, a phoenix server combined with CDN would be sufficient for most startup projects.
But, please calm down, just like Saša Jurić said:
It’s also worth pointing out that synthetic tests can easily be misleading, so be sure to construct an example which resembles the real use case you’re trying to solve.