dragonflydb · romange · Apr 5, 2024 · Mar 21, 2024 · Mar 21, 2024 · Mar 26, 2024
diff --git a/docs/getting-started/benchmark.md b/docs/getting-started/benchmark.md
@@ -0,0 +1,213 @@
+---
+sidebar_position: 3
+---
+
+# Benchmarking Dragonfly
+
+Dragonfly is a high-performance, distributed key-value store designed for scalability
+and low latency. It's a drop-in replacement for Redis 6.x and memcached servers.
+This document outlines a benchmarking methodology and results achieved using Dragonfly
+with the `memtier_benchmark` load testing tool.
+
+We benchmarked Dragonfly using the [memtier_benchmark](https://github.com/RedisLabs/memtier_benchmark)
+load testing tool.
+A prebuilt container is also available on [Docker Hub](https://hub.docker.com/r/redislabs/memtier_benchmark/)
+Although Redis offers the `redis-benchmark` tool in its repository, it has not been as efficient
+as `memtier_benchmark` and it often becomes the bottleneck instead of Dragonfly.
+
+We also developed our own tool [dfly_bench](https://github.com/dragonflydb/dragonfly/blob/main/src/server/dfly_bench.cc),
+which can be built from source in the Dragonfly repository.
+
+
+## Methodology
+- **Remote Deployment:** Dragonfly is a multi-threaded server designed to run remotely.
+Therefore, we recommend running the load testing client and server on separate machines for a more accurate representation of real-world performance.
+- **Minimizing Latency:** Locate client and server within the same Availability Zone and use private IPs for optimal network performance. If you benchmark in the AWS cloud, consider an AWS Cluster placement group
+for the lowest possible latency. The rationale behind this - to remove any environmental factors
+that might skew the test results.
+- **Server vs. Client Resources**: Use a more powerful instance for the load testing client
+to avoid client-side bottlenecks.
+
+The remainder of this document will discuss how to set up a benchmark in the AWS cloud
+to observe millions of QPS from a single instance.
+
+## Load testing configuration
+We used Dragonfly v1.15.0 (latest at the time of writing) with the following arguments:
+`./dragonfly --logtostderr  --dbfilename=`
+
+Please notice that Dragonfly uses of all the available vCPUs by default on the server.
+If you want to control explicitly number of threads in Dragonfly you can add `--proactor_threads=<N>`.
+Both the client and server instances run `Ubuntu 23.04` OS with kernel version 6.2.
+In line with our recommendations above, we used internal IPs for connecting and
+used stronger `c7gn.16xlarge` instance with 64 vCPUs for the load-testing program (i.e. the client).
+
+## Dragongly on `c6gn.12xlarge`
+
+### Write-only test
+On the loadtest instance (`c7gn.16xlarge` with 64 vCPUs):
+`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000`
+
+The run ended with the following summary:
+
+```
+============================================================================================================================
+Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
+----------------------------------------------------------------------------------------------------------------------------
+Sets      4195628.23          ---          ---         0.39283         0.37500         0.68700         2.54300    323231.06
+
+```
+
+In this test, we reached almost 4.2M queries per second (QPS) with the average latency of
+0.4ms between the `memtier_benchmark` and Dragonfly. Consequently, the P50 latency was 0.4ms, P99 - 0.7ms
+and P99.9 was 2.54ms. It is a very short and simple test, but it still gives some perspective
+about the performance of Dragonfly.
+
+### Reads-only test
+Without flushing the database:
+`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000`
+
+Note that the ratio changed to "0:1", meaning only `GET` commands and *no* `SET` commands.
+
+```
+============================================================================================================================
+Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
+----------------------------------------------------------------------------------------------------------------------------
+Gets      4109802.84   4109802.84         0.00         0.40126         0.38300         0.67900         0.90300    296551.68
+```
+
+We can observe that both `Ops` and `Hits` are the same, meaning all of the GET requests
+coming from the load test hit the existing keys.
+Dragonfly responded with returning values for each request and its average QPS was 4.1M qps,
+with P99.9 latency - 903us (less than 1 millisecond).
+
+### Read test with pipelining
+
+Here's another way to loadtest Dragonfly. Below is one with sending `SET`s with pipeline (`--pipeline`)
+of batch size 10. Pipeline means that the client sends multiple commands (10 in this case)
+and only then waits for the responses.
+`memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5  -n 200000  --distinct-client-seed --hide-histogram --pipeline=10`
+
+```
+ALL STATS
+============================================================================================================================
+Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
+----------------------------------------------------------------------------------------------------------------------------
+Gets      7083583.57   7083583.57         0.00         0.45821         0.44700         0.69500         1.53500    511131.14
+```
+
+During the pipelining mode, memtier_benchmark sends `K` (in this case 10) requests in batch without before waiting
+for them to complete. Pipelining reduces the CPU load spent
+in the networking stack, and as a result, Dragonfly can reach 7M qps with sub-millisecond latency.
+Please note, that for real world usecases, pipelining requires cooperation of a client side app,
+which must send multiple requests on a single connection before waiting for the server to respond.
+
+Some asynchronous client libraries like `StackExchange.Redis` or `ioredis` allow multiplexing requests
+on a single connection. They can still provide a simplified synchronous interface to their users while
+benefitting from performance improvements of pipelining.
+
+## Load testing Dragonfly `c7gn.12xlarge`
+
+Next, we tried running Dragonfly on the next generation instance (`c7gn`) with the same number of vCPUs (48).
+We used the same `c7gn.16xlarge` for running `memtier_benchmark` and we used the same commands
+to test writes, reads and pipelined reads:
+
+| Test          | Ops/sec   | Avg. Latency (us) | P99.9 Latency (us) |
+|---------------|-----------|-------------------|--------------------|
+| Write-Only    | 5.2M      | 250               | 631                |
+| Read-Only     | 6M        | 271               | 623                |
+| Pipelined Read| 8.9M      | 323               | 839                |
+
+
+### Writes
+`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000`
+
+```
+============================================================================================================================
+Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
+----------------------------------------------------------------------------------------------------------------------------
+Sets      5195097.56          ---          ---         0.26012         0.24700         0.49500         0.63100    400230.15
+```
+
+### Reads
+`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000`
+```
+============================================================================================================================
+Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
+----------------------------------------------------------------------------------------------------------------------------
+Gets      6078632.89   6078632.89         0.00         0.27177         0.26300         0.49500         0.62300    438616.86
+```
+
+### Pipelined Reads
+`memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5  -n 200000  --distinct-client-seed --hide-histogram --pipeline=10`
+
+```
+============================================================================================================================
+Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
+----------------------------------------------------------------------------------------------------------------------------
+Gets      8975121.86   8975121.86         0.00         0.32325         0.31100         0.52700         0.83900    647619.14
+```
+
+
+## Comparison with Garnet
+Microsoft Research recently released [Garnet](https://github.com/microsoft/garnet),
+a remote cache store. Due to interest within the Dragonfly community, we decided to compare
+Garnet's performance with Dragonfly's. This comparison focuses on performance results
+and does not delve into architectural differences or Redis compatibility implications.
+
+*Note:* Unfortunately, Garnet does not have aarch64 build available,
+therefore we run both Garnet and Dragonfly on x86_64 server
+`c6in.12xlarge`. We run Garnet via docker with host networking enabled via
+`docker run --network=host ghcr.io/romange/garnet:latest --port=6379` command.
+The docker container was built using the Garnet docker build file for ubuntu, located in their
+repository.
+
+### Garnet on `c6in.12xlarge`
+
+Similarly to previous tests we run `memtier_benchmark` on `c7gn.16xlarge` with `cluster`
+placement policy for both instances. For writes we used the following command:
+
+```
+memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000
+```
+
+Similarly, for reads we used
+```
+memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000
+```
+
+and for pipelined reads we used
+
+```
+memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5  -n 2000000  --distinct-client-seed --hide-histogram --pipeline=10
+```
+
+Note that we increased number of requests to `2000000` per client connection in the latter case.
+
+**Results**:
+
+| Test          | Ops/sec   | Avg. Latency (us) | P99.9 Latency (us) |
+|---------------|-----------|-------------------|--------------------|
+| Write-Only    | 3.5M      | 346               | 4287               |
+| Read-Only     | 3.7M      | 327               | 2623               |
+| Pipelined Read| 25.4M !!! | 119               | 375                |
+
+
+The interesting part is around pipelined reads, where Garnet scaled linearly to more than 25M qps
+which is a really impressive performance.
+
+On the other hand, a curious and random finding - a single "dbsize" command took 3 seconds
+to run on Garnet.
+
+### Dragonfly on `c6in.12xlarge`
+
+We run Dragonfly on the same instances with the same test configurations.
+Below are the results for Dragonfly.
+
+| Test          | Ops/sec   | Avg. Latency (us) | P99.9 Latency (us) |
+|---------------|-----------|-------------------|--------------------|
+| Write-Only    | 3.6M      | 291               | 6815               |
+| Read-Only     | 5.1M      | 299               | 7615               |
+| Pipelined Read| 6.9M      | 358               | 1127               |
+
+As you can see Dragonfly shows a comparable throughput for non-pipelined access,
+but its P99.9 was worse. For pipelined commands, Dragonfly had x3.7 less throughput than Garnet.
diff --git a/docs/getting-started/getting-started.md b/docs/getting-started/getting-started.md
@@ -25,4 +25,7 @@ This helps guarantee the reliability and stability of the software.
 # OS Compatibility
 Dragonfly is compatible with Linux versions 4.14 or later.
 However, to achieve optimal performance, it is recommended to run Dragonfly on kernel version 5.10 or later.
-The Dragonfly build environment is based on Ubuntu 20.04.
+The Dragonfly build environment is based on Ubuntu 20.04.
+
+# Benchmarking Dragonfly
+Learn [how to measure the performance of Dragonfly](./benchmark.md) in a cloud environment.