Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial document for benchmark #219

Merged
merged 16 commits into from
Apr 5, 2024
Merged
213 changes: 213 additions & 0 deletions docs/getting-started/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
sidebar_position: 3
---

# Benchmarking Dragonfly

Dragonfly is a high-performance, distributed key-value store designed for scalability
and low latency. It's a drop-in replacement for Redis 6.x and memcached servers.
This document outlines a benchmarking methodology and results achieved using Dragonfly
with the `memtier_benchmark` load testing tool.

We benchmarked Dragonfly using the [memtier_benchmark](https://github.com/RedisLabs/memtier_benchmark)
load testing tool.
A prebuilt container is also available on [Docker Hub](https://hub.docker.com/r/redislabs/memtier_benchmark/)
Although Redis offers the `redis-benchmark` tool in its repository, it has not been as efficient
as `memtier_benchmark` and it often becomes the bottleneck instead of Dragonfly.

We also developed our own tool [dfly_bench](https://github.com/dragonflydb/dragonfly/blob/main/src/server/dfly_bench.cc),
which can be built from source in the Dragonfly repository.


## Methodology
- **Remote Deployment:** Dragonfly is a multi-threaded server designed to run remotely.
Therefore, we recommend running the load testing client and server on separate machines for a more accurate representation of real-world performance.
- **Minimizing Latency:** Locate client and server within the same Availability Zone and use private IPs for optimal network performance. If you benchmark in the AWS cloud, consider an AWS Cluster placement group
for the lowest possible latency. The rationale behind this - to remove any environmental factors
that might skew the test results.
- **Server vs. Client Resources**: Use a more powerful instance for the load testing client
to avoid client-side bottlenecks.

The remainder of this document will discuss how to set up a benchmark in the AWS cloud
to observe millions of QPS from a single instance.

## Load testing configuration
We used Dragonfly v1.15.0 (latest at the time of writing) with the following arguments:
`./dragonfly --logtostderr --dbfilename=`

Please notice that Dragonfly uses of all the available vCPUs by default on the server.
If you want to control explicitly number of threads in Dragonfly you can add `--proactor_threads=<N>`.
Both the client and server instances run `Ubuntu 23.04` OS with kernel version 6.2.
In line with our recommendations above, we used internal IPs for connecting and
used stronger `c7gn.16xlarge` instance with 64 vCPUs for the load-testing program (i.e. the client).

## Dragongly on `c6gn.12xlarge`

### Write-only test
On the loadtest instance (`c7gn.16xlarge` with 64 vCPUs):
`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000`

The run ended with the following summary:

```
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 4195628.23 --- --- 0.39283 0.37500 0.68700 2.54300 323231.06

```

romange marked this conversation as resolved.
Show resolved Hide resolved
In this test, we reached almost 4.2M queries per second (QPS) with the average latency of
0.4ms between the `memtier_benchmark` and Dragonfly. Consequently, the P50 latency was 0.4ms, P99 - 0.7ms
and P99.9 was 2.54ms. It is a very short and simple test, but it still gives some perspective
about the performance of Dragonfly.

### Reads-only test
Without flushing the database:
`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000`

Note that the ratio changed to "0:1", meaning only `GET` commands and *no* `SET` commands.

```
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 4109802.84 4109802.84 0.00 0.40126 0.38300 0.67900 0.90300 296551.68
```

romange marked this conversation as resolved.
Show resolved Hide resolved
We can observe that both `Ops` and `Hits` are the same, meaning all of the GET requests
coming from the load test hit the existing keys.
Dragonfly responded with returning values for each request and its average QPS was 4.1M qps,
with P99.9 latency - 903us (less than 1 millisecond).

### Read test with pipelining

Here's another way to loadtest Dragonfly. Below is one with sending `SET`s with pipeline (`--pipeline`)
of batch size 10. Pipeline means that the client sends multiple commands (10 in this case)
and only then waits for the responses.
`memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5 -n 200000 --distinct-client-seed --hide-histogram --pipeline=10`

```
ALL STATS
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 7083583.57 7083583.57 0.00 0.45821 0.44700 0.69500 1.53500 511131.14
```

During the pipelining mode, memtier_benchmark sends `K` (in this case 10) requests in batch without before waiting
for them to complete. Pipelining reduces the CPU load spent
in the networking stack, and as a result, Dragonfly can reach 7M qps with sub-millisecond latency.
Please note, that for real world usecases, pipelining requires cooperation of a client side app,
which must send multiple requests on a single connection before waiting for the server to respond.

Some asynchronous client libraries like `StackExchange.Redis` or `ioredis` allow multiplexing requests
on a single connection. They can still provide a simplified synchronous interface to their users while
benefitting from performance improvements of pipelining.

## Load testing Dragonfly `c7gn.12xlarge`

Next, we tried running Dragonfly on the next generation instance (`c7gn`) with the same number of vCPUs (48).
We used the same `c7gn.16xlarge` for running `memtier_benchmark` and we used the same commands
to test writes, reads and pipelined reads:

| Test | Ops/sec | Avg. Latency (us) | P99.9 Latency (us) |
|---------------|-----------|-------------------|--------------------|
| Write-Only | 5.2M | 250 | 631 |
| Read-Only | 6M | 271 | 623 |
| Pipelined Read| 8.9M | 323 | 839 |
romange marked this conversation as resolved.
Show resolved Hide resolved


### Writes
`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000`

```
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 5195097.56 --- --- 0.26012 0.24700 0.49500 0.63100 400230.15
```

### Reads
`memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000`
```
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 6078632.89 6078632.89 0.00 0.27177 0.26300 0.49500 0.62300 438616.86
```

### Pipelined Reads
`memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5 -n 200000 --distinct-client-seed --hide-histogram --pipeline=10`

```
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 8975121.86 8975121.86 0.00 0.32325 0.31100 0.52700 0.83900 647619.14
```


## Comparison with Garnet
Microsoft Research recently released [Garnet](https://github.com/microsoft/garnet),
a remote cache store. Due to interest within the Dragonfly community, we decided to compare
Garnet's performance with Dragonfly's. This comparison focuses on performance results
and does not delve into architectural differences or Redis compatibility implications.

*Note:* Unfortunately, Garnet does not have aarch64 build available,
therefore we run both Garnet and Dragonfly on x86_64 server
`c6in.12xlarge`. We run Garnet via docker with host networking enabled via
`docker run --network=host ghcr.io/romange/garnet:latest --port=6379` command.
The docker container was built using the Garnet docker build file for ubuntu, located in their
repository.

### Garnet on `c6in.12xlarge`

Similarly to previous tests we run `memtier_benchmark` on `c7gn.16xlarge` with `cluster`
placement policy for both instances. For writes we used the following command:

```
memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000
```

Similarly, for reads we used
```
memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000
```

and for pipelined reads we used

```
memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5 -n 2000000 --distinct-client-seed --hide-histogram --pipeline=10
```

Note that we increased number of requests to `2000000` per client connection in the latter case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why though? and why not in the above tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pipelining QPS was so high that 200000 finishes too quickly


**Results**:

| Test | Ops/sec | Avg. Latency (us) | P99.9 Latency (us) |
|---------------|-----------|-------------------|--------------------|
| Write-Only | 3.5M | 346 | 4287 |
| Read-Only | 3.7M | 327 | 2623 |
| Pipelined Read| 25.4M !!! | 119 | 375 |


The interesting part is around pipelined reads, where Garnet scaled linearly to more than 25M qps
which is a really impressive performance.

On the other hand, a curious and random finding - a single "dbsize" command took 3 seconds
to run on Garnet.

### Dragonfly on `c6in.12xlarge`

We run Dragonfly on the same instances with the same test configurations.
Below are the results for Dragonfly.

| Test | Ops/sec | Avg. Latency (us) | P99.9 Latency (us) |
|---------------|-----------|-------------------|--------------------|
| Write-Only | 3.6M | 291 | 6815 |
| Read-Only | 5.1M | 299 | 7615 |
| Pipelined Read| 6.9M | 358 | 1127 |

As you can see Dragonfly shows a comparable throughput for non-pipelined access,
but its P99.9 was worse. For pipelined commands, Dragonfly had x3.7 less throughput than Garnet.
5 changes: 4 additions & 1 deletion docs/getting-started/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,7 @@ This helps guarantee the reliability and stability of the software.
# OS Compatibility
Dragonfly is compatible with Linux versions 4.14 or later.
However, to achieve optimal performance, it is recommended to run Dragonfly on kernel version 5.10 or later.
The Dragonfly build environment is based on Ubuntu 20.04.
The Dragonfly build environment is based on Ubuntu 20.04.

# Benchmarking Dragonfly
Learn [how to measure the performance of Dragonfly](./benchmark.md) in a cloud environment.