New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dragonfly is about 10x slower than Redis when used by JuiceFS #1285
Comments
@zhijian-pro how many cpus do you have on your machine? Is your memory store supposed to run locally? i.e. to be colocated with JuiceFS? |
Hi, thanks for the clear reproduction instructions! I've looked at the latency problem and could reproduce it. I saw constant latencies of ~41ms, not spikes. What I saw looked like some kind of network misconfiguration in the TCP socket that causes packets to be delayed until dragonfly gets an ACK from JuiceFS: When I switched to using unix sockets with I saw latencies of ~700us and no spikes.
So that's a temporary workaround and, if you plan to deploy locally, probably a better alternative anyway. We'll try to understand if that's some socket configuration problem we can fix ourselves. |
Alternatively, giving dragonfly |
@romange In fact, this phenomenon was discovered during a multi-machine test where redis and Dragonfly were running simultaneously on a 4c8g linux and the juicefs test program was running on another 4c8g machine. I simplified the model for that test to make it easier for everyone its easier to reproduce the problem. But I can confirm that redis, dragonfly and juicefs all have enough cpu and memory. This should not be a resource allocation issue. |
@royjacobson I tested it, and after adding
|
For the record, the latency issue is reproducible with the following snippet: import time, redis
r = redis.Redis()
def test_multi():
r.execute_command("MULTI")
r.execute_command(f"SET val1 whatever")
start = time.time()
r.execute_command("EXEC")
lat_ms = 1000 * (time.time() - start)
print(f"Latency: {lat_ms}") |
Fixes #1285 Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Fixes #1285 Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Also pull the latest helio. Fixes #1285 Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@romange @royjacobson I tested it manually using the main branch, but I don't think it works well, and there are still very slow requests.
Maybe I didn't compile correctly ?
|
Thanks @zhijian-pro, it's very helpful. The small latency increase might be explained by lack of link-time-optimization that we use in the release binaries, but the fact that you still see latency spikes is concerning. Since I couldn't reproduce the exact latency patterns locally, I think the difference might be in the networking setup you use. If it's not too much trouble, do you think you could do a network capture of the traffic in the benchmark? I think it should be something like this
|
This removes the "25" limit for batched messages. Turns out the aggregation in #1287 was not aggressive enough, because it's quite possible to reach the specified max capacity of io vectors. For example, each "QUEUED" is actually "+", "QUEUED", "\r\n" so we can reach the limit with about 8 batched commands and then finish\ aggregating prematurely. Closes #1285
This removes the "25" limit for batched messages. Turns out the aggregation in #1287 was not aggressive enough, because it's quite possible to reach the specified max capacity of io vectors. For example, each "QUEUED" is actually "+", "QUEUED", "\r\n" so we can reach the limit with about 8 batched commands and then finish\ aggregating prematurely. Closes #1285
This removes the "25" limit for batched messages. Turns out the aggregation in #1287 was not aggressive enough, because it's quite possible to reach the specified max capacity of io vectors. For example, each "QUEUED" is actually "+", "QUEUED", "\r\n" so we can reach the limit with about 8 batched commands and then finish\ aggregating prematurely. Closes #1285
@royjacobson they said it's still not stable |
@romange is it fixed? |
No, because we have not received any new information. Our tests do not show any significant latency. |
@romange Sorry for bothering but It's confused between your test and @zhijian-pro test. juicedata/juicefs#3363 (comment) |
I'm going to retest this problem. |
2023/08/30 Test results:
Dragonfly without
Dragonfly with
When the When Dragonfly as the metadata engine for JuiceFS,must use |
Hi @zhijian-pro, thanks again for your great help reporting and debugging this issue. I've tried to reproduce now, both with an up-to-date version and with the release binary from here and unfortunately I couldn't reproduce the latency behavior you describe. Could you maybe record and upload a new |
Also, can you please describe your OS (kernel, distro etc). |
@royjacobson @romange I'm using c5d.xlarge from aws.
No network changes have been made |
I just tested it and found that your pre-compiled 1.8 version works fine without using tcp_nodelay. Is it possible that my main branch compilation is incorrect? |
@zhijian-pro what's |
Huh. I think I just reproduced when the latency issue happens? I don't think it's a bug in the test, but now that I can reproduce it I'm optimistic about solving it soon :) |
git log There is no log output when the test code is run
|
That's why I noted the second run. However, when I use your pre-compiled version 1.8, it can run very well without |
This branch is based on test_dragonfly changes, it disables the lua script. Use this branch so that the test case will be able to reproduce the high latency problem every time (using my own compiled binary). This may be helpful to you.
|
The 40ms problem might be caused by the bad interaction of Nagle and delayed acks. |
@pveentjer indeed, that's why we actually flipped |
JuiceFS supports using redis as the metadata engine, and we had feedback from the community that they wanted to use dragonfly to replace redis, but after testing, I found that dragonfly is very unstable. It is about 10 times slower than the average redis latency.(calculated by the mean of the latency) I'm not sure if it's my usage or dragonfly's own problem. I would like to ask for help to determine the cause.
As you can see on this graph, Dragonfly's performance is similar to that of Redis under normal circumstances, but there are often exceptionally slow requests, resulting in very large latency fluctuations.
To Reproduce
./dragonfly --logtostderr -dbnum 10 --bind 127.0.0.1 --port 6378
git clone -b test_for_dragonfly https://github.com/juicedata/juicefs.git && cd juicefs
go mod tidy
addr=redis://127.0.0.1:6379/1 go test -count=1 -v ./pkg/meta/... -run=TestDgfAndRedis
addr=redis://127.0.0.1:6378/1 go test -count=1 -v ./pkg/meta/... -run=TestDgfAndRedis
Expected behavior
Performance on the same order of magnitude as redis or better
Environment (please complete the following information):
Linux bench2 5.4.0-1029-aws renamenx is not supported #30-Ubuntu SMP Tue Oct 20 10:06:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
dragonfly v1.3.0-f80afca9c23e2f30373437520a162c591eaa2005
build time: 2023-05-18 07:11:31
The text was updated successfully, but these errors were encountered: