-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of persistent write-ahead log #57
Comments
@potrusil-osi, you was right about |
@sakno, it is unfortunate that I cannot pass snapshot to the interpreter anymore. I currently use it to store the in-memory state and the snapshot command is how it is often initialized. It is easy and elegant. Isn't there a way how to work around the limitation? |
Could you please provide a simple example of how you use custom command type for the snapshot? Do you serialize identifier manually inside of |
@potrusil-osi , support of snapshot entries returned back. [CommandHandler(IsSnapshotHandler = true)]
public ValueTask HandleSnapshot(SnapshotCommand command, CancellationToken token)
{
} |
Thanks for such a fast response! I guess you don't want me to give you the example anymore... But no, I was no serializing the identifier inside After getting the latest changes (and updating the CommandHandler for snapshot) I'm consistently getting this exception in a follower:
The |
Physical layout of log entries on the disk in 3.1.0 is not compatible with previous versions. As a result, if new |
I delete the logs every time before my tests. I'll keep digging myself what could be causing it... |
Oh, binary form of data representation is efficient when transferring over the wire but very fragile and require a lot of testing... I fixed issue caused invalid calculation of |
The last fix seems to fix the error. But I'm getting new errors now. This one occurs periodically on the node that is the very first leader:
And this occurs on startup if I do not clear the logs from the previous run:
|
Fixed. Command id was not passed over the wire correctly. |
Everything works well now. Thanks! |
@sakno, after all the optimizations I'm getting much better numbers... The suggestion to have a buffer with a different size during compaction (described in #56) still holds. I have modified dotNext locally to give me the option and writing to the snapshot file is much faster thanks to that. I set the buffer to approximately the size of snapshot. |
Great news! Currently I'm working on buffered network I/O that allows to copy log entry content passed over the network before calling |
Did you change the buffer size passed to |
I'm getting about 50% performance increase without that flag! I wonder if I need to test everything again to figure out the ideal buffer sizes. |
I guess without WriteThrough flag the size of the buffer has small impact on performance. |
Now |
Buffer sizes are now separated in |
@potrusil-osi , I'm finished buffering API for log entries. It has two use cases:
In case of Interpreter Framework, you can use it easily as follows: var entry = interpreter.CreateLogEntry<MyCommand>(...);
using var buffered = await BufferedRaftLogEntry.CopyAsync(entry, options, token);
log.AppendAsync(buffered, token); This make sense only for large log entries. Also, you need to experiment with settings from |
Now I'm in a halfway to a new version of
At the moment I implemented one innovative thing: concurrent sorted linked list for partitions. This hand-written data structure is much better than The remaining half way is about writing special lock type on top from |
@potrusil-osi , everything is ready for beta-testing. It was really challenging 😎 There are few options for background compaction available: Option # 1: By default, the worker implements incremental compaction. It means that the only one partition will be compacted at a time. It gives you the best performance with minimal interference with writes. However, this approach has drawbacks: if there are too many writes then disk space will grow faster than compaction. To replace incremental compaction, you need to implement Option # 2: I would appreciate if you will able to contribute a good benchmark. Unfortunately, BenchmarkDotNet is not applicable for load testing. Also, it would be great if you'll share the final measurements of your application with background compaction. |
As an option, EWMA can be used as a base for adaptive algorithm for background compaction. During application lifetime the algorithm can compute exponentially weighted average value of compaction factor and pass it to |
hey @sakno, thank you for so much work! I'll try to take advantage of the most recent changes as soon as possible. I have two questions:
|
|
Another compaction mode has been added in the latest commit. Now there are three compaction modes:
|
Also, I have an idea about your second question. I can provide optional interface called |
I think I didn't explain my use case very well. The biggest pain point is that when the compaction is executed on the background there are many partition files full of add operations (the snapshot file is quite small). When clear is committed and the compaction is triggered, almost all of the partitions (including the previous snapshot) can be dropped, except add entries in the last few partitions (which are squashed into a relatively small new snapshot). To support that specific use case, the state implementation would probably need to be able to set custom metadata for the snapshot and all the partitions, access it during compaction, and influence the compaction algorithm to skip/delete certain partition files based on custom logic. But I'm not sure all this would make any sense for a general purpose WAL. |
Hi @sakno, I like both optimizations! Hopefully I will have time to take advantage of them before the research project this is part of is finished. |
Here is benchmark of various
You can try this benchmark on your machine. Run |
Mine results are not as good as yours...
|
The actual result depends on the hardware. In case of |
@potrusil-osi , do you have estimation of time needed to complete your research work? I'm ready to publish release. |
We have two more weeks. After the last two optimizations you provided the local results on my computer are pretty impressive. In our testing cluster the results are not as great, so we are digging deeper what the bottlenecks are. |
@sakno, back to your analysis of the benchmark results: I also have a NVMe SSD drive (Samsung PM981 1TB). Any thoughts why my results are so much worse than yours? Are you running the benchmark in Windows? |
No, I ran these tests on Linux. But I have machine with Windows 10:
The results are close to your. But machine with Windows 10 has SATA SSD instead of NVMe SSD installed on PCIe. I think your laptop has two drives: system partition running on NVMe and user partition running on SATA. Benchmark creates WAL in temp directory. Usually, this directory is mounted to SSD SATA drive in case of two drives by default. Also, in your case |
There is only one drive. But after disabling half the services the results are a bit better:
|
In benchmark I didn't try to choose optimal configuration parameters. However, I changed default buffer size to 4096 bytes which is equal to filesystem cluster size and now the results much better on SATA SSD but remains the same for NVMe. Pull latest changes from |
After getting the latest
|
Thanks for sharing results! Also, I managed to cut off a few microseconds by optimizing hot paths in code within |
Hi @sakno, I am a colleague of @potrusil-osi who is out of the office this week. We have deployed our application to a cluster on dedicated hardware to continue testing. Unfortunately, we are not seeing the same degree of benefit from the optimizations that have been seen when testing locally, as our overall throughput numbers are only about 80% of what was achieved in the local testing. In particular, we are seeing two issues that we would like to get your thoughts on.
We are seeing that the write lock acquisition in
In the |
Hi @afink-osi , |
It depends on numbers of parallel writes. For instance, if you have 36 parallel writes and each write operation takes ~1 ms then the last writer need to wait ~36 ms. |
Additionally, there is another project FASTER that offers general-purpose and high-performance WAL. It might be a good candidate for Raft but requires a lot of research. |
@afink-osi , @potrusil-osi , let's try again. What I've done: cached log entry will not be persisted until commit. As a result, |
Hi @sakno, thanks for the quick response and for the clarification! Yesterday when running tests, we were seeing replication throughput at ~28% the throughput of our baseline (the application running without replication). Today, we were able to run some tests using the same storage volume as before but with your latest changes, and the results are very impressive! We are now running at about 65% of the baseline which is much more in line with the local tests that @potrusil-osi was running before and we are seeing minimal time waiting for locks. We also tested using memory instead of the SSD disk (may be acceptable for our use case) for log storage and pushed this further to ~72% of the baseline. One more question: In the new method |
I missed call to |
Also I added special |
@potrusil-osi , @afink-osi , I've added support of diagnostic events to persistent WAL. You can configure them via
You can use dotnet-counters tool from .NET SDK to track the counters or submit values to the monitoring tool. |
@potrusil-osi , @afink-osi , I would like to ask you guys to check |
@sakno The new counters look very helpful, I didn't get a chance to try them yet but will hopefully look at them tomorrow. We were able to test the new changes to |
Nice to hear that, @afink-osi ! Let me know when you check these counters. |
@afink-osi , @potrusil-osi , did you have a chance to finalize your research project? If so, I would like to release new version of .NEXT. |
@sakno We have completed our research project, unfortunately, we also did not have time to dig into the counters that you added before our time window closed. Thank you for your quick and thorough responses and improvements throughout this work! You may proceed with your release |
Hi @sakno, I also wanted to thank you for all the work. It's been great experience to work with you on improving the WAL performance! |
.NEXT 3.1.0 has been released. Closing issue. |
See suggestions in #56 .
Tasks:
SkipAsync
toIAsyncBinaryReader
PersistentState
in a way that allows to do the compaction in parallel with writes or readsfsync
for every write must be disabled by default (FileOptions.WriteThrough
flag)The text was updated successfully, but these errors were encountered: