Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to set max operation numbers in a single rocksdb batch #4044

Merged
merged 1 commit into from
Aug 14, 2023

Conversation

zymap
Copy link
Member

@zymap zymap commented Aug 7, 2023


Motivation

In rocksdb, memory usage is related to batch size. The more operations in a single batch, the more memory is consumed. Expose the configuration to allow control the batch size.

@zymap zymap added this to the 4.17.0 milestone Aug 7, 2023
@zymap zymap self-assigned this Aug 7, 2023
---

## Motivation

In rocksdb, the memory usage is related to the batch size.
The more operations in a single batch, the more memory is consumed.
Expose the configuration to allow control the batch size.
@zymap zymap force-pushed the expose-rocksdb-batch-settings branch from 157d701 to 0c20fbe Compare August 7, 2023 10:22
@@ -548,17 +560,31 @@ public void remove(byte[] key) throws IOException {
@Override
public void clear() {
writeBatch.clear();
batchCount = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user invoke flush() directly, we should also reset the batchCount

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Users can flush multiple times using the same batch. Even you flush the batch, the batch's content doesn't clean up. If we reset it after flushing, the count is not accurate. So I follow the API to reset the batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If manual flush is completed and the batchCount reaches the batchSize again, does it mean that the operations in this batch will be applied to the database twice?

batch.put(toArray(6), toArray(6));
assertEquals(1, batch.batchCount());

batch.flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here trigger flush() directly, the batchCount should be 0.

}

@Override
public void deleteRange(byte[] beginKey, byte[] endKey) throws IOException {
try {
writeBatch.deleteRange(beginKey, endKey);
countBatchAndFlushIfNeeded();
Copy link
Member

@horizonzy horizonzy Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss whether deleteRange should also be performed within a batch. A deleteRange operation can potentially delete a large number of keys. If we include multiple deleteRange operations within a single batch, I am concerned that it may introduce issues.

The previous core dump #3734 issue could be related to this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be another issue. This PR aims to limit the batch size. We can handle that with another PR.

@zymap
Copy link
Member Author

zymap commented Aug 8, 2023

I was tested with this code to verify the batch impaction on memory.
https://gist.github.com/zymap/19249ab35bb0f64c55cbf7f2e8356cb3
I found the memory keeps increasing with the batch size. And if the batch is not flushed into the sst, it will save into the WAL file, and the WAL file will not be limited by max_total_wal_size.
If it was OOM killed because of the large batch and the batch was saved in the WAL. The only way to reopen the rocksDB is to add more memory for the bookie.
I also talk this issue with rocksDB community, they said:

when the batch size is so large (esp if you run multiple batches together) the wal size may reach (limit + batch-size * number of open batches). We have a project opened by our friends from Kafka streams to handle huge batch size. In the meanwhile can you restrict the size of your batch ?

--
In the Pulsar, the compacted ledger hasn't a rollover policy or retention policy. If the user has tons of keys in the compaction, that would make the compacted ledger bigger and bigger. In our environment, a compacted ledger reached 200G. It contains lots of entries in a single ledger, which makes the batch very large.
In release 4.14.7 and branch-4.15, we didn't limit the delete operation numbers in a single batch.

for (long entryId = firstEntryId; entryId <= lastEntryId; entryId++) {

Finally, when bookie runs the garbage collection and removes the ledger, it will be OOM killed because of the large batch.

Pulsar already has a proposal about configuring the compacted topic ledger retention, apache/pulsar#19665.
But I think we also need to have a way to control the batch size to make sure we have a way to limit the memory.

@@ -540,6 +551,7 @@ public void put(byte[] key, byte[] value) throws IOException {
public void remove(byte[] key) throws IOException {
try {
writeBatch.delete(key);
countBatchAndFlushIfNeeded();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we try to remove a key from the writeBatch, but the writeBatch has been triggered flushed in put method, will it impact the correctness of the key value stored in the RocksDB?

Copy link
Member Author

@zymap zymap Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't. The batch should only take affect when the action starts to do in the rocksdb, we're just pending our operations into the batch to avoid calling the real action method multiple times.
See here: https://github.com/facebook/rocksdb/blob/9a034801cead6421bcf82b506b77e3b2251f1edb/include/rocksdb/write_batch.h#L9

@hangc0276 hangc0276 requested a review from merlimat August 9, 2023 04:19
Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zymap zymap merged commit ad0ed21 into apache:master Aug 14, 2023
16 checks passed
hangc0276 pushed a commit that referenced this pull request Aug 17, 2023
---

In rocksdb, the memory usage is related to the batch size.
The more operations in a single batch, the more memory is consumed.
Expose the configuration to allow control the batch size.

(cherry picked from commit ad0ed21)
zymap added a commit that referenced this pull request Aug 29, 2023
---

## Motivation

In rocksdb, the memory usage is related to the batch size.
The more operations in a single batch, the more memory is consumed.
Expose the configuration to allow control the batch size.

(cherry picked from commit ad0ed21)
zymap added a commit that referenced this pull request Dec 7, 2023
---

## Motivation

In rocksdb, the memory usage is related to the batch size.
The more operations in a single batch, the more memory is consumed.
Expose the configuration to allow control the batch size.

(cherry picked from commit ad0ed21)
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
…4044)

---

## Motivation

In rocksdb, the memory usage is related to the batch size.
The more operations in a single batch, the more memory is consumed.
Expose the configuration to allow control the batch size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants