OOM while flushing memtable with range deletes #11407

yao-xiao-github · 2023-04-25T07:03:36Z

Hi, I'm working on FoundationDB and trying to use rocksdb as the underlying storage engine.

OOM was observed in two scenarios

flush memtable with ~1000 range deletes
recover from WAL containing ~1000 range deletes

Expected behavior

Flush memtable should complete without error.

Actual behavior

Out of memory while flushing a memtable with ~1000 deletes.

Sample heap profile (using massif) during recovery

->40.27% (2,164,601,464B) 0x4C538C8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| ->19.79% (1,063,988,352B) 0x433856D: rocksdb::FragmentedRangeTombstoneList::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice>, std::default_delete<rocksdb::InternalIteratorBase<rocksdb::Slice> > >, rocksdb::InternalKeyComparator const&, bool, std::vector<unsigned long, std::allocator<unsigned long> > const&) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| | ->19.79% (1,063,988,352B) 0x4338E1B: rocksdb::FragmentedRangeTombstoneList::FragmentedRangeTombstoneList(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice>, std::default_delete<rocksdb::InternalIteratorBase<rocksdb::Slice> > >, rocksdb::InternalKeyComparator const&, bool, std::vector<unsigned long, std::allocator<unsigned long> > const&) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |   ->19.79% (1,063,988,352B) 0x432BC33: rocksdb::CompactionRangeDelAggregator::NewIterator(rocksdb::Slice const*, rocksdb::Slice const*, bool) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |     ->19.79% (1,063,988,352B) 0x45860BF: rocksdb::BuildTable(std::string const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, std::vector<rocksdb::BlobFileAddition, std::allocator<rocksdb::BlobFileAddition> >*, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::SeqnoToTimeMapping const&, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::string const*, rocksdb::BlobFileCompletionCallback*, rocksdb::Version*, unsigned long*, unsigned long*, unsigned long*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |       ->19.79% (1,063,988,352B) 0x4295013: rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         ->19.79% (1,063,988,352B) 0x429804F: rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*, rocksdb::DBImpl::RecoveryContext*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         | ->19.79% (1,063,988,352B) 0x429A183: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         |   ->19.79% (1,063,988,352B) 0x4290707: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         |     ->19.79% (1,063,988,352B) 0x4292C95: rocksdb::DB::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         |       ->19.79% (1,063,988,352B) 0x1C5DCFD: (anonymous namespace)::ShardManager::init() (KeyValueStoreShardedRocksDB.actor.cpp:748)
| |         |         ->19.79% (1,063,988,352B) 0x1C60036: action (KeyValueStoreShardedRocksDB.actor.cpp:1863)
| |         |           ->19.79% (1,063,988,352B) 0x1C60036: TypedAction<(anonymous namespace)::ShardedRocksDBKeyValueStore::Writer, (anonymous namespace)::ShardedRocksDBKeyValueStore::Writer::OpenAction>::operator()(IThreadPoolReceiver*) (IThreadPool.h:76)
| |         |             ->19.79% (1,063,988,352B) 0x48EEBC1: dispatch (IThreadPool.cpp:51)
| |         |               ->19.79% (1,063,988,352B) 0x48EEBC1: operator() (IThreadPool.cpp:73)
| |         |                 ->19.79% (1,063,988,352B) 0x48EEBC1: asio_handler_invoke<ThreadPool::ActionWrapper> (handler_invoke_hook.hpp:88)
| |         |                   ->19.79% (1,063,988,352B) 0x48EEBC1: invoke<ThreadPool::ActionWrapper, ThreadPool::ActionWrapper> (handler_invoke_helpers.hpp:54)
| |         |                     ->19.79% (1,063,988,352B) 0x48EEBC1: complete<ThreadPool::ActionWrapper> (handler_work.hpp:512)
| |         |                       ->19.79% (1,063,988,352B) 0x48EEBC1: boost::asio::detail::completion_handler<ThreadPool::ActionWrapper, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (completion_handler.hpp:74)
| |         |                         ->19.79% (1,063,988,352B) 0x48EDB65: complete (scheduler_operation.hpp:40)
| |         |                           ->19.79% (1,063,988,352B) 0x48EDB65: boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) (scheduler.ipp:492)
| |         |                             ->19.79% (1,063,988,352B) 0x48EDF99: run_one (scheduler.ipp:231)
| |         |                               ->19.79% (1,063,988,352B) 0x48EDF99: run_one (io_context.ipp:78)
| |         |                                 ->19.79% (1,063,988,352B) 0x48EDF99: ThreadPool::Thread::run() (IThreadPool.cpp:43)
| |         |                                   ->19.79% (1,063,988,352B) 0x48EE488: ThreadPool::start(void*) (IThreadPool.cpp:54)
| |         |                                     ->19.79% (1,063,988,352B) 0x8095EA4: start_thread (in /usr/lib64/libpthread-2.17.so)
| |         |                                       ->19.79% (1,063,988,352B) 0x83A8B2C: clone (in /usr/lib64/libc-2.17.so)
| |         |                                         
| |         ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
| |

Steps to reproduce the behavior

I reproduced it in our unit tests, but haven't got a chance to make a rocksdb test.
What I did to reproduce the issue
On the flush path

open database
create a new CF
generate range deletions, e.g.

for (int i = 0; i < numRangeDeletions; ++i) {
  writeBatch->Put(cf, prefix + std::to_string(i + 1), value);
  writeBatch->DeleteRange(cf, prefix, prefix + std::to_string(i));
  db->Write(options, writeBatch);
}

Issue flush to the CF db->Flush(options, cf);
Depends on the memory limit of the process, you may get a OOM with less range deletes

On the recovery path
step 1 - 3 are the same as above
4. close database
5. reopen database
6. OOM during recovery

I believe the flush and the recovery actually use the same code path. They all OOMed when creating FragmentedRangeTombstoneList .

I tried issuing flush to the CF after a certain amount of range deletes (e.g. 1000). It seems to resolve the OOM. However, there's no metrics to track the number of range deletes in a memtable. I have to count the deletes sent to a CF, which is inaccurate because some deletes may have been flushed already.

Another thing I tried is to reopen the database using ldb_tools, it also OOMed during recovery. The issued was originally discovered in 7.2.6. After that, we tried many newer version including 8.1.1, still see the same issue.

The text was updated successfully, but these errors were encountered:

cbi42 · 2023-04-25T21:19:57Z

Hi, thanks for reporting the issue. Currently, FragmentTombstones() can take more memory if the input range tombstones are overlapping, especially in your case where all range tombstones are overlapping (the memory overhead is roughly O(N^2) in this case). I doubt the threshold is 1000 range deletions tho, I tried the following test and it consumes about 40MB memory. It's probably closer to 10k depending on the amount of available memory.

...
const int kNumRangeDel = 1000;
  for (int i = 0; i < kNumRangeDel; ++i) {
    ASSERT_OK(Put(Key(i + 1),  std::to_string(i + 1)));
    // All range deletions are overlapping
    ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(0), Key(i)));
  }
  // Flush will trigger fragmentation which could OOM?
  ASSERT_OK(Flush());
...

There are some potential ways to optimize this that I plan to work on. In the mean time, as a work around, are you able to issue non-overlapping range deletions?

However, there's no metrics to track the number of range deletes in a memtable. I have to count the deletes sent to a CF, which is inaccurate because some deletes may have been flushed already.

You are right that manually tracking the count can be inaccurate sometimes but manual flush should still help. There is an open PR (#11358) that is adding this functionality as an option.

yao-xiao-github · 2023-04-26T20:16:00Z

Thanks for the reply.

I doubt the threshold is 1000 range deletions tho, I tried the following test and it consumes about 40MB memory. It's probably closer to 10k depending on the amount of available memory.

Right. In the unit test, it OOMed around 10k deletes. I misread the number when going through the WAL files, which is also close to 10k.

There are some potential ways to optimize this that I plan to work on. In the mean time, as a work around, are you able to issue non-overlapping range deletions?

It's hard to issue non-overlapping range deletions. We plan to convert some DeleteRange to read and delete.
I think #11358 could help with our situation. I also left some comments there.

Will #11358 be included in the next rocksdb release?

cbi42 self-assigned this Apr 25, 2023

cbi42 added the performance Issues related to performance that may or may not be bugs label Apr 25, 2023

yao-xiao-github mentioned this issue Apr 26, 2023

Add an option to trigger flush when the number of range deletions reach a threshold #11358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM while flushing memtable with range deletes #11407

OOM while flushing memtable with range deletes #11407

yao-xiao-github commented Apr 25, 2023

cbi42 commented Apr 25, 2023

yao-xiao-github commented Apr 26, 2023 •

edited

Loading

OOM while flushing memtable with range deletes #11407

OOM while flushing memtable with range deletes #11407

Comments

yao-xiao-github commented Apr 25, 2023

Expected behavior

Actual behavior

Steps to reproduce the behavior

cbi42 commented Apr 25, 2023

yao-xiao-github commented Apr 26, 2023 • edited Loading

yao-xiao-github commented Apr 26, 2023 •

edited

Loading