Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM while flushing memtable with range deletes #11407

Open
yao-xiao-github opened this issue Apr 25, 2023 · 2 comments
Open

OOM while flushing memtable with range deletes #11407

yao-xiao-github opened this issue Apr 25, 2023 · 2 comments
Assignees
Labels
performance Issues related to performance that may or may not be bugs

Comments

@yao-xiao-github
Copy link

Hi, I'm working on FoundationDB and trying to use rocksdb as the underlying storage engine.

OOM was observed in two scenarios

  • flush memtable with ~1000 range deletes
  • recover from WAL containing ~1000 range deletes

Expected behavior

Flush memtable should complete without error.

Actual behavior

Out of memory while flushing a memtable with ~1000 deletes.

Sample heap profile (using massif) during recovery

->40.27% (2,164,601,464B) 0x4C538C8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| ->19.79% (1,063,988,352B) 0x433856D: rocksdb::FragmentedRangeTombstoneList::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice>, std::default_delete<rocksdb::InternalIteratorBase<rocksdb::Slice> > >, rocksdb::InternalKeyComparator const&, bool, std::vector<unsigned long, std::allocator<unsigned long> > const&) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| | ->19.79% (1,063,988,352B) 0x4338E1B: rocksdb::FragmentedRangeTombstoneList::FragmentedRangeTombstoneList(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice>, std::default_delete<rocksdb::InternalIteratorBase<rocksdb::Slice> > >, rocksdb::InternalKeyComparator const&, bool, std::vector<unsigned long, std::allocator<unsigned long> > const&) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |   ->19.79% (1,063,988,352B) 0x432BC33: rocksdb::CompactionRangeDelAggregator::NewIterator(rocksdb::Slice const*, rocksdb::Slice const*, bool) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |     ->19.79% (1,063,988,352B) 0x45860BF: rocksdb::BuildTable(std::string const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, std::vector<rocksdb::BlobFileAddition, std::allocator<rocksdb::BlobFileAddition> >*, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::SeqnoToTimeMapping const&, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::string const*, rocksdb::BlobFileCompletionCallback*, rocksdb::Version*, unsigned long*, unsigned long*, unsigned long*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |       ->19.79% (1,063,988,352B) 0x4295013: rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         ->19.79% (1,063,988,352B) 0x429804F: rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*, rocksdb::DBImpl::RecoveryContext*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         | ->19.79% (1,063,988,352B) 0x429A183: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         |   ->19.79% (1,063,988,352B) 0x4290707: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         |     ->19.79% (1,063,988,352B) 0x4292C95: rocksdb::DB::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (in /home/mstack/fdbserver.7.2_7.10_DEBUG)
| |         |       ->19.79% (1,063,988,352B) 0x1C5DCFD: (anonymous namespace)::ShardManager::init() (KeyValueStoreShardedRocksDB.actor.cpp:748)
| |         |         ->19.79% (1,063,988,352B) 0x1C60036: action (KeyValueStoreShardedRocksDB.actor.cpp:1863)
| |         |           ->19.79% (1,063,988,352B) 0x1C60036: TypedAction<(anonymous namespace)::ShardedRocksDBKeyValueStore::Writer, (anonymous namespace)::ShardedRocksDBKeyValueStore::Writer::OpenAction>::operator()(IThreadPoolReceiver*) (IThreadPool.h:76)
| |         |             ->19.79% (1,063,988,352B) 0x48EEBC1: dispatch (IThreadPool.cpp:51)
| |         |               ->19.79% (1,063,988,352B) 0x48EEBC1: operator() (IThreadPool.cpp:73)
| |         |                 ->19.79% (1,063,988,352B) 0x48EEBC1: asio_handler_invoke<ThreadPool::ActionWrapper> (handler_invoke_hook.hpp:88)
| |         |                   ->19.79% (1,063,988,352B) 0x48EEBC1: invoke<ThreadPool::ActionWrapper, ThreadPool::ActionWrapper> (handler_invoke_helpers.hpp:54)
| |         |                     ->19.79% (1,063,988,352B) 0x48EEBC1: complete<ThreadPool::ActionWrapper> (handler_work.hpp:512)
| |         |                       ->19.79% (1,063,988,352B) 0x48EEBC1: boost::asio::detail::completion_handler<ThreadPool::ActionWrapper, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (completion_handler.hpp:74)
| |         |                         ->19.79% (1,063,988,352B) 0x48EDB65: complete (scheduler_operation.hpp:40)
| |         |                           ->19.79% (1,063,988,352B) 0x48EDB65: boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) (scheduler.ipp:492)
| |         |                             ->19.79% (1,063,988,352B) 0x48EDF99: run_one (scheduler.ipp:231)
| |         |                               ->19.79% (1,063,988,352B) 0x48EDF99: run_one (io_context.ipp:78)
| |         |                                 ->19.79% (1,063,988,352B) 0x48EDF99: ThreadPool::Thread::run() (IThreadPool.cpp:43)
| |         |                                   ->19.79% (1,063,988,352B) 0x48EE488: ThreadPool::start(void*) (IThreadPool.cpp:54)
| |         |                                     ->19.79% (1,063,988,352B) 0x8095EA4: start_thread (in /usr/lib64/libpthread-2.17.so)
| |         |                                       ->19.79% (1,063,988,352B) 0x83A8B2C: clone (in /usr/lib64/libc-2.17.so)
| |         |                                         
| |         ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
| |         

Steps to reproduce the behavior

I reproduced it in our unit tests, but haven't got a chance to make a rocksdb test.
What I did to reproduce the issue
On the flush path

  1. open database
  2. create a new CF
  3. generate range deletions, e.g.
for (int i = 0; i < numRangeDeletions; ++i) {
  writeBatch->Put(cf, prefix + std::to_string(i + 1), value);
  writeBatch->DeleteRange(cf, prefix, prefix + std::to_string(i));
  db->Write(options, writeBatch);
}
  1. Issue flush to the CF db->Flush(options, cf);
  2. Depends on the memory limit of the process, you may get a OOM with less range deletes

On the recovery path
step 1 - 3 are the same as above
4. close database
5. reopen database
6. OOM during recovery

I believe the flush and the recovery actually use the same code path. They all OOMed when creating FragmentedRangeTombstoneList .

I tried issuing flush to the CF after a certain amount of range deletes (e.g. 1000). It seems to resolve the OOM. However, there's no metrics to track the number of range deletes in a memtable. I have to count the deletes sent to a CF, which is inaccurate because some deletes may have been flushed already.

Another thing I tried is to reopen the database using ldb_tools, it also OOMed during recovery. The issued was originally discovered in 7.2.6. After that, we tried many newer version including 8.1.1, still see the same issue.

@cbi42 cbi42 self-assigned this Apr 25, 2023
@cbi42 cbi42 added the performance Issues related to performance that may or may not be bugs label Apr 25, 2023
@cbi42
Copy link
Member

cbi42 commented Apr 25, 2023

Hi, thanks for reporting the issue. Currently, FragmentTombstones() can take more memory if the input range tombstones are overlapping, especially in your case where all range tombstones are overlapping (the memory overhead is roughly O(N^2) in this case). I doubt the threshold is 1000 range deletions tho, I tried the following test and it consumes about 40MB memory. It's probably closer to 10k depending on the amount of available memory.

...
const int kNumRangeDel = 1000;
  for (int i = 0; i < kNumRangeDel; ++i) {
    ASSERT_OK(Put(Key(i + 1),  std::to_string(i + 1)));
    // All range deletions are overlapping
    ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(0), Key(i)));
  }
  // Flush will trigger fragmentation which could OOM?
  ASSERT_OK(Flush());
...

There are some potential ways to optimize this that I plan to work on. In the mean time, as a work around, are you able to issue non-overlapping range deletions?

However, there's no metrics to track the number of range deletes in a memtable. I have to count the deletes sent to a CF, which is inaccurate because some deletes may have been flushed already.

You are right that manually tracking the count can be inaccurate sometimes but manual flush should still help. There is an open PR (#11358) that is adding this functionality as an option.

@yao-xiao-github
Copy link
Author

yao-xiao-github commented Apr 26, 2023

Thanks for the reply.

I doubt the threshold is 1000 range deletions tho, I tried the following test and it consumes about 40MB memory. It's probably closer to 10k depending on the amount of available memory.

Right. In the unit test, it OOMed around 10k deletes. I misread the number when going through the WAL files, which is also close to 10k.

There are some potential ways to optimize this that I plan to work on. In the mean time, as a work around, are you able to issue non-overlapping range deletions?

It's hard to issue non-overlapping range deletions. We plan to convert some DeleteRange to read and delete.
I think #11358 could help with our situation. I also left some comments there.

Will #11358 be included in the next rocksdb release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to performance that may or may not be bugs
Projects
None yet
Development

No branches or pull requests

2 participants