Avoid exception if fail to insert block to block-cache #8670

Myasuka · 2021-08-17T08:56:15Z

Expected behavior

If fail to insert block to block-cache, just abot the insert but return the result as normal.

Actual behavior

If we use default ReadOptions with strict_capacity_limit block cache, RocksDB would throw exception if Insert failed due to LRU cache being full.

Steps to reproduce the behavior

This is because current RetrieveBlock would check the status of MaybeReadBlockAndLoadToCache:

    s = MaybeReadBlockAndLoadToCache(
        prefetch_buffer, ro, handle, uncompression_dict, wait_for_cache,
        block_entry, block_type, get_context, lookup_context,
        /*contents=*/nullptr);

    if (!s.ok()) {
      return s;
    }

And if fail to insert, the block cache would report incomplete status:

    if ((usage_ + total_charge) > capacity_ &&
        (strict_capacity_limit_ || handle == nullptr)) {
      e->SetInCache(false);
      if (handle == nullptr) {
        // Don't insert the entry but still return ok, as if the entry inserted
        // into cache and get evicted immediately.
        last_reference_list.push_back(e);
      } else {
        if (free_handle_on_fail) {
          delete[] reinterpret_cast<char*>(e);
          *handle = nullptr;
        }
        s = Status::Incomplete("Insert failed due to LRU cache being full.");
      }

As more and more applications move on cloud, and fine-granularity memory control is required in many use cases. I think enabling the strict capacity limit of block cache in production environment should be something valueable.
How about this solution:

Introduce a new status as FailInsertCache.
Introduce a new filed named fill_cache_if_possible in ReadOptions, which means if cannot fill block to cache, we would not treat FailInsertCache status as a problem, and continue the read process as normal.

What do you think of this problem and the propsed solution?

The text was updated successfully, but these errors were encountered:

Myasuka · 2021-08-17T09:11:58Z

cc @zhichao-cao , @ltamasi @akankshamahajan15 who might be interested in this topic.

zhichao-cao · 2021-08-17T17:52:56Z

Thanks for the suggestion. So there will be a new error code in Status as Status::FailInsertCache()?

Myasuka · 2021-08-18T02:43:19Z

@zhichao-cao Yes, this is just a proposal and hope to get your response to see whether this is practicable.
If so, do you think RocksDB community could resolve it in this suggested way?

pdillinger · 2021-08-23T22:32:48Z

If we use default ReadOptions with strict_capacity_limit block cache, RocksDB would throw exception if Insert failed due to LRU cache being full.

Correction: LRUCacheOptions

My concern is that if you are setting strict_capacity_limit=true on Cache, you are probably interested in limiting the amount of memory used to hold SST blocks. But under your proposal, iterators and active Get operations are allowed to allocate as much as they want beyond this limit, because they ignore when insertion into Cache fails. This is worse than setting strict_capacity_limit=false because at least with strict_capacity_limit=false you can cache hit on everything that is already in memory.

ajkr · 2021-08-23T22:55:33Z

It is a good question though - since the allocation and read already happened, does it make sense to return the value? I think it's not that straightforward. When using the PinnableSlice APIs we don't just fill a value buffer for the caller -- the whole underlying block comes with it. And this is amplified in MultiGet() or iterators.

One alternative we've been discussing is to pre-reserve cache memory before doing the allocation and read. That would prevent the case where we do an allocation and read only to throw it away and fail the operation. But we would still fail the operation -- it would just be an earlier/more efficient way to fail.

Myasuka · 2021-08-24T03:04:39Z

Thanks for your feedback!

From my previous understanding, current memory control solution in RocksDB cannot control it very well, just as you pointed out. It would first let the memory allocation happen and then choose whether to free that after inserting into the cache. This is not perfect but better than totally no memory control.

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap. On the other hand, Apache Flink community heavily depends on this future in container environment and make the cache strictly limit could help improve the memory control. From current status, I think make strict_capacity_limit=true and not failing the get operation could be the low hanging fruit. I think this is not something against the long term design. What do you think ? @ajkr @pdillinger @zhichao-cao

BTW, I think wait for evicting lru cache until we fail the operation, that might be someting better for long-running applications.

pdillinger · 2022-07-20T04:05:22Z

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.

I think the big complication is that we can't always predict the size needed for decompressed block.

Note: there are two other "lookup with intent to insert" things we could consider solving along with this:

Allocate and insert an "empty" handle after unsuccessful lookup (with intent to insert), to avoid repeating the hashing & probing between Lookup & Insert.
Use the same to allow cooperation between threads intending to insert the same entry. All but one thread would wait for the select one to finish the corresponding insert.

mrambacher · 2022-07-20T16:08:06Z

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.

I think the big complication is that we can't always predict the size needed for decompressed block.

I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.

The size an uncompressed block will be compressed is however unknown.

I wonder if having every Cache have a MemoryAllocator and having a smarter allocator (that allocated a Handle and could support "reallocate") would be helpful...

pdillinger · 2022-07-20T20:22:29Z

I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.

That doesn't help you predict the uncompressed size when all you have is a BlockHandle to the compressed block.

…void insertion exception on cache full ### What changes were proposed in this pull request? Disable strict limit for RocksDB write manager to avoid insertion exception on cache full ### Why are the changes needed? In some cases, if the memory limit is reached, on insert/get, we are seeing the following exception ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 9.0 failed 4 times, most recent failure: Lost task 42.3 in stage 9.0 (TID 2950) (96.104.176.55 executor 0): org.rocksdb.RocksDBException: Insert failed due to LRU cache being full. at org.rocksdb.RocksDB.get(Native Method) at org.rocksdb.RocksDB.get(RocksDB.java:2053) at org.apache.spark.sql.execution.streaming.state.RocksDB.get(RocksDB.scala:299) at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$RocksDBStateStore.get(RocksDBStateStoreProvider.scala:55) ``` It seems this is being thrown with strict memory limit within RocksDB here - https://github.com/facebook/rocksdb/blob/0fa0c97d3e9ac5dfc2e7ae94834b0850cdef5df7/cache/lru_cache.cc#L394 It seems this issue can only happen with the strict mode as described here - facebook/rocksdb#5048 (comment) Seems like there is a pending issue for RocksDB around this as well - facebook/rocksdb#8670 There is probably a relevant fix, but not sure whether this addresses the issue completely - facebook/rocksdb#6619 (cc - siying ) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #42567 from anishshri-db/task/SPARK-44878. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

This was referenced Aug 18, 2021

Nullptr when costing memory used in memtable to block cache #6247

Closed

Return ok status if fail to insert to block cache as full #8683

Open

ajkr added the design discussion label Aug 22, 2021

pdillinger self-assigned this Jun 23, 2022

zhougit86 mentioned this issue Feb 13, 2023

rocksdb oom issue Myasuka/frocksdb-2#1

Open

anishshri-db mentioned this issue Aug 18, 2023

[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full apache/spark#42567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid exception if fail to insert block to block-cache #8670

Avoid exception if fail to insert block to block-cache #8670

Myasuka commented Aug 17, 2021 •

edited

Myasuka commented Aug 17, 2021

zhichao-cao commented Aug 17, 2021

Myasuka commented Aug 18, 2021 •

edited

pdillinger commented Aug 23, 2021

ajkr commented Aug 23, 2021

Myasuka commented Aug 24, 2021 •

edited

pdillinger commented Jul 20, 2022

mrambacher commented Jul 20, 2022

pdillinger commented Jul 20, 2022

Avoid exception if fail to insert block to block-cache #8670

Avoid exception if fail to insert block to block-cache #8670

Comments

Myasuka commented Aug 17, 2021 • edited

Expected behavior

Actual behavior

Steps to reproduce the behavior

Myasuka commented Aug 17, 2021

zhichao-cao commented Aug 17, 2021

Myasuka commented Aug 18, 2021 • edited

pdillinger commented Aug 23, 2021

ajkr commented Aug 23, 2021

Myasuka commented Aug 24, 2021 • edited

pdillinger commented Jul 20, 2022

mrambacher commented Jul 20, 2022

pdillinger commented Jul 20, 2022

Myasuka commented Aug 17, 2021 •

edited

Myasuka commented Aug 18, 2021 •

edited

Myasuka commented Aug 24, 2021 •

edited