Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid exception if fail to insert block to block-cache #8670

Open
Myasuka opened this issue Aug 17, 2021 · 9 comments · May be fixed by #8683
Open

Avoid exception if fail to insert block to block-cache #8670

Myasuka opened this issue Aug 17, 2021 · 9 comments · May be fixed by #8683
Assignees

Comments

@Myasuka
Copy link
Contributor

Myasuka commented Aug 17, 2021

Expected behavior

If fail to insert block to block-cache, just abot the insert but return the result as normal.

Actual behavior

If we use default ReadOptions with strict_capacity_limit block cache, RocksDB would throw exception if Insert failed due to LRU cache being full.

Steps to reproduce the behavior

This is because current RetrieveBlock would check the status of MaybeReadBlockAndLoadToCache:

    s = MaybeReadBlockAndLoadToCache(
        prefetch_buffer, ro, handle, uncompression_dict, wait_for_cache,
        block_entry, block_type, get_context, lookup_context,
        /*contents=*/nullptr);

    if (!s.ok()) {
      return s;
    }

And if fail to insert, the block cache would report incomplete status:

    if ((usage_ + total_charge) > capacity_ &&
        (strict_capacity_limit_ || handle == nullptr)) {
      e->SetInCache(false);
      if (handle == nullptr) {
        // Don't insert the entry but still return ok, as if the entry inserted
        // into cache and get evicted immediately.
        last_reference_list.push_back(e);
      } else {
        if (free_handle_on_fail) {
          delete[] reinterpret_cast<char*>(e);
          *handle = nullptr;
        }
        s = Status::Incomplete("Insert failed due to LRU cache being full.");
      }

As more and more applications move on cloud, and fine-granularity memory control is required in many use cases. I think enabling the strict capacity limit of block cache in production environment should be something valueable.
How about this solution:

  1. Introduce a new status as FailInsertCache.
  2. Introduce a new filed named fill_cache_if_possible in ReadOptions, which means if cannot fill block to cache, we would not treat FailInsertCache status as a problem, and continue the read process as normal.

What do you think of this problem and the propsed solution?

@Myasuka
Copy link
Contributor Author

Myasuka commented Aug 17, 2021

cc @zhichao-cao , @ltamasi @akankshamahajan15 who might be interested in this topic.

@zhichao-cao
Copy link
Contributor

Thanks for the suggestion. So there will be a new error code in Status as Status::FailInsertCache()?

@Myasuka
Copy link
Contributor Author

Myasuka commented Aug 18, 2021

@zhichao-cao Yes, this is just a proposal and hope to get your response to see whether this is practicable.
If so, do you think RocksDB community could resolve it in this suggested way?

@pdillinger
Copy link
Contributor

If we use default ReadOptions with strict_capacity_limit block cache, RocksDB would throw exception if Insert failed due to LRU cache being full.

Correction: LRUCacheOptions

My concern is that if you are setting strict_capacity_limit=true on Cache, you are probably interested in limiting the amount of memory used to hold SST blocks. But under your proposal, iterators and active Get operations are allowed to allocate as much as they want beyond this limit, because they ignore when insertion into Cache fails. This is worse than setting strict_capacity_limit=false because at least with strict_capacity_limit=false you can cache hit on everything that is already in memory.

@ajkr
Copy link
Contributor

ajkr commented Aug 23, 2021

It is a good question though - since the allocation and read already happened, does it make sense to return the value? I think it's not that straightforward. When using the PinnableSlice APIs we don't just fill a value buffer for the caller -- the whole underlying block comes with it. And this is amplified in MultiGet() or iterators.

One alternative we've been discussing is to pre-reserve cache memory before doing the allocation and read. That would prevent the case where we do an allocation and read only to throw it away and fail the operation. But we would still fail the operation -- it would just be an earlier/more efficient way to fail.

@Myasuka
Copy link
Contributor Author

Myasuka commented Aug 24, 2021

Thanks for your feedback!

From my previous understanding, current memory control solution in RocksDB cannot control it very well, just as you pointed out. It would first let the memory allocation happen and then choose whether to free that after inserting into the cache. This is not perfect but better than totally no memory control.

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap. On the other hand, Apache Flink community heavily depends on this future in container environment and make the cache strictly limit could help improve the memory control. From current status, I think make strict_capacity_limit=true and not failing the get operation could be the low hanging fruit. I think this is not something against the long term design. What do you think ? @ajkr @pdillinger @zhichao-cao

BTW, I think wait for evicting lru cache until we fail the operation, that might be someting better for long-running applications.

@pdillinger pdillinger self-assigned this Jun 23, 2022
@pdillinger
Copy link
Contributor

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.

I think the big complication is that we can't always predict the size needed for decompressed block.

Note: there are two other "lookup with intent to insert" things we could consider solving along with this:

  • Allocate and insert an "empty" handle after unsuccessful lookup (with intent to insert), to avoid repeating the hashing & probing between Lookup & Insert.
  • Use the same to allow cooperation between threads intending to insert the same entry. All but one thread would wait for the select one to finish the corresponding insert.

@mrambacher
Copy link
Contributor

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.

I think the big complication is that we can't always predict the size needed for decompressed block.

I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.

The size an uncompressed block will be compressed is however unknown.

I wonder if having every Cache have a MemoryAllocator and having a smarter allocator (that allocated a Handle and could support "reallocate") would be helpful...

@pdillinger
Copy link
Contributor

I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.

That doesn't help you predict the uncompressed size when all you have is a BlockHandle to the compressed block.

HeartSaVioR pushed a commit to apache/spark that referenced this issue Aug 23, 2023
…void insertion exception on cache full

### What changes were proposed in this pull request?
Disable strict limit for RocksDB write manager to avoid insertion exception on cache full

### Why are the changes needed?
In some cases, if the memory limit is reached, on insert/get, we are seeing the following exception

```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 9.0 failed 4 times, most recent failure: Lost task 42.3 in stage 9.0 (TID 2950) (96.104.176.55 executor 0): org.rocksdb.RocksDBException: Insert failed due to LRU cache being full.
	at org.rocksdb.RocksDB.get(Native Method)
	at org.rocksdb.RocksDB.get(RocksDB.java:2053)
	at org.apache.spark.sql.execution.streaming.state.RocksDB.get(RocksDB.scala:299)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$RocksDBStateStore.get(RocksDBStateStoreProvider.scala:55)
```

It seems this is being thrown with strict memory limit within RocksDB here - https://github.com/facebook/rocksdb/blob/0fa0c97d3e9ac5dfc2e7ae94834b0850cdef5df7/cache/lru_cache.cc#L394

It seems this issue can only happen with the strict mode as described here - facebook/rocksdb#5048 (comment)

Seems like there is a pending issue for RocksDB around this as well - facebook/rocksdb#8670

There is probably a relevant fix, but not sure whether this addresses the issue completely - facebook/rocksdb#6619
(cc - siying )

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #42567 from anishshri-db/task/SPARK-44878.

Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants