Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-26018 Perf improvement in L1 cache - Optimistic call to buffer.retain() #3407

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
import org.apache.hadoop.hbase.util.ClassSize;
import org.apache.hadoop.hbase.util.EnvironmentEdgeManager;
import org.apache.hadoop.util.StringUtils;
import org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException;
import org.apache.yetus.audience.InterfaceAudience;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand Down Expand Up @@ -222,15 +223,7 @@ public class LruAdaptiveBlockCache implements FirstLevelBlockCache {
= "hbase.lru.cache.heavy.eviction.overhead.coefficient";
private static final float DEFAULT_LRU_CACHE_HEAVY_EVICTION_OVERHEAD_COEFFICIENT = 0.01f;

/**
* Defined the cache map as {@link ConcurrentHashMap} here, because in
* {@link LruAdaptiveBlockCache#getBlock}, we need to guarantee the atomicity
* of map#computeIfPresent (key, func). Besides, the func method must execute exactly once only
* when the key is present and under the lock context, otherwise the reference count will be
* messed up. Notice that the
* {@link java.util.concurrent.ConcurrentSkipListMap} can not guarantee that.
*/
private transient final ConcurrentHashMap<BlockCacheKey, LruCachedBlock> map;
private transient final Map<BlockCacheKey, LruCachedBlock> map;

/** Eviction lock (locked when eviction in process) */
private transient final ReentrantLock evictionLock = new ReentrantLock(true);
Expand Down Expand Up @@ -646,14 +639,16 @@ private long updateSizeMetrics(LruCachedBlock cb, boolean evict) {
@Override
public Cacheable getBlock(BlockCacheKey cacheKey, boolean caching, boolean repeat,
boolean updateCacheMetrics) {
LruCachedBlock cb = map.computeIfPresent(cacheKey, (key, val) -> {
// It will be referenced by RPC path, so increase here. NOTICE: Must do the retain inside
// this block. because if retain outside the map#computeIfPresent, the evictBlock may remove
// the block and release, then we're retaining a block with refCnt=0 which is disallowed.
// see HBASE-22422.
val.getBuffer().retain();
return val;
});
LruCachedBlock cb = map.get(cacheKey);
if (cb != null) {
try {
cb.getBuffer().retain();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it is purge from cache by a background thread, we'll have a cb w/ a non-zero refcount that is not in the cache? Will that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is purged from cache by:

  protected long evictBlock(LruCachedBlock block, boolean evictedByEvictionProcess) {
    LruCachedBlock previous = map.remove(block.getCacheKey());    =======> removed from map
    if (previous == null) {
      return 0;
    }
    updateSizeMetrics(block, true);
    long val = elements.decrementAndGet();
    if (LOG.isTraceEnabled()) {
      long size = map.size();
      assertCounterSanity(size, val);
    }
    if (block.getBuffer().getBlockType().isData()) {
      dataBlockElements.decrement();
    }
    if (evictedByEvictionProcess) {
      // When the eviction of the block happened because of invalidation of HFiles, no need to
      // update the stats counter.
      stats.evicted(block.getCachedTime(), block.getCacheKey().isPrimary());
      if (victimHandler != null) {
        victimHandler.cacheBlock(block.getCacheKey(), block.getBuffer());
      }
    }
    // Decrease the block's reference count, and if refCount is 0, then it'll auto-deallocate. DO
    // NOT move this up because if do that then the victimHandler may access the buffer with
    // refCnt = 0 which is disallowed.
    previous.getBuffer().release();         ============================> buffer released
    return block.heapSize();
  }

Based on above mentioned eviction code, we have below mentioned possibilities when eviction and getBlock happens for the same block at the same time:

  1. getBlock retrieves block from map, eviction removes it from map, eviction does release(), getBlock does retain() and encounters IllegalRefCount Exception, we handler it with this patch and treat it as cache miss.
  2. getBlock retrieves block from map, eviction removes it from map, getBlock does retain(), eviction does release(). Since getBlock retain() was successful, it proceeds as successful cache hit, which happens even today with computeIfPresent. Subsequent getBlock call will return null as block was evicted previously.
  3. eviction removes from map, getBlock gets null, it's clear cache miss.

I think we seem good here. WDYT @saintstack?

Copy link
Contributor Author

@virajjasani virajjasani Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for possibility#2 in above, we stand a chance where buffer with non-zero refCount is not in the cache. I see, let me see what alternatives we have for this case.
Although I still think that same case can happen even today.
getBlock does retain() which will bring refCount of BB to 2, while getBlock is busy updating stats, eviction thread can evict block from cache and it does release() which will bring refCount of BB to 1. So even in this case, we can positive refCount buffer which is evicted from cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 sounds good.
#2 yeah, it can get interesting. The computeIfPresent made reasoning easier for sure.

Running w/ #get instead of #computeIfPresent -- even though it incorrect -- changed the locking profile of a loaded process; before the change, the blockage in computeIfPresent was the biggest blockage. After, biggest locking consumer was elsewhere and much more insignificant percentage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @saintstack.

After, biggest locking consumer was elsewhere and much more insignificant percentage

Does this mean we can kind of ignore this case (assuming objects not in cache will get GC'ed regardless of their netty based refCount)? Still thinking about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@virajjasani Thats an interesting idea. Whether onheap or offheap, if no references -- i.e. not tied to a pool -- then they should get GC'd. Does the CB get returned to the cache when done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more.... I don't think we can do your trick afterall.

The refcounting is not for the cache, it is for a backing pool of memory used reading data in from hdfs into the cache. When we evict a block from the cache, we call #release on the memory. If the refcount is zero, the memory is released and can be reused in the backing pool. If #release is called and the #refcount is not zero, we just decrement the refcount.

A cached buffer item detached from the cache still needs to have its #release called w/ refcount at zero so the backing memory gets readded to the pool.

So it seems to me. What you think @virajjasani

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the CB get returned to the cache when done?

You mean if CB gets returned to L1 cache (CHM) after it's buffer has served read request? Yes, that's the case (unless I misunderstood the question)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, forgot to submit my comment from a good while ago)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cached buffer item detached from the cache still needs to have its #release called w/ refcount at zero so the backing memory gets readded to the pool.

Yeah I think this makes sense. Let me get back to this in case I find some better and obvious way to improve perf and get some YCSB results.

} catch (IllegalReferenceCountException e) {
cb = null;
LOG.debug("AdaptiveLRU cache block retain caused refCount Exception. Treating this as L1"
+ " cache miss. Exception: {}", e.getMessage());
}
}
if (cb == null) {
if (!repeat && updateCacheMetrics) {
stats.miss(caching, cacheKey.isPrimary(), cacheKey.getBlockType());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import org.apache.hadoop.hbase.io.encoding.DataBlockEncoding;
import org.apache.hadoop.hbase.util.ClassSize;
import org.apache.hadoop.util.StringUtils;
import org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException;
import org.apache.yetus.audience.InterfaceAudience;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand Down Expand Up @@ -145,14 +146,7 @@ public class LruBlockCache implements FirstLevelBlockCache {
private static final String LRU_MAX_BLOCK_SIZE = "hbase.lru.max.block.size";
private static final long DEFAULT_MAX_BLOCK_SIZE = 16L * 1024L * 1024L;

/**
* Defined the cache map as {@link ConcurrentHashMap} here, because in
* {@link LruBlockCache#getBlock}, we need to guarantee the atomicity of map#computeIfPresent
* (key, func). Besides, the func method must execute exactly once only when the key is present
* and under the lock context, otherwise the reference count will be messed up. Notice that the
* {@link java.util.concurrent.ConcurrentSkipListMap} can not guarantee that.
*/
private transient final ConcurrentHashMap<BlockCacheKey, LruCachedBlock> map;
private transient final Map<BlockCacheKey, LruCachedBlock> map;

/** Eviction lock (locked when eviction in process) */
private transient final ReentrantLock evictionLock = new ReentrantLock(true);
Expand Down Expand Up @@ -510,14 +504,16 @@ private long updateSizeMetrics(LruCachedBlock cb, boolean evict) {
@Override
public Cacheable getBlock(BlockCacheKey cacheKey, boolean caching, boolean repeat,
boolean updateCacheMetrics) {
LruCachedBlock cb = map.computeIfPresent(cacheKey, (key, val) -> {
// It will be referenced by RPC path, so increase here. NOTICE: Must do the retain inside
// this block. because if retain outside the map#computeIfPresent, the evictBlock may remove
// the block and release, then we're retaining a block with refCnt=0 which is disallowed.
// see HBASE-22422.
val.getBuffer().retain();
return val;
});
LruCachedBlock cb = map.get(cacheKey);
if (cb != null) {
try {
cb.getBuffer().retain();
} catch (IllegalReferenceCountException e) {
cb = null;
LOG.debug("LRU cache block retain caused refCount Exception. Treating this as L1 cache"
+ " miss. Exception: {}", e.getMessage());
}
}
if (cb == null) {
if (!repeat && updateCacheMetrics) {
stats.miss(caching, cacheKey.isPrimary(), cacheKey.getBlockType());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
import org.apache.hadoop.util.StringUtils;
import org.apache.hbase.thirdparty.com.google.common.base.MoreObjects;
import org.apache.hbase.thirdparty.com.google.common.util.concurrent.ThreadFactoryBuilder;
import org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException;
import org.apache.yetus.audience.InterfaceAudience;

import org.slf4j.Logger;
Expand Down Expand Up @@ -158,13 +159,16 @@ public boolean containsBlock(BlockCacheKey cacheKey) {
@Override
public Cacheable getBlock(BlockCacheKey cacheKey,
boolean caching, boolean repeat, boolean updateCacheMetrics) {
Cacheable value = cache.asMap().computeIfPresent(cacheKey, (blockCacheKey, cacheable) -> {
// It will be referenced by RPC path, so increase here. NOTICE: Must do the retain inside
// this block. because if retain outside the map#computeIfPresent, the evictBlock may remove
// the block and release, then we're retaining a block with refCnt=0 which is disallowed.
cacheable.retain();
return cacheable;
});
Cacheable value = cache.getIfPresent(cacheKey);
if (value != null) {
try {
value.retain();
} catch (IllegalReferenceCountException e) {
value = null;
LOG.debug("TinyLfu cache block retain caused refCount Exception. Treating this as L1 cache"
+ " miss. Exception: {}", e.getMessage());
}
}
if (value == null) {
if (repeat) {
return null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -910,7 +910,7 @@ public void testReaderWithAdaptiveLruCombinedBlockCache() throws Exception {
}

/**
* Test case for CombinedBlockCache with AdaptiveLRU as L1 cache
* Test case for CombinedBlockCache with LRU as L1 cache
*/
@Test
public void testReaderWithLruCombinedBlockCache() throws Exception {
Expand Down