Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented May 16, 2019

No description provided.

// Decrease the block's reference count, and if refCount is 0, then it'll auto-deallocate. DO
// NOT move this up because if do that then the victimHandler may access the buffer with
// refCnt = 0 which is disallowed.
previous.getBuffer().release();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the problem. Mind explaining more? Why in victimHandler we will access the previous? And is it possible to add a UT?

Copy link
Member Author

@openinx openinx May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in victimHandler we will access the previous? And is it possible to add a UT?

For InclusiveCombinedBlockCache , we will move the evicted block from LRUCache to an larger L2 cache (such as MemcachedBlockCache for longer caching I think), So if release in line#596, then victimHandler will cache an block which point to an unknown area because its memory has been free.
Yeah, will provide a UT.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 44 Docker mode activated.
_ Prechecks _
+1 hbaseanti 0 Patch does not have any anti-patterns.
+1 @author 0 The patch does not contain any @author tags.
-0 test4tests 0 The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ HBASE-21879 Compile Tests _
+1 mvninstall 278 HBASE-21879 passed
+1 compile 55 HBASE-21879 passed
+1 checkstyle 76 HBASE-21879 passed
+1 shadedjars 280 branch has no errors when building our shaded downstream artifacts.
-1 findbugs 168 hbase-server in HBASE-21879 has 11 extant Findbugs warnings.
+1 javadoc 38 HBASE-21879 passed
_ Patch Compile Tests _
+1 mvninstall 263 the patch passed
+1 compile 54 the patch passed
+1 javac 54 the patch passed
-1 checkstyle 76 hbase-server: The patch generated 1 new + 95 unchanged - 1 fixed = 96 total (was 96)
+1 whitespace 0 The patch has no whitespace issues.
+1 shadedjars 281 patch has no errors when building our shaded downstream artifacts.
+1 hadoopcheck 539 Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0.
+1 findbugs 188 the patch passed
+1 javadoc 33 the patch passed
_ Other Tests _
-1 unit 13822 hbase-server in the patch failed.
+1 asflicense 30 The patch does not generate ASF License warnings.
17308
Reason Tests
Failed junit tests hadoop.hbase.client.TestAsyncRegionAdminApi
hadoop.hbase.replication.TestReplicationDisableInactivePeer
hadoop.hbase.master.procedure.TestSCPWithReplicas
hadoop.hbase.master.procedure.TestSCPWithReplicasWithoutZKCoordinated
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-242/1/artifact/out/Dockerfile
GITHUB PR #242
Optional Tests dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux fb443f933cf7 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar 18 15:02:05 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /testptch/patchprocess/precommit/personality/provided.sh
git revision HBASE-21879 / ab05d9d
maven version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z)
Default Java 1.8.0_181
findbugs v3.1.11
findbugs https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-242/1/artifact/out/branch-findbugs-hbase-server-warnings.html
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-242/1/artifact/out/diff-checkstyle-hbase-server.txt
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-242/1/artifact/out/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-242/1/testReport/
Max. process+thread count 5429 (vs. ulimit of 10000)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-242/1/console
Powered by Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

@openinx
Copy link
Member Author

openinx commented May 20, 2019

The patch still not fix the refCnt=0 retain issue, because I've applied this patch to my local branch , and deploy to my test cluster. After run some days, the QPS still dropped from 300000 Get/second to 200 Get/second. Need to find out why.

@ramkrish86
Copy link
Contributor

So even after this patch if QPS drops - you stlil get the exception as attached in the original JIRA description?

@openinx
Copy link
Member Author

openinx commented May 20, 2019

bq. So even after this patch if QPS drops - you stlil get the exception as attached in the original JIRA description?
Yeah, still has the stacktrace. Let me check.

Mon May 20 12:16:36 CST 2019, RpcRetryingCaller{globalStartTime=1558325539029, pause=200, maxAttempts=16}, java.io.IOException: java.io.IOException: org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleException(HRegion.java:6534)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6504)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6473)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2999)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2979)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2961)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2955)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2621)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2548)
	at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:374)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1
	at org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.retain0(AbstractReferenceCounted.java:87)
	at org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.retain(AbstractReferenceCounted.java:74)
	at org.apache.hadoop.hbase.nio.SingleByteBuff.retain(SingleByteBuff.java:398)
	at org.apache.hadoop.hbase.nio.SingleByteBuff.retain(SingleByteBuff.java:39)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.retain(HFileBlock.java:449)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.retain(HFileBlock.java:115)
	at org.apache.hadoop.hbase.io.hfile.LruBlockCache.getBlock(LruBlockCache.java:538)
	at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.getBlock(CombinedBlockCache.java:84)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1298)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1464)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:339)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:844)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:794)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:394)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:249)
	at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2063)
	at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2054)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6493)
	... 12 more

@openinx
Copy link
Member Author

openinx commented May 24, 2019

Update the patch with the final fix and UT, FYI @Apache9 @ramkrish86 @anoopsjohn

@openinx
Copy link
Member Author

openinx commented May 25, 2019

Seems no hadoop QA feedback ? It's strange...

// Must initialize it with null here, because if don't and once an exception happen in
// readBlock, then we'll release the previous assigned block twice in the finally block.
// (See HBASE-22422)
block = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

BlockCacheKey cacheKey, Cacheable newBlock) {
// NOTICE: The getBlock has retained the existingBlock inside.
Cacheable existingBlock = blockCache.getBlock(cacheKey, false, false, false);
if (existingBlock == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this means we may get NPE in the past?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see the LruBlockCache#cacheBlock:

    LruCachedBlock cb = map.get(cacheKey);
    if (cb != null && !BlockCacheUtil.shouldReplaceExistingCacheBlock(this, cacheKey, buf)) {
      return;
    }

The existence pre-check and accessing block in shouldReplaceExistingCacheBlock is not atomic op. so if any eviction happen between them, the NPE will happen. I added a UT testMultiThreadGetAndEvictBlock to address this.

@openinx openinx merged commit b673000 into apache:HBASE-21879 May 28, 2019
@openinx openinx deleted the HBASE-21879 branch May 28, 2019 02:24
return absent.get() ? null : re;
}

public boolean remove(BlockCacheKey key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While removing the atomicity is not needed? Because we do computeIfAbsent and that is already now atomically guarded?

absent.set(true);
return entry;
});
return absent.get() ? null : re;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got it now. You have changed the get() to computeIfPresent(). So if the remove has already removed the entry then the get() cannot do a retain(). So a similar change is also needed for putIfAbsent() also? Rest looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if don't make the put and retain in atomic , then if remove & release happen between the put and retain, finally we 're retain a block with refCnt=0 which is also disallowed. Thanks.

openinx added a commit to openinx/hbase that referenced this pull request Jun 25, 2019
infraio pushed a commit to infraio/hbase that referenced this pull request Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants