Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-23181 Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us" #739

Closed
wants to merge 1 commit into from

Conversation

@Apache9
Copy link
Contributor

Apache9 commented Oct 21, 2019

No description provided.

@Apache9 Apache9 requested a review from saintstack Oct 21, 2019
@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 21, 2019

@saintstack @binlijin PTAL. This is another way to fix the problem of HBASE-23157. The idea is simple, just remove everything in SequenceIdAccounting when we write a close region marker.

Why introducing a new parameter instread of checking the edit when appending is that, the append is single threaded so we'd better do less in it, and also passing a protobuf message back from a byte array can throw IOException, which is a bit annoying as we do not expect IOExceptions for lots of methods.

And maybe this could also solve HBASE-23181, but I still think we need to find out the root cause for HBASE-23181 and add a UT for it, maybe there are still other problems.

@Apache-HBase

This comment has been minimized.

Copy link

Apache-HBase commented Oct 21, 2019

💔 -1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 56s Docker mode activated.
_ Prechecks _
💚 dupname 0m 1s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 24 new or modified test files.
_ master Compile Tests _
💙 mvndep 0m 36s Maven dependency ordering for branch
💚 mvninstall 6m 47s master passed
💚 compile 2m 8s master passed
💚 checkstyle 2m 47s master passed
💚 shadedjars 5m 46s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 1m 24s master passed
💙 spotbugs 4m 28s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 6m 15s master passed
_ Patch Compile Tests _
💙 mvndep 0m 16s Maven dependency ordering for patch
💚 mvninstall 5m 41s the patch passed
💚 compile 1m 51s the patch passed
💚 javac 1m 51s the patch passed
💔 checkstyle 1m 28s hbase-server: The patch generated 2 new + 373 unchanged - 17 fixed = 375 total (was 390)
💔 checkstyle 0m 20s hbase-mapreduce: The patch generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2)
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 38s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 15m 42s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 1m 16s the patch passed
💚 findbugs 6m 1s the patch passed
_ Other Tests _
💚 unit 3m 6s hbase-common in the patch passed.
💔 unit 261m 50s hbase-server in the patch failed.
💚 unit 24m 24s hbase-mapreduce in the patch passed.
💚 asflicense 1m 52s The patch does not generate ASF License warnings.
363m 45s
Reason Tests
Failed junit tests hadoop.hbase.master.TestSplitWALManager
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/1/artifact/out/Dockerfile
GITHUB PR #739
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux 5bdec0fb24e6 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-739/out/precommit/personality/provided.sh
git revision master / da9a53e
Default Java 1.8.0_181
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/1/artifact/out/diff-checkstyle-hbase-server.txt
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/1/artifact/out/diff-checkstyle-hbase-mapreduce.txt
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/1/artifact/out/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/1/testReport/
Max. process+thread count 5445 (vs. ulimit of 10000)
modules C: hbase-common hbase-server hbase-mapreduce U: .
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/1/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

@binlijin

This comment has been minimized.

Copy link
Contributor

binlijin commented Oct 22, 2019

The idea is good, it can solve the problem that make regionservers have too many wals, but the master may still have the warn message:
2019-10-07 21:09:49,101 WARN [B.defaultRpcServer.handler=201,queue=21,port=60000] master.ServerManager: RegionServer 9.46.144.221,60020,1565160409765 indicates a last flushed sequence id (24529281) that is less than the previous last flushed sequence id (24529341) for region gdtdata_20191007,5680,1570302605714.187ad7634f6ca396fa2aab5cc5c83b61. Ignoring.

Looks like the warn do not make any problem?

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 22, 2019

The warning does not lead to actual problem but I think we should still avoid it. Maybe another issue? I think we should find a better way to find the proper flushed sequence id for reporting to master.

@Apache9 Apache9 force-pushed the Apache9:HBASE-23157 branch from ebd8a44 to 8bfcc0a Oct 22, 2019
@saintstack

This comment has been minimized.

Copy link
Contributor

saintstack commented Oct 22, 2019

I thought about doing this but thought it too obnoxious. It also punts on what the actual issue is over in HBASE-23181, on why on close an extra edit can arrive after we've moved aside the Region Map in SequenceIdAccounting.

We could apply as a workaround or patch until we figure real issues?

@binlijin So that is why we get that message? I didn't think of that. Good one.

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 22, 2019

But if we use ASYNC_WAL, the patch on HBASE-23181 can not solve the problem, as by design, ASYNC_WAL will not wait for the SyncFuture to complete...

So we have to manually remove the records in SequenceIdAccounting, after we make sure that the region has been closed and all data have been flushed out.

Copy link
Contributor

saintstack left a comment

One question is if the close will cut off any edits in flight?

@@ -51,4 +51,8 @@ public static ImmutableByteArray wrap(byte[] b) {
public String toStringUtf8() {
return Bytes.toString(b);
}

public String toStringBinary() {

This comment has been minimized.

Copy link
@saintstack

saintstack Oct 22, 2019

Contributor

Just make this a toString rather than add a new method?

@@ -193,7 +193,8 @@ public void testWALRecordReader() throws Exception {
WALEdit edit = new WALEdit();
edit.add(new KeyValue(rowName, family, Bytes.toBytes("1"),
System.currentTimeMillis(), value));
long txid = log.append(info, getWalKeyImpl(System.currentTimeMillis(), scopes), edit, true);
long txid =
log.append(info, getWalKeyImpl(System.currentTimeMillis(), scopes), edit, true, false);

This comment has been minimized.

Copy link
@saintstack

saintstack Oct 22, 2019

Contributor

Would it make sense adding an override for append rather than make change everywhere append is called?

super(key, edit);
this.inMemstore = inMemstore;
this.closeRegion = closeRegion;

This comment has been minimized.

Copy link
@saintstack

saintstack Oct 22, 2019

Contributor

Do we have to add a flag? See WALUtil where we have createFlushWALEdit which uses special family and qualifier to denote an edit a FLUSH edit. Could do same here and skip the extra flag?

.collect(Collectors.joining(",", "{", "}")));
}
}
this.highestSequenceIds.remove(encodedRegionName);

This comment has been minimized.

Copy link
@saintstack

saintstack Oct 22, 2019

Contributor

Good. This is like the purge method over in HBASE-23181.

* so we can finish an unfinished compaction -- it is not an edit for memstore.
* @param closeRegion Whether this is a region close marker, i.e, the last wal edit for this
* region on this region server. The WAL implementation should remove all the related
* stuff, for example, the sequence id accounting.

This comment has been minimized.

Copy link
@saintstack

saintstack Oct 22, 2019

Contributor

Good.

@Apache-HBase

This comment has been minimized.

Copy link

Apache-HBase commented Oct 22, 2019

💔 -1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 57s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 1s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 24 new or modified test files.
_ master Compile Tests _
💙 mvndep 0m 38s Maven dependency ordering for branch
💚 mvninstall 5m 16s master passed
💚 compile 1m 47s master passed
💚 checkstyle 2m 16s master passed
💚 shadedjars 4m 41s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 1m 21s master passed
💙 spotbugs 4m 7s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 5m 43s master passed
_ Patch Compile Tests _
💙 mvndep 0m 15s Maven dependency ordering for patch
💚 mvninstall 4m 59s the patch passed
💚 compile 1m 46s the patch passed
💚 javac 1m 46s the patch passed
💚 checkstyle 0m 26s The patch passed checkstyle in hbase-common
💚 checkstyle 1m 26s hbase-server: The patch generated 0 new + 373 unchanged - 17 fixed = 373 total (was 390)
💚 checkstyle 0m 20s The patch passed checkstyle in hbase-mapreduce
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 34s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 15m 48s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 1m 17s the patch passed
💚 findbugs 5m 53s the patch passed
_ Other Tests _
💚 unit 3m 6s hbase-common in the patch passed.
💔 unit 280m 19s hbase-server in the patch failed.
💚 unit 33m 1s hbase-mapreduce in the patch passed.
💚 asflicense 1m 46s The patch does not generate ASF License warnings.
385m 42s
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/2/artifact/out/Dockerfile
GITHUB PR #739
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux 67466d64b9a5 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-739/out/precommit/personality/provided.sh
git revision master / 2ad62b0
Default Java 1.8.0_181
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/2/artifact/out/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/2/testReport/
Max. process+thread count 5436 (vs. ulimit of 10000)
modules C: hbase-common hbase-server hbase-mapreduce U: .
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/2/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 23, 2019

As proposed on slack in the direct message to @saintstack , let me try to introduce two methods in WAL interfacce, one is for appending data, and another is for appending marker, to avoid too many flags.

@binlijin

This comment has been minimized.

Copy link
Contributor

binlijin commented Oct 23, 2019

@Apache9 Yes, it another problem we need to find the root cause and fix it in another issue.

@binlijin

This comment has been minimized.

Copy link
Contributor

binlijin commented Oct 23, 2019

@saintstack i need to study it and figure out why we get that message.

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 23, 2019

Hi @binlijin , please follow the steps on this page

https://cwiki.apache.org/confluence/display/OPENWHISK/Accessing+Apache+GitHub+as+a+Committer

to combine your github account and the asf account? You should be a member of HBase so you can merge pull requests. Now you are listed as a contributor...

@Apache9 Apache9 force-pushed the Apache9:HBASE-23157 branch from 8bfcc0a to ba369fd Oct 23, 2019
@binlijin

This comment has been minimized.

Copy link
Contributor

binlijin commented Oct 23, 2019

@Apache9 OK.

@Apache9 @saintstack
2019-10-07 21:09:49,101 WARN [B.defaultRpcServer.handler=201,queue=21,port=60000] master.ServerManager: RegionServer 9.46.144.221,60020,1565160409765 indicates a last flushed sequence id (24529281) that is less than the previous last flushed sequence id (24529341) for region gdtdata_20191007,5680,1570302605714.187ad7634f6ca396fa2aab5cc5c83b61. Ignoring.

After debuging i find this message caused by the same problem.
And it can be reproduced, i upload it in HBASE-23157, you can see the TestMaxFlushedSeqId.patch.

@Apache-HBase

This comment has been minimized.

Copy link

Apache-HBase commented Oct 23, 2019

💔 -1 overall

Vote Subsystem Runtime Comment
💙 reexec 3m 32s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 1s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 24 new or modified test files.
_ master Compile Tests _
💙 mvndep 0m 36s Maven dependency ordering for branch
💚 mvninstall 5m 14s master passed
💚 compile 1m 47s master passed
💚 checkstyle 2m 15s master passed
💚 shadedjars 4m 40s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 1m 18s master passed
💙 spotbugs 3m 57s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 5m 29s master passed
_ Patch Compile Tests _
💙 mvndep 0m 16s Maven dependency ordering for patch
💚 mvninstall 4m 57s the patch passed
💚 compile 1m 46s the patch passed
💚 javac 1m 46s the patch passed
💔 checkstyle 1m 29s hbase-server: The patch generated 2 new + 373 unchanged - 17 fixed = 375 total (was 390)
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 33s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 15m 38s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 1m 15s the patch passed
💚 findbugs 5m 58s the patch passed
_ Other Tests _
💚 unit 3m 8s hbase-common in the patch passed.
💔 unit 287m 32s hbase-server in the patch failed.
💚 unit 32m 8s hbase-mapreduce in the patch passed.
💚 asflicense 1m 47s The patch does not generate ASF License warnings.
393m 13s
Reason Tests
Failed junit tests hadoop.hbase.client.TestSnapshotDFSTemporaryDirectory
hadoop.hbase.regionserver.TestHRegion
hadoop.hbase.regionserver.TestHRegionWithInMemoryFlush
hadoop.hbase.client.TestFromClientSideWithCoprocessor
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/3/artifact/out/Dockerfile
GITHUB PR #739
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux 1324dd1d2bd8 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-739/out/precommit/personality/provided.sh
git revision master / 2ad62b0
Default Java 1.8.0_181
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/3/artifact/out/diff-checkstyle-hbase-server.txt
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/3/artifact/out/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/3/testReport/
Max. process+thread count 5637 (vs. ulimit of 10000)
modules C: hbase-common hbase-server hbase-mapreduce U: .
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-739/3/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

@saintstack

This comment has been minimized.

Copy link
Contributor

saintstack commented Oct 24, 2019

Thanks for bringing up the " indicates a last flushed sequence id" message on the master. Good one. It is related. At least for the HBASE-23181 case, it makes sense because an entry goes into SequenceIdAccounting with a sequenceid that is one less than the flush sequenceid (somehow it is getting in after the flush the flush). Once it goes in, the Master starts complaining.

You added a whole new flushcache method to HRegion in TestMaxFlushedSeqId.patch. How is it different from the original? The test looks good. What do you see as the output? You can make it so a flush gets a sequenceid that is less than the last edit's sequenceid? The last edit does not make it into the flush? Its out of order?

Nice work.

@saintstack

This comment has been minimized.

Copy link
Contributor

saintstack commented Oct 24, 2019

@binlijin See above sir.

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 24, 2019

I think this could be a follow on. It is the same problem, but we need to discuss how to fix it. If we think the current solution, where we remove all the related records when closing a region, is fine, then I think we could add some sequence ids back to the unflushedSequenceIds in completeCacheFlush, if we found that the current sequence id in the unflushedSequenceIds is even smaller then the flushedSequenceId.

As we will write a flush marker and wait for it to complete, so after flush we can make sure that there are no pending WAL entries for this region any more, so I think the solution above can work.

@Apache9 Apache9 force-pushed the Apache9:HBASE-23157 branch from ba369fd to 0c74260 Oct 24, 2019
@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 24, 2019

Or let's change this PR against HBASE-23181, and use HBASE-23157 for addressing the flushed sequence id problem.

@Apache9 Apache9 changed the title HBASE-23157 WAL unflushed seqId tracking may wrong when Durability.AS… HBASE-23181 Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us" Oct 24, 2019
@saintstack

This comment has been minimized.

Copy link
Contributor

saintstack commented Oct 24, 2019

What would be a follow on sir?

Diff between 23181 and this is a async wal. The patch on there at moment is bad I realize now

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 24, 2019

What would be a follow on sir?

Diff between 23181 and this is a sync wal. The patch on there at moment is bad I realize now.

Like I proposed above, we could do something in the completeCacheFlush to prevent the flushed sequence id go backwards. And this does not lead to data loss, only a warning log on master side which makes people a bit confusing, so I think it is fine to address it in a follow on issue.

Issuing a wal sync while holding the writeLock of updateLock is not good, and we have been doing a lot of works to avoid this...

I've already changed the title to HBASE-23181, if we all agree that this is fine, let's merge this, and then let's use Lijin Bin's new UT to address the flushed sequence id problem using HBASE-23157.

Thanks.

@binlijin

This comment has been minimized.

Copy link
Contributor

binlijin commented Oct 24, 2019

@saintstack
I added a whole new flushcache method to HRegion in TestMaxFlushedSeqId.patch, this method is to flush only some HStore and used by the test.
The test do two flush for a HRegion, the HRegion have two HStore, the first flush do flushing two HStore and the HRegion get maxFlushedSeqId=4 and the second flush only flush one HStore, and the HRegion get maxFlushedSeqId=2 which smaller than the flush before.

@Apache9

This comment has been minimized.

Copy link
Contributor Author

Apache9 commented Oct 24, 2019

Let's close this one, will open a new PR against HBASE-23181.

@saintstack

This comment has been minimized.

Copy link
Contributor

saintstack commented Oct 25, 2019

@binlijin we should fix that. It does not seem to be the behavior seen in HBASE-23181 though, right?

@Apache9 you see the comments above?

Ok on continuing on a new PR.

@binlijin

This comment has been minimized.

Copy link
Contributor

binlijin commented Oct 28, 2019

@saintstack Yes, they are not the same behavior, but for the same reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.