-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-27871 Meta replication stuck forever if wal it's still reading gets rolled and deleted #5241
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@@ -124,7 +124,7 @@ public enum HasNext { | |||
* You can call {@link #peek()} or {@link #next()} to get the actual {@link Entry} if this method | |||
* returns {@link HasNext#YES}. | |||
*/ | |||
public HasNext hasNext() { | |||
public HasNext hasNext() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design here is to not throwing any exceptions, instead, always use the return value to tell upper layers what to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid having to put the FNFE handling logic in the WALEntryStream. To me, that's something specific for the meta replicas, so that's why I thought of placing it in this new CatalogReplicationSourceWALReader class. But for that, I had to make WALEntryStream throw the exception to the next upper layer (the reader). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I have a small question here, are we seeing this issue only with meta replicas or user table region replicas as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then better wrap it with UncheckedIOException, or a specfic exception type defined by us own, add more comments to explain the reason, and then handle this exception in CatalogReplicationSourceWALReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have a small question here, are we seeing this issue only with meta replicas or user table region replicas as well?
Only meta replicas. For user region replicas, we do use ZK stg for the wal queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then better wrap it with UncheckedIOException, or a specfic exception type defined by us own, add more comments to explain the reason, and then handle this exception in CatalogReplicationSourceWALReader?
That's indeed better, makes the fix much simpler, as we don't even need additional CatalogReplicationSourceWALReader. Wrapping a FNFE in UncheckedIOException would go all the way up to the replication endpoint, which would then deal with it accordingly. Let me know WDYT, @Apache9 .
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
((ReplicationPeerImpl) source.replicationPeer).setPeerState(true); | ||
verifyReplication(tableName, numOfMetaReplica, 0, 5000, HConstants.CATALOG_FAMILY); | ||
} finally { | ||
table.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use try with resources.
HRegionServer hrs = cluster.getRegionServer(cluster.getServerHoldingMeta()); | ||
ReplicationSource source = (ReplicationSource) hrs.getReplicationSourceService() | ||
.getReplicationManager().catalogReplicationSource.get(); | ||
((ReplicationPeerImpl) source.replicationPeer).setPeerState(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enough to make sure that the replication is already stopped? The check is in another thread, we'd better add some checks here to confirm that the replication source is stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, there's a small chance for race conditions here, although I have run this many times and never faced it.
Nevertheless, I'm putting extra checks to make sure we just proceed with adding data once we are sure the source wal reader knows the peer is disabled now. Please check, @Apache9 .
HTU.getHBaseCluster().getMaster().getLogCleaner().triggerCleanerNow().get(1, | ||
TimeUnit.SECONDS); | ||
((ReplicationPeerImpl) source.replicationPeer).setPeerState(true); | ||
verifyReplication(tableName, numOfMetaReplica, 0, 5000, HConstants.CATALOG_FAMILY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we just checked whether data can be read? I think the main thing here is that we should make sure the ReplicationSource can still relicate things out, i.e, it is not stuck forever.
So maybe we should load some data, flush, delete the WAL, enable the replication source, and then load more data, without flushing, to make sure that we could read all the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we just checked whether data can be read? I think the main thing here is that we should make sure the ReplicationSource can still relicate things out, i.e, it is not stuck forever.
It's what we are testing here. We disable the catalog peer before we do the flush and compact. When the replication is stuck forever because the FNFE, the updates from line #244 never get to the secondary replicas and the verifyReplication call on line #255 fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that, for meta replication, the reason why we can just drop the WAL is because we can reload the hfiles from the filesystem if data have already been flushed out. So here, since the data have already been flushed out, even if the replication is stuck forever, it is still possible that we could read the data from the secondary replica, as all the data are in hfiles now. So I think we'd better load some data again, to make sure that even if the data is not in hfile, we could still read it, i.e, the replication is sitll alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here, since the data have already been flushed out, even if the replication is stuck forever, it is still possible that we could read the data from the secondary replica, as all the data are in hfiles now.
Not really. The secondary replicas only know they need to load new files when they get notified that flush/compaction happened in the primary. This is done via replication, through the WAL markers. With replication stuck, the secondaries never load new files, it only gets worse, as compacted files eventually get deleted by the compaction discharger, secondaries now start to throw FNFE. This is the scenario this test is simulating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the secondary replica is reassigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, reassignment would "fix" the issue. But we are not triggering reassignment in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do not disable balancer in this test right? And maybe we will add some periodically reloading logic in the future which could make the test not test what we want. So I think still think, we should try to load more data, where the new data are only in WAL file, to make sure that the secondary replica can still load them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, changed it to disable balancer and also add extra data after peer is reenabled but no flush/compaction triggered.
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…gets rolled and deleted
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
// so we are making sure here we only proceed once the reader loop has managed to | ||
// detect the peer is disabled. | ||
int retries = 0; | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use HBTU.waitFor is enough? Anyway, not a blocker.
Change-Id: Icc7a1d4b9b8030fcae2079085904f21b23f859c8
💔 -1 overall
This message was automatically generated. |
I don't know why this pre-commit is giving spotless failures for a class that's not even in the PR. Nevertheless, I have ran spotless on my local branch and no errors are reported. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
No description provided.