-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time looking at this change, and I don't see a problem, it looks good. I have one minor request for change and one comment about tests. It would be good to check that our test cases already cover enough, and otherwise add more test cases.
boolean snapshotTaken = false; | ||
boolean isPreZAB1_0 = true; | ||
//If we are not going to take the snapshot be sure the edits are not applied in memory | ||
boolean writeToEditLog = !snapshotNeeded; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here are using edit
to refer to txns
. I'd rather use txn
to be consistent across the project. Specifically here, you're using EditLog
to refer to the TxnLog
, please change accordingly to have it consistent across the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that I am still a bit new at the internal terminology. I will update it.
@@ -839,6 +839,13 @@ public void converseWithFollower(InputArchive ia, OutputArchive oa, | |||
Assert.assertEquals(1, f.self.getAcceptedEpoch()); | |||
Assert.assertEquals(1, f.self.getCurrentEpoch()); | |||
|
|||
//Wait for the edits to be written out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think some more whether it makes any sense to add test cases for this. The test cases we already have probably cover this enough given that there is no real change of behavior.
This change here is necessary, though. We don't really care about time in general in our tests because we can never be sure of the timing we will get across runs and with different settings.
Thanks for the patch @revans2 . It makes sense to port this change to both 3.5 and master. |
@fpj Thanks for the review I will update the comments and start porting it to other lines. |
@fpj I addressed your review comments, and I also found another race in the ZAB test that I addressed. Apparently the log line I removed was taking enough time for the transactions to be fully flushed. When I removed it the test would occasionally fail. Please also take a look at #158 and #159 for master and branch 3.5. |
Hey @revans2, FYI. I was able to apply the both #158 and #159 without any explicit conflict on If I am right then putting a comment to apply either #158 or #159 (up to you) to both branch-3.5 and master should be enough, IMHO. |
@eribeiro will do. It is a clean cherry pick between master and 3.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change made sense to me, although I think it would be great to test that we are only snapshotting at the appropriate times
//Wait for the transactions to be written out. The thread that writes them out | ||
// does not send anything back when it is done. | ||
long start = System.currentTimeMillis(); | ||
while (createSessionZxid != f.fzk.getLastProcessedZxid() && (System.currentTimeMillis() - start) < 50) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an idea, not sure if it is worth the effort and it may be outside the scope of this patch.
we could play with the test infrastructure here a little bit and do some dependency injection in createFollower
that can let us track if db clearing and snapshotting occurs when expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem a bit beyond the scope of this. But if you really want me to I can look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice, a good way of actually validating everything is behaving as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK Will see what I can do
@@ -321,13 +321,16 @@ protected void syncWithLeader(long newLeaderZxid) throws IOException, Interrupte | |||
QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null); | |||
QuorumPacket qp = new QuorumPacket(); | |||
long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid); | |||
|
|||
readPacket(qp); | |||
//In the DIFF case we don't need to do a snapshot because the transactions will sync on top of any existing snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we generally put spaces after the //
// * When a new quorum is established we can still apply the diff | ||
// on top of the same zkDb data | ||
// * If we fetch a new snapshot from leader, the zkDb will be | ||
// cleared anyway before loading the snapshot | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one case we may still want to clear db here - when one of the ZooKeeper critical threads (such as * processors, session trackers) fail, ZooKeeper server will shutdown (see runFromConfig) and consequently invoke ZooKeeper#shutdown. In this case, I don't see a particular reason not to clear the db, though not doing it might be fine (as one could argue the server will be dead anyway), but I tend to lean towards the safe side on cleaning the db. One way to conditionally do that is to add a Boolean parameter to ZooKeeper#shutdown so we can have fine grained control over when to clear db in what code path.
@hanm I addressed your review comments. |
@revans2 the change looks good, thanks. |
Had another look at the patch, specifically the changes on Learner. Looks good to me. +1. |
@afine I updated the test to spy on the |
Thanks @revans2 +1 lgtm |
Is there any more I need to do to get this merged in? |
@revans2 No more work is required, the patch is ready, but I am not sure if this should be included in the upcoming 3.4.10 release. If not we will wait until 3.4.10 is out to merge this into branch-3.4. @rakeshadr Do you think this should be included in 3.4.10? The PR to master #159 could be merged in, I'll have another look and merge it today. |
… DBs (master) This is the master version of #157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes #159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
@hanm , mostly it makes sense to me as you all have done detailed code reviews and there is no open comments now. I hope the code path is thoroughly discussed/tested? I'd like to freeze the code changes asap. |
@rakeshadr If it makes you feel any better we have been running with an older version of this patch in production for a while. We have used it as part of a rolling upgrade at least 10 times in production where if it were not there we would have had some very painful outages. I have also manually tested it at least 50 times shooting the leader under load (10,000 operations/second) on a 3.4 GB DB, watching it recover, and then validating the integrity of the DB to be sure we didn't get any corruption. |
… DB (3.4) This patch addresses recovery time when a leader is lost on a large DB. It does this by not clearing the DB before leader election begins, and by avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF sync. It does this by buffering the proposals and commits just like the code currently does for proposals/commits sent after the NEWLEADER and before the UPTODATE messages. If a SNAP is sent we cannot avoid writing out the full snapshot because there is no other way to make sure the disk DB is in sync with what is in memory. So any edits to the edit log before a background snapshot happened could possibly be applied on top of an incorrect snapshot. This same optimization should work for TRUNC too, but I opted not to do it for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the DB to be reread after the edit logs are modified. So it would still not be fast. In practice this makes it so instead of taking 5+ mins for the cluster to recover from losing a leader it now takes about 3 seconds. I am happy to port this to 3.5. if it looks good. Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes #157 from revans2/ZOOKEEPER-2678 and squashes the following commits: d079617 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times dcbf325 [Robert (Bobby) Evans] Addressed review comments f57c384 [Robert (Bobby) Evans] ZOOKEEPER-2678: Fixed another race f705293 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments 5aa2562 [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
… DBs (master) This is the master version of #157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes #159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
@revans2 Please close this pull request; it's merged. |
… DBs (master) This is the master version of apache#157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes apache#159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
Suppose there is a situation |
@fregatte123 You should post this question to dev@ or user@ list to reach audience. I'm not sure how many devs are monitoring closed PRs, probably not many. |
… DBs (master) This is the master version of apache#157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes apache#159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
This patch addresses recovery time when a leader is lost on a large DB.
It does this by not clearing the DB before leader election begins, and by avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF sync. It does this by buffering the proposals and commits just like the code currently does for proposals/commits sent after the NEWLEADER and before the UPTODATE messages.
If a SNAP is sent we cannot avoid writing out the full snapshot because there is no other way to make sure the disk DB is in sync with what is in memory. So any edits to the edit log before a background snapshot happened could possibly be applied on top of an incorrect snapshot.
This same optimization should work for TRUNC too, but I opted not to do it for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the DB to be reread after the edit logs are modified. So it would still not be fast.
In practice this makes it so instead of taking 5+ mins for the cluster to recover from losing a leader it now takes about 3 seconds.
I am happy to port this to 3.5. if it looks good.