New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to leader latch WIP #1037
Conversation
@keith-turner Everything builds here without any errors, nice work! |
@@ -235,7 +239,7 @@ private synchronized long getTimestampsImpl(String id, int num) throws TExceptio | |||
} | |||
|
|||
if (!isLeader) { | |||
throw new IllegalStateException("Received timestamp request but Oracle is not leader"); | |||
throw new IllegalStateException("Received timestamp request but Oracle is not leader "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a weird whitespace to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I think I added something for debugging and then removed it.
|
||
if (started) { | ||
// if we stopped the server manually, we shouldn't halt | ||
Halt.halt("Oracle has lost leadership unexpectedly and is now halting."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This halt is very unfriendly to running Fluo services in a framework. It makes sense only if it is only ever executed as it's own Java process... and I don't think it's a good idea to make that assumption, since it constrains how it should be run.
Is there an alternative to doing this which might be available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does this when the Oracle loses leadership without stop being called. Given that the Oracle is a multithreaded application and that its very hard to reason about what those threads are doing and have control over them, halting is the best option for correctness of the system. Also, the normal way to run the Oracle is in its own process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever want to support running the Oracle in any other execution framework, I think it's worth trying to ensure those threads are properly managed and interrupted. But, we don't need to support that today, I suppose.
|
||
// leaderLatch.close() schedules a background delete, give it a chance to process before | ||
// closing curator... this is done to avoid spurious exceptions, see CURATOR-467 | ||
Thread.sleep(250); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An InterruptedException
here will prevent the curatorFramework.close()
on the next line. Maybe do something like the following?
try {
Thread.sleep(250);
} finally {
curatorFramework.close();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use the guava sleep uninterruptably
private void sleepUntilConnected(OracleServer oserver) throws InterruptedException { | ||
while (!oserver.isConnected()) { | ||
private void sleepWhile(Supplier<Boolean> condition) throws InterruptedException { | ||
while (condition.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you changed this to a supplier, but I also liked the readability of the sleepUntil
method name.
You could add it back as:
private void sleepUntil(Supplier<Boolean> condition) throws InterruptedException {
sleepWhile(() -> !condition.get());
}
And then, all your statements would like like one of these two:
sleepWhile(oserver::isConnected);
sleepUntil(oserver::isConnected);
That would remove all the excess () ->
stuff a bit, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that suggestion, much more readable.
I am finished working on this PR. Based on the spelunking I have done in the Curator code, I think we may still see some occasional spurious exceptions from Curator. This is caused by some Curator recipes scheduling background operations that may throw exceptions if things are closed. At this point there is not much we can do about these other than fix curator. |
I have opened CURATOR-467 and CURATOR-469 to document the problems I found while trying to avoid spurious error logging from Curator. |
I switched from LeaderSelector to LeaderLatch in order to avoid all of the exception logging. I am not completely finished with this, but just got a full build to run. I am creating a PR to see how it does on travis.