Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to leader latch WIP #1037

Merged
merged 5 commits into from Jun 5, 2018
Merged

Conversation

keith-turner
Copy link
Contributor

I switched from LeaderSelector to LeaderLatch in order to avoid all of the exception logging. I am not completely finished with this, but just got a full build to run. I am creating a PR to see how it does on travis.

@keith-turner keith-turner mentioned this pull request Jun 1, 2018
@kennethmcfarland
Copy link
Contributor

@keith-turner Everything builds here without any errors, nice work!

@@ -235,7 +239,7 @@ private synchronized long getTimestampsImpl(String id, int num) throws TExceptio
}

if (!isLeader) {
throw new IllegalStateException("Received timestamp request but Oracle is not leader");
throw new IllegalStateException("Received timestamp request but Oracle is not leader ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a weird whitespace to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think I added something for debugging and then removed it.


if (started) {
// if we stopped the server manually, we shouldn't halt
Halt.halt("Oracle has lost leadership unexpectedly and is now halting.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This halt is very unfriendly to running Fluo services in a framework. It makes sense only if it is only ever executed as it's own Java process... and I don't think it's a good idea to make that assumption, since it constrains how it should be run.

Is there an alternative to doing this which might be available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does this when the Oracle loses leadership without stop being called. Given that the Oracle is a multithreaded application and that its very hard to reason about what those threads are doing and have control over them, halting is the best option for correctness of the system. Also, the normal way to run the Oracle is in its own process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever want to support running the Oracle in any other execution framework, I think it's worth trying to ensure those threads are properly managed and interrupted. But, we don't need to support that today, I suppose.


// leaderLatch.close() schedules a background delete, give it a chance to process before
// closing curator... this is done to avoid spurious exceptions, see CURATOR-467
Thread.sleep(250);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An InterruptedException here will prevent the curatorFramework.close() on the next line. Maybe do something like the following?

try {
  Thread.sleep(250);
} finally {
  curatorFramework.close();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use the guava sleep uninterruptably

private void sleepUntilConnected(OracleServer oserver) throws InterruptedException {
while (!oserver.isConnected()) {
private void sleepWhile(Supplier<Boolean> condition) throws InterruptedException {
while (condition.get()) {
Copy link
Member

@ctubbsii ctubbsii Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you changed this to a supplier, but I also liked the readability of the sleepUntil method name.

You could add it back as:

private void sleepUntil(Supplier<Boolean> condition) throws InterruptedException {
  sleepWhile(() -> !condition.get());
}

And then, all your statements would like like one of these two:

sleepWhile(oserver::isConnected);
sleepUntil(oserver::isConnected);

That would remove all the excess () -> stuff a bit, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that suggestion, much more readable.

@keith-turner
Copy link
Contributor Author

I am finished working on this PR. Based on the spelunking I have done in the Curator code, I think we may still see some occasional spurious exceptions from Curator. This is caused by some Curator recipes scheduling background operations that may throw exceptions if things are closed. At this point there is not much we can do about these other than fix curator.

@keith-turner
Copy link
Contributor Author

I have opened CURATOR-467 and CURATOR-469 to document the problems I found while trying to avoid spurious error logging from Curator.

@keith-turner keith-turner merged commit 9b5277c into apache:master Jun 5, 2018
@keith-turner keith-turner deleted the leader-latch branch August 8, 2019 21:18
@keith-turner keith-turner added this to the 2.0.0 milestone Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants