Skip to content

Curator 110#9

Merged
asfgit merged 6 commits intoapache:masterfrom
cammckenzie:CURATOR-110
Jun 17, 2014
Merged

Curator 110#9
asfgit merged 6 commits intoapache:masterfrom
cammckenzie:CURATOR-110

Conversation

@cammckenzie
Copy link
Copy Markdown
Contributor

Fix for CURATOR-110

Cameron McKenzie added 2 commits June 2, 2014 16:30
'RECONNECTED' events in the same way. Added a test case for the leader
latch being started before a connection to ZK has been established.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - this suggests that we should have an isLost() method in the ConnectionState no?

Cameron McKenzie and others added 2 commits June 4, 2014 10:24
that can be overridden by each enum instance.
…nnected' method. This allows a race condition between the start() method and the ConnectionStateListener in the LeaderLatch recipe to be avoided. Updated the leader latch start() method to block until a connection is available (in a background thread), before it attempts to setup its state. This means that it will work correctly if started before a connection to ZK is available.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed? They already exist in the CuratorZookeeperClient.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blockUntilConnectedOrTimedOut() method doesn't do what is needed in this case, because there's no way of specifying a timeout. We could call it in a loop I guess, but would need a second thread to interrupt it when our specified time out occurred, which is pretty ugly. Alternatively, we could modify the CuratorZookeeperClient to expose a blockUntilConnected() method that takes a time out? And have blockUntilConnectedOrTimedOut call this with the connection time out? Preferences?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, there is a timeout. It's the connectionTimeout used when creating the CuratorFramework instance. Is that not sufficient? If it's still needed, it seems better to push it down to CuratorZookeeperClient with the other method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought it's nicer to give control to the caller as to how long they want to wait for rather than being bound to the connection timeout. So, you would prefer if I exposed a blockUntilConnected() method on the CuratorZookeeperClient that takes a timeout, and then refactor the current blockUntilConnectedOrTimedOut to call that with the connection timeout? The only issue with this is that the current implementation of blockUntilConnectedOrTimedOut is implemented with 1 second sleeps, so the best granularity as far as timeouts you're going to get is a full second. This was partly why I implemented it via a connection state listener in the CuratorFramework. It just seemed cleaner than sitting in a loop sleeping for a second and then checking the state. I don't think that the connection state listeners are available from the CuratorZookeeperClient though.

@Randgalt
Copy link
Copy Markdown
Member

Man - it's getting complicated huh? I guess there's no other way. I wonder if there's some simplification that can be done.

@cammckenzie
Copy link
Copy Markdown
Contributor Author

For what I originally thought was going to be a 1 line fix, yes!

It seems to me that the wait logic is cleaner to implement at the CuratorFramework level, rather than at the CuratorZooKeeperClient, just because you've got a nice ConnectionStateListener framework to deal with. The internalBlockUntilConnectedOrTimedOut() method is a bit ugly in that it has to block in 1 second increments. Is there a reason that this can't just wait until for the entire session timeout in one go?

As an aside, it's a bit inefficient too as it allocates and destroys a CountDownLatch, and Watcher each iteration of the loop.

@Randgalt
Copy link
Copy Markdown
Member

OK - I see your point. I forget why the CuratorZookeeperClient does a spin loop like that. That's some of the oldest code in the lib.

@cammckenzie
Copy link
Copy Markdown
Contributor Author

Ok, so what's the way forward? We can either refactor internalBlockUntilConnectedOrTimedOut() in CuratorZooKeeperClient to allow arbitrary timeouts, and try to remove the 1 second sleep increments. Or, we can move the wait logic to the CuratorFramework, and use the ConnectionStateListener.

Assuming there's not technical reason why the CuratorZooKeeperClient can't block for arbitrary lengths of time, then it's ok to implement there. I don't really have a strong preference either way.

@Randgalt
Copy link
Copy Markdown
Member

I'd appreciate your opinion. I'm OK with either.

@cammckenzie
Copy link
Copy Markdown
Contributor Author

Ok, I've implemented it in the CuratorZookeeperClient. Doesn't seem to be any issues with using an arbitrary sleep length. Still need to do a bit of testing. Will get something sorted early next week hopefully.

@cammckenzie
Copy link
Copy Markdown
Contributor Author

Scratch that, I think that the reason that the CuratorZookeeperClient was doing a spin loop was because there's a race condition with the watchers. It's possible for the local watcher to get a 'connected' event before the ConnectionState gets its 'connected' event. This means that when you call into ConnectionState.isConnected() it returns false, even though we know it's actually true.

So, while we could return the boolean based on what we know the state to be, it's going to be inconsistent for a short period with the ConnectionState, and this has potential for knock on consequences, even though the window of inconsistency is short.

So, I think it's actually better to move this wait logic into the CuratorFramework and use the ConnectionStateListener.

the ExecuteAfterConnectionEstablished utility class. Cleaned up the
blockUntilConnected() logic in the CuratorFrameworkImpl
only notify blocked clients when a connected state is reached, rather
than any state.
@cammckenzie
Copy link
Copy Markdown
Contributor Author

Latest commit has everything implemented at the CuratorFramework level. Have a look and see what you think. I still seem to have random completely unrelated tests failing occasionally which is a bit disconcerting. They work fine when I rerun them though. I'm not sure if these are due to race conditions or flakiness of the TestingServer.

@Randgalt
Copy link
Copy Markdown
Member

The test are flakey because many of them have to wait "for a bit" for things to settle (session to fail, ephemeral to delete, etc.). But, the ZK server has essentially random behavior in terms of when it reconnects, connects, etc. If the VM is gc'ing or your system is slow, etc. it gets worse.

I keep trying to tune things but it's still not perfect.

@Randgalt
Copy link
Copy Markdown
Member

Interestingly, now that we have ExecuteAfterConnectionEstablished, we no longer need ConnectionState.isConnected() nor do we need the change to LeaderLatch.handleStateChange().

If you don't mind, I'll remove those changes.

@cammckenzie
Copy link
Copy Markdown
Contributor Author

The ConnectionState.isConnected() is being used is in the CuratorFrameworkImpl when we're blocking for a connection. I think that leaving it in is probably not a bad idea, as it's still got potential to be useful in other places too.

In regards to the tests, I wonder if there's a way of structuring them to be more event driven rather than just waiting some arbitrary amount of time and hope an event occurs within that window? I'll have a look and see if I can find anything, but that's for another JIRA ticket.

@asfgit asfgit merged commit 59bab73 into apache:master Jun 17, 2014
@cammckenzie cammckenzie deleted the CURATOR-110 branch June 17, 2014 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants