Skip to content

Conversation

@dlmarion
Copy link
Contributor

@dlmarion dlmarion commented Aug 11, 2020

Looking at the issues at #1086 (comment), I believe that just disabling the retries from the server components resolves both. This change disables retries from the server components and uses the default retry factory when being called from non-Server classes.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a well-named static factory method would be better than passing in the RetryFactory everywhere. With the RetryFactory parameter, it's very hard to trace through the code to figure out what was used. But, a well-named static method instead of calling a different constructor, can make it much more readable and easier to understand the intent.

@milleruntime
Copy link
Contributor

FYI the original ticket #1086 was for the 1.9 branch which will probably be a very different fix. We may chose to not fix it in 1.10 but just wanted to give a heads up that this could auto close that ticket.

@dlmarion dlmarion changed the title fixes #1086 - Disable retries from the server components fixes #1086 - Disable Zookeeper retries from TabletServer Aug 12, 2020
@dlmarion
Copy link
Contributor Author

@milleruntime - good catch. I created another PR for the 1.9 branch at #1679

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall feedback is that I think this is a bit too aggressive. It swaps in a ZooKeeper utility that behaves completely differently on the TabletServer than it does on other servers.

I don't think that's the right kind of fix we need or want.

First, TabletServer isn't the only code that would be affected... it's anything using ZooLock. So, it doesn't make sense to me to override the retry behavior for ZooReaderWriter in the TabletServer, as a fix.

Second, ZooReaderWriter is a general purpose utility, made available in ServerContext as a convenience. ZooLock does not need to be bound by its retry behavior. It can use its own ZooKeeper instance or a ZK wrapper utility. A more narrow fix would be to address what gets used by ZooLock, without changing the behavior of the rest of the server process (which is pretty risky, and I worry will cause problems worse than the fix).

Comment on lines +1388 to +1397

/**
* For the tablet server we don't want ZooReaderWriter to keep retrying after a error
* communicating with ZooKeeper.
*/
@Override
protected ServerContext createServerContext(SiteConfiguration siteConfig) {
return new ServerContext(siteConfig, false);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this will have a larger impact on the various ways ZooKeeper is used in a TabletServer, rather than be a narrow fix for ZooLock. The ServerContext is universal for the process. The only ZooReaderWriter instance for which we want to avoid retries is the one used by ZooLock. Is there a reason we need to change the behavior across the entire tserver?

this(keepers, timeout, true);
}

public ZooReader(String keepers, int timeout, boolean enableRetries) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could pass in the RetryFactory rather than have a boolean param.

Comment on lines +72 to +74
private ZooReaderWriter(String keepers, int timeoutInMillis, String secret,
boolean enableRetries) {
super(keepers, timeoutInMillis, enableRetries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could pass in the retry factory rather than pass a boolean around. That would make the code more clear.

Comment on lines +74 to +75
public ServerContext(SiteConfiguration siteConfig, boolean enableZookeeperRetries) {
this(new ServerInfo(siteConfig), enableZookeeperRetries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could pass in the retry factory rather than pass around booleans.

Comment on lines +85 to +86
zooReaderWriter = enableRetries ? ZooReaderWriter.retriesEnabled(info.getSiteConfiguration())
: ZooReaderWriter.retriesDisabled(info.getSiteConfiguration());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. It seems outside the scope of ServerContext to be handling this kind of thing. ServerContext should merely be a holder of the ZRW that is available to the server for updating/reading from ZK. The affected code is in ZooLock, and is not a problem with ZooReaderWriter's retry behavior more generally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this a good idea, but its something I thought of after reading this comment. Could make every ZooReaderWriter method accept a retry factory and have overloaded versions that do not take a retry factory and use a default. This seems tedious though, but its the best I could think of so far to associate the retry logic with method calls instead of the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do something like add a retry to every method, I'd like to revisit my effort to clean up the ZooReaderWriter stuff, like what I had started to do in #1459 ; it would be easier to add something like a retry to every method, if the code were cleaned up a bit first. I can revisit my previous cleanup effort over the next week.

In the meantime, I still think it's probably best to simply avoid the problematic ZRW code in the ZooLock. ZooLock does not need to use the general-purpose utility code, when it can use something specific to its needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback on the PR. I read back through the comments on #1086 and the comments above. My previous attempts were to create a code path through the existing objects that didn't retry instead of creating something new that also talks with ZK. It sounds as if you both might be leaning toward a solution that does not use ZooLock for the tablet server lock, but uses something else that does not retry and communicates directly with ZooKeeper. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctubbsii - did you want me to wait for #1459 to be completed before revisiting this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds as if you both might be leaning toward a solution that does not use ZooLock for the tablet server lock, but uses something else that does not retry and communicates directly with ZooKeeper. Is that correct?

I don't think that's quite right. I don't think it makes sense to distance ourselves from using ZooLock for locking (doesn't it exist for that purpose?). Rather, I think we're suggesting that the implementation of ZooLock should be changed to distance itself from the general-purpose utilities of ZooReaderWriter/ZooSession. This can be done by making ZooLock never use those at all and use something of its own, or as Keith suggested, adding special-purpose methods on those utilities for ZooLock to safely use.

@ctubbsii - did you want me to wait for #1459 to be completed before revisiting this?

That depends. #1459 had no intention of touching ZooLock, so if you want to try to make ZooLock work without ZooReaderWriter/ZooSession, then your efforts should not conflict with that and can proceed independently. If, however, you want to try to add special-purpose methods for ZooLock on ZooReaderWriter/ZooSession, then it should wait, and I can prioritize my efforts next week to clean those up to make it easier for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather, I think we're suggesting that the implementation of ZooLock should be changed to distance itself from the general-purpose utilities of ZooReaderWriter/ZooSession. This can be done by making ZooLock never use those at all and use something of its own, or as Keith suggested, adding special-purpose methods on those utilities for ZooLock to safely use.

Understood, I'll work down that path. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlmarion I was thinking of looking into making ZooLock use Zookeeper directly to see if that is workable.

@ctubbsii
Copy link
Member

I have completed a refactor of some of the Zoo-related utilities in #1755 . After that PR is merged, it might be easier to identify alternate ways to address this. (Or, it may not be any help... I'm not sure. It's been awhile since I looked at this issue.)

@dlmarion dlmarion closed this Jan 26, 2021
@dlmarion dlmarion deleted the 1086-zoolock-session-expiration branch January 26, 2021 13:00
@ctubbsii
Copy link
Member

@dlmarion This was closed as marked as "Done" for 2.1.0, but I don't think it was merged. I think it's being addressed via a different approach. Is that correct?

@dlmarion
Copy link
Contributor Author

dlmarion commented Jan 26, 2021

I deleted the previous PR and branch as it's being addressed here. It looks like automation marked it Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants