Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16116: Use apache curator to manage the Solr Zookeeper interactions #760

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

HoustonPutman
Copy link
Contributor

@HoustonPutman HoustonPutman commented Mar 24, 2022

https://issues.apache.org/jira/browse/SOLR-16116

This is still a WIP there are two outstanding issues:

  • The hadoop-auth framework requires a different version of curator
  • The leader election code is slightly broken, because the OnReconnect logic in ZKController expects to be called at different times, with different guarantees of watchers and ephemeral nodes than Zookeeper gives.

Might look into using curator recipes for persistent watchers and leader election, but this is a fairly complete migration to start with.

@HoustonPutman
Copy link
Contributor Author

leader election code is not broken, it was that the sessionTimeout was not being populated in SolrZkClient.

@risdenk
Copy link
Contributor

risdenk commented Apr 12, 2022

@HoustonPutman I haven't looked super closely at this, but I think that Hadoop shades curator when needed. Its possible that the Solr Hadoop classes just need to have the imports updated to use the Hadoop specific shaded curator paths. This is new with the Hadoop auth and HDFS modules in Solr when we upgraded Hadoop. I tried to switch to the Hadoop self contained shaded dependencies to avoid issues like this.

@madrob
Copy link
Contributor

madrob commented Apr 12, 2022

@risdenk I believe the issue currently is that for hadoop-auth we are not able to rely solely on hadoop-client-* dependencies, which are the ones that would use a relocated curator. All of the auth filter and kerberos logic and whatever else we borrow from hadoop is annotated with audience LimitedPrivate I think, so they didn't bother making a consumable client module for people to use.

@HoustonPutman
Copy link
Contributor Author

Thanks for the comment @risdenk . I was talking to @madrob about it, but he hinted that there were other problems with using the shaded dependencies. However that might no longer be applicable, so I will definitely give it a shot in a week or month (things getting a bit busy haha).

@dsmiley
Copy link
Contributor

dsmiley commented Feb 19, 2023

I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right.

Comment on lines 210 to 219
var clientBuilder = CuratorFrameworkFactory.builder()
.ensembleProvider(new FixedEnsembleProvider(zkHost))
.namespace(chroot)
.sessionTimeoutMs(zkClientTimeout)
.connectionTimeoutMs(clientConnectTimeout)
.aclProvider(this.aclProvider)
.authorization(zkCredentialsProvider.getCredentials())
.retryPolicy(retryPolicy);

client = clientBuilder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently worked on an experimental hack for some days on 8x to have the Overseer use Curator for elections. One thing that I got stuck on was that Solr's tests were finding some threads lingering. Eventually I discovered that Curator has an internal executor and a way to set it on the builder runSafeService. Once I did that and tended to ensuring this executor got shot down, I didn't have that problem anymore. Just sharing this with you now in hopes it helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW #1743 used this runSafeService stuff for the hadoop-auth changes. I think it might only come into play when there is an external zkClient provided but good to know either way.

@risdenk
Copy link
Contributor

risdenk commented Oct 3, 2023

hadoop 3.3.6 upgraded curator to 5.2.0 - #1743. I'm working on getting that in. chatted w/ @HoustonPutman about it as well

Copy link
Contributor

@risdenk risdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. I know it needs to be brought up to main again but 🥳

@risdenk
Copy link
Contributor

risdenk commented Oct 4, 2023

#1743 is in main and so this PR should be ready to update :D

@risdenk
Copy link
Contributor

risdenk commented Oct 10, 2023

I merged main (after hadoop 3.3.6/curator 5.x) and then fixed up different things after that to get all the tests passing including nightly. Would appreciate another set of eyes @HoustonPutman @dsmiley

I'll look through this again to see if things can be cleaned up more, but hopefully its progress.

public void command();

@Override
default void stateChanged(CuratorFramework client, ConnectionState newState) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't know if this is right. Its just like OnDisconnect which seems wrong but tests pass :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure about any of these classes. It seems like OnDisconnect should do !newState.isConnected() instead of newState == ConnectionState.LOST... I have a slight memory about doing this stuff and there being reasons, but I can't remember why I did this... It definitely warrants a much more thorough look. I'm not even sure having a BeforeReconnect is a good idea. The curator APIs don't really give us a good way of implementing it.

@dsmiley
Copy link
Contributor

dsmiley commented Mar 11, 2024

AFAIK, ever since #1743 landed, we're no longer blocked on Hadoop-Auth

@HoustonPutman
Copy link
Contributor Author

Yeah let's get this in main. Probably best to not introduce it in a late 9x version

@risdenk
Copy link
Contributor

risdenk commented Mar 11, 2024

@HoustonPutman @dsmiley I'm happy to merge main into this again and resolve any conflicts. can you do another review to make sure all is good before merging to main?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is a doozy

@@ -121,6 +121,16 @@ dependencies {
implementation 'org.eclipse.jetty:jetty-io'
implementation 'org.eclipse.jetty.toolchain:jetty-servlet-api'

implementation('org.apache.curator:curator-framework', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather minor but I think moving this below so that we have a "ZooKeeper / Curator" section is more organized.

@@ -366,7 +361,6 @@ public ZkController(
.withUrl(zkServerAddress)
.withTimeout(clientTimeout, TimeUnit.MILLISECONDS)
.withConnTimeOut(zkClientConnectTimeout, TimeUnit.MILLISECONDS)
.withConnStrategy(strat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened with this?

Op.create(
nodePath,
null,
zkClient.getZkACLProvider().getACLsToAdd(nodePath),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering where ACLs are handled

zookeeper.delete(ops.get(j).getPath(), -1, true);
} catch (KeeperException.NoNodeException e2) {
if (log.isDebugEnabled()) {
log.debug("Can not remove node which is not exist : {}", ops.get(j).getPath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you remove because you saw too much of these?

@@ -398,16 +397,6 @@ private void retryRegisterWatcher() {
return;
} catch (KeeperException e) {
log.warn("Failed watching shard term for collection: {}, retrying!", collection, e);
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove?

}

protected <T> T runWithCorrectThrows(String action, SupplierWithException<T> func)
throws KeeperException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we catch InterruptedException now

@@ -1048,6 +1125,11 @@ public void downloadFromZK(String zkPath, Path dir) throws IOException {
ZkMaintenanceUtils.downloadFromZK(this, zkPath, dir);
}

@FunctionalInterface
public interface IsClosed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this new interface need to be public?

@@ -43,6 +43,17 @@ dependencies {
var zkExcludes = {
exclude group: "org.apache.yetus", module: "audience-annotations"
}
api('org.apache.curator:curator-client', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debatable if "api" is appropriate here

api('org.apache.curator:curator-framework', {
exclude group: 'org.apache.zookeeper', module: 'zookeeper'
})
api('org.apache.curator:curator-test') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus all solr-test-framework users (plugin users outside of the project) will need curator-test. Ah; I see this is because ChaosMonkey is here and uses the KillSession thing. Okay; a little comment might be nice.

long sessionId = zkClient.getZkSessionId();
zkServer.expire(sessionId);
try {
KillSession.kill(zkClient.getCuratorFramework().getZookeeperClient().getZooKeeper());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you add a convenience method for that? If I'm wrong; never mind.

@github-actions github-actions bot removed the stale PR not updated in 60 days label Mar 16, 2024
@dsmiley
Copy link
Contributor

dsmiley commented Apr 25, 2024

This is an important initiative. If you've given up on it @risdenk, let us know so someone can take over.

@risdenk
Copy link
Contributor

risdenk commented Apr 30, 2024

This is an important initiative. If you've given up on it @risdenk, let us know so someone can take over.

I'm not actively working on it. @HoustonPutman started this and I just tried to push it a bit further. I might get back to it but nothing immediately. Someone else can definitely push it further.

@HoustonPutman
Copy link
Contributor Author

Thanks for putting in the work @risdenk ! I will push to add this for 10x only this summer.

Copy link

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Jun 30, 2024
Copy link

github-actions bot commented Oct 5, 2024

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

@github-actions github-actions bot added the closed-stale Closed after being stale for 60 days label Oct 5, 2024
@github-actions github-actions bot closed this Oct 5, 2024
@dsmiley dsmiley reopened this Oct 5, 2024
@epugh
Copy link
Contributor

epugh commented Oct 5, 2024

Thanks for reopening this @dsmiley and would be great to get it in now that summer is over ;-)

@github-actions github-actions bot removed closed-stale Closed after being stale for 60 days stale PR not updated in 60 days labels Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants