-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify repository before cluster update #108531
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
Some integ tests in s3 and gcp are failing on new verification call. I assume they worked before because we updated cluster first and repository was there, not verified. I'm looking how to fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall direction looks good but this PR is combining too many independent changes which makes it harder to review and also means that if it happens to introduce a bug then a git bisect
won't pin down the problem very precisely.
Could we break out the test suite improvements and merge that change first, and separate out the cosmetic variable renaming in MasterService
too? That way we'll be left with a PR that just does the behaviour change.
+1 to David's suggestion. I assume this PR is meant to fix #107840? If so, can we make it clear in the description, i.e. replaces |
server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
Outdated
Show resolved
Hide resolved
I will create separate PR for the test suite improvements, will keep this one open for behaviour change. |
…108589) I observed that `testRegisterRepositorySuccessAfterCreationFailed` test never invokes assertion blocks, because listener is not invoked. There are 2 problems: 1. Test setup used mocks. Mocks interrupt listener chain propagation, so registerRepository never returned Response or Failure. 2. We silently ignore assertions in listener because it is not invoked. Test pass successfully. PutRepositories method relies on cluster state update. I replace mocked ClusterService and ThreadPool with test implementation of these. Also add blocking call on listener to ensure we get result. Address [comment](#108531 (review)) to break down larger PR into smaller pieces in #108531
@DaveCTurner @ywangd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment. Thanks!
if (request.verify()) { | ||
validatePutRepositoryRequest(request, validationStep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this done together with validateRepositoryCanBeCreated
? I think it feels nature that pre-cluster-state-update validation/verification is done in one place. Also seems to be wasteful to create the repository twice.
While I appreciate that SubscribableListener
generally makes code easier to read, can we keep it separate from this PR? If we move this step into validateRepositoryCanBeCreated
, I think we don't need to touch the code here at all in this PR which would make it much easier to review. We can always have a follow-up which will be pure-refactoring for introucing SubscribableListener
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validateRepositoryCanBeCreated
method is used in other places, I would rather remove it from here. But it breaks unrelated tests. I can try to chase these down in this PR, or following PR.
Another problem is behaviour change, validateRepositoryCanBeCreated runs despite "verify" flag, but repository verification behind "verify". So putting everything behind verify or including new verification logic regardless "verify" breaks another set of tests including bwc.
About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.
I think it is better to keep functional change and style change separate unless it is very trivial. If somehow we need to revert the functional change, it is also much cleaner if they are separate. This likely won't happen for this PR. But it feels like a overall good principle.
validateRepositoryCanBeCreated method is used in other places
In that case, we don't have to literally move the new code into it. It can sit besides it as a new method. My main point is to not touch the below steps of listeners in this PR.
Let's hear what @DaveCTurner has to say about this before changing anything. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the SubscribableListener
change does not have to come after this PR. It would be totally reasonable to have a separate PR just for that and merge it before proceeding here. My main point is to have them separate, order is not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right about splitting it apart, I talked with David before about it, but got tunnel-visioned chasing down test failures and forgot about it. I rethink what I said before, and I believe all of what you mentioned above can be addressed.
I created PR with non functional change - #108788
I have some thoughts how to organize validateRepositoryCanBeCreated
better, will update this PR after refactoring merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. That was my main gripe :) Thanks for splitting it out. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated PR with refactoring included
This PR is a syntactic change for `registerRepository` in `RepositoriesService`. I use `SubscribableListener` to display order of events and reduce boilerplate code around failures delegation `listener.delegateFailureAndWrap`. It's a part of larger change for verification logic, which should take advantage of this "sequential" version of code. #108531
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to go except a couple of superficial points, see below.
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> { | ||
try { | ||
final var token = repository.startVerification(); | ||
if (token != null) { | ||
repository.verify(token, clusterService.localNode()); | ||
repository.endVerification(token); | ||
} | ||
resultListener.onResponse(null); | ||
} catch (Exception e) { | ||
resultListener.onFailure(e); | ||
} finally { | ||
closeRepository(repository); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally best to avoid submitting a bare Runnable
to threadpool executors, because they don't handle rejection very well. In this particular case it'd be ok since (a) SNAPSHOT
doesn't reject anything (today) and (b) the whole thing is in a try/catch block anyway, but still these things are easy to miss in future changes in this area. AbstractRunnable
is always a better choice:
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> { | |
try { | |
final var token = repository.startVerification(); | |
if (token != null) { | |
repository.verify(token, clusterService.localNode()); | |
repository.endVerification(token); | |
} | |
resultListener.onResponse(null); | |
} catch (Exception e) { | |
resultListener.onFailure(e); | |
} finally { | |
closeRepository(repository); | |
} | |
}); | |
threadPool.executor(ThreadPool.Names.SNAPSHOT) | |
.execute(ActionRunnable.run(ActionListener.runBefore(resultListener, () -> closeRepository(repository)), () -> { | |
final var token = repository.startVerification(); | |
if (token != null) { | |
repository.verify(token, clusterService.localNode()); | |
repository.endVerification(token); | |
} | |
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try-catch block may not be necessary since we are calling this method in a SubscribableListener#newForked
which does its own try-catch. That said, I am not too fussed with the additional try-catch. It helps reasoning when just reading this single method.
@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) { | |||
} | |||
} | |||
|
|||
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) { | |||
return safeAwait(SubscribableListener.newForked(exceptionListener -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary {...}
, can just be an expression lambda
return safeAwait(SubscribableListener.newForked(exceptionListener -> { | |
return safeAwait(SubscribableListener.newForked(exceptionListener -> |
@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) { | |||
} | |||
} | |||
|
|||
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't use the generic type param T
anywhere we can just use a wildcard instead:
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) { | |
public static Exception safeAwaitFailure(SubscribableListener<?> listener) { |
@@ -181,6 +173,26 @@ public void testRegisterRejectsInvalidRepositoryNames() { | |||
} | |||
} | |||
|
|||
public void testPutRepositoryVerificationFails() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to also have a test that starts with an existing repository and demostrate a failed update (due to verfication) request does not change the existing repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added new test testPutRepositoryVerificationFailsOnExisting
return createRepository(name, true); | ||
return createRepository(name, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary? Is it because we will reject invalid repo earlier with this change and we still want some invalid repo to be created in tests? Sounds like an exceptional use case. Are there many such usages? If not, I feel it might be better for clarity to have a separate method such as createRepositoryWithoutVerification
for them. The following method String createRepository(final String name, final boolean verify)
is only used here and probably can be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several tests that dont rely on verification, repository creation is a part of the setup but no assertions done. In some tests we create a snapshot thread pool with single thread, new verification code also uses snapshot thread and it causes test to hang. My understanding it happens when we intentionally block some repository calls though mocks. At least thats what I was able to reproduce with debugger.
Also not all tests I was able to catch locally, some of them fail only on CI. Even not on my GCP instance.
I can try to refactor with createRepositoryWithoutVerification
for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some tests we create a snapshot thread pool with single thread, new verification code also uses snapshot thread and it causes test to hang.
The existing code also performs verification in the snapshot thread pool after the repo is added to the cluster state. So it is not immediately clear to me why the new verficiation step would hang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me reproduce these and post details here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might mess up a bit here, this change is not related to hanging tests.
Another diff in this PR addresses hanging test:
RepositoriesIT -> testRepositoryConflict
blockMasterOnWriteIndexFile(repo);
logger.info("--> start deletion of snapshot");
ActionFuture<AcknowledgedResponse> future = clusterAdmin().prepareDeleteSnapshot(repo, snapshot1).execute();
logger.info("--> waiting for block to kick in on node [{}]", blockedNode);
waitForBlock(blockedNode, repo);
...
logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress");
RepositoryConflictException e2 = expectThrows(
RepositoryConflictException.class,
clusterAdmin().preparePutRepository(repo)
.setVerify(true) // >>> will hang on verification
.setType("mock")
.setSettings(Settings.builder().put("location", randomRepoPath()))
);
...
logger.info("--> unblocking blocked node [{}]", blockedNode);
unblockNode(repo, blockedNode); // >>> release thread here
So here we use thread pool with single thread that is busy with snapshot deletion, then we issue another call to repository service and it hangs on verification. Test unblocks node after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change related to metrics. We have tests that verify we have exact Repository metrics - get/put/list across nodes. But with new change master node will have few more, and there are several tests fail in s3/gcp/azure. Not consistently thou.
https://gradle-enterprise.elastic.co/s/w4amqdgguyipy
Problem is I cannot reproduce these locally, tried fix them based on gradle scan, then got few more. So I decided a blanket approach. @DaveCTurner suggested to turn verification off, not the blanket approach :)
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> { | ||
try { | ||
final var token = repository.startVerification(); | ||
if (token != null) { | ||
repository.verify(token, clusterService.localNode()); | ||
repository.endVerification(token); | ||
} | ||
resultListener.onResponse(null); | ||
} catch (Exception e) { | ||
resultListener.onFailure(e); | ||
} finally { | ||
closeRepository(repository); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try-catch block may not be necessary since we are calling this method in a SubscribableListener#newForked
which does its own try-catch. That said, I am not too fussed with the additional try-catch. It helps reasoning when just reading this single method.
@@ -302,7 +302,10 @@ public void testRepositoryConflict() throws Exception { | |||
logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress"); | |||
RepositoryConflictException e2 = expectThrows( | |||
RepositoryConflictException.class, | |||
clusterAdmin().preparePutRepository(repo).setType("mock").setSettings(Settings.builder().put("location", randomRepoPath())) | |||
clusterAdmin().preparePutRepository(repo) | |||
.setVerify(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will deadlock on snapshot thread pool, we are running with single thread which is busy at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this comment in the code please?
@@ -181,7 +181,7 @@ public final void testSnapshotWithLargeSegmentFiles() throws Exception { | |||
} | |||
|
|||
public void testRequestStats() throws Exception { | |||
final String repository = createRepository(randomRepositoryName()); | |||
final String repository = createRepository(randomRepositoryName(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional verification adds few more invocations on master node, failing all related tests in s3/gcp/azure.
Here test asserts that we have exact metrics across all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here test asserts that we have exact metrics across all nodes.
This is not the case. The test failed because the repository master node uses for verification is different from the actual repository that gets created afterwards. The initial repository is closed and not counted towards sdkRequestCounts
. Thus making it have lower values compared to the records on the HTTP server side. Can we add a comment to this line to explain why verify
needs to be false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the iterations. I suggest we add comments into places where we explicity set verify
to false
for repository creation so it is easier for future readers to see why.
@@ -181,7 +181,7 @@ public final void testSnapshotWithLargeSegmentFiles() throws Exception { | |||
} | |||
|
|||
public void testRequestStats() throws Exception { | |||
final String repository = createRepository(randomRepositoryName()); | |||
final String repository = createRepository(randomRepositoryName(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here test asserts that we have exact metrics across all nodes.
This is not the case. The test failed because the repository master node uses for verification is different from the actual repository that gets created afterwards. The initial repository is closed and not counted towards sdkRequestCounts
. Thus making it have lower values compared to the records on the HTTP server side. Can we add a comment to this line to explain why verify
needs to be false
?
var resultListener = new SubscribableListener<AcknowledgedResponse>(); | ||
repositoriesService.registerRepository(request, resultListener); | ||
var ackResponse = safeAwait(resultListener); | ||
assertTrue(ackResponse.isAcknowledged()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can directly use PlainActionFuture
instead of going through safeAwait(SubscribableListener)
which internally uses PlainActionFuture
, e.g.:
var resultListener = new SubscribableListener<AcknowledgedResponse>(); | |
repositoriesService.registerRepository(request, resultListener); | |
var ackResponse = safeAwait(resultListener); | |
assertTrue(ackResponse.isAcknowledged()); | |
var future = new PlainActionFuture<AcknowledgedResponse>(); | |
repositoriesService.registerRepository(request, future); | |
assertTrue(safeGet(future).isAcknowledged()); |
@@ -302,7 +302,10 @@ public void testRepositoryConflict() throws Exception { | |||
logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress"); | |||
RepositoryConflictException e2 = expectThrows( | |||
RepositoryConflictException.class, | |||
clusterAdmin().preparePutRepository(repo).setType("mock").setSettings(Settings.builder().put("location", randomRepoPath())) | |||
clusterAdmin().preparePutRepository(repo) | |||
.setVerify(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this comment in the code please?
@@ -2178,6 +2179,14 @@ public static <T> T safeGet(Future<T> future) { | |||
} | |||
} | |||
|
|||
public static Exception safeAwaitFailure(SubscribableListener<?> listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can have a variant of this method that takes PlainActionFuture
in a future PR.
* add verification before cluster update * dont verify repo in testRequestStats
This PR addresses issue when we can create "invalid" repository and persist it
in cluster state. Even thou we return error to caller, repository stays there.
Add repository verification only on master node before cluster update.
Refactored registerRepository code with SubscribableListener to display order
of listener calls.
fixes #107840