Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify repository before cluster update #108531

Merged
merged 23 commits into from
May 31, 2024
Merged

Conversation

mhl-b
Copy link
Contributor

@mhl-b mhl-b commented May 10, 2024

This PR addresses issue when we can create "invalid" repository and persist it
in cluster state. Even thou we return error to caller, repository stays there.

Add repository verification only on master node before cluster update.
Refactored registerRepository code with SubscribableListener to display order
of listener calls.

fixes #107840

@mhl-b mhl-b added >enhancement >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team v8.15.0 labels May 10, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@mhl-b
Copy link
Contributor Author

mhl-b commented May 11, 2024

Some integ tests in s3 and gcp are failing on new verification call. I assume they worked before because we updated cluster first and repository was there, not verified.

I'm looking how to fix this.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction looks good but this PR is combining too many independent changes which makes it harder to review and also means that if it happens to introduce a bug then a git bisect won't pin down the problem very precisely.

Could we break out the test suite improvements and merge that change first, and separate out the cosmetic variable renaming in MasterService too? That way we'll be left with a PR that just does the behaviour change.

@ywangd
Copy link
Member

ywangd commented May 13, 2024

+1 to David's suggestion. I assume this PR is meant to fix #107840? If so, can we make it clear in the description, i.e. replaces related with something like fixes so that we get GitHub to associate them?

@mhl-b
Copy link
Contributor Author

mhl-b commented May 13, 2024

@DaveCTurner @ywangd

Could we break out the test suite improvements and merge that change first

I will create separate PR for the test suite improvements, will keep this one open for behaviour change.

elasticsearchmachine pushed a commit that referenced this pull request May 14, 2024
…108589)

I observed that `testRegisterRepositorySuccessAfterCreationFailed` test
never invokes assertion blocks, because listener is not invoked.

There are 2 problems:

1. Test setup used mocks. Mocks interrupt listener chain propagation, so registerRepository never returned Response or Failure.
2. We silently ignore assertions in listener because it is not invoked. Test pass successfully.

PutRepositories method relies on cluster state update. I replace mocked
ClusterService and ThreadPool with test implementation of these. Also
add blocking call on listener to ensure we get result.

Address
[comment](#108531 (review))
to break down larger PR into smaller pieces in #108531
@mhl-b mhl-b changed the title Validate repository before cluster update Unregister repository when PutRepository validation fails May 15, 2024
@mhl-b
Copy link
Contributor Author

mhl-b commented May 17, 2024

@DaveCTurner @ywangd
updated PR, using pre-publish verification now, found and fixed complaining tests

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment. Thanks!

Comment on lines 172 to 173
if (request.verify()) {
validatePutRepositoryRequest(request, validationStep);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this done together with validateRepositoryCanBeCreated? I think it feels nature that pre-cluster-state-update validation/verification is done in one place. Also seems to be wasteful to create the repository twice.

While I appreciate that SubscribableListener generally makes code easier to read, can we keep it separate from this PR? If we move this step into validateRepositoryCanBeCreated, I think we don't need to touch the code here at all in this PR which would make it much easier to review. We can always have a follow-up which will be pure-refactoring for introucing SubscribableListener.

Copy link
Contributor Author

@mhl-b mhl-b May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateRepositoryCanBeCreated method is used in other places, I would rather remove it from here. But it breaks unrelated tests. I can try to chase these down in this PR, or following PR.

Another problem is behaviour change, validateRepositoryCanBeCreated runs despite "verify" flag, but repository verification behind "verify". So putting everything behind verify or including new verification logic regardless "verify" breaks another set of tests including bwc.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

I think it is better to keep functional change and style change separate unless it is very trivial. If somehow we need to revert the functional change, it is also much cleaner if they are separate. This likely won't happen for this PR. But it feels like a overall good principle.

validateRepositoryCanBeCreated method is used in other places

In that case, we don't have to literally move the new code into it. It can sit besides it as a new method. My main point is to not touch the below steps of listeners in this PR.

Let's hear what @DaveCTurner has to say about this before changing anything. Thanks!

Copy link
Member

@ywangd ywangd May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the SubscribableListener change does not have to come after this PR. It would be totally reasonable to have a separate PR just for that and merge it before proceeding here. My main point is to have them separate, order is not important.

Copy link
Contributor Author

@mhl-b mhl-b May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right about splitting it apart, I talked with David before about it, but got tunnel-visioned chasing down test failures and forgot about it. I rethink what I said before, and I believe all of what you mentioned above can be addressed.

I created PR with non functional change - #108788

I have some thoughts how to organize validateRepositoryCanBeCreated better, will update this PR after refactoring merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. That was my main gripe :) Thanks for splitting it out. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR with refactoring included

elasticsearchmachine pushed a commit that referenced this pull request May 17, 2024
This PR is a syntactic change for `registerRepository` in
`RepositoriesService`. I use `SubscribableListener` to display order of
events and reduce boilerplate code around failures delegation
`listener.delegateFailureAndWrap`.

It's a part of larger change for verification logic, which should take
advantage of this "sequential" version of code. #108531
@mhl-b mhl-b requested review from a team as code owners May 21, 2024 22:41
@mhl-b mhl-b requested review from ywangd and removed request for a team May 22, 2024 00:13
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go except a couple of superficial points, see below.

Comment on lines 359 to 372
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
try {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
resultListener.onResponse(null);
} catch (Exception e) {
resultListener.onFailure(e);
} finally {
closeRepository(repository);
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally best to avoid submitting a bare Runnable to threadpool executors, because they don't handle rejection very well. In this particular case it'd be ok since (a) SNAPSHOT doesn't reject anything (today) and (b) the whole thing is in a try/catch block anyway, but still these things are easy to miss in future changes in this area. AbstractRunnable is always a better choice:

Suggested change
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
try {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
resultListener.onResponse(null);
} catch (Exception e) {
resultListener.onFailure(e);
} finally {
closeRepository(repository);
}
});
threadPool.executor(ThreadPool.Names.SNAPSHOT)
.execute(ActionRunnable.run(ActionListener.runBefore(resultListener, () -> closeRepository(repository)), () -> {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
}));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try-catch block may not be necessary since we are calling this method in a SubscribableListener#newForked which does its own try-catch. That said, I am not too fussed with the additional try-catch. It helps reasoning when just reading this single method.

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
}
}

public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
return safeAwait(SubscribableListener.newForked(exceptionListener -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary {...}, can just be an expression lambda

Suggested change
return safeAwait(SubscribableListener.newForked(exceptionListener -> {
return safeAwait(SubscribableListener.newForked(exceptionListener ->

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
}
}

public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't use the generic type param T anywhere we can just use a wildcard instead:

Suggested change
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
public static Exception safeAwaitFailure(SubscribableListener<?> listener) {

@mhl-b mhl-b requested a review from DaveCTurner May 24, 2024 00:05
@@ -181,6 +173,26 @@ public void testRegisterRejectsInvalidRepositoryNames() {
}
}

public void testPutRepositoryVerificationFails() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also have a test that starts with an existing repository and demostrate a failed update (due to verfication) request does not change the existing repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added new test testPutRepositoryVerificationFailsOnExisting

Comment on lines 95 to 100
return createRepository(name, true);
return createRepository(name, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary? Is it because we will reject invalid repo earlier with this change and we still want some invalid repo to be created in tests? Sounds like an exceptional use case. Are there many such usages? If not, I feel it might be better for clarity to have a separate method such as createRepositoryWithoutVerification for them. The following method String createRepository(final String name, final boolean verify) is only used here and probably can be dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several tests that dont rely on verification, repository creation is a part of the setup but no assertions done. In some tests we create a snapshot thread pool with single thread, new verification code also uses snapshot thread and it causes test to hang. My understanding it happens when we intentionally block some repository calls though mocks. At least thats what I was able to reproduce with debugger.

Also not all tests I was able to catch locally, some of them fail only on CI. Even not on my GCP instance.

I can try to refactor with createRepositoryWithoutVerification for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some tests we create a snapshot thread pool with single thread, new verification code also uses snapshot thread and it causes test to hang.

The existing code also performs verification in the snapshot thread pool after the repo is added to the cluster state. So it is not immediately clear to me why the new verficiation step would hang?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me reproduce these and post details here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might mess up a bit here, this change is not related to hanging tests.

Another diff in this PR addresses hanging test:
RepositoriesIT -> testRepositoryConflict

        blockMasterOnWriteIndexFile(repo);
        logger.info("--> start deletion of snapshot");
        ActionFuture<AcknowledgedResponse> future = clusterAdmin().prepareDeleteSnapshot(repo, snapshot1).execute();
        logger.info("--> waiting for block to kick in on node [{}]", blockedNode);
        waitForBlock(blockedNode, repo);

        ...

        logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress");
        RepositoryConflictException e2 = expectThrows(
            RepositoryConflictException.class,
            clusterAdmin().preparePutRepository(repo)
                .setVerify(true) // >>> will hang on verification
                .setType("mock")
                .setSettings(Settings.builder().put("location", randomRepoPath()))
        );

        ...

        logger.info("--> unblocking blocked node [{}]", blockedNode);
        unblockNode(repo, blockedNode); // >>> release thread here

So here we use thread pool with single thread that is busy with snapshot deletion, then we issue another call to repository service and it hangs on verification. Test unblocks node after.

Copy link
Contributor Author

@mhl-b mhl-b May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change related to metrics. We have tests that verify we have exact Repository metrics - get/put/list across nodes. But with new change master node will have few more, and there are several tests fail in s3/gcp/azure. Not consistently thou.

https://gradle-enterprise.elastic.co/s/w4amqdgguyipy

Problem is I cannot reproduce these locally, tried fix them based on gradle scan, then got few more. So I decided a blanket approach. @DaveCTurner suggested to turn verification off, not the blanket approach :)

Comment on lines 359 to 372
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
try {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
resultListener.onResponse(null);
} catch (Exception e) {
resultListener.onFailure(e);
} finally {
closeRepository(repository);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try-catch block may not be necessary since we are calling this method in a SubscribableListener#newForked which does its own try-catch. That said, I am not too fussed with the additional try-catch. It helps reasoning when just reading this single method.

@@ -302,7 +302,10 @@ public void testRepositoryConflict() throws Exception {
logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress");
RepositoryConflictException e2 = expectThrows(
RepositoryConflictException.class,
clusterAdmin().preparePutRepository(repo).setType("mock").setSettings(Settings.builder().put("location", randomRepoPath()))
clusterAdmin().preparePutRepository(repo)
.setVerify(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will deadlock on snapshot thread pool, we are running with single thread which is busy at the moment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this comment in the code please?

@@ -181,7 +181,7 @@ public final void testSnapshotWithLargeSegmentFiles() throws Exception {
}

public void testRequestStats() throws Exception {
final String repository = createRepository(randomRepositoryName());
final String repository = createRepository(randomRepositoryName(), false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional verification adds few more invocations on master node, failing all related tests in s3/gcp/azure.
Here test asserts that we have exact metrics across all nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here test asserts that we have exact metrics across all nodes.

This is not the case. The test failed because the repository master node uses for verification is different from the actual repository that gets created afterwards. The initial repository is closed and not counted towards sdkRequestCounts. Thus making it have lower values compared to the records on the HTTP server side. Can we add a comment to this line to explain why verify needs to be false?

@mhl-b mhl-b requested a review from ywangd May 24, 2024 05:56
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the iterations. I suggest we add comments into places where we explicity set verify to false for repository creation so it is easier for future readers to see why.

@@ -181,7 +181,7 @@ public final void testSnapshotWithLargeSegmentFiles() throws Exception {
}

public void testRequestStats() throws Exception {
final String repository = createRepository(randomRepositoryName());
final String repository = createRepository(randomRepositoryName(), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here test asserts that we have exact metrics across all nodes.

This is not the case. The test failed because the repository master node uses for verification is different from the actual repository that gets created afterwards. The initial repository is closed and not counted towards sdkRequestCounts. Thus making it have lower values compared to the records on the HTTP server side. Can we add a comment to this line to explain why verify needs to be false?

Comment on lines +190 to +193
var resultListener = new SubscribableListener<AcknowledgedResponse>();
repositoriesService.registerRepository(request, resultListener);
var ackResponse = safeAwait(resultListener);
assertTrue(ackResponse.isAcknowledged());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can directly use PlainActionFuture instead of going through safeAwait(SubscribableListener) which internally uses PlainActionFuture, e.g.:

Suggested change
var resultListener = new SubscribableListener<AcknowledgedResponse>();
repositoriesService.registerRepository(request, resultListener);
var ackResponse = safeAwait(resultListener);
assertTrue(ackResponse.isAcknowledged());
var future = new PlainActionFuture<AcknowledgedResponse>();
repositoriesService.registerRepository(request, future);
assertTrue(safeGet(future).isAcknowledged());

@@ -302,7 +302,10 @@ public void testRepositoryConflict() throws Exception {
logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress");
RepositoryConflictException e2 = expectThrows(
RepositoryConflictException.class,
clusterAdmin().preparePutRepository(repo).setType("mock").setSettings(Settings.builder().put("location", randomRepoPath()))
clusterAdmin().preparePutRepository(repo)
.setVerify(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this comment in the code please?

@@ -2178,6 +2179,14 @@ public static <T> T safeGet(Future<T> future) {
}
}

public static Exception safeAwaitFailure(SubscribableListener<?> listener) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can have a variant of this method that takes PlainActionFuture in a future PR.

@DaveCTurner DaveCTurner dismissed their stale review May 31, 2024 06:06

all addressed

@mhl-b mhl-b merged commit 18219ca into elastic:main May 31, 2024
15 checks passed
craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Jun 11, 2024
* add verification before cluster update
* dont verify repo in testRequestStats
@mhl-b mhl-b deleted the repo-validation branch June 17, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
5 participants