Verify repository before cluster update #108531

mhl-b · 2024-05-10T23:22:02Z

This PR addresses issue when we can create "invalid" repository and persist it
in cluster state. Even thou we return error to caller, repository stays there.

Add repository verification only on master node before cluster update.
Refactored registerRepository code with SubscribableListener to display order
of listener calls.

fixes #107840

elasticsearchmachine · 2024-05-10T23:22:27Z

Pinging @elastic/es-distributed (Team:Distributed)

mhl-b · 2024-05-11T00:15:52Z

Some integ tests in s3 and gcp are failing on new verification call. I assume they worked before because we updated cluster first and repository was there, not verified.

I'm looking how to fix this.

DaveCTurner

Overall direction looks good but this PR is combining too many independent changes which makes it harder to review and also means that if it happens to introduce a bug then a git bisect won't pin down the problem very precisely.

Could we break out the test suite improvements and merge that change first, and separate out the cosmetic variable renaming in MasterService too? That way we'll be left with a PR that just does the behaviour change.

ywangd · 2024-05-13T05:36:36Z

+1 to David's suggestion. I assume this PR is meant to fix #107840? If so, can we make it clear in the description, i.e. replaces related with something like fixes so that we get GitHub to associate them?

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

mhl-b · 2024-05-13T17:54:55Z

@DaveCTurner @ywangd

Could we break out the test suite improvements and merge that change first

I will create separate PR for the test suite improvements, will keep this one open for behaviour change.

…108589) I observed that `testRegisterRepositorySuccessAfterCreationFailed` test never invokes assertion blocks, because listener is not invoked. There are 2 problems: 1. Test setup used mocks. Mocks interrupt listener chain propagation, so registerRepository never returned Response or Failure. 2. We silently ignore assertions in listener because it is not invoked. Test pass successfully. PutRepositories method relies on cluster state update. I replace mocked ClusterService and ThreadPool with test implementation of these. Also add blocking call on listener to ensure we get result. Address [comment](#108531 (review)) to break down larger PR into smaller pieces in #108531

mhl-b · 2024-05-17T04:01:03Z

@DaveCTurner @ywangd
updated PR, using pre-publish verification now, found and fixed complaining tests

ywangd

I left a comment. Thanks!

ywangd · 2024-05-17T04:29:38Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

+                if (request.verify()) {
+                    validatePutRepositoryRequest(request, validationStep);


Can we have this done together with validateRepositoryCanBeCreated? I think it feels nature that pre-cluster-state-update validation/verification is done in one place. Also seems to be wasteful to create the repository twice.

While I appreciate that SubscribableListener generally makes code easier to read, can we keep it separate from this PR? If we move this step into validateRepositoryCanBeCreated, I think we don't need to touch the code here at all in this PR which would make it much easier to review. We can always have a follow-up which will be pure-refactoring for introucing SubscribableListener.

validateRepositoryCanBeCreated method is used in other places, I would rather remove it from here. But it breaks unrelated tests. I can try to chase these down in this PR, or following PR.

Another problem is behaviour change, validateRepositoryCanBeCreated runs despite "verify" flag, but repository verification behind "verify". So putting everything behind verify or including new verification logic regardless "verify" breaks another set of tests including bwc.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

I think it is better to keep functional change and style change separate unless it is very trivial. If somehow we need to revert the functional change, it is also much cleaner if they are separate. This likely won't happen for this PR. But it feels like a overall good principle.

validateRepositoryCanBeCreated method is used in other places

In that case, we don't have to literally move the new code into it. It can sit besides it as a new method. My main point is to not touch the below steps of listeners in this PR.

Let's hear what @DaveCTurner has to say about this before changing anything. Thanks!

Btw, the SubscribableListener change does not have to come after this PR. It would be totally reasonable to have a separate PR just for that and merge it before proceeding here. My main point is to have them separate, order is not important.

You are absolutely right about splitting it apart, I talked with David before about it, but got tunnel-visioned chasing down test failures and forgot about it. I rethink what I said before, and I believe all of what you mentioned above can be addressed.

I created PR with non functional change - #108788

I have some thoughts how to organize validateRepositoryCanBeCreated better, will update this PR after refactoring merged.

Awesome. That was my main gripe :) Thanks for splitting it out. 👍

updated PR with refactoring included

This PR is a syntactic change for `registerRepository` in `RepositoriesService`. I use `SubscribableListener` to display order of events and reduce boilerplate code around failures delegation `listener.delegateFailureAndWrap`. It's a part of larger change for verification logic, which should take advantage of this "sequential" version of code. #108531

DaveCTurner

Looks good to go except a couple of superficial points, see below.

DaveCTurner · 2024-05-23T06:54:03Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

+                threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
+                    try {
+                        final var token = repository.startVerification();
+                        if (token != null) {
+                            repository.verify(token, clusterService.localNode());
+                            repository.endVerification(token);
+                        }
+                        resultListener.onResponse(null);
+                    } catch (Exception e) {
+                        resultListener.onFailure(e);
+                    } finally {
+                        closeRepository(repository);
+                    }
+                });


It's generally best to avoid submitting a bare Runnable to threadpool executors, because they don't handle rejection very well. In this particular case it'd be ok since (a) SNAPSHOT doesn't reject anything (today) and (b) the whole thing is in a try/catch block anyway, but still these things are easy to miss in future changes in this area. AbstractRunnable is always a better choice:

Suggested change

threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {

try {

final var token = repository.startVerification();

if (token != null) {

repository.verify(token, clusterService.localNode());

repository.endVerification(token);

}

resultListener.onResponse(null);

} catch (Exception e) {

resultListener.onFailure(e);

} finally {

closeRepository(repository);

}

});

threadPool.executor(ThreadPool.Names.SNAPSHOT)

.execute(ActionRunnable.run(ActionListener.runBefore(resultListener, () -> closeRepository(repository)), () -> {

final var token = repository.startVerification();

if (token != null) {

repository.verify(token, clusterService.localNode());

repository.endVerification(token);

}

}));

The try-catch block may not be necessary since we are calling this method in a SubscribableListener#newForked which does its own try-catch. That said, I am not too fussed with the additional try-catch. It helps reasoning when just reading this single method.

DaveCTurner · 2024-05-23T06:56:40Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
        }
    }

+    public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
+        return safeAwait(SubscribableListener.newForked(exceptionListener -> {


Nit: unnecessary {...}, can just be an expression lambda

Suggested change

return safeAwait(SubscribableListener.newForked(exceptionListener -> {

return safeAwait(SubscribableListener.newForked(exceptionListener ->

DaveCTurner · 2024-05-23T06:56:42Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
        }
    }

+    public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {


Since we don't use the generic type param T anywhere we can just use a wildcard instead:

Suggested change

public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {

public static Exception safeAwaitFailure(SubscribableListener<?> listener) {

ywangd · 2024-05-24T02:09:25Z

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java

@@ -181,6 +173,26 @@ public void testRegisterRejectsInvalidRepositoryNames() {
        }
    }

+    public void testPutRepositoryVerificationFails() {


Is it possible to also have a test that starts with an existing repository and demostrate a failed update (due to verfication) request does not change the existing repo?

added new test testPutRepositoryVerificationFailsOnExisting

ywangd · 2024-05-24T02:25:06Z

...c/main/java/org/elasticsearch/repositories/blobstore/ESBlobStoreRepositoryIntegTestCase.java

-        return createRepository(name, true);
+        return createRepository(name, false);


Why is this change necessary? Is it because we will reject invalid repo earlier with this change and we still want some invalid repo to be created in tests? Sounds like an exceptional use case. Are there many such usages? If not, I feel it might be better for clarity to have a separate method such as createRepositoryWithoutVerification for them. The following method String createRepository(final String name, final boolean verify) is only used here and probably can be dropped.

There are several tests that dont rely on verification, repository creation is a part of the setup but no assertions done. In some tests we create a snapshot thread pool with single thread, new verification code also uses snapshot thread and it causes test to hang. My understanding it happens when we intentionally block some repository calls though mocks. At least thats what I was able to reproduce with debugger.

Also not all tests I was able to catch locally, some of them fail only on CI. Even not on my GCP instance.

I can try to refactor with createRepositoryWithoutVerification for clarity.

In some tests we create a snapshot thread pool with single thread, new verification code also uses snapshot thread and it causes test to hang.

The existing code also performs verification in the snapshot thread pool after the repo is added to the cluster state. So it is not immediately clear to me why the new verficiation step would hang?

Let me reproduce these and post details here.

I might mess up a bit here, this change is not related to hanging tests.

Another diff in this PR addresses hanging test:
RepositoriesIT -> testRepositoryConflict

blockMasterOnWriteIndexFile(repo); logger.info("--> start deletion of snapshot"); ActionFuture<AcknowledgedResponse> future = clusterAdmin().prepareDeleteSnapshot(repo, snapshot1).execute(); logger.info("--> waiting for block to kick in on node [{}]", blockedNode); waitForBlock(blockedNode, repo); ... logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress"); RepositoryConflictException e2 = expectThrows( RepositoryConflictException.class, clusterAdmin().preparePutRepository(repo) .setVerify(true) // >>> will hang on verification .setType("mock") .setSettings(Settings.builder().put("location", randomRepoPath())) ); ... logger.info("--> unblocking blocked node [{}]", blockedNode); unblockNode(repo, blockedNode); // >>> release thread here

So here we use thread pool with single thread that is busy with snapshot deletion, then we issue another call to repository service and it hangs on verification. Test unblocks node after.

This change related to metrics. We have tests that verify we have exact Repository metrics - get/put/list across nodes. But with new change master node will have few more, and there are several tests fail in s3/gcp/azure. Not consistently thou.

https://gradle-enterprise.elastic.co/s/w4amqdgguyipy

Problem is I cannot reproduce these locally, tried fix them based on gradle scan, then got few more. So I decided a blanket approach. @DaveCTurner suggested to turn verification off, not the blanket approach :)

ywangd · 2024-05-24T02:27:17Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

+                threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
+                    try {
+                        final var token = repository.startVerification();
+                        if (token != null) {
+                            repository.verify(token, clusterService.localNode());
+                            repository.endVerification(token);
+                        }
+                        resultListener.onResponse(null);
+                    } catch (Exception e) {
+                        resultListener.onFailure(e);
+                    } finally {
+                        closeRepository(repository);
+                    }
+                });


The try-catch block may not be necessary since we are calling this method in a SubscribableListener#newForked which does its own try-catch. That said, I am not too fussed with the additional try-catch. It helps reasoning when just reading this single method.

mhl-b · 2024-05-24T04:25:18Z

server/src/internalClusterTest/java/org/elasticsearch/snapshots/RepositoriesIT.java

@@ -302,7 +302,10 @@ public void testRepositoryConflict() throws Exception {
        logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress");
        RepositoryConflictException e2 = expectThrows(
            RepositoryConflictException.class,
-            clusterAdmin().preparePutRepository(repo).setType("mock").setSettings(Settings.builder().put("location", randomRepoPath()))
+            clusterAdmin().preparePutRepository(repo)
+                .setVerify(false)


will deadlock on snapshot thread pool, we are running with single thread which is busy at the moment

Can we have this comment in the code please?

mhl-b · 2024-05-24T04:26:51Z

...ain/java/org/elasticsearch/repositories/blobstore/ESMockAPIBasedRepositoryIntegTestCase.java

@@ -181,7 +181,7 @@ public final void testSnapshotWithLargeSegmentFiles() throws Exception {
    }

    public void testRequestStats() throws Exception {
-        final String repository = createRepository(randomRepositoryName());
+        final String repository = createRepository(randomRepositoryName(), false);


Additional verification adds few more invocations on master node, failing all related tests in s3/gcp/azure.
Here test asserts that we have exact metrics across all nodes.

Here test asserts that we have exact metrics across all nodes.

This is not the case. The test failed because the repository master node uses for verification is different from the actual repository that gets created afterwards. The initial repository is closed and not counted towards sdkRequestCounts. Thus making it have lower values compared to the records on the HTTP server side. Can we add a comment to this line to explain why verify needs to be false?

ywangd

LGTM

Thanks for the iterations. I suggest we add comments into places where we explicity set verify to false for repository creation so it is easier for future readers to see why.

ywangd · 2024-05-30T03:46:26Z

...ain/java/org/elasticsearch/repositories/blobstore/ESMockAPIBasedRepositoryIntegTestCase.java

@@ -181,7 +181,7 @@ public final void testSnapshotWithLargeSegmentFiles() throws Exception {
    }

    public void testRequestStats() throws Exception {
-        final String repository = createRepository(randomRepositoryName());
+        final String repository = createRepository(randomRepositoryName(), false);


Here test asserts that we have exact metrics across all nodes.

This is not the case. The test failed because the repository master node uses for verification is different from the actual repository that gets created afterwards. The initial repository is closed and not counted towards sdkRequestCounts. Thus making it have lower values compared to the records on the HTTP server side. Can we add a comment to this line to explain why verify needs to be false?

ywangd · 2024-05-30T03:52:11Z

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java

+        var resultListener = new SubscribableListener<AcknowledgedResponse>();
+        repositoriesService.registerRepository(request, resultListener);
+        var ackResponse = safeAwait(resultListener);
+        assertTrue(ackResponse.isAcknowledged());


Nit: I think we can directly use PlainActionFuture instead of going through safeAwait(SubscribableListener) which internally uses PlainActionFuture, e.g.:

Suggested change

var resultListener = new SubscribableListener<AcknowledgedResponse>();

repositoriesService.registerRepository(request, resultListener);

var ackResponse = safeAwait(resultListener);

assertTrue(ackResponse.isAcknowledged());

var future = new PlainActionFuture<AcknowledgedResponse>();

repositoriesService.registerRepository(request, future);

assertTrue(safeGet(future).isAcknowledged());

ywangd · 2024-05-30T04:05:45Z

server/src/internalClusterTest/java/org/elasticsearch/snapshots/RepositoriesIT.java

@@ -302,7 +302,10 @@ public void testRepositoryConflict() throws Exception {
        logger.info("--> try updating the repository, should fail because the deletion of the snapshot is in progress");
        RepositoryConflictException e2 = expectThrows(
            RepositoryConflictException.class,
-            clusterAdmin().preparePutRepository(repo).setType("mock").setSettings(Settings.builder().put("location", randomRepoPath()))
+            clusterAdmin().preparePutRepository(repo)
+                .setVerify(false)


Can we have this comment in the code please?

ywangd · 2024-05-30T04:09:16Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

@@ -2178,6 +2179,14 @@ public static <T> T safeGet(Future<T> future) {
        }
    }

+    public static Exception safeAwaitFailure(SubscribableListener<?> listener) {


Nit: I think we can have a variant of this method that takes PlainActionFuture in a future PR.

all addressed

* add verification before cluster update * dont verify repo in testRequestStats

mhl-b added >enhancement >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team v8.15.0 labels May 10, 2024

mhl-b requested review from ywangd, DiannaHohensee and DaveCTurner May 10, 2024 23:22

DaveCTurner reviewed May 12, 2024

View reviewed changes

idegtiarenko reviewed May 13, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java Outdated Show resolved Hide resolved

idegtiarenko reviewed May 13, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java Outdated Show resolved Hide resolved

DaveCTurner mentioned this pull request May 13, 2024

Ensure listener called in testRegisterRepositorySuccessAfterCreationFailed #108477

Closed

DaveCTurner reviewed May 13, 2024

View reviewed changes

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java Outdated Show resolved Hide resolved

DaveCTurner mentioned this pull request May 13, 2024

Extract SAFE_AWAIT_TIMEOUT constant #108554

Merged

mhl-b mentioned this pull request May 13, 2024

Replace mocks with test implementation in RepositoriesServiceTests #108589

Merged

unregister repository if validation fails

0474f8c

mhl-b force-pushed the repo-validation branch from dc4a5a1 to 0474f8c Compare May 15, 2024 02:50

mhl-b changed the title ~~Validate repository before cluster update~~ Unregister repository when PutRepository validation fails May 15, 2024

mhl-b requested review from idegtiarenko and DaveCTurner May 15, 2024 03:03

mhl-b added 5 commits May 14, 2024 20:10

naming

2d63dc6

naming

bfa15df

fix logger

66eaa22

Merge remote-tracking branch 'upstream/main' into repo-validation

58d6852

revert executor type

4892603

ywangd reviewed May 17, 2024

View reviewed changes

mhl-b mentioned this pull request May 17, 2024

Refactor registerRepository method #108788

Merged

update

fcd79a6

mhl-b requested review from a team as code owners May 21, 2024 22:41

update

418e167

mhl-b requested review from ywangd and removed request for a team May 22, 2024 00:13

DaveCTurner reviewed May 23, 2024

View reviewed changes

mhl-b added 2 commits May 23, 2024 11:37

address feedback

7d0696f

fix test deadlock

219ddbc

mhl-b requested a review from DaveCTurner May 24, 2024 00:05

ywangd reviewed May 24, 2024

View reviewed changes

mhl-b added 3 commits May 23, 2024 20:22

address feedback

b09dbf6

cleanup

89f793f

dont verify repo in testRequestStats

811233b

mhl-b commented May 24, 2024

View reviewed changes

fix repo metrics tests

d698530

mhl-b requested a review from ywangd May 24, 2024 05:56

ywangd approved these changes May 30, 2024

View reviewed changes

mhl-b added 2 commits May 31, 2024 10:32

merge main, update comments

f30c966

spotless

ec7a1a9

mhl-b merged commit 18219ca into elastic:main May 31, 2024
15 checks passed

craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Jun 11, 2024

Verify repository before cluster update (elastic#108531)

c3a1581

* add verification before cluster update * dont verify repo in testRequestStats

mhl-b deleted the repo-validation branch June 17, 2024 17:55

		if (request.verify()) {
		validatePutRepositoryRequest(request, validationStep);

	return safeAwait(SubscribableListener.newForked(exceptionListener -> {
	return safeAwait(SubscribableListener.newForked(exceptionListener ->

	public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
	public static Exception safeAwaitFailure(SubscribableListener<?> listener) {

		return createRepository(name, true);
		return createRepository(name, false);

Verify repository before cluster update #108531

Verify repository before cluster update #108531

Conversation

mhl-b commented May 10, 2024 • edited

elasticsearchmachine commented May 10, 2024

mhl-b commented May 11, 2024 • edited

DaveCTurner left a comment

Choose a reason for hiding this comment

ywangd commented May 13, 2024

mhl-b commented May 13, 2024

mhl-b commented May 17, 2024

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhl-b May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd May 17, 2024 • edited

Choose a reason for hiding this comment

mhl-b May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhl-b May 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhl-b commented May 10, 2024 •

edited

mhl-b commented May 11, 2024 •

edited

mhl-b May 17, 2024 •

edited

ywangd May 17, 2024 •

edited

mhl-b May 17, 2024 •

edited

mhl-b May 24, 2024 •

edited