Skip to content

Conversation

kvanerum
Copy link
Contributor

@kvanerum kvanerum commented Aug 16, 2025

Today, S3-backed repositories ignore the failIfAlreadyExists flag and
may therefore overwrite a blob which already exists, potentially
corrupting a repository subject to concurrent writes, rather than
failing the second write.

AWS S3 now supports writes conditional on the non-existence of an object
via the If-None-Match: * HTTP header. This commit adjusts the
S3-backed repository implementation to respect the failIfAlreadyExists
flag using these conditional writes, eliminating the possibility of
overwriting blobs which should not be overwritten.

Relates #128565

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v9.2.0 labels Aug 16, 2025
@szybia szybia added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label labels Aug 18, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good approach to me. I left a few small comments.

}
}

public void testFailIfAlreadyExists() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, so nice to see this test being added at last.

Could we make it slightly stronger and instead start by performing two writes concurrently, asserting that exactly one of them succeeds, and then follow up with the check that we can overwrite blobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test to include the changes you requested, but now it fails in HdfsRepositoryTests. Both writes succeed. I also tried using blobStore.writeBlobAtomic(...), without luck.
From what I can tell, TestingFs might not provide atomic renaming. Could you advise on how best to proceed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh HDFS how we love thee. I'd suggest suppressing the test thusly:

diff --git a/plugins/repository-hdfs/src/test/java/org/elasticsearch/repositories/hdfs/HdfsRepositoryTests.java b/plugins/repository-hdfs/src/test/java/org/elasticsearch/repositories/hdfs/HdfsRepositoryTests.java
index 7961ca0257be..3d75d9915bf7 100644
--- a/plugins/repository-hdfs/src/test/java/org/elasticsearch/repositories/hdfs/HdfsRepositoryTests.java
+++ b/plugins/repository-hdfs/src/test/java/org/elasticsearch/repositories/hdfs/HdfsRepositoryTests.java
@@ -62,4 +62,9 @@ public class HdfsRepositoryTests extends AbstractThirdPartyRepositoryTestCase {
             assertThat(response.result().blobs(), equalTo(0L));
         }
     }
+
+    @Override
+    public void testFailIfAlreadyExists() {
+        // HDFS does not implement failIfAlreadyExists correctly
+    }
 }

+ "</Key>\n"
+ "</CompleteMultipartUploadResult>").getBytes(StandardCharsets.UTF_8);

if (isProtectOverwrite(exchange) && blobs.containsKey(request.path())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking containsKey before the put is racy. I think we need a putIfAbsent if If-None-Match: * is specified, and therefore also a test which tries to catch a race here.

// a copy request is a put request with an X-amz-copy-source header
final var copySource = copySourceName(exchange);
if (copySource != null) {
if (isProtectOverwrite(exchange) && blobs.containsKey(request.path())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, we need to use putIfAbsent to avoid the race on this path.

Comment on lines 578 to 579
// initial write blob
writeBlob(container, blobName, new BytesArray(data), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good spot to check for potential races in S3HttpHandler, if we initially wrote two blobs concurrently and verified that exactly one of those writes succeeded.

Could we also sometimes write a much larger blob to trigger the multipart upload path?


}

public void testPreventObjectOverwrite() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'd like us to test for races here too. Could we concurrently issue one or two PutObject requests, and one or two CompleteMultipartUpload requests, and verify that exactly one of them succeeds? And ideally that doing a GetObject afterwards returns the contents of the object which succeeded?

@kvanerum kvanerum requested a review from DaveCTurner August 27, 2025 06:57
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR solves #128565

Not quite, we also need to verify the failIfAlreadyExists behaviour in repo analysis before we can close this issue. But I'm ok to merge this PR to implement the flag even without the repo analysis changes.

I left a few small comments but structurally this all looks good now.

final String destinationBlobName,
final long blobSize
final long blobSize,
final boolean failIfAlreadyExists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is always false (in production code anyway) and the true case isn't really tested - I'd rather we inlined it, passing the literal false to executeMultipart, instead.

exchange.sendResponseHeaders(RestStatus.OK.getStatus(), response.length);
exchange.getResponseBody().write(response);
boolean preconditionFailed = false;
if (isProtectOverwrite(exchange)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a way for this to be true with a request from Elasticsearch as it stands today, nor does it look like this is covered by unit tests, so this is effectively dead code. I'd rather leave this area alone for now, except perhaps to throw something (AssertionError would be fine) to document that If-None-Match: * isn't supported here if ever we change the callers in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, are you suggesting I revert the changes to the request.isPutObjectRequest() case? That would also mean updating S3HttpHandlerTests.testPreventObjectOverwrite().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the PutObject request handling is doing what we want and is adequately tested. It's just the CopyObject API here where we're not actually using or even testing the If-None-Match: * option.


TestWriteTask(Consumer<TestWriteTask> consumer, Consumer<TestWriteTask> prepare) {
this(consumer);
prepare.accept(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just call the prepare steps directly in the caller? It's kinda hard to follow the flow as written, it looks as if we're doing both the prepare and complete steps concurrently (and in the wrong order too!)

try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
tasks.forEach(task -> executor.submit(task.consumer));
executor.shutdown();
var done = executor.awaitTermination(1, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1s is too short to be a reliable timeout in tests (we push our CI workers pretty hard sometimes):

Suggested change
var done = executor.awaitTermination(1, TimeUnit.SECONDS);
var done = executor.awaitTermination(SAFE_AWAIT_TIMEOUT.seconds(), TimeUnit.SECONDS);

ex2 = e;
}

assertTrue("Exactly one of the writes must fail", ex1 != null ^ ex2 != null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's that exactly one must succeed, it just happens that that's equivalent to exactly one failing when there's only two requests (and ^ is a little unfriendly to readers, I'd recommend !=)

Suggested change
assertTrue("Exactly one of the writes must fail", ex1 != null ^ ex2 != null);
assertTrue("Exactly one of the writes must succeed", (ex1 == null) != (ex2 == null));

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (assuming CI is happy, I'll take care of that) - thanks for your work on this @kvanerum

@DaveCTurner
Copy link
Contributor

@elasticmachine test this please

@DaveCTurner DaveCTurner merged commit 025396b into elastic:main Sep 12, 2025
35 checks passed
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025
Today, S3-backed repositories ignore the `failIfAlreadyExists` flag and
may therefore overwrite a blob which already exists, potentially
corrupting a repository subject to concurrent writes, rather than
failing the second write.

AWS S3 now supports writes conditional on the non-existence of an object
via the `If-None-Match: *` HTTP header. This commit adjusts the
S3-backed repository implementation to respect the `failIfAlreadyExists`
flag using these conditional writes, eliminating the possibility of
overwriting blobs which should not be overwritten.

Relates elastic#128565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants