HDDS-13919. S3 Conditional Writes (PutObject) #9334

peterxcli · 2025-11-20T07:03:50Z

What changes were proposed in this pull request?

The design doc for S3 conditional write

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13919

How was this patch tested?

ivandika3

Thanks @peterxcli for the design, left some initial comments.

hadoop-hdds/docs/content/design/s3-conditional-requests.md

ivandika3 · 2025-11-25T05:25:07Z

@hevinhsu Could you help to take a look as well? Thanks.

…no pre-flight RPC) - Add If-Match implementation: validate ETag in OM validateAndUpdateCache, avoiding GetS3KeyDetails pre-flight check to optimize happy path - Document If-None-Match using EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS=-1 constant for atomic create-if-not-exists semantics - Reorganize spec sections: separate Write/Read/Copy specifications - Clarify OM validation logic: locking, key lookup, ETag comparison, error cases - Update error mapping: add PRECONDITION_FAILED for missing ETag scenarios - Add HDDS-13963 reference for Create-If-Not-Exists capability

peterxcli · 2025-11-30T09:57:37Z

@ivandika3 @chungen0126 I’ve refined the design—please take another look.

Regarding the TODO: I plan to evolve the design and code together across patches:

Initial patch: introduce the design, fully detail “conditional write,” and outline high-level approaches for get/copy.
Conditional get: complete the remaining design details for conditional get and include the corresponding code changes.
Conditional copy: complete the remaining design details for conditional copy and include the corresponding code changes.

Let me know if this workflow sounds feasible.

ivandika3

Thanks for iterating for this, the direction is good. Should we separate the design docs and the actual implementations?

Left one comment.

ivandika3 · 2025-12-14T02:47:01Z

hadoop-hdds/docs/content/design/s3-conditional-requests.md

+3. **Validation**:
+
+    - **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412).
+    - **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, validation fails. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread. Throws `PRECONDITION_FAILED`.


Let's be more permissive and not fail the precondition if ETag metadata does not exist . IMO S3 Conditional Writes should only be aimed for pure S3 use cases (only S3 users are accessing the bucket). Therefore, if there are mixed user for example OFS and S3A users (e.g. the upstream write to a Hive table using OFS, but the downstream user uses S3A to read same hive table), we don't want PRECONDITION_FAILED to pop up suddenly to users.

peterxcli · 2025-12-14T06:57:18Z

Thanks for iterating for this, the direction is good. Should we separate the design docs and the actual implementations?

Sounds good. I’ll first revert the code changes, then we can merge this design doc first.

Let's be more permissive and not fail the precondition if ETag metadata does not exist

Agreed. will update the doc accordingly.

…anges and ETag validation logic - Change error code from `KEY_ALREADY_EXISTS` to `KEY_GENERATION_MISMATCH` for concurrent key creation failures. - Modify ETag validation logic to allow operations to proceed when no ETag metadata is present, ensuring compatibility with mixed access patterns. - Update error mapping to include `ETAG_MISMATCH` for ETag comparison failures. - Add note regarding the upcoming addition of atomic create-if-not-exists capability linked to HDDS-13963.

…ional-writes

…r S3 API" This reverts commit dc046a1.

hadoop-hdds/docs/content/design/s3-conditional-requests.md

sodonnel · 2025-12-17T11:54:40Z

hadoop-hdds/docs/content/design/s3-conditional-requests.md

+3. **Validation**:
+
+    - **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412).
+    - **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, skip ETag validation and allow the operation to proceed. This ensures compatibility with mixed access patterns (OFS and S3A) where S3 Conditional Writes are primarily intended for pure S3 use cases. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread.


Just skipping seems a bit risky. If the S3 API has the header that says "only create the new key if an existing on of the same version exists" and we just don't have an etag, then it could cause unexpected lost writes. The front end has asked for a feature the backend cannot support - it doesn't seem correct to just "do it anyway". Returning an error would be much safer.

For S3 originated writes, are we always storing etags? Where do the etags come from? Could non-s3 originated writes also set an etag easily? It feels like etags could be a summation of the CRC checksums of a block.

For S3 originated writes, are we always storing etags? Where do the etags come from?

Yes, s3g will always set the etag for object.

ozone/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java

Lines 162 to 170 in acde9e7

static {

E_TAG_PROVIDER = ThreadLocal.withInitial(() -> {

try {

return MessageDigest.getInstance(OzoneConsts.MD5_HASH);

} catch (NoSuchAlgorithmException e) {

throw new RuntimeException(e);

}

});

}

ozone/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java

Lines 342 to 345 in acde9e7

eTag = DatatypeConverter.printHexBinary(

digestInputStream.getMessageDigest().digest())

.toLowerCase();

output.getMetadata().put(ETAG, eTag);

Could non-s3 originated writes also set an etag easily? It feels like etags could be a summation of the CRC checksums of a block.

Yes, the client just need to set the ETAG key metadata field. but the its value should be computed in the application.

Just skipping seems a bit risky. If the S3 API has the header that says "only create the new key if an existing on of the same version exists" and we just don't have an etag, then it could cause unexpected lost writes. The front end has asked for a feature the backend cannot support - it doesn't seem correct to just "do it anyway". Returning an error would be much safer.

Agree.

cc @ivandika3, WDYT

or maybe we can introduce a new Error type sth like ETAG_NOT_AVAILABLE?

As it stands now, ETag is not always written when uploading a key (e.g. OFS users or using OzoneClient directly). We can technically support ETag on all write, but the issue are

ETag includes calculating hash (e.g. md5) which adds overhead and if it's never used by users like OFS, it becomes an unnecessary overhead.

Old clients will still upload without setting ETag. So unless we always calculate ETag in OM (which is a bad idea), we cannot ensure that all keys have ETag.

Old keys do not always have ETag (keys created before the ETag feature is deployed): So unless we spin up a new cluster or we run a finalization that calculates every single key ETag (which is expensive), it's not feasible.

In Ozone, the S3 compatibility for LEGACY or FSO buckets is "best-effort" meaning we try to support S3 as much as possible, but there will be limitations (compare to pure S3 object storage). Since for OBS buckets can only be used by S3 users, the conditional write guarantee here is stronger. If LEGACY or FSO buckets are only used by S3 users, then the conditional write is stronger.

In the end, it's a tradeoff between compatibility vs safety. We want S3 users that uses FSO / LEGACY bucket to be able to coexist with OFS users without throwing any unexpected exceptions. On the other hand we also want to ensure the safety contract is respected. That said, if new S3G talks to old OM without any version checks, the OM would also ignore the conditional write behavior without throwing exceptions.

We can simply document this behavior. I'm fine if the community decides to prioritize safety, but personally I prefer compatibility.

Answering @sodonnel questions

For S3 originated writes, are we always storing etags?

Yes, but keys uploaded before ETag feature is deployed will not have ETags

Where do the etags come from?

From S3G

Could non-s3 originated writes also set an etag easily?

Possible, but this requires changing KeyOutputStream to calculate ETag all the time.

It feels like etags could be a summation of the CRC checksums of a block.

ETag technically can be anything that uniquely identify an object, but currently for normal (non-MPU) key it's MD5 hash (since current this is the current AWS S3 behavior) while for MPU key it's not MD5 (IIRC it's hash all the MD5 of all parts).

…pected etag

sodonnel · 2025-12-17T20:51:38Z

@errose28 and @kerneltime You guys had some interest in a fuller implementation of the "conditional write" api when I was doing the atomic rewrite change. Would be good to get your thoughts on this design doc too.

add design doc

589d6c5

peterxcli force-pushed the HDDS-13919-conditional-writes branch from 5262235 to 589d6c5 Compare November 20, 2025 07:08

peterxcli mentioned this pull request Nov 20, 2025

HDDS-13963. Atomic Create-If-Not-Exists #9332

Open

peterxcli requested review from ivandika3 and sodonnel November 20, 2025 07:18

WIP: Implement conditional write support in ObjectEndpoint for S3 API

dc046a1

ivandika3 reviewed Nov 25, 2025

View reviewed changes

hadoop-hdds/docs/content/design/s3-conditional-requests.md Outdated Show resolved Hide resolved

hadoop-hdds/docs/content/design/s3-conditional-requests.md Outdated Show resolved Hide resolved

hadoop-hdds/docs/content/design/s3-conditional-requests.md Outdated Show resolved Hide resolved

ivandika3 added the design label Nov 25, 2025

chungen0126 self-requested a review November 25, 2025 10:24

peterxcli requested a review from ivandika3 November 30, 2025 09:50

ivandika3 reviewed Dec 14, 2025

View reviewed changes

peterxcli added 2 commits December 14, 2025 15:11

Merge remote-tracking branch 'upstream/master' into HDDS-13919-condit…

b2d152c

…ional-writes

peterxcli marked this pull request as ready for review December 14, 2025 07:12

Revert "WIP: Implement conditional write support in ObjectEndpoint fo…

1c12209

…r S3 API" This reverts commit dc046a1.

peterxcli requested a review from ivandika3 December 14, 2025 07:12

Add Apache License header to S3 Conditional Requests documentation

acde9e7

ivandika3 reviewed Dec 16, 2025

View reviewed changes

hadoop-hdds/docs/content/design/s3-conditional-requests.md Show resolved Hide resolved

ivandika3 added the s3 S3 Gateway label Dec 16, 2025

sodonnel reviewed Dec 17, 2025

View reviewed changes

hadoop-hdds/docs/content/design/s3-conditional-requests.md Outdated Show resolved Hide resolved

sodonnel reviewed Dec 17, 2025

View reviewed changes

clarify s3 conditional write will use atomic key rewrite to handle ex…

ad824b2

…pected etag

peterxcli requested review from ivandika3 and sodonnel December 17, 2025 14:48

peterxcli requested a review from kerneltime December 18, 2025 05:07

peterxcli requested a review from errose28 December 18, 2025 05:07

	static {
	E_TAG_PROVIDER = ThreadLocal.withInitial(() -> {
	try {
	return MessageDigest.getInstance(OzoneConsts.MD5_HASH);
	} catch (NoSuchAlgorithmException e) {
	throw new RuntimeException(e);
	}
	});
	}

	eTag = DatatypeConverter.printHexBinary(
	digestInputStream.getMessageDigest().digest())
	.toLowerCase();
	output.getMetadata().put(ETAG, eTag);

HDDS-13919. S3 Conditional Writes (PutObject) #9334

Are you sure you want to change the base?

HDDS-13919. S3 Conditional Writes (PutObject) #9334

Uh oh!

Conversation

peterxcli commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

ivandika3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivandika3 commented Nov 25, 2025

Uh oh!

peterxcli commented Nov 30, 2025

Uh oh!

ivandika3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivandika3 Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

peterxcli commented Dec 14, 2025

Uh oh!

Uh oh!

Uh oh!

sodonnel Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

peterxcli Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterxcli Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

peterxcli Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ivandika3 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

peterxcli commented Nov 20, 2025 •

edited

Loading

ivandika3 left a comment •

edited

Loading

peterxcli Dec 17, 2025 •

edited

Loading

ivandika3 Dec 18, 2025 •

edited

Loading