Skip to content

Conversation

@peterxcli
Copy link
Member

@peterxcli peterxcli commented Nov 20, 2025

What changes were proposed in this pull request?

The design doc for S3 conditional write

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13919

How was this patch tested?

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peterxcli for the design, left some initial comments.

@ivandika3
Copy link
Contributor

@hevinhsu Could you help to take a look as well? Thanks.

@chungen0126 chungen0126 self-requested a review November 25, 2025 10:24
…no pre-flight RPC)

- Add If-Match implementation: validate ETag in OM validateAndUpdateCache,
  avoiding GetS3KeyDetails pre-flight check to optimize happy path
- Document If-None-Match using EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS=-1
  constant for atomic create-if-not-exists semantics
- Reorganize spec sections: separate Write/Read/Copy specifications
- Clarify OM validation logic: locking, key lookup, ETag comparison, error cases
- Update error mapping: add PRECONDITION_FAILED for missing ETag scenarios
- Add HDDS-13963 reference for Create-If-Not-Exists capability
@peterxcli peterxcli requested a review from ivandika3 November 30, 2025 09:50
@peterxcli
Copy link
Member Author

@ivandika3 @chungen0126 I’ve refined the design—please take another look.


Regarding the TODO: I plan to evolve the design and code together across patches:

  1. Initial patch: introduce the design, fully detail “conditional write,” and outline high-level approaches for get/copy.
  2. Conditional get: complete the remaining design details for conditional get and include the corresponding code changes.
  3. Conditional copy: complete the remaining design details for conditional copy and include the corresponding code changes.

Let me know if this workflow sounds feasible.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating for this, the direction is good. Should we separate the design docs and the actual implementations?

Left one comment.

3. **Validation**:

- **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412).
- **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, validation fails. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread. Throws `PRECONDITION_FAILED`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more permissive and not fail the precondition if ETag metadata does not exist . IMO S3 Conditional Writes should only be aimed for pure S3 use cases (only S3 users are accessing the bucket). Therefore, if there are mixed user for example OFS and S3A users (e.g. the upstream write to a Hive table using OFS, but the downstream user uses S3A to read same hive table), we don't want PRECONDITION_FAILED to pop up suddenly to users.

@peterxcli
Copy link
Member Author

Thanks for iterating for this, the direction is good. Should we separate the design docs and the actual implementations?

Sounds good. I’ll first revert the code changes, then we can merge this design doc first.

Let's be more permissive and not fail the precondition if ETag metadata does not exist

Agreed. will update the doc accordingly.

…anges and ETag validation logic

- Change error code from `KEY_ALREADY_EXISTS` to `KEY_GENERATION_MISMATCH` for concurrent key creation failures.
- Modify ETag validation logic to allow operations to proceed when no ETag metadata is present, ensuring compatibility with mixed access patterns.
- Update error mapping to include `ETAG_MISMATCH` for ETag comparison failures.
- Add note regarding the upcoming addition of atomic create-if-not-exists capability linked to HDDS-13963.
@peterxcli peterxcli marked this pull request as ready for review December 14, 2025 07:12
@peterxcli peterxcli requested a review from ivandika3 December 14, 2025 07:12
@ivandika3 ivandika3 added the s3 S3 Gateway label Dec 16, 2025
3. **Validation**:

- **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412).
- **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, skip ETag validation and allow the operation to proceed. This ensures compatibility with mixed access patterns (OFS and S3A) where S3 Conditional Writes are primarily intended for pure S3 use cases. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skipping seems a bit risky. If the S3 API has the header that says "only create the new key if an existing on of the same version exists" and we just don't have an etag, then it could cause unexpected lost writes. The front end has asked for a feature the backend cannot support - it doesn't seem correct to just "do it anyway". Returning an error would be much safer.

For S3 originated writes, are we always storing etags? Where do the etags come from? Could non-s3 originated writes also set an etag easily? It feels like etags could be a summation of the CRC checksums of a block.

Copy link
Member Author

@peterxcli peterxcli Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For S3 originated writes, are we always storing etags? Where do the etags come from?

Yes, s3g will always set the etag for object.


Could non-s3 originated writes also set an etag easily? It feels like etags could be a summation of the CRC checksums of a block.

Yes, the client just need to set the ETAG key metadata field. but the its value should be computed in the application.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skipping seems a bit risky. If the S3 API has the header that says "only create the new key if an existing on of the same version exists" and we just don't have an etag, then it could cause unexpected lost writes. The front end has asked for a feature the backend cannot support - it doesn't seem correct to just "do it anyway". Returning an error would be much safer.

Agree.

cc @ivandika3, WDYT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we can introduce a new Error type sth like ETAG_NOT_AVAILABLE?

Copy link
Contributor

@ivandika3 ivandika3 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands now, ETag is not always written when uploading a key (e.g. OFS users or using OzoneClient directly). We can technically support ETag on all write, but the issue are

  1. ETag includes calculating hash (e.g. md5) which adds overhead and if it's never used by users like OFS, it becomes an unnecessary overhead.
  2. Old clients will still upload without setting ETag. So unless we always calculate ETag in OM (which is a bad idea), we cannot ensure that all keys have ETag.
  3. Old keys do not always have ETag (keys created before the ETag feature is deployed): So unless we spin up a new cluster or we run a finalization that calculates every single key ETag (which is expensive), it's not feasible.

In Ozone, the S3 compatibility for LEGACY or FSO buckets is "best-effort" meaning we try to support S3 as much as possible, but there will be limitations (compare to pure S3 object storage). Since for OBS buckets can only be used by S3 users, the conditional write guarantee here is stronger. If LEGACY or FSO buckets are only used by S3 users, then the conditional write is stronger.

In the end, it's a tradeoff between compatibility vs safety. We want S3 users that uses FSO / LEGACY bucket to be able to coexist with OFS users without throwing any unexpected exceptions. On the other hand we also want to ensure the safety contract is respected. That said, if new S3G talks to old OM without any version checks, the OM would also ignore the conditional write behavior without throwing exceptions.

We can simply document this behavior. I'm fine if the community decides to prioritize safety, but personally I prefer compatibility.

Answering @sodonnel questions

For S3 originated writes, are we always storing etags?

Yes, but keys uploaded before ETag feature is deployed will not have ETags

Where do the etags come from?

From S3G

Could non-s3 originated writes also set an etag easily?

Possible, but this requires changing KeyOutputStream to calculate ETag all the time.

It feels like etags could be a summation of the CRC checksums of a block.

ETag technically can be anything that uniquely identify an object, but currently for normal (non-MPU) key it's MD5 hash (since current this is the current AWS S3 behavior) while for MPU key it's not MD5 (IIRC it's hash all the MD5 of all parts).

@sodonnel
Copy link
Contributor

@errose28 and @kerneltime You guys had some interest in a fuller implementation of the "conditional write" api when I was doing the atomic rewrite change. Would be good to get your thoughts on this design doc too.

@peterxcli peterxcli requested a review from kerneltime December 18, 2025 05:07
@peterxcli peterxcli requested a review from errose28 December 18, 2025 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design s3 S3 Gateway

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants