-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13919. S3 Conditional Writes (PutObject) #9334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5262235 to
589d6c5
Compare
ivandika3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @peterxcli for the design, left some initial comments.
|
@hevinhsu Could you help to take a look as well? Thanks. |
…no pre-flight RPC) - Add If-Match implementation: validate ETag in OM validateAndUpdateCache, avoiding GetS3KeyDetails pre-flight check to optimize happy path - Document If-None-Match using EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS=-1 constant for atomic create-if-not-exists semantics - Reorganize spec sections: separate Write/Read/Copy specifications - Clarify OM validation logic: locking, key lookup, ETag comparison, error cases - Update error mapping: add PRECONDITION_FAILED for missing ETag scenarios - Add HDDS-13963 reference for Create-If-Not-Exists capability
|
@ivandika3 @chungen0126 I’ve refined the design—please take another look. Regarding the TODO: I plan to evolve the design and code together across patches:
Let me know if this workflow sounds feasible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating for this, the direction is good. Should we separate the design docs and the actual implementations?
Left one comment.
| 3. **Validation**: | ||
|
|
||
| - **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412). | ||
| - **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, validation fails. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread. Throws `PRECONDITION_FAILED`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more permissive and not fail the precondition if ETag metadata does not exist . IMO S3 Conditional Writes should only be aimed for pure S3 use cases (only S3 users are accessing the bucket). Therefore, if there are mixed user for example OFS and S3A users (e.g. the upstream write to a Hive table using OFS, but the downstream user uses S3A to read same hive table), we don't want PRECONDITION_FAILED to pop up suddenly to users.
Sounds good. I’ll first revert the code changes, then we can merge this design doc first.
Agreed. will update the doc accordingly. |
…anges and ETag validation logic - Change error code from `KEY_ALREADY_EXISTS` to `KEY_GENERATION_MISMATCH` for concurrent key creation failures. - Modify ETag validation logic to allow operations to proceed when no ETag metadata is present, ensuring compatibility with mixed access patterns. - Update error mapping to include `ETAG_MISMATCH` for ETag comparison failures. - Add note regarding the upcoming addition of atomic create-if-not-exists capability linked to HDDS-13963.
…r S3 API" This reverts commit dc046a1.
| 3. **Validation**: | ||
|
|
||
| - **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412). | ||
| - **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, skip ETag validation and allow the operation to proceed. This ensures compatibility with mixed access patterns (OFS and S3A) where S3 Conditional Writes are primarily intended for pure S3 use cases. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skipping seems a bit risky. If the S3 API has the header that says "only create the new key if an existing on of the same version exists" and we just don't have an etag, then it could cause unexpected lost writes. The front end has asked for a feature the backend cannot support - it doesn't seem correct to just "do it anyway". Returning an error would be much safer.
For S3 originated writes, are we always storing etags? Where do the etags come from? Could non-s3 originated writes also set an etag easily? It feels like etags could be a summation of the CRC checksums of a block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For S3 originated writes, are we always storing etags? Where do the etags come from?
Yes, s3g will always set the etag for object.
ozone/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java
Lines 162 to 170 in acde9e7
static { E_TAG_PROVIDER = ThreadLocal.withInitial(() -> { try { return MessageDigest.getInstance(OzoneConsts.MD5_HASH); } catch (NoSuchAlgorithmException e) { throw new RuntimeException(e); } }); } ozone/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java
Lines 342 to 345 in acde9e7
eTag = DatatypeConverter.printHexBinary( digestInputStream.getMessageDigest().digest()) .toLowerCase(); output.getMetadata().put(ETAG, eTag);
Could non-s3 originated writes also set an etag easily? It feels like etags could be a summation of the CRC checksums of a block.
Yes, the client just need to set the ETAG key metadata field. but the its value should be computed in the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skipping seems a bit risky. If the S3 API has the header that says "only create the new key if an existing on of the same version exists" and we just don't have an etag, then it could cause unexpected lost writes. The front end has asked for a feature the backend cannot support - it doesn't seem correct to just "do it anyway". Returning an error would be much safer.
Agree.
cc @ivandika3, WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe we can introduce a new Error type sth like ETAG_NOT_AVAILABLE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands now, ETag is not always written when uploading a key (e.g. OFS users or using OzoneClient directly). We can technically support ETag on all write, but the issue are
- ETag includes calculating hash (e.g. md5) which adds overhead and if it's never used by users like OFS, it becomes an unnecessary overhead.
- Old clients will still upload without setting ETag. So unless we always calculate ETag in OM (which is a bad idea), we cannot ensure that all keys have ETag.
- Old keys do not always have ETag (keys created before the ETag feature is deployed): So unless we spin up a new cluster or we run a finalization that calculates every single key ETag (which is expensive), it's not feasible.
In Ozone, the S3 compatibility for LEGACY or FSO buckets is "best-effort" meaning we try to support S3 as much as possible, but there will be limitations (compare to pure S3 object storage). Since for OBS buckets can only be used by S3 users, the conditional write guarantee here is stronger. If LEGACY or FSO buckets are only used by S3 users, then the conditional write is stronger.
In the end, it's a tradeoff between compatibility vs safety. We want S3 users that uses FSO / LEGACY bucket to be able to coexist with OFS users without throwing any unexpected exceptions. On the other hand we also want to ensure the safety contract is respected. That said, if new S3G talks to old OM without any version checks, the OM would also ignore the conditional write behavior without throwing exceptions.
We can simply document this behavior. I'm fine if the community decides to prioritize safety, but personally I prefer compatibility.
Answering @sodonnel questions
For S3 originated writes, are we always storing etags?
Yes, but keys uploaded before ETag feature is deployed will not have ETags
Where do the etags come from?
From S3G
Could non-s3 originated writes also set an etag easily?
Possible, but this requires changing KeyOutputStream to calculate ETag all the time.
It feels like etags could be a summation of the CRC checksums of a block.
ETag technically can be anything that uniquely identify an object, but currently for normal (non-MPU) key it's MD5 hash (since current this is the current AWS S3 behavior) while for MPU key it's not MD5 (IIRC it's hash all the MD5 of all parts).
|
@errose28 and @kerneltime You guys had some interest in a fuller implementation of the "conditional write" api when I was doing the atomic rewrite change. Would be good to get your thoughts on this design doc too. |
What changes were proposed in this pull request?
The design doc for S3 conditional write
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13919
How was this patch tested?