[feat](snapshot) Support storage vault for clone instance#63217
Conversation
|
run cloudut |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
794740c to
896dbe2
Compare
896dbe2 to
86aa010
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found blocking correctness issues in the cascade/update paths.
Critical checkpoint conclusions:
- Goal/test: the PR attempts to cascade object-store and storage-vault credential changes to snapshot-derived instances, and adds positive/disabled tests. The tests do not cover retry/partial-failure behavior or non-default credential provider types.
- Scope/focus: the change is mostly localized, but the helper refactor changes existing ALTER_OBJ_INFO semantics for role_arn credential providers.
- Concurrency/transactions: no new shared in-process locking is introduced, but the RPC handlers now perform multiple independent KV transactions after committing the root instance. Several new paths return an error after the root commit, leaving partial state and making MetaServiceProxy retries non-idempotent.
- Lifecycle/static init: no special lifecycle or cross-TU static initialization concern found.
- Config/compatibility: no new config or storage-format compatibility concern found.
- Parallel paths: similar cascade failure handling appears in alter_storage_vault, alter_obj_store_info, and update_ak_sk; the same retry/partial-success issue needs a consistent fix.
- Tests: current tests only cover success and snapshot-disabled cases. Missing negative/retry tests for derived-instance transaction failures and provider-type preservation.
- Observability: logs include instance ids, but returning failure after root commit is the main correctness issue.
- Transaction/persistence: root and derived instance updates are committed separately; failure after the first commit can leave durable partial updates while the RPC reports failure.
- Performance: no major performance issue found in the reviewed diff.
No additional user-provided focus points were supplied.
There was a problem hiding this comment.
I found blocking correctness issues in the cascade/update paths.
Critical checkpoint conclusions:
- Goal/test: the PR adds cascade behavior for storage vault/object credential updates and tests the happy path plus snapshot-disabled path, but failure/retry and role-provider cases are not covered.
- Scope: the factoring is mostly focused, but the extracted helper changes existing role_arn semantics.
- Concurrency/lifecycle: no new locks or static lifecycle concerns found, but the post-commit cascade creates partial-update windows.
- RPC/idempotency: not satisfied. Several error returns happen after the root instance has already committed, and at least rename/update_ak_sk retries cannot complete the cascade.
- Compatibility/persistence: storage vault and instance metadata are persisted, but partial cascade can leave derived instances stale.
- Tests: missing negative/retry tests for cascade failures and missing coverage for non-default cred_provider_type.
- Observability: logs include useful instance ids, but they do not compensate for the partial update correctness issue.
- User focus: no additional user-provided review focus was supplied.
86aa010 to
869f852
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Blocking issues remain from the existing inline threads, so I cannot approve this PR yet. I did not add duplicate inline comments because the current review context already raises the same concrete blockers.
Critical checkpoint conclusions:
- Goal/test: The PR aims to cascade storage vault / object credential changes to clone-derived instances and adds unit coverage for happy-path S3 vault cascading and snapshot-disabled skip behavior. The tests do not cover the already-raised retry/idempotency failures after root commit or role-based provider-type preservation.
- Scope/focus: The implementation is focused on meta-service resource cascade handling, but the cascade error semantics change existing best-effort behavior in a way that can expose partial commits.
- Concurrency/lifecycle: No new shared in-memory concurrency or special static lifecycle issue was found in the changed paths; each cascade instance uses its own transaction.
- RPC/idempotency: Not satisfied. Existing threads already identify that storage-vault, object-store, and update_ak_sk cascades can return errors after the root transaction commits, and retries cannot reliably resume. This violates the meta-service retry contract for retryable/maybe-committed KV failures.
- Compatibility/parallel paths: Storage vault and object-store cascade paths were both updated, but the role_arn ALTER_OBJ_INFO helper regresses cred_provider_type handling, as already raised inline.
- Error handling/observability: Status is generally set on local return paths and logs include instance identifiers, but post-root cascade failures currently report RPC failure despite durable partial success.
- Data correctness/persistence: The main correctness risk is persistent divergence between root and derived instances when a cascade fails after the root commit, plus the existing provider-type persistence regression for role-based object-store updates.
- Performance: No additional clear performance blocker beyond the extra per-derived-instance transactions, which are consistent with the intended transaction-size isolation.
- Tests: Existing new tests cover only successful cascade and snapshot-disabled skip; missing negative/retry/idempotency and role-provider cases correspond to the existing blockers.
User focus: No additional user-provided review focus was specified, and I found no separate issue for focus points.
No description provided.