Skip to content

fix(common): normalize hudi table base path in StorageBasedLockProvider#18817

Open
Davis-Zhang-Onehouse wants to merge 4 commits into
apache:masterfrom
Davis-Zhang-Onehouse:normalize-storage-lock-provider-base-path
Open

fix(common): normalize hudi table base path in StorageBasedLockProvider#18817
Davis-Zhang-Onehouse wants to merge 4 commits into
apache:masterfrom
Davis-Zhang-Onehouse:normalize-storage-lock-provider-base-path

Conversation

@Davis-Zhang-Onehouse
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Follow-up to #18814, which introduced FSUtils.normalizeBasePathForLocking for the implicit-key lock providers. StorageBasedLockProvider derives lockFilePath directly from the raw `hoodie.base.path` config value:

```java
this.basePath = config.getHudiTableBasePath();
String lockFolderPath = StorageLockClient.getLockFolderPath(basePath); // new StoragePath(basePath, ".locks")
this.lockFilePath = new StoragePath(lockFolderPath, DEFAULT_TABLE_LOCK_FILE_NAME).toString();
```

`StoragePath` absorbs trailing-slash drift, so `s3://b/t` and `s3://b/t/` already produce the same lock file. But two other classes of benign basePath drift do not get absorbed and route the writers to different lock files:

Drift Same lock file today? Reason
Trailing slash (`s3://b/t` vs `s3://b/t/`) ✅ yes `StoragePath` normalizes
Scheme case (`s3a://b/t` vs `s3://b/t`) no Scheme is preserved verbatim through `StoragePath`
Surrounding whitespace no Whitespace becomes part of the URI

Two writers that disagree on those benign formatting details acquire different lock files and lose mutual exclusion. Same class of bug as #18814, lower severity because the divergence is a visible second lock file rather than a silent hash split, but still a correctness issue.

Summary and Changelog

Route the raw basePath through `FSUtils.normalizeBasePathForLocking` (the helper introduced in #18814) before building `lockFilePath`. Using the same canonicalization API as the implicit-key LPs keeps the contract for "what counts as the same Hudi table" consistent across all lock providers.

Also rename the stored field `basePath` → `normalizedHudiTableBasePath` so the name reflects what's actually stored.

Impact

  • Behavior: `StorageBasedLockProvider` will produce the same `lockFilePath` for a Hudi table regardless of trailing-slash, surrounding whitespace, or `s3a`/`s3` scheme variations. No public API signature change.
  • Performance: Negligible — one extra `trim()` and a short scheme rewrite per LP construction.
  • Compatibility (rollout caveat): Lock files keyed under the previous `s3a://...` form effectively orphan at deploy time, since the new code looks at `s3://...` for the same logical table. Operators upgrading should coordinate the deploy across all writers that share a Hudi table that uses this LP, or briefly quiesce writers while rolling out. Stale lock files can be cleaned up manually after the cutover. No rollout impact for callers that already pass `s3://...`.

Risk Level

low

The change only affects how raw config strings are canonicalized before being fed to existing path-derivation code. Existing trailing-slash callers will see no change in derived lockFilePath (`StoragePath` already absorbed that). The only callers whose lock file moves are those that supplied an `s3a://...` basePath or surrounding whitespace.

Mitigation:

  • Existing TestStorageBasedLockProvider suite continues to pass.
  • New tests cover the invariant directly: scheme drift, whitespace, and trailing-slash variants all produce the same lockFilePath.

Documentation Update

none — no user-facing config or website change.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Note: This PR is stacked on top of #18814 — its diff currently includes the parent PR's changes until #18814 merges. The StorageBasedLockProvider-specific commit is the last one in the branch.

…iders

The implicit-key lock providers - DynamoDBBasedImplicitPartitionKeyLockProvider
and ZookeeperBasedImplicitBasePathLockProvider - hash the hudi table base path
to derive a DynamoDB partition key / Zookeeper znode for the lock. The hash
function (XXH64) is avalanche: any byte-level difference in the input produces
a completely different output. Today the only normalization applied before
hashing is s3aToS3 (s3a:// -> s3://). Trailing slashes, repeated slashes, and
surrounding whitespace pass through unchanged.

When two writers for the same hudi table disagree on those benign formatting
details - e.g. one engine supplies "s3://bucket/table" while another supplies
"s3://bucket/table/" - they end up acquiring different lock rows / znodes and
lose mutual exclusion, even though they are targeting the same table. Two
writers that should serialize can then write concurrently and corrupt the
hudi timeline.

Fix: introduce FSUtils.normalizeBasePathForLocking() as the single source of
truth for canonicalization before hashing:

  1. Reject null / empty / whitespace-only basePath.
  2. trim() surrounding whitespace.
  3. Apply existing s3aToS3 (case-insensitive s3a:// -> s3://).
  4. Strip all trailing '/' then add exactly one.

Both implicit-key lock providers now route through it. The Dynamo provider
gains a small public static derivePartitionKey(String) so the formula is
testable without a DynamoDB client.

Inner consecutive slashes (s3://b//x vs s3://b/x) are intentionally NOT
collapsed - they can resolve to a legitimate S3 key.

Tests:
  - TestFSUtils#testNormalizeBasePathForLocking exercises the normalization
    rules directly: trailing slash, multi-slash, whitespace, s3a/s3 schemes,
    inner-slash preservation, null/empty rejection.
  - TestDynamoDBBasedImplicitPartitionKeyLockProvider verifies that all
    trailing-slash, multi-slash, whitespace, and s3a variants of the same
    base path produce the same DynamoDB partition key.
  - TestZookeeperBasedImplicitBasePathLockProvider verifies the same
    invariant for the Zookeeper lock base path.

Compatibility note: locks taken under the previous (no-trailing-slash or
whitespace-sensitive) form effectively orphan at deploy time, since the new
code looks at a different lock row / znode for the same logical table.
Deploys should be coordinated across all writers that share a hudi table,
or accept a brief writer quiesce while rolling out.
- normalizeBasePathForLocking: reject scheme-only inputs (s3://, s3a:///)
  and all-slash inputs that strip to nothing meaningful to hash
- DynamoDBBasedImplicitPartitionKeyLockProvider#derivePartitionKey:
  switch String.format to SLF4J parameterized logging (consistent with
  the sibling ZK provider); javadoc note explaining the static helper
  accepts raw input (super-constructor ordering precludes using the
  instance field)
- ZookeeperBasedImplicitBasePathLockProvider#getLockBasePath: same
  javadoc note for symmetry
- ITTestDynamoDBBasedLockProvider#testAcquireLock: compare against the
  normalized (trailing-slash) form of the hash input — the previous
  assertion would have failed after the canonicalization change
- TestFSUtils: cover s3:///, s3a:///, all-slash, and a special-char path
The private field stores the post-normalization basePath; rename it so
the name reflects that storage shape, matching what the constructor
actually assigns.
Two writers for the same Hudi table must derive the same lockFilePath
or they lose mutual exclusion. Today the path is built directly from
the raw config basePath; while StoragePath absorbs trailing-slash
drift, scheme-case drift (s3a:// vs s3://) and surrounding whitespace
do route to different lock files.

Route the basePath through FSUtils.normalizeBasePathForLocking (the
same helper used by the implicit-key LPs from the parent commit —
single canonicalization API across all lock providers) before
deriving lockFilePath. Rename the stored field to
normalizedHudiTableBasePath so the name reflects what is stored.

Tests: 3 end-to-end invariance tests on lockFilePath (scheme drift,
whitespace, trailing slash). The normalization helper itself is
covered by TestFSUtils#testNormalizeBasePathForLocking from the
parent commit.
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 22, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR completes the basePath canonicalization story by routing StorageBasedLockProvider through FSUtils.normalizeBasePathForLocking, matching the pattern already applied to the implicit-key providers. The helper is well-tested (edge cases for scheme, whitespace, trailing slashes, and scheme-only/all-slash rejection), derivePartitionKey is a nice testability extraction with clear Javadoc on why the instance field can't be used at that point, and parseBucketAndPath in the underlying S3/GCS lock clients preserves audit-config and lock-object identity across the scheme change. The hash-based rollout concern for DynamoDB/Zookeeper providers was already raised and acknowledged on #18814 so I won't reopen it here. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming suggestion below — the new normalizedHudiTableBasePath field name carries redundant context across the three lock provider classes; otherwise the code is clean and well-documented.

cc @yihua

private final Option<HoodieLockMetrics> hoodieLockMetrics;
private Option<AuditService> auditService;
private final String basePath;
private final String normalizedHudiTableBasePath;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: normalizedHudiTableBasePath feels a bit verbose — within a class that's exclusively a Hudi table lock provider, the HudiTable infix is redundant context. Could you shorten to normalizedBasePath? Same pattern applies in ZookeeperBasedImplicitBasePathLockProvider and DynamoDBBasedImplicitPartitionKeyLockProvider.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.18519% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.92%. Comparing base (e299b84) to head (59be5df).

Files with missing lines Patch % Lines
...DynamoDBBasedImplicitPartitionKeyLockProvider.java 50.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18817   +/-   ##
=========================================
  Coverage     68.92%   68.92%           
- Complexity    29076    29086   +10     
=========================================
  Files          2509     2509           
  Lines        139470   139485   +15     
  Branches      17117    17120    +3     
=========================================
+ Hits          96130    96144   +14     
- Misses        35584    35588    +4     
+ Partials       7756     7753    -3     
Flag Coverage Δ
common-and-other-modules 44.43% <59.25%> (-0.01%) ⬇️
hadoop-mr-java-client 44.90% <0.00%> (-0.02%) ⬇️
spark-client-hadoop-common 48.24% <63.15%> (+<0.01%) ⬆️
spark-java-tests 49.35% <0.00%> (+<0.01%) ⬆️
spark-scala-tests 45.26% <0.00%> (-0.02%) ⬇️
utilities 37.45% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ent/transaction/lock/StorageBasedLockProvider.java 87.86% <100.00%> (ø)
...ck/ZookeeperBasedImplicitBasePathLockProvider.java 92.85% <100.00%> (+1.19%) ⬆️
...c/main/java/org/apache/hudi/common/fs/FSUtils.java 78.87% <100.00%> (+0.87%) ⬆️
...DynamoDBBasedImplicitPartitionKeyLockProvider.java 38.46% <50.00%> (+38.46%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants