Skip to content

[core] chain table support special partition expire.#7643

Open
Stephen0421 wants to merge 1 commit into
apache:masterfrom
Stephen0421:chain_table_partition_expire
Open

[core] chain table support special partition expire.#7643
Stephen0421 wants to merge 1 commit into
apache:masterfrom
Stephen0421:chain_table_partition_expire

Conversation

@Stephen0421
Copy link
Copy Markdown
Contributor

Purpose

This PR implements partition expiration for chain tables. Chain tables store data across snapshot
and delta branches, where delta partitions depend on their nearest earlier snapshot partition as an
anchor for merge-on-read. Standard partition expiration cannot be applied directly because dropping
a snapshot partition without considering its dependent deltas would break the chain integrity.

Changes

New: Segment-based partition expiration (ChainTablePartitionExpire)

Introduces a segment-based expiration algorithm that preserves chain integrity. A segment consists
of one snapshot partition and all delta partitions whose time falls between that snapshot and the
next snapshot. The segment is the atomic unit of expiration.
Algorithm per group:

  1. Sort snapshot partitions by chain partition time.
  2. Filter to those before the cutoff (now - partition.expiration-time).
  3. If fewer than 2 snapshots fall before the cutoff, nothing is expired (the only one must be
    kept as anchor for its dependent deltas).
  4. The most recent snapshot before the cutoff is the anchor (kept). All earlier snapshots form
    expirable segments together with their associated delta partitions.
  5. Orphan deltas before the earliest expired snapshot are also expired.
  6. Delta partitions are dropped before snapshot partitions so that the commit pre-check passes.

Refactored: PartitionExpire interface extraction

Extracted PartitionExpire from a concrete class into an interface with three methods:
expire(long), isValueExpiration(), and isValueAllExpired(Collection<BinaryRow>).
The original implementation is preserved as NormalPartitionExpire. All existing consumers
(Flink/Spark procedures, actions, TableCommitImpl, ConflictDetection) use only the interface
methods — no compatibility impact.

Fixed: ChainTableCommitPreCallback group partition awareness

The commit pre-callback that validates snapshot partition drops was not group-partition aware.
It used full partition comparators and triangular predicates, which could match partitions across
different groups and produce incorrect pre/next snapshot lookups. Refactored to:

  • Filter snapshot partitions to the same group before finding pre/next.
  • Use ChainPartitionProjector to extract group and chain dimensions.
  • Apply group-scoped predicates for delta partition filtering.

Tests

  • Single partition key: expire segments with correct anchor retention
  • No expiration when < 2 snapshots before cutoff
  • Multiple segment expiration
  • No expiration when no snapshots before cutoff
  • Check interval prevents premature expiration
  • maxExpireNum limits number of expired segments
  • Group partition: independent expiration per group
  • isValueAllExpired: anchor partitions not reported as expired
  • isValueAllExpired: groups with < 2 snapshots retain all
  • isValueAllExpired: cross-group mixed scenarios
  • isValueAllExpired: partitions after cutoff not expired

@Stephen0421 Stephen0421 force-pushed the chain_table_partition_expire branch from e79ac90 to 408f4cc Compare April 27, 2026 08:01
@Stephen0421
Copy link
Copy Markdown
Contributor Author

Hello @JingsongLi , could you PTAL at this PR when you have a moment? I'd really appreciate your feedback.

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed implementation and tests. I found one correctness issue in the explicit Flink action path:

ExpirePartitionsAction still builds a PartitionExpireStrategy from the user-provided expireStrategy argument and calls FileStore.newPartitionExpire(..., expireStrategy). However, the chain-table branch in AbstractFileStore.newPartitionExpire(String, FileStoreTable, Duration, Duration, PartitionExpireStrategy) ignores that argument and always returns newChainTablePartitionExpire(...).

This means an explicit action call with expireStrategy = update-time (or a custom strategy) against a chain table will silently run the chain table's values-time expiration instead of rejecting the unsupported strategy. The schema validation added in this PR catches persisted/dynamic table options, but it does not cover this action path because the action passes the strategy only as an argument to the overload.

Could we add the same validation in this overload before creating ChainTablePartitionExpire (for example, require the provided strategy to be PartitionValuesTimeExpireStrategy), or route the action through dynamic options so SchemaValidation rejects non-values-time consistently?

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant and well-thought-out feature. The segment-based expiration algorithm is the right approach for preserving chain integrity. Comments:

Design:

  1. Segment abstraction: The "one snapshot + its dependent deltas = atomic unit" model is clean. The documentation in the chain-table.md is excellent and clearly explains the algorithm.

  2. Interface extraction: Extracting PartitionExpire to an interface with NormalPartitionExpire as the original implementation is a good refactoring pattern. All existing call sites only use interface methods — no compatibility impact.

  3. Delta-before-snapshot ordering: Dropping deltas before their snapshot partition ensures the commit pre-check always passes. This is a critical correctness detail — well handled.

Concerns:

  1. Anchor selection edge case: "If fewer than 2 snapshots fall before the cutoff, nothing is expired." What about the case where there are 100 expired snapshots but only 1 is before cutoff? The algorithm description says we need at least 2 before cutoff to expire the ones before the anchor. This seems correct but make sure the test covers the boundary (exactly 2 snapshots before cutoff — the earliest is expired, the second becomes anchor).

  2. Group partition awareness: The PR description mentions a fix to ChainTableCommitPreCallback for group-partition awareness. This is a bug fix bundled with a feature — consider whether it should be a separate commit for easier bisection.

  3. Performance: For tables with many partitions, the sorting + filtering per group could be expensive during each commit. Is this cached or does it re-scan manifest state each time?

  4. SchemaValidation changes: The change file list includes SchemaValidation.java. What validation is being added/changed? Ensure partition expiration options are validated at table creation time.

Good work overall. Please confirm test coverage for the boundary conditions (exactly 2 snapshots before cutoff, groups with no snapshot branch partitions).

@JingsongLi
Copy link
Copy Markdown
Contributor

Please rebase master.

@Stephen0421 Stephen0421 force-pushed the chain_table_partition_expire branch from 408f4cc to 63a35e5 Compare May 25, 2026 11:07
@Stephen0421
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed implementation and tests. I found one correctness issue in the explicit Flink action path:

ExpirePartitionsAction still builds a PartitionExpireStrategy from the user-provided expireStrategy argument and calls FileStore.newPartitionExpire(..., expireStrategy). However, the chain-table branch in AbstractFileStore.newPartitionExpire(String, FileStoreTable, Duration, Duration, PartitionExpireStrategy) ignores that argument and always returns newChainTablePartitionExpire(...).

This means an explicit action call with expireStrategy = update-time (or a custom strategy) against a chain table will silently run the chain table's values-time expiration instead of rejecting the unsupported strategy. The schema validation added in this PR catches persisted/dynamic table options, but it does not cover this action path because the action passes the strategy only as an argument to the overload.

Could we add the same validation in this overload before creating ChainTablePartitionExpire (for example, require the provided strategy to be PartitionValuesTimeExpireStrategy), or route the action through dynamic options so SchemaValidation rejects non-values-time consistently?

Already add validation in AbstractFileStore.newPartitionExpire.

@Stephen0421
Copy link
Copy Markdown
Contributor Author

Please rebase master.

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants