Skip to content

[core] Filter side files in BTree global index scans#8109

Merged
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:codex/btree-index-read-type-scan
Jun 4, 2026
Merged

[core] Filter side files in BTree global index scans#8109
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:codex/btree-index-read-type-scan

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 commented Jun 3, 2026

Purpose

BTree global index scan planning should avoid unnecessary dedicated side files such as blob and vector-store files. However, pruning by readType is too broad for data-evolution tables: old normal data files may not contain a newly added indexed column, but they still need to be scanned and indexed with a NULL key.

Changes

  • Filter blob and vector-store side files with withManifestEntryFilter during BTree full and incremental scans.
  • Avoid withReadType scan pruning so normal data files missing a newly added indexed column are preserved.
  • Keep the blob side-file regression test and add an end-to-end added-column NULL-key regression for BTree global index scans.

Tests

  • mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=BTreeGlobalIndexBuilderTest,BtreeGlobalIndexTableTest test
  • mvn -pl paimon-core spotless:apply && mvn -pl paimon-core spotless:check

@leaves12138 leaves12138 marked this pull request as ready for review June 3, 2026 13:36
@leaves12138 leaves12138 force-pushed the codex/btree-index-read-type-scan branch from 5998a92 to 9b98df2 Compare June 3, 2026 13:42
return Optional.empty();
}
snapshotReader = snapshotReader.withSnapshot(snapshot);
snapshotReader = withReadType(snapshotReader.withSnapshot(snapshot));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pruning changes the semantics for data-evolution tables when the indexed column was added after some data was already written. DataEvolutionFileStoreScan.withReadType filters out manifest entries whose physical file schema does not contain any requested non-system field. For a newly added indexed column, old files do not contain that field, but they should still be scanned and indexed with a NULL key (the reader can project the missing column as null, and the BTree writer/reader already support null keys). With this change those old files are dropped during scan(), so IS NULL queries on the new column miss the old rows after the index is built. I verified this with a small regression: write rows, add column f3, write one new row with f3, build a BTree index on f3, then global-index scan f3 IS NULL; expected the old rows, but the index returned 0. Could we avoid applying this read-type pruning for normal data files that lack the indexed column, while still excluding blob/vector side files?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@leaves12138 leaves12138 force-pushed the codex/btree-index-read-type-scan branch from 9b98df2 to c01b94d Compare June 3, 2026 14:08
@leaves12138 leaves12138 changed the title [core] Pass read type to BTree global index scans [core] Filter side files in BTree global index scans Jun 3, 2026
@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit 4a5462d into apache:master Jun 4, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants