[core] Filter side files in BTree global index scans by leaves12138 · Pull Request #8109 · apache/paimon

leaves12138 · 2026-06-03T13:27:16Z

Purpose

BTree global index scan planning should avoid unnecessary dedicated side files such as blob and vector-store files. However, pruning by readType is too broad for data-evolution tables: old normal data files may not contain a newly added indexed column, but they still need to be scanned and indexed with a NULL key.

Changes

Filter blob and vector-store side files with withManifestEntryFilter during BTree full and incremental scans.
Avoid withReadType scan pruning so normal data files missing a newly added indexed column are preserved.
Keep the blob side-file regression test and add an end-to-end added-column NULL-key regression for BTree global index scans.

Tests

mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=BTreeGlobalIndexBuilderTest,BtreeGlobalIndexTableTest test
mvn -pl paimon-core spotless:apply && mvn -pl paimon-core spotless:check

JingsongLi · 2026-06-03T13:55:00Z

            return Optional.empty();
        }
-        snapshotReader = snapshotReader.withSnapshot(snapshot);
+        snapshotReader = withReadType(snapshotReader.withSnapshot(snapshot));


This pruning changes the semantics for data-evolution tables when the indexed column was added after some data was already written. DataEvolutionFileStoreScan.withReadType filters out manifest entries whose physical file schema does not contain any requested non-system field. For a newly added indexed column, old files do not contain that field, but they should still be scanned and indexed with a NULL key (the reader can project the missing column as null, and the BTree writer/reader already support null keys). With this change those old files are dropped during scan(), so IS NULL queries on the new column miss the old rows after the index is built. I verified this with a small regression: write rows, add column f3, write one new row with f3, build a BTree index on f3, then global-index scan f3 IS NULL; expected the old rows, but the index returned 0. Could we avoid applying this read-type pruning for normal data files that lack the indexed column, while still excluding blob/vector side files?

JingsongLi · 2026-06-04T03:01:43Z

+1

leaves12138 marked this pull request as ready for review June 3, 2026 13:36

leaves12138 force-pushed the codex/btree-index-read-type-scan branch from 5998a92 to 9b98df2 Compare June 3, 2026 13:42

JingsongLi reviewed Jun 3, 2026

View reviewed changes

Filter side files in btree global index scans

c01b94d

leaves12138 force-pushed the codex/btree-index-read-type-scan branch from 9b98df2 to c01b94d Compare June 3, 2026 14:08

leaves12138 changed the title ~~[core] Pass read type to BTree global index scans~~ [core] Filter side files in BTree global index scans Jun 3, 2026

JingsongLi merged commit 4a5462d into apache:master Jun 4, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Filter side files in BTree global index scans#8109

[core] Filter side files in BTree global index scans#8109
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:codex/btree-index-read-type-scan

leaves12138 commented Jun 3, 2026 •

edited

Loading

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

leaves12138 Jun 3, 2026

Uh oh!

JingsongLi commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leaves12138 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Tests

Uh oh!

JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

leaves12138 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leaves12138 commented Jun 3, 2026 •

edited

Loading