[core] Filter side files in BTree global index scans#8109
Conversation
5998a92 to
9b98df2
Compare
| return Optional.empty(); | ||
| } | ||
| snapshotReader = snapshotReader.withSnapshot(snapshot); | ||
| snapshotReader = withReadType(snapshotReader.withSnapshot(snapshot)); |
There was a problem hiding this comment.
This pruning changes the semantics for data-evolution tables when the indexed column was added after some data was already written. DataEvolutionFileStoreScan.withReadType filters out manifest entries whose physical file schema does not contain any requested non-system field. For a newly added indexed column, old files do not contain that field, but they should still be scanned and indexed with a NULL key (the reader can project the missing column as null, and the BTree writer/reader already support null keys). With this change those old files are dropped during scan(), so IS NULL queries on the new column miss the old rows after the index is built. I verified this with a small regression: write rows, add column f3, write one new row with f3, build a BTree index on f3, then global-index scan f3 IS NULL; expected the old rows, but the index returned 0. Could we avoid applying this read-type pruning for normal data files that lack the indexed column, while still excluding blob/vector side files?
9b98df2 to
c01b94d
Compare
|
+1 |
Purpose
BTree global index scan planning should avoid unnecessary dedicated side files such as blob and vector-store files. However, pruning by
readTypeis too broad for data-evolution tables: old normal data files may not contain a newly added indexed column, but they still need to be scanned and indexed with a NULL key.Changes
withManifestEntryFilterduring BTree full and incremental scans.withReadTypescan pruning so normal data files missing a newly added indexed column are preserved.Tests
mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=BTreeGlobalIndexBuilderTest,BtreeGlobalIndexTableTest testmvn -pl paimon-core spotless:apply && mvn -pl paimon-core spotless:check