Skip to content

[core] Avoid cross-file blob and vector compaction for data evolution#7938

Merged
JingsongLi merged 3 commits into
apache:masterfrom
leaves12138:codex/fix-de-blob-compact-range
May 23, 2026
Merged

[core] Avoid cross-file blob and vector compaction for data evolution#7938
JingsongLi merged 3 commits into
apache:masterfrom
leaves12138:codex/fix-de-blob-compact-range

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 commented May 22, 2026

Purpose

This PR prevents standalone Data Evolution dedicated-file compaction from combining blob or vector-store files that belong to different regular data-file row-id ranges.

Root Cause

The compact planner grouped dedicated files from a data compaction group before planning dedicated compact tasks. If blob or vector-store files were compacted across multiple regular data-file ranges without compacting those regular data files into the same row-id range, the compacted dedicated file could overlap several remaining data files.

Conflict detection groups files by overlapping row-id range and filters blob files from the error message, so the failure surfaced as multiple regular data files with different row-id ranges conflicting during COMPACT.

Changes

  • Keep cross-data-file blob/vector-store compaction only when the corresponding regular data files are compacted in the same task.
  • Plan blob/vector-store compaction per containing data file when no regular data-file compaction is triggered.
  • Update planner tests for both the no-compact and compact-together paths.

Tests

  • JAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core spotless:apply
  • JAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core -Dtest=DataEvolutionCompactCoordinatorTest test

@leaves12138 leaves12138 marked this pull request as ready for review May 22, 2026 18:00
@leaves12138 leaves12138 force-pushed the codex/fix-de-blob-compact-range branch from b0abd46 to 6d20e27 Compare May 22, 2026 18:09
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments:

  1. Vector store files have the same bug. Lines 374-383 still collect all vector store files from all data files in the group and compact them together, regardless of whether triggerNormalFile is true. The same
    cross-file compaction problem applies to vector store files. The fix should be applied symmetrically.
  2. Test is a negative-only assertion. The new test testCompactPlannerDoesNotCompactBlobFilesAcrossDataFiles asserts tasks.isEmpty(), but it would be stronger to also verify that when compactMinFileNum=2 (matching
    the 2 data files), the blob files DO get compacted together. This proves both the "yes-compact" and "no-compact" paths work. The existing testCompactPlannerWithBlobFiles partially covers this, but the boundary
    is subtle.
  3. Edge case: single data file with multiple blob files per field. When triggerNormalFile == false, the per-data-file blob compaction loop calls blobFileGroupsToCompact() for each data file individually. If a
    single data file has, say, 3 small blob files for the same field (from prior partial compactions or writes), this correctly compacts them. Good.
  4. Minor: The else branch iterates all dataFiles and plans blob compaction per file. If dataFiles has, say, 5 files but only 2 have blob files, this incurs 5 iterations but getOrDefault(..., emptyList()) returns
    empty for the others and blobFileGroupsToCompact([]) returns empty — harmless but slightly wasteful. Not worth fixing.

@leaves12138 leaves12138 changed the title [core] Avoid cross-file blob compaction for data evolution [core] Avoid cross-file blob and vector compaction for data evolution May 23, 2026
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The fix for blob compaction is correct — preventing cross-data-file blob compaction when the data files themselves aren't being compacted makes sense and the root cause analysis is clear.

Main issue: Vector store files have the same bug.

Lines 374-383 (the compactVector block) still collect all vector store files across all data files in the group unconditionally:

if (compactVector) {
    List<DataFileMeta> vectorStoreFiles = new ArrayList<>();
    for (DataFileMeta dataFile : dataFiles) {
        vectorStoreFiles.addAll(
                dataFileToVectorStoreFiles.getOrDefault(dataFile, Collections.emptyList()));
    }
    if (vectorStoreFiles.size() >= compactMinFileNum) {
        tasks.add(new DataEvolutionCompactTask(partition, vectorStoreFiles, false));
    }
}

This has the exact same cross-file compaction problem. When triggerNormalFile == false, vector store files from different data-file row-id ranges will be compacted together, producing a compacted vector file that overlaps multiple uncompacted data files. The triggerNormalFile guard should be applied symmetrically to vector store compaction.

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit 3e91173 into apache:master May 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants