[core] Avoid cross-file blob and vector compaction for data evolution#7938
Conversation
b0abd46 to
6d20e27
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Left comments:
- Vector store files have the same bug. Lines 374-383 still collect all vector store files from all data files in the group and compact them together, regardless of whether triggerNormalFile is true. The same
cross-file compaction problem applies to vector store files. The fix should be applied symmetrically. - Test is a negative-only assertion. The new test testCompactPlannerDoesNotCompactBlobFilesAcrossDataFiles asserts tasks.isEmpty(), but it would be stronger to also verify that when compactMinFileNum=2 (matching
the 2 data files), the blob files DO get compacted together. This proves both the "yes-compact" and "no-compact" paths work. The existing testCompactPlannerWithBlobFiles partially covers this, but the boundary
is subtle. - Edge case: single data file with multiple blob files per field. When triggerNormalFile == false, the per-data-file blob compaction loop calls blobFileGroupsToCompact() for each data file individually. If a
single data file has, say, 3 small blob files for the same field (from prior partial compactions or writes), this correctly compacts them. Good. - Minor: The else branch iterates all dataFiles and plans blob compaction per file. If dataFiles has, say, 5 files but only 2 have blob files, this incurs 5 iterations but getOrDefault(..., emptyList()) returns
empty for the others and blobFileGroupsToCompact([]) returns empty — harmless but slightly wasteful. Not worth fixing.
JingsongLi
left a comment
There was a problem hiding this comment.
Review
The fix for blob compaction is correct — preventing cross-data-file blob compaction when the data files themselves aren't being compacted makes sense and the root cause analysis is clear.
Main issue: Vector store files have the same bug.
Lines 374-383 (the compactVector block) still collect all vector store files across all data files in the group unconditionally:
if (compactVector) {
List<DataFileMeta> vectorStoreFiles = new ArrayList<>();
for (DataFileMeta dataFile : dataFiles) {
vectorStoreFiles.addAll(
dataFileToVectorStoreFiles.getOrDefault(dataFile, Collections.emptyList()));
}
if (vectorStoreFiles.size() >= compactMinFileNum) {
tasks.add(new DataEvolutionCompactTask(partition, vectorStoreFiles, false));
}
}This has the exact same cross-file compaction problem. When triggerNormalFile == false, vector store files from different data-file row-id ranges will be compacted together, producing a compacted vector file that overlaps multiple uncompacted data files. The triggerNormalFile guard should be applied symmetrically to vector store compaction.
|
+1 |
Purpose
This PR prevents standalone Data Evolution dedicated-file compaction from combining blob or vector-store files that belong to different regular data-file row-id ranges.
Root Cause
The compact planner grouped dedicated files from a data compaction group before planning dedicated compact tasks. If blob or vector-store files were compacted across multiple regular data-file ranges without compacting those regular data files into the same row-id range, the compacted dedicated file could overlap several remaining data files.
Conflict detection groups files by overlapping row-id range and filters blob files from the error message, so the failure surfaced as multiple regular data files with different row-id ranges conflicting during COMPACT.
Changes
Tests
JAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core spotless:applyJAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core -Dtest=DataEvolutionCompactCoordinatorTest test