Core: Add v4 TrackedFileAdapters to bridge Data/Delete Files by anoopj · Pull Request #16100 · apache/iceberg

anoopj · 2026-04-24T14:22:47Z

The adapter bridges TrackedFile to existing DataFile/DeleteFile APIs and would allow to minimize the v4 related code changes during scan planning and commits.

…leteFile APIs This adapter would allow to minimize the v4 related code changes during scan planning and commits.

anoopj · 2026-04-24T15:13:12Z

+    return new TrackedDeleteFile(file, spec);
+  }
+
+  // TODO: TrackedFile will likely get an explicit partition tuple field (using a union partition


This will change after the approach to store partition tuple is settled.

anoopj · 2026-04-28T17:11:48Z

+    return result.isEmpty() ? null : result;
+  }
+
+  static Map<Integer, Long> nullValueCounts(ContentStats stats) {


An open question is whether it's worth caching the stats (lazy/eager). I don't see a lot of repeated reads, so may not be worth it.

It's probably fine. I have a comment about this below.

rdblue · 2026-05-04T20:41:18Z

+
+  private TrackedFileAdapters() {}
+
+  static DataFile asDataFile(TrackedFile file, PartitionSpec spec) {


What was your reason not to make this a method of TrackedFile?

Also, PartitionSpec is tracked by the file itself, so is very strange to pass it in here. At a minimum, I would expect this to have a validation that the spec's ID matches the file's spec ID. But it would also be better to have some way to look up spec by ID instead of forcing the caller to do it and then validate that the caller did it correctly.

I intentionally kept the adapters outside of TrackedFile to avoid coupling it with Data/DeleteFile. It seemed like an adapter concern.

The tracked file keeps track of the spec ID, but not the spec itself. I saw some v3 code paths that followed a similar pattern of passing around specsById. Pretty much open to changing it if you have suggestions.

I changed the methods to take in specsById map instead.

rdblue · 2026-05-04T23:11:26Z

+
+    @Override
+    public Integer sortOrderId() {
+      return null;


Good context from the spec for a comment:

Position deletes are required to be sorted by file and position, not a table order, and should set sort order id to null

Maybe note that this is from the spec.

rdblue · 2026-05-04T23:16:15Z

+    // each reported the full file size).
+    @Override
+    public long fileSizeInBytes() {
+      return dv.sizeInBytes();


I appreciate the comment. The decision here seems reasonable to me since we don't know the total Puffin file size.

We should also consider whether we want to have a field in the tracked_file struct for this. We originally wanted to use file_size_in_bytes for the Puffin file size so that we could determine when DVs should be compacted. But variance in the footer was a problem that prevented it from being used as intended.

@aokolnychyi should we revisit this?

even if we know the total Puffin file size, dv.sizeInBytes still seems the correct value here. A puffin file may contain thousands of DVs. A logical DeleteFile should only contain one DV. The DeleteFile size should be the DV size.

rdblue · 2026-05-04T23:24:31Z

+
+    @Override
+    public Long firstRowId() {
+      return tracking != null ? tracking.firstRowId() : null;


From the spec:

The value of first_row_id for delete manifests is always null.

The value of first_row_id for delete files is always null.

This should just be null.

I also wonder if there are some implementations that can be refactored out, to avoid duplication between this and the other adapters? It seems like many of these should be the same as for data file and for v2 delete files that haven't been compacted into DVs yet.

This should just be null.

Great callout. Fixed.

I'll add an abstract based class to reduce the duplication.

rdblue · 2026-05-04T23:39:15Z

+
+    @Override
+    public Map<Integer, ByteBuffer> lowerBounds() {
+      return TrackedFileAdapters.lowerBounds(file.contentStats());


This is okay, but a little concerning because the helper method here creates a new map every time it is called. Creating the map is an expensive operation because it allocates buffers to hold each column bound and serializes into that buffer. Then the map is thrown away rather than reused.

The evaluators themselves (for example, InclusiveMetricsEvaluator) only call these methods once to evaluate for the data file, but if multiple evaluators are used then we should expect a performance degredation.

On the other hand, we want to have evaluators that work directly with ContentStats instead of going through these methods. I think I'm fine with leaving this as-is for now, but we need to make sure that we use the right evaluators everywhere.

Agree. Will build a new evaluator that operates on content stats, as a followup. cc @nastra

Filed #16218 to track this.

Let's hold off on doing it right now. I think the ContentStats API is going to change.

rdblue · 2026-05-04T23:44:14Z

+    return new TrackedDataFile(file, spec);
+  }
+
+  static DeleteFile asDVDeleteFile(TrackedFile file, PartitionSpec spec) {


We will also need a way to wrap TrackedFile as a v2 DeleteFile for position deletes.

Ack. I will create a followup PR for this so that the size is manageable.

rdblue · 2026-05-05T23:07:52Z

+    private final Tracking tracking;
+    private final PartitionSpec spec;
+
+    private AbstractTrackedContentFile(TrackedFile file, PartitionSpec spec) {


Rather than Abstract, we typically use Base for the prefix. For example, BaseAction or BaseContentStats.

I think we can even drop the Base here with just TrackedContentFile, because ContentFile is a base class. This is also consistent with other classes like TrackedDataFile, TrackedDVDeleteFile.

rdblue · 2026-05-05T23:13:02Z

+
+    @Override
+    public int specId() {
+      return spec.specId();


If the spec ID is set on file then it should be returned.

The last version was this:

return file.specId() != null ? file.specId() : 0;

The problem wasn't that it was returning file.specId(). When that spec ID is set, it is canonical and is the ID that was used to look up spec. The problem was that it was guessing ID 0 for the unpartitioned spec, which is not correct. The updated version should be this:

return file.specId() != null ? file.specId() : spec.specId();

rdblue · 2026-05-05T23:17:21Z

+   * <p>Subclasses provide {@code content()}, {@code firstRowId()}, {@code equalityFieldIds()}, and
+   * the copy methods.
+   */
+  private abstract static class AbstractTrackedContentFile<F extends ContentFile<F>>


It would be helpful to implement the version for v2 position deletes here as well. That would make it easier to evaluate the implementations here, although I suspect it will be fine.

rdblue · 2026-05-05T23:20:25Z

+
+    @Override
+    public Long pos() {
+      return tracking != null ? tracking.manifestPos() : null;


I wonder if the methods that delegate to Tracking could be shared through another base class.

rdblue · 2026-05-05T23:20:44Z

+
+    @Override
+    public int specId() {
+      return spec.specId();


Needs to delegate to file.specId() as well.

stevenzwu · 2026-05-05T22:57:46Z

+  private TrackedFileAdapters() {}
+
+  static DataFile asDataFile(TrackedFile file, Map<Integer, PartitionSpec> specsById) {
+    Preconditions.checkState(


nit: checkState is for invariants on internal state; checkArgument is the right call here since file.contentType() is essentially validating the input. Same applies to asDVDeleteFile (the two checkState calls on L54/L58) and asEqualityDeleteFile (L64).

stevenzwu · 2026-05-05T22:57:46Z

+
+    @Override
+    public DeleteFile copy() {
+      return new TrackedDVDeleteFile(file.copy(), spec);


The DV adapter never exposes the underlying TrackedFile's content stats — valueCounts() / lowerBounds() / etc. all return null directly. So file.copy() here retains content stats that will never be read through this adapter.

Using file.copyWithoutStats() would let copy callers drop them at the source and avoid retaining the dead weight when many DVs are held in memory. Same applies to the other copy variants below.

stevenzwu · 2026-05-05T22:57:46Z

+    tracking.set(3, 11L);
+    tracking.set(5, 1000L);
+    tracking.setManifestLocation("s3://bucket/manifest.avro");
+    tracking.set(8, 7L);


switch to the builder added recently?

stevenzwu · 2026-05-05T23:51:07Z

+    private final Tracking tracking;
+    private final PartitionSpec spec;
+
+    private AbstractTrackedContentFile(TrackedFile file, PartitionSpec spec) {


I think we can even drop the Base here with just TrackedContentFile, because ContentFile is a base class. This is also consistent with other classes like TrackedDataFile, TrackedDVDeleteFile.

stevenzwu · 2026-05-05T23:57:57Z

+    }
+  }
+
+  /** Adapts a TrackedFile EQUALITY_DELETES entry to the {@link DeleteFile} interface. */


Is this class only for EQUALITY_DELETES, as the content() method override indicates? if yes, it can probably names as TrackedEqualityDeleteFile

Just to confirm my understanding. This is for new V4 manifest files, where v2 position delete file entries won't exist.

stevenzwu · 2026-05-06T00:07:43Z

+    // each reported the full file size).
+    @Override
+    public long fileSizeInBytes() {
+      return dv.sizeInBytes();


even if we know the total Puffin file size, dv.sizeInBytes still seems the correct value here. A puffin file may contain thousands of DVs. A logical DeleteFile should only contain one DV. The DeleteFile size should be the DV size.

[core] v4: Add TrackedFileAdapters: bridge TrackedFile to DataFile/De…

63e03ed

…leteFile APIs This adapter would allow to minimize the v4 related code changes during scan planning and commits.

github-actions Bot added the core label Apr 24, 2026

anoopj moved this to In review in V4: metadata tree Apr 24, 2026

anoopj added this to V4: metadata tree Apr 24, 2026

Clean up tests

9864790

anoopj commented Apr 24, 2026

View reviewed changes

anoopj changed the title ~~[core] v4: Add TrackedFileAdapters to bridge Data/Delete Files~~ Core: Add v4 TrackedFileAdapters to bridge Data/Delete Files Apr 24, 2026

anoopj added 2 commits April 28, 2026 09:47

Change design such that a DV adapted to DeleteFile

dd86280

Make copy safe

f9d8399

anoopj commented Apr 28, 2026

View reviewed changes

Reorder

1889ca8

stevenzwu reviewed Apr 28, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java

anoopj requested a review from stevenzwu April 30, 2026 15:10