Core: Add TrackedFileStruct, implementation of v4 TrackedFile by anoopj · Pull Request #15854 · apache/iceberg

anoopj · 2026-04-01T18:10:33Z

Implements TrackedFile and its nested interfaces (Tracking, DeletionVector, ManifestInfo) as mutable structs.

This is a followup of #15049
Design doc: s.apache.org/iceberg-single-file-commit
Prototype PR: #14533

RussellSpitzer · 2026-04-06T18:32:02Z

+  private DeletionVectorStruct deletionVector;
+  private ManifestInfoStruct manifestInfo;
+  private ByteBuffer keyMetadata;
+  private List<Long> splitOffsets;


Why List here (and below) instead of Array? This is something I've often been a little confused by.

Because it was the simplest option. The interface deals with List<Long>, so are are just storing it as-is. This was probably not a great idea because we are giving up a lot of memory efficiency by not using an array of primitives. I will fix it.

Classes like this typically use arrays instead of lists because arrays are compatible with both Java and Kryo serialization. Our best practice is to use arrays.

Good callout from Russell. Already fixed.

stevenzwu

some initial comments

stevenzwu · 2026-04-06T21:33:15Z

+  private TrackingStruct tracking;
+  private int contentType;
+  private String location;
+  private String fileFormat;


we probably should use FileFormat enum here

rdblue · 2026-04-07T23:25:24Z

+  private FileFormat fileFormat;
+  private long recordCount;
+  private long fileSizeInBytes;
+  private Integer specId;


Won't we always have a spec ID?

I think we will. It's even a primitive in BaseFile: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/BaseFile.java#L62

If I remember right, our plan was to make spec ID optional. So unpartitioned tables can omit it instead of having an unpartitioned spec. Is that not accurate?

Yes, you're right. We may still want to return a specific ID in places to avoid API-breaking changes, but these files can have a null ID. Good call.

There's a discrepancy in the modeling on the original proposal and the addition in the latest tab, where the spec ID in the original proposal was required and in the latest tab it's optional. We can discuss on the doc but since it was brought up here, after a bit of thought I feel like there's a reasonable argument that the spec ID should be required.

The original intent of it being optional was that we're moving away from a model where a given manifest is bound to a particular partition spec. That intent still applies but from a writer requirement perspective, I think it's best to just require writing the unpartitioned spec ID in case there's a manifest with some "mixed" specs. Sure we definitley could just leave it as null, but I wonder if it leads to ambiguity between the "null" case and "unpartitioned" case. It may be better to just have it be explicit.

In other words, it seems reasonable to me to just say "Manifests must have the unpartitioned spec if its entries span 1 or more multiple partition specs"; if there are multiple entries with different specs, it seems reasonable to just say logically that it is unpartitioned.

I'm not as strongly opinionated on this one, I'm mostly just biasing towards simplicity of client expectations on the read side of things. We don't have to worry about "What does it mean if it's null"? And secondarily, all the API changes (but yes that's an implementation concern not a spec concern)

It feels a bit cleaner to make it optional, rather than force unpartitioned tables to have a partition spec. Happy to change it, but I'll let @rdblue chime in on this first.

I generally like the idea of null meaning unpartitioned, and I think it fits better with a few issues that we've had. For instance, Puffin files where we store DVs are not associated with a partition spec, the DVs are associated with the partition from the data file. Similarly, equality deletes that apply to the whole table are attached to the unpartitioned spec. There are a few cases where we "default" to the unpartitioned spec, but that spec may not exist in the table and we have to somehow manage it. Using null instead cleans up some of these issues by giving us an option for "not partitioned".

I'd lean into using null for this purpose, unless there are known problems with that approach.

rdblue · 2026-04-07T23:59:55Z

+  }
+
+  @Override
+  public String manifestLocation() {


Should these be handled by Tracking rather than TrackedFile? They are tracking metadata that are not part of the data, so it seems like that is a more consistent place for them.

That makes sense. It requires changes to the interfaces and the design doc. Do you mind if I do this as a followup PR?

I think that's fine, but we should either throw an exception (to ensure we follow up) or delegate to tracking. I think it's probably fine to leave the method here for convenience. If we do, it should delegate to Tracking and return null if tracking is not set.

OK - returning null for now if tracking is not set. Will fix it in a followup PR when we move it to tracking

rdblue · 2026-04-14T20:58:36Z

+  private static final EntryStatus[] STATUS_VALUES = EntryStatus.values();
+
+  private EntryStatus status = EntryStatus.EXISTING;
+  private Long snapshotId = null;


Note to me so I don't have to think through this again: these values may be null in a file because they are inherited when set to null.

…ckedFile Implements TrackedFile and its nested interfaces (Tracking, DeletionVector, ManifestInfo) as mutable structs.

github-actions bot added the core label Apr 1, 2026

anoopj marked this pull request as draft April 1, 2026 18:12

anoopj force-pushed the v4-tracked-file-struct branch from 78a2198 to f183289 Compare April 1, 2026 18:15

anoopj marked this pull request as ready for review April 1, 2026 18:50

anoopj changed the title ~~Add TrackedFileStruct: mutable StructLike implementation of v4 TrackedFile~~ Core: Add TrackedFileStruct, implementation of v4 TrackedFile Apr 2, 2026

rdblue reviewed Apr 3, 2026

View reviewed changes

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileStruct.java Outdated

anoopj requested a review from rdblue April 3, 2026 18:30

anoopj added this to V4: metadata tree Apr 6, 2026

anoopj moved this to In progress in V4: metadata tree Apr 6, 2026

anoopj removed this from V4: metadata tree Apr 6, 2026

anoopj moved this to In review in V4: metadata tree Apr 6, 2026

anoopj added this to V4: metadata tree Apr 6, 2026

RussellSpitzer reviewed Apr 6, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated

RussellSpitzer reviewed Apr 6, 2026

View reviewed changes

stevenzwu reviewed Apr 6, 2026

View reviewed changes

anoopj requested review from RussellSpitzer and stevenzwu April 6, 2026 23:58