Skip to content

Core: Add TrackedFileStruct, implementation of v4 TrackedFile#15854

Open
anoopj wants to merge 23 commits intoapache:mainfrom
anoopj:v4-tracked-file-struct
Open

Core: Add TrackedFileStruct, implementation of v4 TrackedFile#15854
anoopj wants to merge 23 commits intoapache:mainfrom
anoopj:v4-tracked-file-struct

Conversation

@anoopj
Copy link
Copy Markdown
Contributor

@anoopj anoopj commented Apr 1, 2026

Implements TrackedFile and its nested interfaces (Tracking, DeletionVector, ManifestInfo) as mutable structs.

This is a followup of #15049
Design doc: s.apache.org/iceberg-single-file-commit
Prototype PR: #14533

@github-actions github-actions bot added the core label Apr 1, 2026
@anoopj anoopj marked this pull request as draft April 1, 2026 18:12
@anoopj anoopj force-pushed the v4-tracked-file-struct branch from 78a2198 to f183289 Compare April 1, 2026 18:15
@anoopj anoopj marked this pull request as ready for review April 1, 2026 18:50
@anoopj anoopj changed the title Add TrackedFileStruct: mutable StructLike implementation of v4 TrackedFile Core: Add TrackedFileStruct, implementation of v4 TrackedFile Apr 2, 2026
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileStruct.java Outdated
@anoopj anoopj requested a review from rdblue April 3, 2026 18:30
@anoopj anoopj moved this to In progress in V4: metadata tree Apr 6, 2026
@anoopj anoopj moved this to In review in V4: metadata tree Apr 6, 2026
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
private DeletionVectorStruct deletionVector;
private ManifestInfoStruct manifestInfo;
private ByteBuffer keyMetadata;
private List<Long> splitOffsets;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why List here (and below) instead of Array? This is something I've often been a little confused by.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was the simplest option. The interface deals with List<Long>, so are are just storing it as-is. This was probably not a great idea because we are giving up a lot of memory efficiency by not using an array of primitives. I will fix it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classes like this typically use arrays instead of lists because arrays are compatible with both Java and Kryo serialization. Our best practice is to use arrays.

Copy link
Copy Markdown
Contributor Author

@anoopj anoopj Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout from Russell. Already fixed.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments

private TrackingStruct tracking;
private int contentType;
private String location;
private String fileFormat;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should use FileFormat enum here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
private FileFormat fileFormat;
private long recordCount;
private long fileSizeInBytes;
private Integer specId;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we always have a spec ID?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember right, our plan was to make spec ID optional. So unpartitioned tables can omit it instead of having an unpartitioned spec. Is that not accurate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. We may still want to return a specific ID in places to avoid API-breaking changes, but these files can have a null ID. Good call.

Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a discrepancy in the modeling on the original proposal and the addition in the latest tab, where the spec ID in the original proposal was required and in the latest tab it's optional. We can discuss on the doc but since it was brought up here, after a bit of thought I feel like there's a reasonable argument that the spec ID should be required.

The original intent of it being optional was that we're moving away from a model where a given manifest is bound to a particular partition spec. That intent still applies but from a writer requirement perspective, I think it's best to just require writing the unpartitioned spec ID in case there's a manifest with some "mixed" specs. Sure we definitley could just leave it as null, but I wonder if it leads to ambiguity between the "null" case and "unpartitioned" case. It may be better to just have it be explicit.

In other words, it seems reasonable to me to just say "Manifests must have the unpartitioned spec if its entries span 1 or more multiple partition specs"; if there are multiple entries with different specs, it seems reasonable to just say logically that it is unpartitioned.

I'm not as strongly opinionated on this one, I'm mostly just biasing towards simplicity of client expectations on the read side of things. We don't have to worry about "What does it mean if it's null"? And secondarily, all the API changes (but yes that's an implementation concern not a spec concern)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit cleaner to make it optional, rather than force unpartitioned tables to have a partition spec. Happy to change it, but I'll let @rdblue chime in on this first.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like the idea of null meaning unpartitioned, and I think it fits better with a few issues that we've had. For instance, Puffin files where we store DVs are not associated with a partition spec, the DVs are associated with the partition from the data file. Similarly, equality deletes that apply to the whole table are attached to the unpartitioned spec. There are a few cases where we "default" to the unpartitioned spec, but that spec may not exist in the table and we have to somehow manage it. Using null instead cleans up some of these issues by giving us an option for "not partitioned".

I'd lean into using null for this purpose, unless there are known problems with that approach.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
}

@Override
public String manifestLocation() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be handled by Tracking rather than TrackedFile? They are tracking metadata that are not part of the data, so it seems like that is a more consistent place for them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. It requires changes to the interfaces and the design doc. Do you mind if I do this as a followup PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, but we should either throw an exception (to ensure we follow up) or delegate to tracking. I think it's probably fine to leave the method here for convenience. If we do, it should delegate to Tracking and return null if tracking is not set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - returning null for now if tracking is not set. Will fix it in a followup PR when we move it to tracking

Comment thread core/src/main/java/org/apache/iceberg/TrackingStruct.java Outdated
private static final EntryStatus[] STATUS_VALUES = EntryStatus.values();

private EntryStatus status = EntryStatus.EXISTING;
private Long snapshotId = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to me so I don't have to think through this again: these values may be null in a file because they are inherited when set to null.

Comment thread core/src/main/java/org/apache/iceberg/TrackingStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/DeletionVectorStruct.java Outdated
@anoopj anoopj force-pushed the v4-tracked-file-struct branch from d0463cd to a4cf477 Compare April 16, 2026 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

5 participants