Core: Add TrackedFileStruct, implementation of v4 TrackedFile#15854
Core: Add TrackedFileStruct, implementation of v4 TrackedFile#15854anoopj wants to merge 23 commits intoapache:mainfrom
Conversation
78a2198 to
f183289
Compare
| private DeletionVectorStruct deletionVector; | ||
| private ManifestInfoStruct manifestInfo; | ||
| private ByteBuffer keyMetadata; | ||
| private List<Long> splitOffsets; |
There was a problem hiding this comment.
Why List here (and below) instead of Array? This is something I've often been a little confused by.
There was a problem hiding this comment.
Because it was the simplest option. The interface deals with List<Long>, so are are just storing it as-is. This was probably not a great idea because we are giving up a lot of memory efficiency by not using an array of primitives. I will fix it.
There was a problem hiding this comment.
Classes like this typically use arrays instead of lists because arrays are compatible with both Java and Kryo serialization. Our best practice is to use arrays.
There was a problem hiding this comment.
Good callout from Russell. Already fixed.
| private TrackingStruct tracking; | ||
| private int contentType; | ||
| private String location; | ||
| private String fileFormat; |
There was a problem hiding this comment.
we probably should use FileFormat enum here
| private FileFormat fileFormat; | ||
| private long recordCount; | ||
| private long fileSizeInBytes; | ||
| private Integer specId; |
There was a problem hiding this comment.
Won't we always have a spec ID?
There was a problem hiding this comment.
I think we will. It's even a primitive in BaseFile: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/BaseFile.java#L62
There was a problem hiding this comment.
If I remember right, our plan was to make spec ID optional. So unpartitioned tables can omit it instead of having an unpartitioned spec. Is that not accurate?
There was a problem hiding this comment.
Yes, you're right. We may still want to return a specific ID in places to avoid API-breaking changes, but these files can have a null ID. Good call.
There was a problem hiding this comment.
There's a discrepancy in the modeling on the original proposal and the addition in the latest tab, where the spec ID in the original proposal was required and in the latest tab it's optional. We can discuss on the doc but since it was brought up here, after a bit of thought I feel like there's a reasonable argument that the spec ID should be required.
The original intent of it being optional was that we're moving away from a model where a given manifest is bound to a particular partition spec. That intent still applies but from a writer requirement perspective, I think it's best to just require writing the unpartitioned spec ID in case there's a manifest with some "mixed" specs. Sure we definitley could just leave it as null, but I wonder if it leads to ambiguity between the "null" case and "unpartitioned" case. It may be better to just have it be explicit.
In other words, it seems reasonable to me to just say "Manifests must have the unpartitioned spec if its entries span 1 or more multiple partition specs"; if there are multiple entries with different specs, it seems reasonable to just say logically that it is unpartitioned.
I'm not as strongly opinionated on this one, I'm mostly just biasing towards simplicity of client expectations on the read side of things. We don't have to worry about "What does it mean if it's null"? And secondarily, all the API changes (but yes that's an implementation concern not a spec concern)
There was a problem hiding this comment.
It feels a bit cleaner to make it optional, rather than force unpartitioned tables to have a partition spec. Happy to change it, but I'll let @rdblue chime in on this first.
There was a problem hiding this comment.
I generally like the idea of null meaning unpartitioned, and I think it fits better with a few issues that we've had. For instance, Puffin files where we store DVs are not associated with a partition spec, the DVs are associated with the partition from the data file. Similarly, equality deletes that apply to the whole table are attached to the unpartitioned spec. There are a few cases where we "default" to the unpartitioned spec, but that spec may not exist in the table and we have to somehow manage it. Using null instead cleans up some of these issues by giving us an option for "not partitioned".
I'd lean into using null for this purpose, unless there are known problems with that approach.
| } | ||
|
|
||
| @Override | ||
| public String manifestLocation() { |
There was a problem hiding this comment.
Should these be handled by Tracking rather than TrackedFile? They are tracking metadata that are not part of the data, so it seems like that is a more consistent place for them.
There was a problem hiding this comment.
That makes sense. It requires changes to the interfaces and the design doc. Do you mind if I do this as a followup PR?
There was a problem hiding this comment.
I think that's fine, but we should either throw an exception (to ensure we follow up) or delegate to tracking. I think it's probably fine to leave the method here for convenience. If we do, it should delegate to Tracking and return null if tracking is not set.
There was a problem hiding this comment.
OK - returning null for now if tracking is not set. Will fix it in a followup PR when we move it to tracking
| private static final EntryStatus[] STATUS_VALUES = EntryStatus.values(); | ||
|
|
||
| private EntryStatus status = EntryStatus.EXISTING; | ||
| private Long snapshotId = null; |
There was a problem hiding this comment.
Note to me so I don't have to think through this again: these values may be null in a file because they are inherited when set to null.
…ckedFile Implements TrackedFile and its nested interfaces (Tracking, DeletionVector, ManifestInfo) as mutable structs.
d0463cd to
a4cf477
Compare
Implements
TrackedFileand its nested interfaces (Tracking,DeletionVector,ManifestInfo) as mutable structs.This is a followup of #15049
Design doc: s.apache.org/iceberg-single-file-commit
Prototype PR: #14533