-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Clarify identity partition edge cases. #10835
Conversation
CC @rdblue @RussellSpitzer an alternative (or perhaps cleanup) would be to move column projection to be a subject of scan-planning. I think this would flow more nicely since all of the concepts would have been introduced already (instead of having to forward link) |
Co-authored-by: Ajantha Bhat <ajanthabhat@gmail.com>
Co-authored-by: Ajantha Bhat <ajanthabhat@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emkornfield for taking this up.
This is super helpful in streamlining what we need to do hive tables migrated to iceberg tables would need to do as the data files didn't persist partition col in the file itself as hive catalog tracks it.
One had to read the code to understand how iceberg-core does this handling with all fallback and can be easily missed when doing diff lang impl, I came across it when i was adding iceberg support for redshift and had to go through the similar exercise.
format/spec.md
Outdated
* Return the value from partition metadata if an [Identity Transform](#partition-transforms) exists for the field and the partition value is present in the `partition` struct on `data_file` object in the manifest. | ||
* Use `schema.name-mapping.default` metadata to map field id to columns without field id as described below and use the column if it is present. | ||
* Return the default value if it has a defined in `initial-default` (See [Default values](#default-values) section for more details). | ||
* Return `null` in all other cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] should we move these lines after we explain what schema.name-mapping.default
is ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referenced that it is described below. I placed it i moved text from the following paragraph and highlights the overall logic is non-trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clearing this up 👍
Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids. If a field id is missing from a data file, its value for each row should be `null`. | ||
Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids. | ||
|
||
Values for field ids which are not present in a data file must be resolved according the following rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above says "All columns must be written to data file".
Here it says "field ids which are not present in a data file ..."
Besides schema evolution (new columns) case, and multiple partitionings in use, when can the field ids be missing from the data file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When files are added from Hive tables or other places. If you're writing to an Iceberg table we have stricter standards than if you are adding data files from a converted table.
In the past, we didn't document the expectations for readers, but this is definitely a case where we want to be clear about the requirements across engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When migrating from other table formats that don't write this information into data files. Hive is the primary example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it -- files migrated from hive lack some information and we need to deal with it.
If such "partial data files" are legal, why do we expect writers not to create them? The readers still need to support them, so what's the win (from format perspective)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument here is that especially with default values there is a larger room for misinterpretation. I don't have a strong feeling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Files are the source of truth for row data. It's convenient to have a copy of identity columns in metadata because we can avoid projecting and doing extra work. But you should still get a full copy of the row as it was written if you read just the file.
format/spec.md
Outdated
|
||
Values for field ids which are not present in a data file must be resolved according the following rules: | ||
|
||
* Return the value from partition metadata if an [Identity Transform](#partition-transforms) exists for the field and the partition value is present in the `partition` struct on `data_file` object in the manifest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a new column projection is defined (eg day(order_time)
), can the query engine derive value for that field from the order_time
column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transforms are one way, with the exception of identity
that does not modify the value. No other partition transforms support recovering the original value.
If you have a case where the engine is deriving a value that is stored in metadata (which I think is what you're asking) then you can use the metadata value as long as you know that the transform the engine is projecting exactly matches the Iceberg transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully understand the question but if order_time is of type date, then in this special case the transform is effectively an identity transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about situation like this (using Trino syntax)
-- create a table with some data
CREATE TABLE t AS SELECT 123 AS a;
-- add new partitioning column
ALTER TABLE t SET PROPERTIES partitioning = ARRAY['truncate(a, 10)'];
Now we have two fields: the a
data column and a_trunc
projected column.
Trino doesn't provide a way to query for a_trunc
column directly.
However, if it did, the value for a_trunc
could be derived from the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems for this case we probably want a subsection of the specification or an implementation note for projecting partition values? Maybe that would be a good follow-up (I'm not sure which engines actually allow projecting partition values since they aren't technically part of the schema.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed that partition columns are not part of the schema, so don't show up in table's relational model.
However, the document here talks about reading "field ids" and partition columns are also fields with ids. Or is this section supposed to be only about fields that are plain data fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read it as at least initially only about plain data, since for transform partition values these would never actually be in data file, and that is what this topic currently discusses. I think adding a subsection about projecting partition values if necessary to this section as a follow-up could make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is allowed, particularly for filters where you can use Iceberg's expression library to filter by partition values directly. You can also use partition values in aggregations (which is under development).
I don't think that changes what is said here. In this case, the value from metadata should be used if it is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value from metadata should be used if it is present.
makes sense, agreed
but if it's not present -- why require to return null? we could require the query engine to compute the value based on the source columns present in the data file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi, the spec originally stated that all columns that were not present in a file must be interpreted as null
because the file is the source of truth for row data -- if a column was not present then it must not have existed at the time the file was written and must default to null
because that was the only possible default for new columns.
This is how identity partition columns were always required to be written because omitting them would create a conflict between the partition value in metadata and the source of truth in the file. Now, the requirement to write all columns is more explicitly stated above.
Over time, we added support for reading Hive files that were missing the identity partitioned columns. This now documents how to do that in the read path but we didn't drop the requirement to write all columns in the write path.
In addition, we've now added column defaults for v3 and didn't realize that this section conflicted with the initial-default
behavior.
This is just stating more clearly what the expected behavior is and I think it's a good change because it is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add back reasoning for partition columns.
Could a committer merge? |
Discussion on mailing list: https://lists.apache.org/thread/hss83r1605r8932b94xv9y2wfb9o0yns.