Spec: Clarify identity partition edge cases. #10835

emkornfield · 2024-08-01T06:05:45Z

Discussion on mailing list: https://lists.apache.org/thread/hss83r1605r8932b94xv9y2wfb9o0yns.

emkornfield · 2024-08-01T06:09:47Z

CC @rdblue @RussellSpitzer an alternative (or perhaps cleanup) would be to move column projection to be a subject of scan-planning. I think this would flow more nicely since all of the concepts would have been introduced already (instead of having to forward link)

format/spec.md

Co-authored-by: Ajantha Bhat <ajanthabhat@gmail.com>

format/spec.md

Co-authored-by: Ajantha Bhat <ajanthabhat@gmail.com>

singhpk234

Thanks @emkornfield for taking this up.

This is super helpful in streamlining what we need to do hive tables migrated to iceberg tables would need to do as the data files didn't persist partition col in the file itself as hive catalog tracks it.

One had to read the code to understand how iceberg-core does this handling with all fallback and can be easily missed when doing diff lang impl, I came across it when i was adding iceberg support for redshift and had to go through the similar exercise.

singhpk234 · 2024-08-02T16:37:44Z

format/spec.md

+* Return the value from partition metadata if an [Identity Transform](#partition-transforms) exists for the field and the partition value is present in the `partition` struct on `data_file` object in the manifest. 
+* Use `schema.name-mapping.default` metadata to map field id to columns without field id as described below and use the column if it is present.
+* Return the default value if it has a defined in `initial-default` (See [Default values](#default-values) section for more details). 
+* Return `null` in all other cases.


[minor] should we move these lines after we explain what schema.name-mapping.default is ?

I referenced that it is described below. I placed it i moved text from the following paragraph and highlights the overall logic is non-trivial.

Fokko

Thanks for clearing this up 👍

findepi · 2024-08-02T20:38:05Z

format/spec.md

-Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids. If a field id is missing from a data file, its value for each row should be `null`.
+Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids.
+
+Values for field ids which are not present in a data file must be resolved according the following rules:


Above says "All columns must be written to data file".
Here it says "field ids which are not present in a data file ..."
Besides schema evolution (new columns) case, and multiple partitionings in use, when can the field ids be missing from the data file?

When files are added from Hive tables or other places. If you're writing to an Iceberg table we have stricter standards than if you are adding data files from a converted table.

In the past, we didn't document the expectations for readers, but this is definitely a case where we want to be clear about the requirements across engines.

When migrating from other table formats that don't write this information into data files. Hive is the primary example.

I get it -- files migrated from hive lack some information and we need to deal with it.
If such "partial data files" are legal, why do we expect writers not to create them? The readers still need to support them, so what's the win (from format perspective)?

The argument here is that especially with default values there is a larger room for misinterpretation. I don't have a strong feeling here.

Files are the source of truth for row data. It's convenient to have a copy of identity columns in metadata because we can avoid projecting and doing extra work. But you should still get a full copy of the row as it was written if you read just the file.

findepi · 2024-08-02T20:39:34Z

format/spec.md

+
+Values for field ids which are not present in a data file must be resolved according the following rules:
+
+* Return the value from partition metadata if an [Identity Transform](#partition-transforms) exists for the field and the partition value is present in the `partition` struct on `data_file` object in the manifest. 


if a new column projection is defined (eg day(order_time)), can the query engine derive value for that field from the order_time column?

Transforms are one way, with the exception of identity that does not modify the value. No other partition transforms support recovering the original value.

If you have a case where the engine is deriving a value that is stored in metadata (which I think is what you're asking) then you can use the metadata value as long as you know that the transform the engine is projecting exactly matches the Iceberg transform.

I'm not sure I fully understand the question but if order_time is of type date, then in this special case the transform is effectively an identity transform.

I was thinking about situation like this (using Trino syntax)

-- create a table with some data CREATE TABLE t AS SELECT 123 AS a; -- add new partitioning column ALTER TABLE t SET PROPERTIES partitioning = ARRAY['truncate(a, 10)'];

Now we have two fields: the a data column and a_trunc projected column.
Trino doesn't provide a way to query for a_trunc column directly.
However, if it did, the value for a_trunc could be derived from the data.

It seems for this case we probably want a subsection of the specification or an implementation note for projecting partition values? Maybe that would be a good follow-up (I'm not sure which engines actually allow projecting partition values since they aren't technically part of the schema.)

Yes, agreed that partition columns are not part of the schema, so don't show up in table's relational model.
However, the document here talks about reading "field ids" and partition columns are also fields with ids. Or is this section supposed to be only about fields that are plain data fields?

I read it as at least initially only about plain data, since for transform partition values these would never actually be in data file, and that is what this topic currently discusses. I think adding a subsection about projecting partition values if necessary to this section as a follow-up could make sense.

This is allowed, particularly for filters where you can use Iceberg's expression library to filter by partition values directly. You can also use partition values in aggregations (which is under development).

I don't think that changes what is said here. In this case, the value from metadata should be used if it is present.

the value from metadata should be used if it is present.

makes sense, agreed

but if it's not present -- why require to return null? we could require the query engine to compute the value based on the source columns present in the data file

@findepi, the spec originally stated that all columns that were not present in a file must be interpreted as null because the file is the source of truth for row data -- if a column was not present then it must not have existed at the time the file was written and must default to null because that was the only possible default for new columns.

This is how identity partition columns were always required to be written because omitting them would create a conflict between the partition value in metadata and the source of truth in the file. Now, the requirement to write all columns is more explicitly stated above.

Over time, we added support for reading Hive files that were missing the identity partitioned columns. This now documents how to do that in the read path but we didn't drop the requirement to write all columns in the write path.

In addition, we've now added column defaults for v3 and didn't realize that this section conflicted with the initial-default behavior.

This is just stating more clearly what the expected behavior is and I think it's a good change because it is clear.

emkornfield

add back reasoning for partition columns.

emkornfield · 2024-08-06T00:50:54Z

Vote passed

emkornfield · 2024-08-06T00:51:07Z

Could a committer merge?

emkornfield added 2 commits July 31, 2024 23:01

clarify_projection

4614fda

typo

db9f357

github-actions bot added the Specification Issues that may introduce spec changes. label Aug 1, 2024

emkornfield added 2 commits July 31, 2024 23:06

remove whitespace

2b2e595

remove white space

020a800

ajantha-bhat reviewed Aug 1, 2024

View reviewed changes