Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Clarify identity partition edge cases. #10835

Merged
merged 11 commits into from
Aug 6, 2024

Conversation

emkornfield
Copy link
Contributor

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Aug 1, 2024
@emkornfield
Copy link
Contributor Author

CC @rdblue @RussellSpitzer an alternative (or perhaps cleanup) would be to move column projection to be a subject of scan-planning. I think this would flow more nicely since all of the concepts would have been introduced already (instead of having to forward link)

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Co-authored-by: Ajantha Bhat <ajanthabhat@gmail.com>
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
@emkornfield emkornfield requested a review from rdblue August 1, 2024 21:34
format/spec.md Outdated Show resolved Hide resolved
Co-authored-by: Ajantha Bhat <ajanthabhat@gmail.com>
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for taking this up.

This is super helpful in streamlining what we need to do hive tables migrated to iceberg tables would need to do as the data files didn't persist partition col in the file itself as hive catalog tracks it.

One had to read the code to understand how iceberg-core does this handling with all fallback and can be easily missed when doing diff lang impl, I came across it when i was adding iceberg support for redshift and had to go through the similar exercise.

format/spec.md Outdated
Comment on lines 252 to 255
* Return the value from partition metadata if an [Identity Transform](#partition-transforms) exists for the field and the partition value is present in the `partition` struct on `data_file` object in the manifest.
* Use `schema.name-mapping.default` metadata to map field id to columns without field id as described below and use the column if it is present.
* Return the default value if it has a defined in `initial-default` (See [Default values](#default-values) section for more details).
* Return `null` in all other cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] should we move these lines after we explain what schema.name-mapping.default is ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referenced that it is described below. I placed it i moved text from the following paragraph and highlights the overall logic is non-trivial.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clearing this up 👍

Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids. If a field id is missing from a data file, its value for each row should be `null`.
Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids.

Values for field ids which are not present in a data file must be resolved according the following rules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above says "All columns must be written to data file".
Here it says "field ids which are not present in a data file ..."
Besides schema evolution (new columns) case, and multiple partitionings in use, when can the field ids be missing from the data file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When files are added from Hive tables or other places. If you're writing to an Iceberg table we have stricter standards than if you are adding data files from a converted table.

In the past, we didn't document the expectations for readers, but this is definitely a case where we want to be clear about the requirements across engines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When migrating from other table formats that don't write this information into data files. Hive is the primary example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it -- files migrated from hive lack some information and we need to deal with it.
If such "partial data files" are legal, why do we expect writers not to create them? The readers still need to support them, so what's the win (from format perspective)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument here is that especially with default values there is a larger room for misinterpretation. I don't have a strong feeling here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files are the source of truth for row data. It's convenient to have a copy of identity columns in metadata because we can avoid projecting and doing extra work. But you should still get a full copy of the row as it was written if you read just the file.

format/spec.md Outdated

Values for field ids which are not present in a data file must be resolved according the following rules:

* Return the value from partition metadata if an [Identity Transform](#partition-transforms) exists for the field and the partition value is present in the `partition` struct on `data_file` object in the manifest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a new column projection is defined (eg day(order_time)), can the query engine derive value for that field from the order_time column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transforms are one way, with the exception of identity that does not modify the value. No other partition transforms support recovering the original value.

If you have a case where the engine is deriving a value that is stored in metadata (which I think is what you're asking) then you can use the metadata value as long as you know that the transform the engine is projecting exactly matches the Iceberg transform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand the question but if order_time is of type date, then in this special case the transform is effectively an identity transform.

Copy link
Member

@findepi findepi Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about situation like this (using Trino syntax)

-- create a table with some data
CREATE TABLE t AS SELECT 123 AS a;

-- add new partitioning column
ALTER TABLE t SET PROPERTIES partitioning = ARRAY['truncate(a, 10)'];

Now we have two fields: the a data column and a_trunc projected column.
Trino doesn't provide a way to query for a_trunc column directly.
However, if it did, the value for a_trunc could be derived from the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems for this case we probably want a subsection of the specification or an implementation note for projecting partition values? Maybe that would be a good follow-up (I'm not sure which engines actually allow projecting partition values since they aren't technically part of the schema.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed that partition columns are not part of the schema, so don't show up in table's relational model.
However, the document here talks about reading "field ids" and partition columns are also fields with ids. Or is this section supposed to be only about fields that are plain data fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read it as at least initially only about plain data, since for transform partition values these would never actually be in data file, and that is what this topic currently discusses. I think adding a subsection about projecting partition values if necessary to this section as a follow-up could make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is allowed, particularly for filters where you can use Iceberg's expression library to filter by partition values directly. You can also use partition values in aggregations (which is under development).

I don't think that changes what is said here. In this case, the value from metadata should be used if it is present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value from metadata should be used if it is present.

makes sense, agreed

but if it's not present -- why require to return null? we could require the query engine to compute the value based on the source columns present in the data file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi, the spec originally stated that all columns that were not present in a file must be interpreted as null because the file is the source of truth for row data -- if a column was not present then it must not have existed at the time the file was written and must default to null because that was the only possible default for new columns.

This is how identity partition columns were always required to be written because omitting them would create a conflict between the partition value in metadata and the source of truth in the file. Now, the requirement to write all columns is more explicitly stated above.

Over time, we added support for reading Hive files that were missing the identity partitioned columns. This now documents how to do that in the read path but we didn't drop the requirement to write all columns in the write path.

In addition, we've now added column defaults for v3 and didn't realize that this section conflicted with the initial-default behavior.

This is just stating more clearly what the expected behavior is and I think it's a good change because it is clear.

Copy link
Contributor Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add back reasoning for partition columns.

@emkornfield
Copy link
Contributor Author

Vote passed

@emkornfield
Copy link
Contributor Author

Could a committer merge?

@amogh-jahagirdar amogh-jahagirdar merged commit 525d887 into apache:main Aug 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants