Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust,python): cast each parquet file to delta schema #2615

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

HawaiianSpork
Copy link
Contributor

@HawaiianSpork HawaiianSpork commented Jun 21, 2024

Description

By casting the read record batch to the delta schema datafusion can read tables where the underlying parquet files can be cast to the desired schema. Fixes:

  • Errors querying data where some of the parquet files may not have columns that were added later because of schema migration. This includes nested columns for structs that are in Maps, Lists, or children of other structs
  • maps and lists written with different different element names
  • timestamps of different units.
  • Any other cast supported by arrow-cast.

This can be done now since data-fusion exposes a SchemaAdapter which can be overwritten.

We should note that this makes all times being read by delta-rs as having microsecond precision to match the Delta protocol.

Related Issue(s)

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jun 21, 2024
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@HawaiianSpork HawaiianSpork changed the title feat(core): Allow schema evolution on read feat(rust,python): allow schema evolution on read Jun 21, 2024
@ion-elgreco
Copy link
Collaborator

@HawaiianSpork can you add a test where we have a delta table that contains parquets with nanosecond timestamps in the files. Maybe just create a parquet table and then use convert to delta?

@HawaiianSpork
Copy link
Contributor Author

@HawaiianSpork can you add a test where we have a delta table that contains parquets with nanosecond timestamps in the files. Maybe just create a parquet table and then use convert to delta?

@ion-elgreco, I'd be happy to add more tests but want to make sure I create the correct ones. ../test/tests/data/table_with_edge_timestamps data has parquet files with nanosecond timestamp precision, you can see how this change leads datafusion only seeing microsecond precision. Do you think this is fine?

FYI, @wjones127 and @roeap for making the original commit that read the schema from the parquet files: #1266.

@rtyler
Copy link
Member

rtyler commented Jun 26, 2024

This looks promising, but I would like to update the title if you don't mind for the changelog in the future. Schema evolution is typically understood in the Delta context as changes to the Delta schema (i.e. a transaction commit occurs).

I am understanding this correctly it's more about schema adaptation on read results

@HawaiianSpork HawaiianSpork changed the title feat(rust,python): allow schema evolution on read feat(rust,python): Cast each Parquet File To Delta Schema Jun 28, 2024
By casting the read record batch to the delta schema datafusion can read tables where the underlying parquet files can be cast to the desired schema.
@HawaiianSpork HawaiianSpork changed the title feat(rust,python): Cast each Parquet File To Delta Schema feat(rust,python): Cast each parquet file to delta schema Jun 28, 2024
@HawaiianSpork HawaiianSpork changed the title feat(rust,python): Cast each parquet file to delta schema feat(rust,python): Cast each parquet file to delta schema Jun 28, 2024
@HawaiianSpork HawaiianSpork changed the title feat(rust,python): Cast each parquet file to delta schema feat(rust,python): cast each parquet file to delta schema Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants