Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify partition_values field type in Add/Remove actions. #354

Merged
merged 5 commits into from
Aug 8, 2021

Conversation

zijie0
Copy link
Contributor

@zijie0 zijie0 commented Aug 6, 2021

Description

Modify partition_values to type HashMap<String, Option> according to the delta spec.

Related Issue(s)

Documentation

Detailed discussion here: https://groups.google.com/g/delta-users/c/ZzzrtNRqRyk

@houqp
Copy link
Member

houqp commented Aug 6, 2021

Thank you @zijie0 ! Looks good from a quick glance, I will take a closer look later today.

@zijie0
Copy link
Contributor Author

zijie0 commented Aug 7, 2021

@houqp I'm not sure whether I need to add some write tests. I tried locally changed following test: https://github.com/delta-io/delta-rs/blob/main/rust/tests/write_exploration.rs#L422 with empty string value:

// test data set #2 - updates
let json_rows = vec![
    json!({ "id": "D", "value": 148, "modified": "" }),
    json!({ "id": "E", "value": 150, "modified": "" }),
    json!({ "id": "F", "value": 152, "modified": "" }),
];

The generated files seems to be the same with "native" delta.

rust/src/action.rs Outdated Show resolved Hide resolved
rust/src/checkpoints.rs Outdated Show resolved Hide resolved
col_name: k.to_string(),
})?
.clone()
.unwrap_or("__HIVE_DEFAULT_PARTITION__".to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminded me that we are doing partition filtering wrong in get_files_by_partitions. We are parsing Action.path to reconstruct the partition values, but instead, we should be using Action.partition_values or Action.partition_values_parsed directly.

I will file a separate issue to track this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -150,7 +173,7 @@ pub struct Add {
/// The size of this file in bytes
pub size: DeltaDataTypeLong,
/// A map from partition column to value for this file
pub partition_values: HashMap<String, String>,
pub partition_values: HashMap<String, Option<String>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that the problem you found has a much bigger impact than just the partition_values field, it also applies to Add.tags, Remove.tags, MetaData.format.options, MetaData.configurations.

Let me know if you prefer to address them or only focus on partition_values in this PR. If you prefer the latter, I can file an issue to track the fix for the other fields.

Copy link
Contributor Author

@zijie0 zijie0 Aug 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey QP, I went through the spec doc but didn't found situations that those structs could also contain null values in the map. Is that because Map[String, String] allows null value in Scala, so that we also need to make the change to HashMap<String, Option<String>>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, based on the discussion in the dev list, all data types are annotated using scala syntax, so all Map[String, String] should map to HashMap<String, Option<String>> in rust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I am perfectly fine with us addressing these other fields in an other PR if you want to keep the scope of this PR small. The fix for partition_values itself is already a big improvement :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look on this :)

@houqp
Copy link
Member

houqp commented Aug 7, 2021

@houqp I'm not sure whether I need to add some write tests.

@zijie0 it would be great if you could help write some write tests to cover the cases of both empty strings and actual null values. So perhaps something like below:

    json!({ "id": "D", "value": 148, "modified": "" }),
    json!({ "id": "E", "value": 150, "modified": null }),

let files = delta_table.get_files();
assert_eq!(files.len(), 1);
assert_eq!(
files[0].split("/").next().unwrap(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor ask, could we check the partition_values field instead to make sure it's null? delta doesn't care whether we have partition value in the file path or not, so it's perfectly valid to not have modified=__HIVE_DEFAULT_PARTITION_ in the path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I will change the assertion.

@houqp
Copy link
Member

houqp commented Aug 7, 2021

The rest looks good to me, thank you @zijie0 for discovering this big can of worms :)

@zijie0
Copy link
Contributor Author

zijie0 commented Aug 7, 2021

It is weird, cargo clippy passed on my local machine. Could you help me on this build failure? @houqp

@zijie0 zijie0 force-pushed the bugfix/null_partition_value branch from 7d428e4 to 48c0702 Compare August 7, 2021 14:25
@zijie0
Copy link
Contributor Author

zijie0 commented Aug 7, 2021

Sorry, forgot to rebase on main branch...

@houqp houqp merged commit 4a90cd2 into delta-io:main Aug 8, 2021
@houqp
Copy link
Member

houqp commented Aug 8, 2021

Thank you @zijie0 !

@zijie0 zijie0 deleted the bugfix/null_partition_value branch August 9, 2021 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read table failed when containing null partition values
2 participants