New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Project transform #309
Conversation
@liurenjie1024 @ZENOTME @sdd PTAL |
crates/iceberg/src/spec/transform.rs
Outdated
|
||
assert_eq!(format!("{}", result_unary), "projected_name IS NULL"); | ||
assert_eq!(format!("{}", result_binary), "projected_name = 0"); | ||
assert_eq!(format!("{}", result_set), "projected_name IN (0)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to follow what's happening here so that I understand.
So in the case of result_binary
, the value of 5 gets truncated to 0, since the truncate transform when applied to an int effectively divides by 10, rounding down for all fractional values (ie, 1-9 get truncated to 0, 10-19 get truncated to 1, 20-29 get truncated to 2)?
And in the case of result_set
, since both 5 and 6 get truncated to the same value of 0, a set of (5, 6) becomes a set of (0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think your understanding is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to have more than one element in the set. In Python and Java a IN (0)
is being rewritten to = 0
before evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example can be found here: https://github.com/apache/iceberg/blob/81b62c78e0c230516090becda7d6040ee03e6a91/api/src/test/java/org/apache/iceberg/transforms/TestTruncatesProjection.java#L189-L190
It would be good to port the other tests of Java as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with the caveat that I'm not an expert on this part of the spec, but the PR seems well-structured.
I wouldn't mind seeing more comments in the tests though to explain the "why" aspect of each test.
Thanks for your feedback. If the others approve "the correctness" of the PR - I'll add those comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up @marvinlanhenke
The most important part of adding projection is making that they are absolutely correct. If rust would generate something different than Java of Python, leads to data incorrectness since it would not correctly evaluate the partition predicates.
crates/iceberg/src/spec/transform.rs
Outdated
|
||
assert_eq!(format!("{}", result_unary), "projected_name IS NULL"); | ||
assert_eq!(format!("{}", result_binary), "projected_name = 0"); | ||
assert_eq!(format!("{}", result_set), "projected_name IN (0)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to have more than one element in the set. In Python and Java a IN (0)
is being rewritten to = 0
before evaluated.
crates/iceberg/src/spec/transform.rs
Outdated
|
||
assert_eq!(format!("{}", result_unary), "projected_name IS NULL"); | ||
assert_eq!(format!("{}", result_binary), "projected_name = 0"); | ||
assert_eq!(format!("{}", result_set), "projected_name IN (0)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example can be found here: https://github.com/apache/iceberg/blob/81b62c78e0c230516090becda7d6040ee03e6a91/api/src/test/java/org/apache/iceberg/transforms/TestTruncatesProjection.java#L189-L190
It would be good to port the other tests of Java as well.
@Fokko |
@Fokko @liurenjie1024 Once, we agree on the overall implementation - I'll start porting the testsuite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marvinlanhenke for picking up this. In general it looks good, but I have some small suggestions:
- How about we split this pr into smaller prs where each pr implements project for one transform?
- How about we move specific logic of transform into each file in transform module?
- I agree with @Fokko that we should port tests from java to ensure correctness.
@marvinlanhenke Thanks for this, I also feel that an boundary trait is a little over design, and implementing them directly on |
@Fokko @liurenjie1024 I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. |
Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs? |
... I haven't made up my mind yet. I think we can merge this if its okay, in order to unblock #253? Also I'm not so sure anymore we have to move the logic at all? Since most of the helper functions handle logic that is not dependent of the type of transform? Perhaps you can outline the refactor high-level what you have in mind? |
Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you. |
Take your time while reviewing in the meantime - I'll open another draft with a possible refactor and we can compare whats best? I 'm thinking of splitting the code into Identity, Bucket, Truncate and Temporal logic - including the tests. This however might introduce some minor code duplication (I'll have to try and see for myself though, now that we have all the tests - it should be much easier to verify) EDIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!
crates/iceberg/src/spec/transform.rs
Outdated
let func = create_transform_function(self)?; | ||
|
||
let projection = match predicate { | ||
BoundPredicate::Unary(expr) => match self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to rewrite this as following:
match self {
Transform::Identity => {
match predicate => {
BoundPredicate::Unary(expr) => { ... }
BoundPredicate::Binary(expr) => {...}
}
}
}
I know the results are same, but rewrite it in this approach makes it easier to read, and do check against java implemention, since they are organized by transfrom in feach file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the structure you suggested in an earlier version. I changed it the other way around since predicate
has the smaller cardinality, which allows me to group more transforms into a single predicate match arm. I can change it back, however this would introduce more match arms and some code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some code duplication is worth so that we can have better readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I'm already implementing it - since I wanted to compare for myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024
I did a refactor changing the structure (matching order). I also extracted common functionality, renamed those helpers and updated the docs. I hope not only the structure but the overall design is more readable and understandable with those changes applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it looks much better now, thanks! I'll take a careful review later.
Thanks for the review, I'll get to your suggestions - those should be easy to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marvinlanhenke for this great pr with the thorough tests, really appreciate it! I have some small suggestions, but it looks great!
cc @Fokko Do you have other comments? |
@liurenjie1024 @Fokko |
Which issue does this PR close?
Closes #264
Rationale for this change
The ability to project row_filter to
Transform
partition.This will unblock the Manifest & PartitionEvaluator, which will enable the pruning of manifest files in
fn plan_files
.What changes are included in this PR?
fn project(...)
onTransform
Are these changes tested?
Yes. Unit tests are included.