Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Project transform #309

Merged
merged 48 commits into from Apr 5, 2024
Merged

Conversation

marvinlanhenke
Copy link
Contributor

Which issue does this PR close?

Closes #264

Rationale for this change

The ability to project row_filter to Transform partition.
This will unblock the Manifest & PartitionEvaluator, which will enable the pruning of manifest files in fn plan_files.

What changes are included in this PR?

  • add fn project(...) on Transform
  • mainly based on Java implementation

Are these changes tested?

Yes. Unit tests are included.

@marvinlanhenke
Copy link
Contributor Author

@liurenjie1024 @ZENOTME @sdd PTAL


assert_eq!(format!("{}", result_unary), "projected_name IS NULL");
assert_eq!(format!("{}", result_binary), "projected_name = 0");
assert_eq!(format!("{}", result_set), "projected_name IN (0)");
Copy link
Contributor

@sdd sdd Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to follow what's happening here so that I understand.

So in the case of result_binary, the value of 5 gets truncated to 0, since the truncate transform when applied to an int effectively divides by 10, rounding down for all fractional values (ie, 1-9 get truncated to 0, 10-19 get truncated to 1, 20-29 get truncated to 2)?

And in the case of result_set, since both 5 and 6 get truncated to the same value of 0, a set of (5, 6) becomes a set of (0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think your understanding is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to have more than one element in the set. In Python and Java a IN (0) is being rewritten to = 0 before evaluated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the caveat that I'm not an expert on this part of the spec, but the PR seems well-structured.

I wouldn't mind seeing more comments in the tests though to explain the "why" aspect of each test.

@marvinlanhenke
Copy link
Contributor Author

LGTM, with the caveat that I'm not an expert on this part of the spec, but the PR seems well-structured.

I wouldn't mind seeing more comments in the tests though to explain the "why" aspect of each test.

Thanks for your feedback. If the others approve "the correctness" of the PR - I'll add those comments.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @marvinlanhenke

The most important part of adding projection is making that they are absolutely correct. If rust would generate something different than Java of Python, leads to data incorrectness since it would not correctly evaluate the partition predicates.


assert_eq!(format!("{}", result_unary), "projected_name IS NULL");
assert_eq!(format!("{}", result_binary), "projected_name = 0");
assert_eq!(format!("{}", result_set), "projected_name IN (0)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to have more than one element in the set. In Python and Java a IN (0) is being rewritten to = 0 before evaluated.

crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved

assert_eq!(format!("{}", result_unary), "projected_name IS NULL");
assert_eq!(format!("{}", result_binary), "projected_name = 0");
assert_eq!(format!("{}", result_set), "projected_name IN (0)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marvinlanhenke
Copy link
Contributor Author

Thanks for picking this up @marvinlanhenke

The most important part of adding projection is making that they are absolutely correct. If rust would generate something different than Java of Python, leads to data incorrectness since it would not correctly evaluate the partition predicates.

@Fokko
Thanks for the extensive review. Not only did I miss some test but the complete implementation for Dates : https://github.com/apache/iceberg/blob/d350c9b8c995a2953aa8b80a0a1fc7cadc4dd16a/api/src/main/java/org/apache/iceberg/transforms/Dates.java#L123 // I was just looking for the transforms for Years, Months, etc. So thank you for hinting me at this - I'll implement this as well - and try to port the Java Testsuite.

@marvinlanhenke marvinlanhenke marked this pull request as draft March 28, 2024 10:55
@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Mar 28, 2024

@Fokko
Is it correct to assume, as far as I understood the Java implementation, that we support dates projection only for year, month, and day? Also adjusting the boundary only works with integer days? I'm asking because in my current implementation I only adjust the boundaries for PrimitiveLiteral::Date(i32) and have no support for PrimitiveLiteral::Time(i64 etc.

@liurenjie1024
I converted this to a draft - since I had to change the design. I'll push this draft, so you can verify. Basically, I got rid of the trait and implemented boundary on Datum itself - which makes more sense to me, now that I have a better overall picture with the dates transformations in mind.

Once, we agree on the overall implementation - I'll start porting the testsuite.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marvinlanhenke for picking up this. In general it looks good, but I have some small suggestions:

  1. How about we split this pr into smaller prs where each pr implements project for one transform?
  2. How about we move specific logic of transform into each file in transform module?
  3. I agree with @Fokko that we should port tests from java to ensure correctness.

crates/iceberg/src/transform/mod.rs Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Collaborator

liurenjie1024 commented Mar 28, 2024

Basically, I got rid of the trait and implemented boundary on Datum itself - which makes more sense to me, now that I have a better overall picture with the dates transformations in mind.

@marvinlanhenke Thanks for this, I also feel that an boundary trait is a little over design, and implementing them directly on Datum seems better to me.

@marvinlanhenke marvinlanhenke marked this pull request as ready for review April 1, 2024 08:48
@marvinlanhenke
Copy link
Contributor Author

@Fokko @liurenjie1024
PTAL

I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation.
I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

@liurenjie1024
Copy link
Collaborator

@Fokko @liurenjie1024 PTAL

I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs?

@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Apr 1, 2024

@Fokko @liurenjie1024 PTAL
I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs?

... I haven't made up my mind yet. I think we can merge this if its okay, in order to unblock #253?

Also I'm not so sure anymore we have to move the logic at all? Since most of the helper functions handle logic that is not dependent of the type of transform? Perhaps you can outline the refactor high-level what you have in mind?

@liurenjie1024
Copy link
Collaborator

Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you.

@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Apr 1, 2024

Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you.

Take your time while reviewing in the meantime - I'll open another draft with a possible refactor and we can compare whats best? I 'm thinking of splitting the code into Identity, Bucket, Truncate and Temporal logic - including the tests. This however might introduce some minor code duplication (I'll have to try and see for myself though, now that we have all the tests - it should be much easier to verify)

EDIT:
@liurenjie1024
I've taken a look again and I don't think a refactor makes sense here (except moving the tests).
Since so many transforms are handled the same way, implementing project on each transform would lead to code duplication, and a bigger maintenance burden (also I'd have to create a single match arm for each enum variant).
So I think the approach with the helper functions on Transform itself is fine, as it allows for code locallity and an overall more generic approach (combining the same logic).

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Show resolved Hide resolved
let func = create_transform_function(self)?;

let projection = match predicate {
BoundPredicate::Unary(expr) => match self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to rewrite this as following:

match self {
   Transform::Identity => {
     match predicate => {
       BoundPredicate::Unary(expr) => { ... }
       BoundPredicate::Binary(expr) => {...}
    }
   }
}

I know the results are same, but rewrite it in this approach makes it easier to read, and do check against java implemention, since they are organized by transfrom in feach file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the structure you suggested in an earlier version. I changed it the other way around since predicate has the smaller cardinality, which allows me to group more transforms into a single predicate match arm. I can change it back, however this would introduce more match arms and some code duplication?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some code duplication is worth so that we can have better readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'm already implementing it - since I wanted to compare for myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liurenjie1024
I did a refactor changing the structure (matching order). I also extracted common functionality, renamed those helpers and updated the docs. I hope not only the structure but the overall design is more readable and understandable with those changes applied?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks much better now, thanks! I'll take a careful review later.

crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/transform/temporal.rs Show resolved Hide resolved
@marvinlanhenke
Copy link
Contributor Author

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

Thanks for the review, I'll get to your suggestions - those should be easy to fix.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marvinlanhenke for this great pr with the thorough tests, really appreciate it! I have some small suggestions, but it looks great!

crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/transform.rs Show resolved Hide resolved
@liurenjie1024
Copy link
Collaborator

cc @Fokko Do you have other comments?

@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Apr 5, 2024

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

@liurenjie1024 @Fokko
Thanks for the review - I did some minor fixes according to your suggestions. If no other comments, I think we're good to go.

@liurenjie1024 liurenjie1024 merged commit 4e89ac7 into apache:main Apr 5, 2024
7 checks passed
@marvinlanhenke marvinlanhenke deleted the project_transform branch April 23, 2024 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement project for Transform.
4 participants