feat: Project transform #309

marvinlanhenke · 2024-03-27T19:47:51Z

Which issue does this PR close?

Closes #264

Rationale for this change

The ability to project row_filter to Transform partition.
This will unblock the Manifest & PartitionEvaluator, which will enable the pruning of manifest files in fn plan_files.

What changes are included in this PR?

add fn project(...) on Transform
mainly based on Java implementation

Are these changes tested?

Yes. Unit tests are included.

marvinlanhenke · 2024-03-27T19:48:23Z

@liurenjie1024 @ZENOTME @sdd PTAL

sdd · 2024-03-28T07:50:09Z

crates/iceberg/src/spec/transform.rs

+
+        assert_eq!(format!("{}", result_unary), "projected_name IS NULL");
+        assert_eq!(format!("{}", result_binary), "projected_name = 0");
+        assert_eq!(format!("{}", result_set), "projected_name IN (0)");


Just trying to follow what's happening here so that I understand.

So in the case of result_binary, the value of 5 gets truncated to 0, since the truncate transform when applied to an int effectively divides by 10, rounding down for all fractional values (ie, 1-9 get truncated to 0, 10-19 get truncated to 1, 20-29 get truncated to 2)?

And in the case of result_set, since both 5 and 6 get truncated to the same value of 0, a set of (5, 6) becomes a set of (0)?

Yes, I think your understanding is correct.

I think it is better to have more than one element in the set. In Python and Java a IN (0) is being rewritten to = 0 before evaluated.

Example can be found here: https://github.com/apache/iceberg/blob/81b62c78e0c230516090becda7d6040ee03e6a91/api/src/test/java/org/apache/iceberg/transforms/TestTruncatesProjection.java#L189-L190

It would be good to port the other tests of Java as well.

sdd

LGTM, with the caveat that I'm not an expert on this part of the spec, but the PR seems well-structured.

I wouldn't mind seeing more comments in the tests though to explain the "why" aspect of each test.

marvinlanhenke · 2024-03-28T07:59:18Z

LGTM, with the caveat that I'm not an expert on this part of the spec, but the PR seems well-structured.

I wouldn't mind seeing more comments in the tests though to explain the "why" aspect of each test.

Thanks for your feedback. If the others approve "the correctness" of the PR - I'll add those comments.

Fokko

Thanks for picking this up @marvinlanhenke

The most important part of adding projection is making that they are absolutely correct. If rust would generate something different than Java of Python, leads to data incorrectness since it would not correctly evaluate the partition predicates.

Fokko · 2024-03-28T08:08:36Z

crates/iceberg/src/spec/transform.rs

+
+        assert_eq!(format!("{}", result_unary), "projected_name IS NULL");
+        assert_eq!(format!("{}", result_binary), "projected_name = 0");
+        assert_eq!(format!("{}", result_set), "projected_name IN (0)");


I think it is better to have more than one element in the set. In Python and Java a IN (0) is being rewritten to = 0 before evaluated.

crates/iceberg/src/spec/transform.rs

Fokko · 2024-03-28T08:16:37Z

crates/iceberg/src/spec/transform.rs

+
+        assert_eq!(format!("{}", result_unary), "projected_name IS NULL");
+        assert_eq!(format!("{}", result_binary), "projected_name = 0");
+        assert_eq!(format!("{}", result_set), "projected_name IN (0)");


Example can be found here: https://github.com/apache/iceberg/blob/81b62c78e0c230516090becda7d6040ee03e6a91/api/src/test/java/org/apache/iceberg/transforms/TestTruncatesProjection.java#L189-L190

It would be good to port the other tests of Java as well.

marvinlanhenke · 2024-03-28T08:30:47Z

Thanks for picking this up @marvinlanhenke

The most important part of adding projection is making that they are absolutely correct. If rust would generate something different than Java of Python, leads to data incorrectness since it would not correctly evaluate the partition predicates.

@Fokko
Thanks for the extensive review. Not only did I miss some test but the complete implementation for Dates : https://github.com/apache/iceberg/blob/d350c9b8c995a2953aa8b80a0a1fc7cadc4dd16a/api/src/main/java/org/apache/iceberg/transforms/Dates.java#L123 // I was just looking for the transforms for Years, Months, etc. So thank you for hinting me at this - I'll implement this as well - and try to port the Java Testsuite.

marvinlanhenke · 2024-03-28T11:22:03Z

@Fokko
Is it correct to assume, as far as I understood the Java implementation, that we support dates projection only for year, month, and day? Also adjusting the boundary only works with integer days? I'm asking because in my current implementation I only adjust the boundaries for PrimitiveLiteral::Date(i32) and have no support for PrimitiveLiteral::Time(i64 etc.

@liurenjie1024
I converted this to a draft - since I had to change the design. I'll push this draft, so you can verify. Basically, I got rid of the trait and implemented boundary on Datum itself - which makes more sense to me, now that I have a better overall picture with the dates transformations in mind.

Once, we agree on the overall implementation - I'll start porting the testsuite.

liurenjie1024

Thanks @marvinlanhenke for picking up this. In general it looks good, but I have some small suggestions:

How about we split this pr into smaller prs where each pr implements project for one transform?
How about we move specific logic of transform into each file in transform module?
I agree with @Fokko that we should port tests from java to ensure correctness.

crates/iceberg/src/transform/mod.rs

crates/iceberg/src/spec/transform.rs

liurenjie1024 · 2024-03-28T12:11:11Z

Basically, I got rid of the trait and implemented boundary on Datum itself - which makes more sense to me, now that I have a better overall picture with the dates transformations in mind.

@marvinlanhenke Thanks for this, I also feel that an boundary trait is a little over design, and implementing them directly on Datum seems better to me.

crates/iceberg/src/spec/values.rs

marvinlanhenke · 2024-04-01T08:50:28Z

@Fokko @liurenjie1024
PTAL

I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation.
I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

liurenjie1024 · 2024-04-01T09:01:11Z

@Fokko @liurenjie1024 PTAL

I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs?

marvinlanhenke · 2024-04-01T09:13:48Z

@Fokko @liurenjie1024 PTAL
I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs?

... I haven't made up my mind yet. I think we can merge this if its okay, in order to unblock #253?

Also I'm not so sure anymore we have to move the logic at all? Since most of the helper functions handle logic that is not dependent of the type of transform? Perhaps you can outline the refactor high-level what you have in mind?

liurenjie1024 · 2024-04-01T09:23:19Z

Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you.

marvinlanhenke · 2024-04-01T09:27:00Z

Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you.

Take your time while reviewing in the meantime - I'll open another draft with a possible refactor and we can compare whats best? I 'm thinking of splitting the code into Identity, Bucket, Truncate and Temporal logic - including the tests. This however might introduce some minor code duplication (I'll have to try and see for myself though, now that we have all the tests - it should be much easier to verify)

EDIT:
@liurenjie1024
I've taken a look again and I don't think a refactor makes sense here (except moving the tests).
Since so many transforms are handled the same way, implementing project on each transform would lead to code duplication, and a bigger maintenance burden (also I'd have to create a single match arm for each enum variant).
So I think the approach with the helper functions on Transform itself is fine, as it allows for code locallity and an overall more generic approach (combining the same logic).

liurenjie1024

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

crates/iceberg/src/spec/transform.rs

liurenjie1024 · 2024-04-02T09:42:18Z

crates/iceberg/src/spec/transform.rs

+        let func = create_transform_function(self)?;
+
+        let projection = match predicate {
+            BoundPredicate::Unary(expr) => match self {


Would you mind to rewrite this as following:

match self { Transform::Identity => { match predicate => { BoundPredicate::Unary(expr) => { ... } BoundPredicate::Binary(expr) => {...} } } }

I know the results are same, but rewrite it in this approach makes it easier to read, and do check against java implemention, since they are organized by transfrom in feach file.

I had the structure you suggested in an earlier version. I changed it the other way around since predicate has the smaller cardinality, which allows me to group more transforms into a single predicate match arm. I can change it back, however this would introduce more match arms and some code duplication?

I think some code duplication is worth so that we can have better readability?

sure, I'm already implementing it - since I wanted to compare for myself.

@liurenjie1024
I did a refactor changing the structure (matching order). I also extracted common functionality, renamed those helpers and updated the docs. I hope not only the structure but the overall design is more readable and understandable with those changes applied?

Yeah, it looks much better now, thanks! I'll take a careful review later.

crates/iceberg/src/spec/transform.rs

crates/iceberg/src/transform/temporal.rs

marvinlanhenke · 2024-04-02T10:58:57Z

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

Thanks for the review, I'll get to your suggestions - those should be easy to fix.

liurenjie1024

Thanks @marvinlanhenke for this great pr with the thorough tests, really appreciate it! I have some small suggestions, but it looks great!

crates/iceberg/src/spec/transform.rs

liurenjie1024 · 2024-04-05T05:59:50Z

cc @Fokko Do you have other comments?

marvinlanhenke · 2024-04-05T09:42:01Z

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

@liurenjie1024 @Fokko
Thanks for the review - I did some minor fixes according to your suggestions. If no other comments, I think we're good to go.

marvinlanhenke added 11 commits March 27, 2024 11:29

add project bucket_unary

8d014b1

add project bucket_binary

507caa2

add project bucket_set

09eda3f

add project identity

41f90f7

add project truncate

73f1e3d

fixed array boundary

fd79c14

add project void

7885483

add project unknown

bb84d2b

add docs + none projections

a5dc6ef

docs

bba3629

docs

066a69c

sdd reviewed Mar 28, 2024

View reviewed changes

sdd approved these changes Mar 28, 2024

View reviewed changes

Fokko reviewed Mar 28, 2024

View reviewed changes

marvinlanhenke marked this pull request as draft March 28, 2024 10:55

marvinlanhenke added 3 commits March 28, 2024 12:35

remove trait + impl boundary on Datum

ac86baa

fix: clippy

4f113b6

fix: test Transform::Unknown

3f99f38

liurenjie1024 reviewed Mar 28, 2024

View reviewed changes

crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved

marvinlanhenke added 5 commits March 28, 2024 13:47

add: transform_literal_result

32aef76

add: transform_literal_result

736bb91

remove: whitespace

d476993

move boundary to transform.rs

9385084

add check if transform can be applied to data_type

9738416

marvinlanhenke mentioned this pull request Mar 31, 2024

Bug: fn day_timestamp_micro produces wrong results #311

Closed

marvinlanhenke added 2 commits April 1, 2024 07:23

basic fix

a123fc1

change to Result<i32>

3483f33

marvinlanhenke mentioned this pull request Apr 1, 2024

Fix day timestamp micro #312

Merged

marvinlanhenke added 3 commits April 1, 2024 07:45

use try_unary

a55be8f

Merge branch 'fix_day_timestamp_micro' into project_transform

7ec4c79

add: java-testsuite Transform::Timestamp Hours

014d793

marvinlanhenke marked this pull request as ready for review April 1, 2024 08:48

marvinlanhenke added 2 commits April 1, 2024 16:24

refactor: split and move tests

ab06022

refactor: move transform tests

d78e269

liurenjie1024 reviewed Apr 2, 2024

View reviewed changes

marvinlanhenke added 4 commits April 2, 2024 13:14

remove self

4f84a0e

refactor: structure fn project + helpers

eaacaa8

fix: clippy

2bb2f95

fix: typo

976d8c9

liurenjie1024 approved these changes Apr 5, 2024

View reviewed changes

marvinlanhenke added 2 commits April 5, 2024 11:28

Merge branch 'main' into project_transform

2961f98

fix: naming + generics

82e6244

liurenjie1024 merged commit 4e89ac7 into apache:main Apr 5, 2024
7 checks passed

marvinlanhenke mentioned this pull request Apr 7, 2024

Implement transforms projection #289

Closed

marvinlanhenke deleted the project_transform branch April 23, 2024 04:27

Fokko mentioned this pull request Apr 24, 2024

Tracking issues of iceberg-rust v0.3.0 #348

Open

73 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Project transform #309

feat: Project transform #309

marvinlanhenke commented Mar 27, 2024

marvinlanhenke commented Mar 27, 2024

sdd Mar 28, 2024 •

edited

Loading

marvinlanhenke Mar 28, 2024

Fokko Mar 28, 2024

Fokko Mar 28, 2024

sdd left a comment

marvinlanhenke commented Mar 28, 2024

Fokko left a comment

Fokko Mar 28, 2024

Fokko Mar 28, 2024

marvinlanhenke commented Mar 28, 2024

marvinlanhenke commented Mar 28, 2024 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 commented Mar 28, 2024 •

edited

Loading

marvinlanhenke commented Apr 1, 2024

liurenjie1024 commented Apr 1, 2024

marvinlanhenke commented Apr 1, 2024 •

edited

Loading

liurenjie1024 commented Apr 1, 2024

marvinlanhenke commented Apr 1, 2024 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 Apr 2, 2024

marvinlanhenke Apr 2, 2024

liurenjie1024 Apr 2, 2024

marvinlanhenke Apr 2, 2024

marvinlanhenke Apr 2, 2024

liurenjie1024 Apr 2, 2024

marvinlanhenke commented Apr 2, 2024

liurenjie1024 left a comment

liurenjie1024 commented Apr 5, 2024

marvinlanhenke commented Apr 5, 2024 •

edited

Loading

feat: Project transform #309

feat: Project transform #309

Conversation

marvinlanhenke commented Mar 27, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

marvinlanhenke commented Mar 27, 2024

sdd Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdd left a comment

Choose a reason for hiding this comment

marvinlanhenke commented Mar 28, 2024

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marvinlanhenke commented Mar 28, 2024

marvinlanhenke commented Mar 28, 2024 • edited Loading

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Mar 28, 2024 • edited Loading

marvinlanhenke commented Apr 1, 2024

liurenjie1024 commented Apr 1, 2024

marvinlanhenke commented Apr 1, 2024 • edited Loading

liurenjie1024 commented Apr 1, 2024

marvinlanhenke commented Apr 1, 2024 • edited Loading

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marvinlanhenke commented Apr 2, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Apr 5, 2024

marvinlanhenke commented Apr 5, 2024 • edited Loading

sdd Mar 28, 2024 •

edited

Loading

marvinlanhenke commented Mar 28, 2024 •

edited

Loading

liurenjie1024 commented Mar 28, 2024 •

edited

Loading

marvinlanhenke commented Apr 1, 2024 •

edited

Loading

marvinlanhenke commented Apr 1, 2024 •

edited

Loading

marvinlanhenke commented Apr 5, 2024 •

edited

Loading