-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release #7591
Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release #7591
Conversation
Ah looks like some tests still reference the deprecated methods. Will fix those |
8785e73
to
9b27b66
Compare
0bbb231
to
c7b4c4b
Compare
I need to update the tests so that the arrow null checking property is enabled when relevant |
e1f2908
to
a069e7b
Compare
static { | ||
System.setProperty("arrow.enable_null_check_for_get", "true"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably is not the right way to handle the failing tests. So for context what happens is that here we always use the NullCheckingForGet.NULL_CHECKING_ENABLED
value which is a final static variable which is set once based on the value of the arrow.enable_null_check_for_get property.
Right now we have some tests (most of which want to validate the validity buffer) and a few which do not. It's not possible to dynamically set the property for these different cases because it's static final, once it's set, every read of the value will just yield the original value.
Before we had an API to explicitly passing in to the vectorized reader if we should use the validity buffer, but now we want to deprecate that.
In practice users will set this once for their Spark job but for the purpose of testing we want to validate both paths (my implementation here just optimizes for the majority of the existing test cases, but misses out on validating the behavior when this is set to false which is the default due to better performance.
Long story short, I'm thinking we should still expose a method but it will be package private, for setting the validity buffer. this package private method would be used for the purpose of testing, and constructing a parquet reader depending on what we want to test.
Thoughts @aokolnychyi @singhpk234 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm overcomplicating this. since Iceberg performs the nullability check anyways there's not much value for our test path to even validate the arrow validity buffer (which was the premise of https://github.com/apache/iceberg/pull/6550/files). I think we can just remove the assertions related to checkArrowValidityVector
from these tests
…I for 1.3.0 release
a069e7b
to
a0838e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with vectornized code, but it looks good to me according to the javadoc
Thanks, @amogh-jahagirdar! Thanks for reviewing, @szehon-ho! |
Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release
cc: @szehon-ho @aokolnychyi @nastra @singhpk234