Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release #7591

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented May 12, 2023

Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release
cc: @szehon-ho @aokolnychyi @nastra @singhpk234

@amogh-jahagirdar
Copy link
Contributor Author

Ah looks like some tests still reference the deprecated methods. Will fix those

@amogh-jahagirdar amogh-jahagirdar force-pushed the 1.3.0-vectorized-reader-deprecation branch from 8785e73 to 9b27b66 Compare May 12, 2023 03:39
@amogh-jahagirdar amogh-jahagirdar force-pushed the 1.3.0-vectorized-reader-deprecation branch 3 times, most recently from 0bbb231 to c7b4c4b Compare May 12, 2023 04:01
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented May 12, 2023

I need to update the tests so that the arrow null checking property is enabled when relevant

@amogh-jahagirdar amogh-jahagirdar force-pushed the 1.3.0-vectorized-reader-deprecation branch 2 times, most recently from e1f2908 to a069e7b Compare May 12, 2023 18:12
Comment on lines 61 to 63
static {
System.setProperty("arrow.enable_null_check_for_get", "true");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably is not the right way to handle the failing tests. So for context what happens is that here we always use the NullCheckingForGet.NULL_CHECKING_ENABLED value which is a final static variable which is set once based on the value of the arrow.enable_null_check_for_get property.

Right now we have some tests (most of which want to validate the validity buffer) and a few which do not. It's not possible to dynamically set the property for these different cases because it's static final, once it's set, every read of the value will just yield the original value.

Before we had an API to explicitly passing in to the vectorized reader if we should use the validity buffer, but now we want to deprecate that.

In practice users will set this once for their Spark job but for the purpose of testing we want to validate both paths (my implementation here just optimizes for the majority of the existing test cases, but misses out on validating the behavior when this is set to false which is the default due to better performance.

Long story short, I'm thinking we should still expose a method but it will be package private, for setting the validity buffer. this package private method would be used for the purpose of testing, and constructing a parquet reader depending on what we want to test.

Thoughts @aokolnychyi @singhpk234 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm overcomplicating this. since Iceberg performs the nullability check anyways there's not much value for our test path to even validate the arrow validity buffer (which was the premise of https://github.com/apache/iceberg/pull/6550/files). I think we can just remove the assertions related to checkArrowValidityVector from these tests

@amogh-jahagirdar amogh-jahagirdar force-pushed the 1.3.0-vectorized-reader-deprecation branch from a069e7b to a0838e2 Compare May 15, 2023 00:26
Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with vectornized code, but it looks good to me according to the javadoc

@aokolnychyi aokolnychyi merged commit 04a79ec into apache:master May 16, 2023
31 checks passed
@aokolnychyi
Copy link
Contributor

Thanks, @amogh-jahagirdar! Thanks for reviewing, @szehon-ho!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants