Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39312][SQL] Use parquet native In predicate for in filter push down #36696

Closed
wants to merge 9 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Since now Parquet supports its native in predicate, we want to simplify the current In filter pushdown using Parquet's native in predicate.

Why are the changes needed?

code enhancement

Does this PR introduce any user-facing change?

No

How was this patch tested?

modify the existing tests

@github-actions github-actions bot added the SQL label May 27, 2022
@huaxingao
Copy link
Contributor Author

cc @wangyum

Comment on lines 730 to 737
if (values.length <= pushDownInFilterThreshold) {
values.distinct.flatMap { v =>
makeEq.lift(fieldType).map(_(fieldNames, v))
}.reduceLeftOption(FilterApi.or)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a regression, For example:

In(id, 1, 3, 5, 100000)

Before this PR, the pushded predicate is: id = 1 or id = 3 or id = 5 or id = 100000,
the current pushded predicate is: id >= 1 and id <= 100000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test coverage for @wangyum 's case, please, @huaxingao ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change back to use equal Or equal if the number of values in the In set <= pushDownInFilterThreshold. Only use parquet In predicate if the number of values in the In set > pushDownInFilterThreshold. The existing tests coverage should be good, just need to change the expected type back to equal Or equal when the number of values in the In set <= pushDownInFilterThreshold.

PartialFunction[ParquetSchemaType, (Array[String], Any) => FilterPredicate] = {
case ParquetBooleanType =>
(n: Array[String], v: Any) =>
val values = Option(v).map(_.asInstanceOf[Array[Object]]).orNull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If orNull is hit, for (value <- values) { ... } will throw NPE

ornull

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. I don't think v could be null, but this line is actually not needed, so I deleted.

@dongjoon-hyun
Copy link
Member

cc @sunchao

@huaxingao
Copy link
Contributor Author

@wangyum is helping me testing this because he has lots of in/notIn test cases. I will change this PR to draft for now.

@huaxingao huaxingao marked this pull request as draft June 6, 2022 00:04
@wangyum
Copy link
Member

wangyum commented Jun 6, 2022

Please wait me several days.

@wangyum
Copy link
Member

wangyum commented Jun 16, 2022

I have tested it more than 2 weeks and no data issue.

@huaxingao
Copy link
Contributor Author

@wangyum Thank you very much for helping me test this!

@dongjoon-hyun
Copy link
Member

What is the next step for us, @huaxingao ?

@huaxingao
Copy link
Contributor Author

@dongjoon-hyun Thanks for the ping. Give me a couple of more days. I want to check one more time before mark this ready for review.

@dongjoon-hyun
Copy link
Member

No problem~ Thank you for informing that, @huaxingao . Take your time.

@wangyum
Copy link
Member

wangyum commented Jul 6, 2022

Could you update the FilterPushdownBenchmark results?

@dongjoon-hyun
Copy link
Member

Gentle ping, @huaxingao .

@huaxingao huaxingao marked this pull request as ready for review July 11, 2022 04:50
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update. Could you check the relevant failures?

[error] Failed tests:
[error] 	org.apache.spark.sql.execution.datasources.parquet.ParquetV2FilterSuite
[error] 	org.apache.spark.sql.execution.datasources.parquet.ParquetV1FilterSuite

@dongjoon-hyun
Copy link
Member

Could you re-trigger once more, @huaxingao ?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @sunchao and @viirya , too.

@wangyum wangyum closed this in fa8d936 Jul 16, 2022
@wangyum
Copy link
Member

wangyum commented Jul 16, 2022

Merged to master.

@huaxingao
Copy link
Contributor Author

Thanks you all very much!

@huaxingao huaxingao deleted the in_predicate branch July 16, 2022 16:33
@dongjoon-hyun
Copy link
Member

Thank you, @huaxingao and all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants