[SPARK-6715][SQL] Eliminate duplicate filters from pushdown predicates #5369

viirya · 2015-04-05T13:53:11Z

Now in DataSourceStrategy, the pushdown predicates are duplicate of original Filter conditions. Thus, some predicates are performed both by data source relation and Filter plan. I think it is a duplicate loading. Once the predicates are pushed down and performed by a data source relation, it
can be eliminated from outside Filter plan's condition.

AmplabJenkins · 2015-04-05T15:22:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29727/
Test PASSed.

liancheng · 2015-04-06T16:20:30Z

sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala

How about renaming this to unrecognizedFilters?

liancheng · 2015-04-06T17:00:30Z

Thanks for working on this! Left several commented about styling. However, I just realized that adding a new method supportPredicate in PrunedFilteredScan breaks backwards compatibility of the data sources API. This makes hesitant to have this.

Incidentally, I just wrote a design doc for improving data sources API and adding partitioning support to it. This very issue is also covered by adding a new trait rather than a new method to an existing trait. This doc should probably be made public very soon. Would you mind me revisiting this later? I'll keep you updated.

viirya · 2015-04-06T17:15:08Z

Good suggestion for me. I will address these styling first and then wait for your design doc. Thanks.

marmbrus · 2015-04-09T02:31:19Z

Thanks for your suggestion. As @liancheng mentioned, we cannot add methods to interfaces after they have been made public. Doing so would break libraries and break our compatibility guarantees. As such, I think we need to close this issue until the next version of the API (probably not until Spark 2.0). In general, evaluating filters is a pretty fast operation anyway.

viirya · 2015-04-09T06:46:57Z

Okay. However, I think it is still important to make the query plan simpler by removing redundant parts even those parts are fast operations.

E.g., for this pr, the redundant filters will cause problems. First, because there are duplicate predicates in pushdown parts and Filter node, you may not know the filtering is done by which one. Previously, we have noticed such problem and don't find the pushdown predicates don't working.

Second, if the data source has different logic on a specific predicate than Filter, duplicate predicates will make something wrong.

The time spent on filtering depends on how many predicates are used and how these predicates are complicated. Also the data size. Redundant filtering will still cause somewhat performance cost.

As @liancheng said, I think it would be good if we can revisit this after new api is coming out.

Thank you.

Eliminate duplicate filters from pushdown predicates.

8616252

liancheng reviewed Apr 6, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala

Copy link

Contributor

liancheng Apr 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this to unrecognizedFilters?

viirya closed this Apr 9, 2015

viirya deleted the de_dup_filter branch December 27, 2023 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6715][SQL] Eliminate duplicate filters from pushdown predicates #5369

[SPARK-6715][SQL] Eliminate duplicate filters from pushdown predicates #5369

Uh oh!

viirya commented Apr 5, 2015

Uh oh!

AmplabJenkins commented Apr 5, 2015

Uh oh!

liancheng Apr 6, 2015

Uh oh!

liancheng commented Apr 6, 2015

Uh oh!

viirya commented Apr 6, 2015

Uh oh!

marmbrus commented Apr 9, 2015

Uh oh!

viirya commented Apr 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-6715][SQL] Eliminate duplicate filters from pushdown predicates #5369

[SPARK-6715][SQL] Eliminate duplicate filters from pushdown predicates #5369

Uh oh!

Conversation

viirya commented Apr 5, 2015

Uh oh!

AmplabJenkins commented Apr 5, 2015

Uh oh!

liancheng Apr 6, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Apr 6, 2015

Uh oh!

viirya commented Apr 6, 2015

Uh oh!

marmbrus commented Apr 9, 2015

Uh oh!

viirya commented Apr 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants