Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 5, 2015

Now in DataSourceStrategy, the pushdown predicates are duplicate of original Filter conditions. Thus, some predicates are performed both by data source relation and Filter plan. I think it is a duplicate loading. Once the predicates are pushed down and performed by a data source relation, it
can be eliminated from outside Filter plan's condition.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29727/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this to unrecognizedFilters?

@liancheng
Copy link
Contributor

Thanks for working on this! Left several commented about styling. However, I just realized that adding a new method supportPredicate in PrunedFilteredScan breaks backwards compatibility of the data sources API. This makes hesitant to have this.

Incidentally, I just wrote a design doc for improving data sources API and adding partitioning support to it. This very issue is also covered by adding a new trait rather than a new method to an existing trait. This doc should probably be made public very soon. Would you mind me revisiting this later? I'll keep you updated.

@viirya
Copy link
Member Author

viirya commented Apr 6, 2015

Good suggestion for me. I will address these styling first and then wait for your design doc. Thanks.

@marmbrus
Copy link
Contributor

marmbrus commented Apr 9, 2015

Thanks for your suggestion. As @liancheng mentioned, we cannot add methods to interfaces after they have been made public. Doing so would break libraries and break our compatibility guarantees. As such, I think we need to close this issue until the next version of the API (probably not until Spark 2.0). In general, evaluating filters is a pretty fast operation anyway.

@viirya
Copy link
Member Author

viirya commented Apr 9, 2015

Okay. However, I think it is still important to make the query plan simpler by removing redundant parts even those parts are fast operations.

E.g., for this pr, the redundant filters will cause problems. First, because there are duplicate predicates in pushdown parts and Filter node, you may not know the filtering is done by which one. Previously, we have noticed such problem and don't find the pushdown predicates don't working.

Second, if the data source has different logic on a specific predicate than Filter, duplicate predicates will make something wrong.

The time spent on filtering depends on how many predicates are used and how these predicates are complicated. Also the data size. Redundant filtering will still cause somewhat performance cost.

As @liancheng said, I think it would be good if we can revisit this after new api is coming out.

Thank you.

@viirya viirya closed this Apr 9, 2015
@viirya viirya deleted the de_dup_filter branch December 27, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants