Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet #511

Closed
wants to merge 13 commits into from

Conversation

AndreSchumacher
Copy link
Contributor

Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test:

Uncompressed Compressed
File size 10 GB 2 GB
Speedup 2 1.8

Since mileage may vary I added a new option to SparkConf:

org.apache.spark.sql.parquet.filter.pushdown

Default value would be true and setting it to false disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter.

Because of an issue with Parquet (see here) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this.

@AndreSchumacher
Copy link
Contributor Author

@marmbrus would be great if you could have a look when you have some time. Thanks!

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AndreSchumacher
Copy link
Contributor Author

This PR will need to be revised depending on the outcome of #482

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14381/

// Note: we do not actually remove the filters that were pushed down to Parquet from
// the plan, in case that some of the predicates cannot be evaluated there because
// they contain complex operations, such as CASTs.
// TODO: rethink whether conjuntions that are handed down to Parquet should be removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to move the logic that determines what filters can be pushed down here, into the planner, so that we can avoid the double evaluation?

@marmbrus
Copy link
Contributor

Very cool. This should be a pretty big performance win. Only minor comments.

Regarding the question about normalizing expressions, it could be done in a rule. However, really I think we can probably greatly simplify all of that logic with code gen (hopefully coming in 1.1). So, given that you have already written out all of the cases I don't think we need to do further simplification in this PR.

@mateiz
Copy link
Contributor

mateiz commented Apr 25, 2014

@marmbrus @AndreSchumacher do we really want a SparkConf option for this? I'd rather minimize the number of options and add rules in the optimizer later to decide when to do this. These kind of options are esoteric and very hard for users to configure.

@marmbrus
Copy link
Contributor

@mateiz good point. I agree with you long-term this decision should be up to the optimizer. However, in this case I think the right thing is probably to create a section of the sql config called hints. We don't promise to obey or support hints long term, but they can be used for immediate optimizations. This seems like the typical DB way to offset the cases where the optimization is currently poor.

@mateiz
Copy link
Contributor

mateiz commented Apr 25, 2014

But are there any realistic workloads where you'd want to turn this on all the time, or turn it off all the time? It seems that in an ad-hoc query workload, you'll have some queries that can use this, and some that can't. You should just pick whether you want it as a default. Personally I'd go for it unless the cost is super high in the cases where it doesn't work, because I imagine filtering is pretty common in large schemas and I hope Parquet itself optimizes this down the line.

@mateiz
Copy link
Contributor

mateiz commented Apr 25, 2014

BTW if you do add a config setting, a better name would be spark.sql.hints.parquetFilterPushdown; our other setting names don't start with org.apache. An even better option though might be to put it in the SQL statement itself, so users can do it on a per-query basis, though that probably requires nasty changes to the parser. But I'd still prefer to either always turn this on (if the penalty isn't huge) or leave it off for now and not introduce a new setting.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AndreSchumacher
Copy link
Contributor Author

@marmbrus @mateiz Thanks a lot for the comments and the fast response.

About the config setting: I would feel more comfortable setting a default after there has been some experience with realistic workloads and schemas. But I renamed it now, as suggested by Matei.

The bigger changes in my last commit are now to keep track of what is actually pushed and why. Then the predicates which are "completely" pushed are removed inside the Planner. Note that attempting to push "A & B" can result only in "A" being pushed because B contains anything other than a simple comparison of a column value. In this case "A & B" should be kept for now (IMHO). There is still in advantage in pushing A since hopefully there are fewer records that pass the filter to the higher level.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14481/

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14482/

// Note: filters cannot be pushed down to Parquet if they contain more complex
// expressions than simple "Attribute cmp Literal" comparisons. Here we remove
// all filters that have been pushed down. Note that a predicate such as
// "A AND B" can result in "A" being pushed down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we will never get A AND B right? As I think any conjunctive predicates will be split by using PhysicalOperation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, bad example. That's why I initially didn't treat ANDs at all when creating the filters from the expressions. But then I thought one could have expressions such as (A AND B) OR C which should probably be treated in the planner I guess and turned into (A OR C) AND (B OR C) but currently are not. Please correct me if I am wrong. It may be that the parser doesn't currently allow these kind of filter expressions with '(', ')' though although nothing speaks against them I guess.

@marmbrus
Copy link
Contributor

Cool, thanks for renaming it.

@mateiz I don't think we should even include these hints in the docs (unless we find particularly useful ones) as I agree presenting too much complexity to users is a bad idea. However, even just for our own benchmarking, recompiling to change these settings is just not feasible and it's really hard to predict performance without actually running things. Also when I've talked about building catalyst to experienced database people, basically everyone said, "No matter how good you think your optimizer is, always make sure you have knobs to control it because it is going to be wrong."

Having these hints in the language could maybe be nice, but I really don't think that is worth the engineering effort of not only changing the parser, but also making sure they get threaded through analysis, optimization and planning correctly. Using language based hints would also would change if you are using sql, hql, or the DSL.

Having a special conf mechanism that lets you set them on a per query basis would be nice. I'm not sure how flexible the SparkConf infrastructure is in this regard, but might be something to consider. I can imagine cases where this might even be useful for standard spark jobs.

@mateiz
Copy link
Contributor

mateiz commented Apr 26, 2014

Okay, it sounds good then as a hidden parameter.

Changes:
- Predicates that are pushed down to Parquet are now kept track off
- Predicates which are pushed down are removed from the higher-level
  filters
- Filter enable setting renamed to "spark.sql.hints.parquetFilterPushdown"
- Smaller changes, code formatting, imports, etc.
@AmplabJenkins
Copy link

Merged build triggered.

@AndreSchumacher
Copy link
Contributor Author

Jenkins, test this please

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15047/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15048/

@AndreSchumacher
Copy link
Contributor Author

@rxin Thanks for the note. I just rebased it.

AndreSchumacher added a commit that referenced this pull request May 16, 2014
…arquet

Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test:

             | Uncompressed  | Compressed
-------------| ------------- | -------------
File size  |     10 GB  | 2 GB
Speedup |      2         | 1.8

Since mileage may vary I added a new option to SparkConf:

`org.apache.spark.sql.parquet.filter.pushdown`

Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter.

Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this.

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes #511 from AndreSchumacher/parquet_filter and squashes the following commits:

16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase
7b304ca [Andre Schumacher] Fixing formatting
c36d5cb [Andre Schumacher] Scalastyle
3da98db [Andre Schumacher] Second round of review feedback
7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter
a86553b [Andre Schumacher] First round of code review feedback
b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData
85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown
f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR
210e9cb [Andre Schumacher] Adding disjunctive filter predicates
a93a588 [Andre Schumacher] Adding unit test for filtering
6d22666 [Andre Schumacher] Extending ParquetFilters
93e8192 [Andre Schumacher] First commit Parquet record filtering
@AndreSchumacher
Copy link
Contributor Author

Closing this now since it got merged. Thanks everyone.

asfgit pushed a commit that referenced this pull request Jun 13, 2014
#511 and #863 got left out of branch-1.0 since we were really close to the release.  Now that they have been tested a little I see no reason to leave them out.

Author: Michael Armbrust <michael@databricks.com>
Author: witgo <witgo@qq.com>

Closes #1078 from marmbrus/branch-1.0 and squashes the following commits:

22be674 [witgo]  [SPARK-1841]: update scalatest to version 2.1.5
fc8fc79 [Michael Armbrust] Include #1071 as well.
c5d0adf [Michael Armbrust] Update SparkSQL in branch-1.0 to match master.
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…arquet

Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test:

             | Uncompressed  | Compressed
-------------| ------------- | -------------
File size  |     10 GB  | 2 GB
Speedup |      2         | 1.8

Since mileage may vary I added a new option to SparkConf:

`org.apache.spark.sql.parquet.filter.pushdown`

Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter.

Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this.

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes apache#511 from AndreSchumacher/parquet_filter and squashes the following commits:

16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase
7b304ca [Andre Schumacher] Fixing formatting
c36d5cb [Andre Schumacher] Scalastyle
3da98db [Andre Schumacher] Second round of review feedback
7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter
a86553b [Andre Schumacher] First round of code review feedback
b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData
85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown
f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR
210e9cb [Andre Schumacher] Adding disjunctive filter predicates
a93a588 [Andre Schumacher] Adding unit test for filtering
6d22666 [Andre Schumacher] Extending ParquetFilters
93e8192 [Andre Schumacher] First commit Parquet record filtering
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)

This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException.  I applied the same fix to the Spark Streaming Java APIs.  The commit message describes the fix in more detail.

I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.
(cherry picked from commit c66a2ef)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)

This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException.  I applied the same fix to the Spark Streaming Java APIs.  The commit message describes the fix in more detail.

I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.
(cherry picked from commit c66a2ef)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Signed-off-by: Melvin Hillsman <mrhillsman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants