SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet #511

AndreSchumacher · 2014-04-23T18:10:02Z

Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test:

	Uncompressed	Compressed
File size	10 GB	2 GB
Speedup	2	1.8

Since mileage may vary I added a new option to SparkConf:

org.apache.spark.sql.parquet.filter.pushdown

Default value would be true and setting it to false disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter.

Because of an issue with Parquet (see here) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this.

AndreSchumacher · 2014-04-23T18:11:20Z

@marmbrus would be great if you could have a look when you have some time. Thanks!

AmplabJenkins · 2014-04-23T18:12:55Z

Build triggered.

AmplabJenkins · 2014-04-23T18:13:01Z

Build started.

AndreSchumacher · 2014-04-23T19:25:38Z

This PR will need to be revised depending on the outcome of #482

AmplabJenkins · 2014-04-23T19:43:48Z

Build finished.

AmplabJenkins · 2014-04-23T19:43:48Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14381/

marmbrus · 2014-04-23T23:40:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

+        // Note: we do not actually remove the filters that were pushed down to Parquet from
+        // the plan, in case that some of the predicates cannot be evaluated there because
+        // they contain complex operations, such as CASTs.
+        // TODO: rethink whether conjuntions that are handed down to Parquet should be removed


How hard would it be to move the logic that determines what filters can be pushed down here, into the planner, so that we can avoid the double evaluation?

marmbrus · 2014-04-24T00:00:16Z

Very cool. This should be a pretty big performance win. Only minor comments.

Regarding the question about normalizing expressions, it could be done in a rule. However, really I think we can probably greatly simplify all of that logic with code gen (hopefully coming in 1.1). So, given that you have already written out all of the cases I don't think we need to do further simplification in this PR.

mateiz · 2014-04-25T03:40:10Z

@marmbrus @AndreSchumacher do we really want a SparkConf option for this? I'd rather minimize the number of options and add rules in the optimizer later to decide when to do this. These kind of options are esoteric and very hard for users to configure.

marmbrus · 2014-04-25T03:48:09Z

@mateiz good point. I agree with you long-term this decision should be up to the optimizer. However, in this case I think the right thing is probably to create a section of the sql config called hints. We don't promise to obey or support hints long term, but they can be used for immediate optimizations. This seems like the typical DB way to offset the cases where the optimization is currently poor.

mateiz · 2014-04-25T05:58:18Z

But are there any realistic workloads where you'd want to turn this on all the time, or turn it off all the time? It seems that in an ad-hoc query workload, you'll have some queries that can use this, and some that can't. You should just pick whether you want it as a default. Personally I'd go for it unless the cost is super high in the cases where it doesn't work, because I imagine filtering is pretty common in large schemas and I hope Parquet itself optimizes this down the line.

mateiz · 2014-04-25T06:11:16Z

BTW if you do add a config setting, a better name would be spark.sql.hints.parquetFilterPushdown; our other setting names don't start with org.apache. An even better option though might be to put it in the SQL statement itself, so users can do it on a per-query basis, though that probably requires nasty changes to the parser. But I'd still prefer to either always turn this on (if the penalty isn't huge) or leave it off for now and not introduce a new setting.

AmplabJenkins · 2014-04-25T06:52:55Z

Build triggered.

AmplabJenkins · 2014-04-25T06:53:05Z

Build started.

AndreSchumacher · 2014-04-25T06:55:17Z

@marmbrus @mateiz Thanks a lot for the comments and the fast response.

About the config setting: I would feel more comfortable setting a default after there has been some experience with realistic workloads and schemas. But I renamed it now, as suggested by Matei.

The bigger changes in my last commit are now to keep track of what is actually pushed and why. Then the predicates which are "completely" pushed are removed inside the Planner. Note that attempting to push "A & B" can result only in "A" being pushed because B contains anything other than a simple comparison of a column value. In this case "A & B" should be kept for now (IMHO). There is still in advantage in pushing A since hopefully there are fewer records that pass the filter to the higher level.

AmplabJenkins · 2014-04-25T07:07:55Z

Build triggered.

AmplabJenkins · 2014-04-25T07:08:05Z

Build started.

AmplabJenkins · 2014-04-25T08:25:25Z

Build finished.

AmplabJenkins · 2014-04-25T08:25:25Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14481/

AmplabJenkins · 2014-04-25T08:38:53Z

Build finished.

AmplabJenkins · 2014-04-25T08:38:53Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14482/

marmbrus · 2014-04-25T18:45:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

+              // Note: filters cannot be pushed down to Parquet if they contain more complex
+              // expressions than simple "Attribute cmp Literal" comparisons. Here we remove
+              // all filters that have been pushed down. Note that a predicate such as
+              // "A AND B" can result in "A" being pushed down.


However, we will never get A AND B right? As I think any conjunctive predicates will be split by using PhysicalOperation

Good point, bad example. That's why I initially didn't treat ANDs at all when creating the filters from the expressions. But then I thought one could have expressions such as (A AND B) OR C which should probably be treated in the planner I guess and turned into (A OR C) AND (B OR C) but currently are not. Please correct me if I am wrong. It may be that the parser doesn't currently allow these kind of filter expressions with '(', ')' though although nothing speaks against them I guess.

marmbrus · 2014-04-25T19:07:40Z

Cool, thanks for renaming it.

@mateiz I don't think we should even include these hints in the docs (unless we find particularly useful ones) as I agree presenting too much complexity to users is a bad idea. However, even just for our own benchmarking, recompiling to change these settings is just not feasible and it's really hard to predict performance without actually running things. Also when I've talked about building catalyst to experienced database people, basically everyone said, "No matter how good you think your optimizer is, always make sure you have knobs to control it because it is going to be wrong."

Having these hints in the language could maybe be nice, but I really don't think that is worth the engineering effort of not only changing the parser, but also making sure they get threaded through analysis, optimization and planning correctly. Using language based hints would also would change if you are using sql, hql, or the DSL.

Having a special conf mechanism that lets you set them on a per query basis would be nice. I'm not sure how flexible the SparkConf infrastructure is in this regard, but might be something to consider. I can imagine cases where this might even be useful for standard spark jobs.

mateiz · 2014-04-26T02:35:01Z

Okay, it sounds good then as a hidden parameter.

Changes: - Predicates that are pushed down to Parquet are now kept track off - Predicates which are pushed down are removed from the higher-level filters - Filter enable setting renamed to "spark.sql.hints.parquetFilterPushdown" - Smaller changes, code formatting, imports, etc.

AmplabJenkins · 2014-05-16T17:42:58Z

Merged build triggered.

AndreSchumacher · 2014-05-16T17:43:06Z

Jenkins, test this please

AmplabJenkins · 2014-05-16T17:43:07Z

Merged build started.

AmplabJenkins · 2014-05-16T17:47:58Z

Merged build triggered.

AmplabJenkins · 2014-05-16T17:48:07Z

Merged build started.

AmplabJenkins · 2014-05-16T19:01:30Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-16T19:01:31Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15047/

AmplabJenkins · 2014-05-16T19:10:24Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-16T19:10:24Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15048/

AndreSchumacher · 2014-05-16T19:29:42Z

@rxin Thanks for the note. I just rebased it.

…arquet Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test: | Uncompressed | Compressed -------------| ------------- | ------------- File size | 10 GB | 2 GB Speedup | 2 | 1.8 Since mileage may vary I added a new option to SparkConf: `org.apache.spark.sql.parquet.filter.pushdown` Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter. Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this. Author: Andre Schumacher <andre.schumacher@iki.fi> Closes #511 from AndreSchumacher/parquet_filter and squashes the following commits: 16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase 7b304ca [Andre Schumacher] Fixing formatting c36d5cb [Andre Schumacher] Scalastyle 3da98db [Andre Schumacher] Second round of review feedback 7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter a86553b [Andre Schumacher] First round of code review feedback b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData 85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR 210e9cb [Andre Schumacher] Adding disjunctive filter predicates a93a588 [Andre Schumacher] Adding unit test for filtering 6d22666 [Andre Schumacher] Extending ParquetFilters 93e8192 [Andre Schumacher] First commit Parquet record filtering

AndreSchumacher · 2014-05-17T07:55:04Z

Closing this now since it got merged. Thanks everyone.

#511 and #863 got left out of branch-1.0 since we were really close to the release. Now that they have been tested a little I see no reason to leave them out. Author: Michael Armbrust <michael@databricks.com> Author: witgo <witgo@qq.com> Closes #1078 from marmbrus/branch-1.0 and squashes the following commits: 22be674 [witgo] [SPARK-1841]: update scalatest to version 2.1.5 fc8fc79 [Michael Armbrust] Include #1071 as well. c5d0adf [Michael Armbrust] Update SparkSQL in branch-1.0 to match master.

…arquet Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test: | Uncompressed | Compressed -------------| ------------- | ------------- File size | 10 GB | 2 GB Speedup | 2 | 1.8 Since mileage may vary I added a new option to SparkConf: `org.apache.spark.sql.parquet.filter.pushdown` Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter. Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this. Author: Andre Schumacher <andre.schumacher@iki.fi> Closes apache#511 from AndreSchumacher/parquet_filter and squashes the following commits: 16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase 7b304ca [Andre Schumacher] Fixing formatting c36d5cb [Andre Schumacher] Scalastyle 3da98db [Andre Schumacher] Second round of review feedback 7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter a86553b [Andre Schumacher] First round of code review feedback b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData 85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR 210e9cb [Andre Schumacher] Adding disjunctive filter predicates a93a588 [Andre Schumacher] Adding unit test for filtering 6d22666 [Andre Schumacher] Extending ParquetFilters 93e8192 [Andre Schumacher] First commit Parquet record filtering

Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException. I applied the same fix to the Spark Streaming Java APIs. The commit message describes the fix in more detail. I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running. (cherry picked from commit c66a2ef) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

Signed-off-by: Melvin Hillsman <mrhillsman@gmail.com>

marmbrus reviewed Apr 23, 2014
View reviewed changes

marmbrus reviewed Apr 25, 2014
View reviewed changes

AndreSchumacher added 9 commits May 16, 2014 20:37

Undoing changes not needed for this PR

f0ad3cf

Adding SparkConf setting to disable filter predicate pushdown

85fea2d

Optimizing imports in ParquetTestData

b0f7806

Fixing broken formatting in ParquetFilter

7a78265

Second round of review feedback

3da98db

Scalastyle

c36d5cb

Fixing formatting

7b304ca

Removing leftovers from merge during rebase

16bfe83

AndreSchumacher closed this May 17, 2014

liancheng mentioned this pull request Jun 13, 2014

[SPARK-2135][SQL] Use planner for in-memory scans #1072

Closed

marmbrus mentioned this pull request Jun 13, 2014

[SQL] Update SparkSQL and ScalaTest in branch-1.0 to match master. #1078

Closed

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Add GCP info for KinD (apache#511)

c5dc5b9

Signed-off-by: Melvin Hillsman <mrhillsman@gmail.com>

SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet #511

SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet #511

Conversation

AndreSchumacher commented Apr 23, 2014

AndreSchumacher commented Apr 23, 2014

AmplabJenkins commented Apr 23, 2014

AmplabJenkins commented Apr 23, 2014

AndreSchumacher commented Apr 23, 2014

AmplabJenkins commented Apr 23, 2014

AmplabJenkins commented Apr 23, 2014

marmbrus Apr 23, 2014

Choose a reason for hiding this comment

marmbrus commented Apr 24, 2014

mateiz commented Apr 25, 2014

marmbrus commented Apr 25, 2014

mateiz commented Apr 25, 2014

mateiz commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AndreSchumacher commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

marmbrus Apr 25, 2014

Choose a reason for hiding this comment

AndreSchumacher Apr 26, 2014

Choose a reason for hiding this comment

marmbrus commented Apr 25, 2014

mateiz commented Apr 26, 2014

AmplabJenkins commented May 16, 2014

AndreSchumacher commented May 16, 2014

AmplabJenkins commented May 16, 2014

AmplabJenkins commented May 16, 2014

AmplabJenkins commented May 16, 2014

AmplabJenkins commented May 16, 2014

AmplabJenkins commented May 16, 2014

AmplabJenkins commented May 16, 2014

AmplabJenkins commented May 16, 2014

AndreSchumacher commented May 16, 2014

AndreSchumacher commented May 17, 2014