[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728

gatorsmile · 2016-06-17T03:52:39Z

What changes were proposed in this pull request?

Starting Spark 2.0, vectorized decoding is introduced for improving parquet reading performance. This feature changes the filter pushdown behavior of parquet reading. Thus, this PR updates the out-of-dated description of two external SQLConf: spark.sql.parquet.filterPushdown and spark.sql.parquet.enableVectorizedReader.

The PR also slightly simplifies the code for building parquetReader. cc @davies @liancheng @marmbrus

Because the current test cases do not verify the behavior when spark.sql.parquet.filterPushdown is set to false, added a test case for improving the test case coverage. Also, improved the test case when the parquet file path points to either non-existent files or non-existent hosts.

How was this patch tested?

Added the related test cases.

SparkQA · 2016-06-17T05:36:01Z

Test build #60680 has finished for PR 13728 at commit ad1f18c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-06-17T06:17:50Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+        withTempPath { dir =>
+          val path = s"${dir.getCanonicalPath}/table1"
+          (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
+          // When a filter is pushed to Parquet, Parquet can apply it to every row.


Is this true? I thought the filter is only applied to row group.

If vectorized reader is disabled, then it will fall back to parquet's reader which would filter row by row as well.

@davies You are right. Sorry, I just simply copied this comment from the other test cases. Let me remove all of them. Thanks!

@HyukjinKwon Davies's comment is just about how Parquet prunes the rows. It is on the row group level.

Based on the test cases, it sounds like each row group contains only one row... Not sure how Parquet implements it?

HyukjinKwon · 2016-06-17T07:07:19Z

@gatorsmile hm.. Doesn't Parquet filter2 filter and also prune the rows as well as row groups? I think the copied test was written by me before..

I think I misunderstood you comments above maybe but it does filter row by row and also at row group level as well.. as far as I remember.. See here

gatorsmile · 2016-06-17T07:15:59Z

uh... I see @HyukjinKwon I did not realize this filter could be used more than once. Let me revert the changes. Thanks!

SparkQA · 2016-06-17T08:46:28Z

Test build #60691 has finished for PR 13728 at commit 9967cc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-17T09:03:39Z

Test build #60692 has finished for PR 13728 at commit d1b2cbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-22T04:38:43Z

retest this please

SparkQA · 2016-06-22T06:22:58Z

Test build #61006 has finished for PR 13728 at commit d1b2cbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile added 3 commits June 16, 2016 19:59

fix

a7b89bd

update the comment

a1da798

update the document

ad1f18c

davies reviewed Jun 17, 2016
View reviewed changes

update the comments.

9967cc7

gatorsmile mentioned this pull request Jun 17, 2016

[SPARK-15639][SPARK-16321][SQL] Push down filter at RowGroups level for parquet reader #13701

Closed

update the comments.

d1b2cbb

gatorsmile closed this Nov 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728

[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728

gatorsmile commented Jun 17, 2016 •

edited

SparkQA commented Jun 17, 2016

davies Jun 17, 2016

HyukjinKwon Jun 17, 2016

gatorsmile Jun 17, 2016

gatorsmile Jun 17, 2016

gatorsmile Jun 17, 2016

HyukjinKwon commented Jun 17, 2016 •

edited

gatorsmile commented Jun 17, 2016 •

edited

SparkQA commented Jun 17, 2016

SparkQA commented Jun 17, 2016

gatorsmile commented Jun 22, 2016

SparkQA commented Jun 22, 2016

[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728

[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728

Conversation

gatorsmile commented Jun 17, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 17, 2016

davies Jun 17, 2016

Choose a reason for hiding this comment

HyukjinKwon Jun 17, 2016

Choose a reason for hiding this comment

gatorsmile Jun 17, 2016

Choose a reason for hiding this comment

gatorsmile Jun 17, 2016

Choose a reason for hiding this comment

gatorsmile Jun 17, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Jun 17, 2016 • edited

gatorsmile commented Jun 17, 2016 • edited

SparkQA commented Jun 17, 2016

SparkQA commented Jun 17, 2016

gatorsmile commented Jun 22, 2016

SparkQA commented Jun 22, 2016

gatorsmile commented Jun 17, 2016 •

edited

HyukjinKwon commented Jun 17, 2016 •

edited

gatorsmile commented Jun 17, 2016 •

edited