Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728

Closed
wants to merge 5 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jun 17, 2016

What changes were proposed in this pull request?

Starting Spark 2.0, vectorized decoding is introduced for improving parquet reading performance. This feature changes the filter pushdown behavior of parquet reading. Thus, this PR updates the out-of-dated description of two external SQLConf: spark.sql.parquet.filterPushdown and spark.sql.parquet.enableVectorizedReader.

The PR also slightly simplifies the code for building parquetReader. cc @davies @liancheng @marmbrus

Because the current test cases do not verify the behavior when spark.sql.parquet.filterPushdown is set to false, added a test case for improving the test case coverage. Also, improved the test case when the parquet file path points to either non-existent files or non-existent hosts.

How was this patch tested?

Added the related test cases.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60680 has finished for PR 13728 at commit ad1f18c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/table1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
// When a filter is pushed to Parquet, Parquet can apply it to every row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I thought the filter is only applied to row group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vectorized reader is disabled, then it will fall back to parquet's reader which would filter row by row as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies You are right. Sorry, I just simply copied this comment from the other test cases. Let me remove all of them. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Davies's comment is just about how Parquet prunes the rows. It is on the row group level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the test cases, it sounds like each row group contains only one row... Not sure how Parquet implements it?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 17, 2016

@gatorsmile hm.. Doesn't Parquet filter2 filter and also prune the rows as well as row groups? I think the copied test was written by me before..

I think I misunderstood you comments above maybe but it does filter row by row and also at row group level as well.. as far as I remember.. See here

@gatorsmile
Copy link
Member Author

gatorsmile commented Jun 17, 2016

uh... I see @HyukjinKwon I did not realize this filter could be used more than once. Let me revert the changes. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60691 has finished for PR 13728 at commit 9967cc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60692 has finished for PR 13728 at commit d1b2cbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61006 has finished for PR 13728 at commit d1b2cbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile closed this Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants