-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16010] [SQL] Code Refactoring, Test Case Improvement and Description Updates for SQLConf spark.sql.parquet.filterPushdown #13728
Conversation
Test build #60680 has finished for PR 13728 at commit
|
withTempPath { dir => | ||
val path = s"${dir.getCanonicalPath}/table1" | ||
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) | ||
// When a filter is pushed to Parquet, Parquet can apply it to every row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I thought the filter is only applied to row group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If vectorized reader is disabled, then it will fall back to parquet's reader which would filter row by row as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies You are right. Sorry, I just simply copied this comment from the other test cases. Let me remove all of them. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Davies's comment is just about how Parquet prunes the rows. It is on the row group level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the test cases, it sounds like each row group contains only one row... Not sure how Parquet implements it?
@gatorsmile hm.. Doesn't Parquet filter2 filter and also prune the rows as well as row groups? I think the copied test was written by me before.. I think I misunderstood you comments above maybe but it does filter row by row and also at row group level as well.. as far as I remember.. See here |
uh... I see @HyukjinKwon I did not realize this filter could be used more than once. Let me revert the changes. Thanks! |
Test build #60691 has finished for PR 13728 at commit
|
Test build #60692 has finished for PR 13728 at commit
|
retest this please |
Test build #61006 has finished for PR 13728 at commit
|
What changes were proposed in this pull request?
Starting Spark 2.0, vectorized decoding is introduced for improving parquet reading performance. This feature changes the filter pushdown behavior of parquet reading. Thus, this PR updates the out-of-dated description of two external
SQLConf
:spark.sql.parquet.filterPushdown
andspark.sql.parquet.enableVectorizedReader
.The PR also slightly simplifies the code for building
parquetReader
. cc @davies @liancheng @marmbrusBecause the current test cases do not verify the behavior when
spark.sql.parquet.filterPushdown
is set tofalse
, added a test case for improving the test case coverage. Also, improved the test case when the parquet file path points to either non-existent files or non-existent hosts.How was this patch tested?
Added the related test cases.