Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Nov 9, 2019

What changes were proposed in this pull request?

Enable nested schema pruning and nested pruning on expressions by default. We have been using those features in production in Apple for couple months with great success. For some jobs, we reduce the data reading by more than 8x and 21x faster in wall clock time.

Why are the changes needed?

Better performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

…_ENABLED NESTED_PRUNING_ON_EXPRESSIONS by default
@viirya
Copy link
Member

viirya commented Nov 9, 2019

In title it should be [SQL] instead of [Core].

@SparkQA
Copy link

SparkQA commented Nov 9, 2019

Test build #113483 has finished for PR 26443 at commit 8d4e7d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29805] [Core] Enable nested schema pruning and nested pruning on expressions by default [SPARK-29805][Core] Enable nested schema pruning and nested pruning on expressions by default Nov 9, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for 3.0.0.

Do you have any concern on this, @gatorsmile and @cloud-fan ?

@dbtsai dbtsai changed the title [SPARK-29805][Core] Enable nested schema pruning and nested pruning on expressions by default [SPARK-29805][SQL] Enable nested schema pruning and nested pruning on expressions by default Nov 9, 2019
"executing unnecessary nested expressions.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, these two spark.sql.optimizer.serializer.nestedSchemaPruning.enabled and spark.sql.optimizer.expression.nestedPruning.enabled were added as of Spark 3.0 (SPARK-26837 and SPARK-27707). I thought it's rather usual to have one minor release term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is added as part of Spark 3.0, but on really early stage of Spark 3.0 development. We internally cherry picked them into 2.4.x in our production Spark distributions, and those help a lot in many nested column use-cases.

@dbtsai
Copy link
Member Author

dbtsai commented Nov 11, 2019

Thanks all for reviewing. Merged into master.

@dbtsai dbtsai closed this in a6a2748 Nov 11, 2019
@dbtsai dbtsai deleted the enableNestedSchemaPrunning branch November 11, 2019 23:05
@gatorsmile
Copy link
Member

I hope we can enable them in the preview release of Spark 3.0. The community can help us verify the quality.

@jiangxb1987 Let us have one more preview release next month?

@jiangxb1987
Copy link
Contributor

Sounds good!

@dongjoon-hyun
Copy link
Member

Can we give the opportunity to another committer? That would be helpful for the Apache community growth, @gatorsmile and @jiangxb1987 .

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 12, 2019

Also, cc @HyukjinKwon and @holdenk if you are interested~

@gatorsmile
Copy link
Member

I assume Xingbo already has an environment for preview release. He can do it very quickly.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 12, 2019

@gatorsmile . That's not a good reason~ :)
Actually, during two releases, I also built the environment and have it, too.

@gatorsmile
Copy link
Member

I do not care who do the release manager for preview. I only care whether it will delay the release of 3.0. I expect we will have one or two new Spark 3.0 preview releases.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 12, 2019

And, it's a good chance for the committers to involve the Apache Spark community more.
A PMC member should try to give more opportunities to the Apache Spark committer to grow as a PMC. That's the reason why we waited @jiangxb1987 to do that. And, both of us know that @jiangxb1987 also learned during this process.

@dongjoon-hyun
Copy link
Member

We have only a few releases in one year, and the increment of Apache Spark committers is bigger than that.

@gatorsmile
Copy link
Member

@dongjoon-hyun Please do not misunderstand my point. It took @jiangxb1987 more than two weeks for releasing Spark 3.0 preview. As long as the other committers can finish it very quickly, I am totally fine to do it. This is just like a new RC for Spark 3.0 preview.

We need to release Spark 3.0 preview ASAP and make the community to try it and verify the fix. The quality of 3.0 release is our top priority. Hopefully, you agree on it. Doing the release manager is the labor work. Even if we have a new release manager for each RC, it will not grow the community I think.

@holdenk
Copy link
Contributor

holdenk commented Nov 13, 2019

Is there a committer who is interested in learning the release process? If so I think a preview release is a great lower stakes than usual opportunity to have someone skill up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants