[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics#13491
Closed
JoshRosen wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics#13491JoshRosen wants to merge 1 commit intoapache:masterfrom
JoshRosen wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
Author
|
@ericl also observed this same perf. bottleneck in his profiling. I'll update the benchmark numbers tomorrrow. |
|
Test build #59921 has finished for PR 13491 at commit
|
Contributor
Author
|
Jenkins, retest this please. |
Contributor
|
LGTM provided tests still pass |
|
Test build #59942 has finished for PR 13491 at commit
|
Contributor
Author
|
Hmm, weird; let me investigate what's going on with these tests... |
Contributor
Author
|
Jenkins, retest this please. |
|
Test build #59995 has finished for PR 13491 at commit
|
|
Test build #3065 has finished for PR 13491 at commit
|
Contributor
|
Merging in master/2.0. |
asfgit
pushed a commit
that referenced
this pull request
Jun 5, 2016
… in PartitionStatistics `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns. This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern. Author: Josh Rosen <joshrosen@databricks.com> Closes #13491 from JoshRosen/foldleft-to-flatmap. (cherry picked from commit 26c1089) Signed-off-by: Reynold Xin <rxin@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PartitionStatisticsusesfoldLeftand list concatenation (++) to flatten an iterator of lists, but this is extremely inefficient compared to simply doingflatMap/flattenbecause it performs many unnecessary object allocations. Simply replacing thisfoldLeftby aflatMapresults in decent performance gains when constructing PartitionStatistics instances for tables with many columns.This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.