Skip to content

[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics#13491

Closed
JoshRosen wants to merge 1 commit intoapache:masterfrom
JoshRosen:foldleft-to-flatmap
Closed

[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics#13491
JoshRosen wants to merge 1 commit intoapache:masterfrom
JoshRosen:foldleft-to-flatmap

Conversation

@JoshRosen
Copy link
Contributor

PartitionStatistics uses foldLeft and list concatenation (++) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing flatMap/flatten because it performs many unnecessary object allocations. Simply replacing this foldLeft by a flatMap results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.

This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.

@JoshRosen
Copy link
Contributor Author

@ericl also observed this same perf. bottleneck in his profiling.

I'll update the benchmark numbers tomorrrow.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59921 has finished for PR 13491 at commit 50a6270.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@ericl
Copy link
Contributor

ericl commented Jun 3, 2016

LGTM provided tests still pass

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59942 has finished for PR 13491 at commit 50a6270.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Hmm, weird; let me investigate what's going on with these tests...

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 5, 2016

Test build #59995 has finished for PR 13491 at commit 50a6270.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 5, 2016

Test build #3065 has finished for PR 13491 at commit 50a6270.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 5, 2016

Merging in master/2.0.

asfgit pushed a commit that referenced this pull request Jun 5, 2016
… in PartitionStatistics

`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.

This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13491 from JoshRosen/foldleft-to-flatmap.

(cherry picked from commit 26c1089)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 26c1089 Jun 5, 2016
@JoshRosen JoshRosen deleted the foldleft-to-flatmap branch August 29, 2016 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants