[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759

viirya · 2016-03-16T09:19:29Z

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-13930

Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.

How was this patch tested?

Add a benchmark for collect limit to BenchmarkWholeStageCodegen.

Without this patch:

model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X

With this patch:

model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X

SparkQA · 2016-03-16T09:27:48Z

Test build #53309 has finished for PR 11759 at commit d1306ad.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-16T09:34:14Z

retest this please.

SparkQA · 2016-03-16T11:25:32Z

Test build #53310 has finished for PR 11759 at commit d1306ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-16T15:50:54Z

Test build #53320 has finished for PR 11759 at commit 2c2055a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-17T02:46:27Z

cc @davies @rxin

davies · 2016-03-17T04:57:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

-   * Runs this query returning the result as an array.
+   * Packing the UnsafeRows into byte array for faster serialization.
+   * The byte arrays are in the following format:
+   * [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1]


This is the implementation details, it has nothing with the APIs, I'd like to keep these as comments.

nvm, you make it as an function.

SparkQA · 2016-03-18T05:56:16Z

Test build #53501 has finished for PR 11759 at commit 6752775.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T06:02:40Z

Test build #53502 has finished for PR 11759 at commit ed9aa30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-18T06:12:30Z

cc @davies The comments are addressed and tests are passed. Please see if this is ok now. Thanks!

davies · 2016-03-18T06:24:25Z

LGTM, merging this into master, thanks!

## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13930 Recently the fast serialization has been introduced to collecting DataFrame/Dataset (apache#11664). The same technology can be used on collect limit operator too. ## How was this patch tested? Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`. Without this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X With this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 833 / 1284 1.3 794.4 1.0X collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes apache#11759 from viirya/execute-take.

viirya added 2 commits March 16, 2016 08:41

init import.

16ac5fe

Use fast serialization on collect limit too.

d1306ad

Fix bug.

2c2055a

davies reviewed Mar 17, 2016
View reviewed changes

viirya added 2 commits March 18, 2016 04:20

Address comments.

6752775

Remove unnecessary copying again.

ed9aa30

asfgit closed this in 750ed64 Mar 18, 2016

viirya deleted the execute-take branch December 27, 2023 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759

[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759

viirya commented Mar 16, 2016

SparkQA commented Mar 16, 2016

viirya commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

viirya commented Mar 17, 2016

davies Mar 17, 2016

davies Mar 17, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

viirya commented Mar 18, 2016

davies commented Mar 18, 2016

[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759

[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759

Conversation

viirya commented Mar 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 16, 2016

viirya commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

viirya commented Mar 17, 2016

davies Mar 17, 2016

Choose a reason for hiding this comment

davies Mar 17, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

viirya commented Mar 18, 2016

davies commented Mar 18, 2016