Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759

Closed
wants to merge 5 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 16, 2016

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-13930

Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.

How was this patch tested?

Add a benchmark for collect limit to BenchmarkWholeStageCodegen.

Without this patch:

model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X

With this patch:

model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53309 has finished for PR 11759 at commit d1306ad.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 16, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53310 has finished for PR 11759 at commit d1306ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53320 has finished for PR 11759 at commit 2c2055a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 17, 2016

cc @davies @rxin

* Runs this query returning the result as an array.
* Packing the UnsafeRows into byte array for faster serialization.
* The byte arrays are in the following format:
* [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the implementation details, it has nothing with the APIs, I'd like to keep these as comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, you make it as an function.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53501 has finished for PR 11759 at commit 6752775.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53502 has finished for PR 11759 at commit ed9aa30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 18, 2016

cc @davies The comments are addressed and tests are passed. Please see if this is ok now. Thanks!

@davies
Copy link
Contributor

davies commented Mar 18, 2016

LGTM, merging this into master, thanks!

@asfgit asfgit closed this in 750ed64 Mar 18, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-13930

Recently the fast serialization has been introduced to collecting DataFrame/Dataset (apache#11664). The same technology can be used on collect limit operator too.

## How was this patch tested?

Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.

Without this patch:

    model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
    collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
    collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X

With this patch:

    model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
    collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
    collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes apache#11759 from viirya/execute-take.
@viirya viirya deleted the execute-take branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants