Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2304] tera sort example program for shuffle benchmarks #1242

Closed
wants to merge 14 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Jun 27, 2014

This pull request adds an example program for benchmarking Spark shuffle. It dynamically generates a set of 100 byte records according to the tera sort spec, and repartitions the data based on an evenly spaced range partitioner. By design, it does NOT yet perform sorting after the range partitioning yet.

Some of the code copied directly from Hadoop and simplified (the data generator stuff).

I've been using this utility to benchmark Spark at scale, including shuffling 100TB of data in 12 mins and 300TB in 36 mins.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16195/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16196/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16224/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16227/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16231/

@tgravescs
Copy link
Contributor

The hadoop code for generating the data is out of date. It might not matter for your purposes, but if you want the up to date one look at sortbenchmark.org. I had filed jira to update Hadoop one but haven't gotten to it.

@mridulm
Copy link
Contributor

mridulm commented Jun 30, 2014

Nice addition, thanks Reynold !

@pwendell
Copy link
Contributor

pwendell commented Sep 2, 2014

@rxin can you close this issue for now? It's been lingering a long time.

@rxin rxin closed this Sep 2, 2014
@jerryshao
Copy link
Contributor

Hi @rxin , sorry to bring this out. Are you planning to merge this terasort example into Spark? I think this would be a good standard to test the performance of Shuffle.

Besides I think generated records should be copied, otherwise will lead to error in sort-based shuffle like SPARK-2967.

Also is this intended not to do in-partition sorting or will do later?

Thanks a lot.

@rxin
Copy link
Contributor Author

rxin commented Sep 12, 2014

I don't think we are going to merge this in Spark, unless there is huge demand from users...

@liuqiyun
Copy link

@rxin I am confusing on the input parameters of GenSort.scala.
It requires 3 parameters: " [num-parts] [records-per-part] [output-path]".
If I want to generate and sort 100 GB data using 4 partitions, is that correct to set the parameters as '4, 268435456, /tmp/sort-output'?

Seems 1 row(record) equals 100 byte, so I computed the records(rows) number as following way:
100 GB data = 107374182400 byte = 1073741824 row * 100 byte/row = 268435456 row * 4 partition * 100 byte/row
So each partition should compute 268435456 row(record), right?

However, If I save the output as sequence file, the size of output files is only 20 GB. if I save the output as text file, not sequence file, the size of output files is 309.2 GB(77.3 GB * 4 partition), but NOT 100 GB. why?

@rxin
Copy link
Contributor Author

rxin commented Dec 29, 2014

The size of the data is 100GB in its uncompressed binary representation. You are probably compressing the data when you saved it as sequence file. When you save it as text file, the text representation is much larger (i.e. a single byte is shown as multiple byte in text).

@liuqiyun
Copy link

So how to save as the uncompressed binary representation in the GenSort.scala program? I want to compare it with Hadoop MR which also use the uncompressed binary representation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants