-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2304] tera sort example program for shuffle benchmarks #1242
Conversation
Conflicts: core/src/main/scala/org/apache/spark/Partitioner.scala
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16195/ |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16224/ |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
The hadoop code for generating the data is out of date. It might not matter for your purposes, but if you want the up to date one look at sortbenchmark.org. I had filed jira to update Hadoop one but haven't gotten to it. |
Nice addition, thanks Reynold ! |
@rxin can you close this issue for now? It's been lingering a long time. |
Hi @rxin , sorry to bring this out. Are you planning to merge this terasort example into Spark? I think this would be a good standard to test the performance of Shuffle. Besides I think generated records should be copied, otherwise will lead to error in sort-based shuffle like SPARK-2967. Also is this intended not to do in-partition sorting or will do later? Thanks a lot. |
I don't think we are going to merge this in Spark, unless there is huge demand from users... |
@rxin I am confusing on the input parameters of GenSort.scala. Seems 1 row(record) equals 100 byte, so I computed the records(rows) number as following way: However, If I save the output as sequence file, the size of output files is only 20 GB. if I save the output as text file, not sequence file, the size of output files is 309.2 GB(77.3 GB * 4 partition), but NOT 100 GB. why? |
The size of the data is 100GB in its uncompressed binary representation. You are probably compressing the data when you saved it as sequence file. When you save it as text file, the text representation is much larger (i.e. a single byte is shown as multiple byte in text). |
So how to save as the uncompressed binary representation in the GenSort.scala program? I want to compare it with Hadoop MR which also use the uncompressed binary representation |
This pull request adds an example program for benchmarking Spark shuffle. It dynamically generates a set of 100 byte records according to the tera sort spec, and repartitions the data based on an evenly spaced range partitioner. By design, it does NOT yet perform sorting after the range partitioning yet.
Some of the code copied directly from Hadoop and simplified (the data generator stuff).
I've been using this utility to benchmark Spark at scale, including shuffling 100TB of data in 12 mins and 300TB in 36 mins.