TeraSort benchmark for Spark
This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch, but it is not the same TeraSort program that currently holds the record. That program is here.
The default is to link against Spark 2.4.4 jars (released September 2019). If you plan to run using an older version of Spark (e.g. 1.6) you will have to try
-Dspark.version=1.6. If possible, it's probably a better idea to just update to a more recent version or Spark.
cd to your your Spark install.
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 1g file://$HOME/data/terasort_in
Sort the data
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar file://$HOME/data/terasort_in file://$HOME/data/terasort_out
Validate the data
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar file://$HOME/data/terasort_out file://$HOME/data/terasort_validate
This terasort doesn't use the partitioning scheme that Hadoop's Terasort uses. This results in not very good performance. I could copy the Partitioning code from the Hadoop tree verbatim but I thought it would be more appropriate to rewrite more of it in Scala.
I haven't pulled the DaytonaPartitioner from the record holding sort yet because it's pretty intertwined into the rest of the code and AFAIK it's not really idiomatic Spark.
Functionality on native file systems
TeraValidate can read the file parts in the wrong order on native file systems (e.g. if you run Spark on your laptop, on Lustre, Panasas, etc). HDFS apparently always returns the files in alphanumeric order so most Hadoop users aren't affected. I thought I fixed this in the TeraInputFormat, but I was able to reproduce it since migrating the code from my Spark terasort branch.
PRs are very welcome!