Skip to content
This repository has been archived by the owner. It is now read-only.
Permalink
Browse files
update rdma shuffle blog
  • Loading branch information
PepperJo committed Nov 8, 2017
1 parent 57cbbef commit 62b467dcb8650d85a946c2109aa07090e7a80416
Show file tree
Hide file tree
Showing 9 changed files with 69 additions and 33 deletions.
Binary file not shown.
@@ -11,6 +11,7 @@ set grid ytics mytic
set key left
set style data histogram
set style fill solid border -1
set boxwidth 0.9
unset xtics
set title "Equijoin"

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
@@ -11,6 +11,7 @@ set grid ytics mytic
set key left
set style data histogram
set style fill solid border -1
set boxwidth 0.9
unset xtics
set title "TeraSort"

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@@ -17,7 +17,7 @@ This blog is comparing the performance of different RDMA-based shuffle plugins f
The specific cluster configuration used for the experiments in this blog:

* Cluster
* 8 node x86_64 cluster
* 8 compute + 1 management node x86_64 cluster
* Node configuration
* CPU: 2 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
* DRAM: 96GB DDR4
@@ -73,9 +73,43 @@ terasort and SQL equijoin.
<br>
<div style="text-align: justify">
<p>
First we run <a href="https://github.com/zrlio/crail-spark-terasort">terasort</a>
on our 8+1 machine cluster (see above). We sort 200GB, i.e. each nodes gets 25GB
of data (assuming equal distribution). To get the best possible configuration for
all setups we brute-force the configuration space for each of them.
All configuration use 8 executors with 12 cores each. Note that
in a typical Spark run more CPU cores than assigned are engaged because of
garbabge collection, etc. In our test runs assigning 12 cores lead to the
best performance.<br/><br/>

The plot above shows runtimes of the various configuration we run with terasort.
SparkRDMA with the Wrapper shuffle writer performance slightly better (3-4%) than
vanilla Spark whereas the Chunked shuffle writer shows a 30% overhead. A quick
inspection found that this overhead stems from memory allocation and registration
for the shuffle data to be kept in memory between the stages. Crail shows
a performance improvement of around 235%.
</p>
</div>
<br>
<div style="text-align:center"><img src ="/img/blog/rdma-shuffle/sql.svg" width="750"/></div>
<br>

<div style="text-align: justify">
<p>
For our second workload we choose the
<a href="https://github.com/zrlio/sql-benchmarks">SQL equijoin</a> with a
<a href="https://github.com/zrlio/spark-nullio-fileformat">special fileformat</a>
that allows data to be generated on the fly, i.e. this benchmark focuses on
shuffle performance. The shuffle data size is around 148GB. Here the
Wrapper shuffle writer is slightly slower than vanilla Spark but instead the
Chunked shuffle writer is roughly the same amount faster. Crail again shows a
great performance increasement over vanilla Spark.<br/><br/>
These benchmarks validate our previous statements that we believe a
tightly integrated design cannot deliver the same performance as a holistic
approach, i.e. one has to look at the whole picture in how to integrate
RDMA into Spark applications. Replacing only the data transfer alone does not
lead to the anticipated performance increasement. We learned this the hard
way when we intially started working on Crail.
</p>
</div>

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 62b467d

Please sign in to comment.