Skip to content
Permalink
Browse files
updated blog and pictures
  • Loading branch information
animeshtrivedi committed Aug 14, 2018
1 parent d7b09bd commit c5b15ed1b3cd48a987e77b85d3cd5505da80c5fd
Showing 4 changed files with 9 additions and 6 deletions.
@@ -29,14 +29,16 @@ The specific cluster configuration used for the experiments in this blog:
* <a href="https://github.com/apache/incubator-crail/">Apache Crail (incubating) with NVMeF support</a>, commit 64e635e5ce9411041bf47fac5d7fadcb83a84355 (since then Crail has a stable source release v1.0 with a newer NVMeF code-base)

### Overview

In a typical cloud-based relational data processing setup, the input data is stored on an external data storage solution like HDFS or AWS S3. Data tables and their associated schema are converted into a storage-friendly format for optimal performance. Examples of some popular and familiar file formats are [Apache Parquet](https://parquet.apache.org/), [Apache ORC](https://orc.apache.org/), [Apache Avro](https://avro.apache.org/), [JSON](https://en.wikipedia.org/wiki/JSON), etc. More recently, [Apache Arrow](https://arrow.apache.org/) has been introduced to standardize the in-memory columnar data representation between multiple frameworks. There is no one size fits all as all these formats have their own strengths, weaknesses, and features. In this blog, we are specifically interested in the performance of these formats on modern high-performance networking and storage devices.

<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/outline.svg" width="550"/><figcaption>Figure 1: The benchmarking setup with HDFS and file formats on a 100 Gbps network with NVMe flash devices. All formats contains routines for compression, encoding, and value materialization with associated I/O buffer management and data copies routines.<p></p></figcaption></div></figure>

To benchmark the performance of file formats, we wrote a set of micro-benchmarks which are available at [https://github.com/zrlio/fileformat-benchmarks](https://github.com/zrlio/fileformat-benchmarks). We cannot use typical SQL micro-benchmarks because every SQL engine has its own favorite file format, on which it performs the best. Hence, in order to ensure parity, we decoupled the performance of reading the input file format from the SQL query processing by writing simple table reading micro-benchmarks. Our benchmark reads in the store_sales table from the TPC-DS dataset (scale factor 100), and calculates a sum of values present in the table. The table contains 23 columns of integers, doubles, and longs.

<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/performance-all.svg" width="550"/><figcaption>Figure 1: Performance of JSON, Avro, Parquet, ORC, and Arrow on NVMe devices over a 100 Gbps network.<p></p></figcaption></div></figure>
<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/performance-all.svg" width="550"/><figcaption>Figure 2: Performance of JSON, Avro, Parquet, ORC, and Arrow on NVMe devices over a 100 Gbps network.<p></p></figcaption></div></figure>

We evaluate the performance of the benchmark on a 3 node HDFS cluster connected using 100 Gbps RoCE. One datanode in HDFS contains 4 NVMe devices with a collective aggregate bandwidth of 12.5 GB/sec (equals to 100 Gbps, hence, we have a balanced network and storage performance). Figure 1 shows our results where none of the file formats is able to deliver the full hardware performance for reading input files. One third of the performance is already lost in HDFS (maximum throughput 74.9 Gbps out of possible 100 Gbps). The rest of the performance is lost inside the file format implementation, which needs to deal with encoding, buffer and I/O management, compression, etc. The best performer is Apache Arrow which is designed for in-memory columnar datasets. The performance of these file formats are bounded by the performance of the CPU, which is 100% loaded during the experiment. For a detailed analysis of the file formats, please refer to our paper - [Albis: High-Performance File Format for Big Data Systems (USENIX, ATC’18)](https://www.usenix.org/conference/atc18/presentation/trivedi).
We evaluate the performance of the benchmark on a 3 node HDFS cluster connected using 100 Gbps RoCE. One datanode in HDFS contains 4 NVMe devices with a collective aggregate bandwidth of 12.5 GB/sec (equals to 100 Gbps, hence, we have a balanced network and storage performance). Figure 2 shows our results where none of the file formats is able to deliver the full hardware performance for reading input files. One third of the performance is already lost in HDFS (maximum throughput 74.9 Gbps out of possible 100 Gbps). The rest of the performance is lost inside the file format implementation, which needs to deal with encoding, buffer and I/O management, compression, etc. The best performer is Apache Arrow which is designed for in-memory columnar datasets. The performance of these file formats are bounded by the performance of the CPU, which is 100% loaded during the experiment. For a detailed analysis of the file formats, please refer to our paper - [Albis: High-Performance File Format for Big Data Systems (USENIX, ATC’18)](https://www.usenix.org/conference/atc18/presentation/trivedi).

### Albis: High-Performance File Format for Big Data Systems

@@ -47,7 +49,7 @@ Based on these findings, we have developed a new file format called Albis. Albis
* Careful object materialization using a binary API: To optimize the runtime representation in managed runtimes like the JVM, only objects which are necessary for SQL processing are materialized. Otherwise, a 4 byte integer can be passed around as a byte array (using the binary API of Albis).


<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/core-scalability.svg" width="550"/><figcaption>Figure 2: Core scalability of JSON, Avro, Parquet, ORC, Arrow, and Albis.<p></p></figcaption></div></figure>
<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/core-scalability.svg" width="550"/><figcaption>Figure 3: Core scalability of JSON, Avro, Parquet, ORC, Arrow, and Albis on HDFS/NVMe.<p></p></figcaption></div></figure>

Using the Albis format, we revise our previous experiment where we read the input store_sales table from HDFS. In the figure above, we show the performance of Albis and other file formats with number of CPU cores involved. At the right hand of the x-axis, we have performance with all 16 cores engaged, hence, representing the peak possible performance. As evident, Albis delivers 59.9 Gbps out of 74.9 Gbps possible bandwidth with HDFS over NVMe. Albis performance is 1.9 - 21.4x better than other file formats. To give an impression where the performance is coming from, in the table below we show some micro-architectural features for Parquet, ORC, Arrow, and Albis. Our previously discussed design ideas in Albis result in a shorter code path (shown as less instructions required for each row), better cache performance (shows as lower cache misses per row), and clearly better performance (shown as nanoseconds required per row for processing). For a detailed evaluation of Albis please refer to our paper.

@@ -86,9 +88,9 @@ Using the Albis format, we revise our previous experiment where we read the inpu

### Apache Crail (Incubating) with Albis

For our final experiment, we try to answer the question what it would take to deliver the full 100 Gbps bandwidth for Albis. Certainly, the first bottleneck is to improve the base storage layer performance. Here we use Apache Crail (Incubating) with its [NVMeF](https://en.wikipedia.org/wiki/NVM_Express#NVMeOF) storage tier. This tier uses [jNVMf library](https://github.com/zrlio/jNVMf) to implement NVMeF stack in Java. As we have shown in a previous blog <a href="{{ site.base }}/blog/2017/08/crail-nvme-fabrics-v1.html">post</a> that Crail's NVMeF tier can deliver performance (97.8 Gbps) very close to the hardware limits. Hence, Albis with Crail is a perfect setup to evaluate on high-performance NVMe and RDMA devices. Before we get there, let's get some calculations right. The store_sales table in the TPC-DS dataset has a data density of 93.9% (out of 100 bytes, only 93.9 is data, others are null values). As we measure the goodput, the expected performance of Albis on Crail is 93.9% of 97.8 Gbps, which calculates to 91.8 Gbps. In our experiments, Albis on Crail delivers 85.5 Gbps. Figure 2 shows more detailed results.
For our final experiment, we try to answer the question what it would take to deliver the full 100 Gbps bandwidth for Albis. Certainly, the first bottleneck is to improve the base storage layer performance. Here we use Apache Crail (Incubating) with its [NVMeF](https://en.wikipedia.org/wiki/NVM_Express#NVMeOF) storage tier. This tier uses [jNVMf library](https://github.com/zrlio/jNVMf) to implement NVMeF stack in Java. As we have shown in a previous blog <a href="{{ site.base }}/blog/2017/08/crail-nvme-fabrics-v1.html">post</a> that Crail's NVMeF tier can deliver performance (97.8 Gbps) very close to the hardware limits. Hence, Albis with Crail is a perfect setup to evaluate on high-performance NVMe and RDMA devices. Before we get there, let's get some calculations right. The store_sales table in the TPC-DS dataset has a data density of 93.9% (out of 100 bytes, only 93.9 is data, others are null values). As we measure the goodput, the expected performance of Albis on Crail is 93.9% of 97.8 Gbps, which calculates to 91.8 Gbps. In our experiments, Albis on Crail delivers 85.5 Gbps. Figure 4 shows more detailed results.

<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/albis-crail.svg" width="550"/><figcaption>Figure 2: Performance of Albis on Crail.<p></p></figcaption></div></figure>
<figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/albis-crail.svg" width="550"/><figcaption>Figure 4: Performance of Albis on Crail.<p></p></figcaption></div></figure>

The left half of the figure shows the performance scalability of Albis on Crail in a setup with 1 core (8.9 Gbps) to 16 cores (85.5 Gbps). In comparison, the right half of the figure shows the performance of Crail on HDFS/NVMe at 59.9 Gbps, and on Crail/NVMe at 85.5 Gbps. The last bar shows the performance of Albis if the benchmark does not materialize Java object values. In this configuration, Albis on Crail delivers 91.3 Gbps, which is very close to the expected peak of 91.8 Gbps.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c5b15ed

Please sign in to comment.