[ADAM-1661] Add file storage benchmarks.

Resolves bigdatagenomics#1661.
fnothaft · Oct 21, 2017 · f31f914 · f31f914
1 parent 188836a
commit f31f914
Show file tree

Hide file tree

Showing 6 changed files with 115 additions and 0 deletions.
diff --git a/docs/source/35_benchmarks.md b/docs/source/35_benchmarks.md
@@ -0,0 +1,51 @@
+# Benchmarks {#benchmarks}
+
+ADAM uses [Apache Parquet](https://parquet.apache.org) as a way to store genomic
+data. This is in addition to our support for conventional genomic file formats.
+Parquet is an efficient columnar storage system that is widely used in the
+analytics ecosystem, and integrates with a variety of data management tools and
+query engines. Parquet provides improved storage capacity relative to several
+conventional genomics data storage formats. Here, we look at the storage cost of
+aligned reads, features, and variants.
+
+## Aligned Reads {#aligned-reads-storage}
+
+In this benchmark, we have stored a copy of NA12878 aligned to the GRCh37
+reference genome using BWA. We store this genome in BAM, CRAM, and ADAM, using
+the default compression settings for each. BAM and CRAM files were generated
+using htslib. This read file was sequenced at approximately 60x coverage across
+the whole genome.
+
+![Storage cost of a 60x coverage WGS aligned dataset](source/img/bam.pdf)
+
+ADAM provides a 20% improvement in storage size over BAM, while CRAM achieves
+a 43% improvement in storage cost. While CRAM achieves a higher compression
+ratio, CRAM uses reference-based compression techniques that minimize the amount
+of data stored on disk.
+
+## Features {#feature-storage}
+
+Here, we benchmark both the GFF3 and BED formats. For GFF3, we use the ENSEMBL
+GRCh38 genome annotation file. For BED, we use genome-wide coverage counts
+generated from the NA12878 dataset used in the [aligned read
+benchmarks](#aligned-reads-storage).
+
+![Storage cost of genome annotations](source/img/gff.pdf)
+
+For the genome annotation file, ADAM provides a 20% improvement in storage size
+relative to the compressed GFF file.
+
+![Storage cost of coverage data](source/img/bed.pdf)
+
+For the coverage data, ADAM provides a 45% improvement in storage size relative
+to the compressed BED file.
+
+## Genomic Variants {#variant-storage}
+
+In this benchmark, we used the 1,000 Genomes phase 3 data release VCFs. We
+compared GZIP-compressed VCF and uncompressed VCF to ADAM.
+
+![Storage cost of variant data](source/img/vcf.pdf)
+
+Compressed VCF is approximately 10% smaller than genotype data stored as
+Parquet.
diff --git a/docs/source/img/bam.pdf b/docs/source/img/bam.pdf
diff --git a/docs/source/img/bed.pdf b/docs/source/img/bed.pdf
diff --git a/docs/source/img/gff.pdf b/docs/source/img/gff.pdf
diff --git a/docs/source/img/source/file_benchmarks.py b/docs/source/img/source/file_benchmarks.py
@@ -0,0 +1,64 @@
+#
+# Licensed to Big Data Genomics (BDG) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The BDG licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+# sizes in MB
+gff_sizes  = [  29.9,       36, 392.2]
+gff_labels = ['ADAM', 'GFF.GZ', 'GFF']
+
+# sizes in GB
+bed_sizes  = [   9.1,       16, 118.9]
+bed_labels = ['ADAM', 'BED.GZ', 'BED']
+
+# sizes in GB
+bam_sizes  = [    69,   95.1, 119.1]
+bam_labels = ['CRAM', 'ADAM', 'BAM']
+
+# sizes in GB
+vcf_sizes  = [      19,     21,   807]
+vcf_labels = ['VCF.GZ', 'ADAM', 'VCF']
+
+def plot(sizes, labels, filename, title, unit='GB'):
+
+    ind = np.arange(len(sizes))
+    width = 0.35
+
+    fig, ax = plt.subplots()
+    rects = ax.bar(ind - (width / 2.0), sizes, width)
+
+    (ymin, ymax) = plt.ylim()
+    plt.ylim(ymin, ymax + 10)
+    ax.set_ylabel('File size (%s)' % unit)
+    ax.set_title(title)
+    ax.set_xticks(ind)
+    ax.set_xticklabels(labels)
+
+    for (rect, size) in zip(rects, sizes):
+        height = rect.get_height()
+        ax.text(rect.get_x() + rect.get_width() / 2.0, 1.01 * height,
+                '%d %s' % (size, unit),
+                ha='center', va='bottom')
+
+    fig.savefig(filename)
+
+plot(gff_sizes, gff_labels, 'gff.pdf', 'File sizes for ENSEMBL GRCh38 GTF', unit='MB')
+plot(bed_sizes, bed_labels, 'bed.pdf', 'File sizes for WGS Coverage BED')
+plot(bam_sizes, bam_labels, 'bam.pdf', 'File sizes for 60x WGS BAM')
+plot(vcf_sizes, vcf_labels, 'vcf.pdf', 'File sizes for 1,000 Genomes VCF')
diff --git a/docs/source/img/vcf.pdf b/docs/source/img/vcf.pdf