Skip to content

Commit

Permalink
[ADAM-1630] Overhauled docs introduction and added architecture section.
Browse files Browse the repository at this point in the history
Resolves bigdatagenomics#1630, bigdatagenomics#1632, bigdatagenomics#1633. Rewrote the introduction to focus on what ADAM
provides and the ADAM echosystem. Adds an architecture section that talks about
ADAM's stack model and schemas, and which introduces the ADAMContext and
GenomicRDDs as implementations of the evidence access layer of the stack.
  • Loading branch information
fnothaft committed Dec 4, 2017
1 parent 5ec9701 commit 26887db
Show file tree
Hide file tree
Showing 12 changed files with 4,968 additions and 192 deletions.
19 changes: 19 additions & 0 deletions docs/build.sh
Expand Up @@ -5,6 +5,7 @@ git_version=$(git log -1 --pretty=format:%H)
output_dir="output"
pdf_output="$output_dir/ADAM_${git_version}.pdf"
html_output="$output_dir/ADAM_${git_version}.html"
md_output="$output_dir/ADAM_${git_version}.rst"
date_str=$(date '+%Y-%m-%d')

mkdir -p ${output_dir}
Expand All @@ -20,6 +21,24 @@ if [ $? -ne "0" ]; then
exit 0
fi

# Generate a PDF of the docs
pandoc -N -t rst \
--mathjax \
--filter pandoc-citeproc \
--highlight-style "$highlight_style" \
--variable mainfont="Georgia" \
--variable sansfont="Arial" \
--variable monofont="Andale Mono" \
--variable fontsize=10pt \
--variable version=$git_version \
--variable listings=true \
--variable title="$title" \
--variable date="$date" \
--variable author="$author" \
--toc \
--bibliography=source/bibliography.bib \
source/*.md -s -S -o $md_output

# Generate a PDF of the docs
pandoc -N --template=template.tex \
--filter pandoc-citeproc \
Expand Down
229 changes: 40 additions & 189 deletions docs/source/01_intro.md
@@ -1,191 +1,42 @@
# Introduction

ADAM is a genomics analysis platform with specialized file formats built using [Apache Avro](http://avro.apache.org), [Apache Spark](http://spark.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.

* [Follow](https://twitter.com/bigdatagenomics/) our Twitter account
* [Chat](https://gitter.im/bigdatagenomics/adam) with ADAM developers on Gitter
* [Join](http://bdgenomics.org/mail) our mailing list
* [Check out](https://amplab.cs.berkeley.edu/jenkins/view/Big%20Data%20Genomics/) the current build status
* [Download](https://github.com/bigdatagenomics/adam/releases) official releases
* [View](http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics) our software artifacts on Maven Central
* [See](https://oss.sonatype.org/index.html#nexus-search;quick~bdgenomics) our snapshots
* [Look](https://github.com/bigdatagenomics/adam/blob/master/CHANGES.md) at our CHANGES file

## Apache Spark

[Apache Spark](http://spark.apache.org/) allows developers to write algorithms in succinct code that can run fast locally, on an in-house cluster or on Amazon, Google or Microsoft clouds.

For example, the following code snippet will print the top ten 21-mers in `NA2114` from 1000 Genomes:

```scala
val ac = new ADAMContext(sc)
// Load alignments from disk
val reads = ac.loadAlignments(
"/data/NA21144.chrom11.ILLUMINA.adam",
predicate = Some(classOf[ExamplePredicate]),
projection = Some(Projection(
AlignmentRecordField.sequence,
AlignmentRecordField.readMapped,
AlignmentRecordField.mapq)))
// Generate, count and sort 21-mers
val kmers = reads.flatMap { read =>
read.getSequence.sliding(21).map(k => (k, 1L))
}.reduceByKey((k1: Long, k2: Long) => k1 + k2)
.map(_.swap)
.sortByKey(ascending = false)
// Print the top 10 most common 21-mers
kmers.take(10).foreach(println)
```

Executing this Spark job will output the following:

```
(121771,TTTTTTTTTTTTTTTTTTTTT)
(44317,ACACACACACACACACACACA)
(44023,TGTGTGTGTGTGTGTGTGTGT)
(42474,CACACACACACACACACACAC)
(42095,GTGTGTGTGTGTGTGTGTGTG)
(33797,TAATCCCAGCACTTTGGGAGG)
(33081,AATCCCAGCACTTTGGGAGGC)
(32775,TGTAATCCCAGCACTTTGGGA)
(32484,CCTCCCAAAGTGCTGGGATTA)
```

You do not need to be a Scala developer to use ADAM. You could also run the following ADAM CLI command for the same result:

```bash
adam-submit count_kmers \
/data/NA21144.chrom11.ILLUMINA.adam \
/data/results.txt 21
```

## Apache Parquet

[Apache Parquet](http://parquet.apache.org) is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

- Parquet compresses legacy genomic formats using standard columnar techniques (e.g. RLE, dictionary encoding). ADAM files are typically ~20% smaller than compressed BAM files.
- Parquet integrates with:
- **Query engines**: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto
- **Frameworks**: Spark, MapReduce, Cascading, Crunch, Scalding, Kite
- **Data models**: Avro, Thrift, ProtocolBuffers, POJOs
- Parquet is simply a file format which makes it easy to sync and share data using tools like `distcp`, `rsync`, etc
- Parquet provides a command-line tool, `parquet.hadoop.PrintFooter`, which reports useful compression statistics

In the counting k-mers example above, you can see that there is a defined *predicate* and *projection*. The *predicate* allows rapid filtering of rows while a *projection* allows you to efficiently materialize only specific columns for analysis. For this k-mer counting example, we filter out any records that are not mapped or have a `MAPQ` less than 20 using a `predicate` and only materialize the `Sequence`, `ReadMapped` flag and `MAPQ` columns and skip over all other fields like `Reference` or `Start` position, e.g.

Sequence| ReadMapped | MAPQ | ~~Reference~~ | ~~Start~~ | ...
--------|------------|------|-----------|-------|-------
~~GGTCCAT~~ | ~~false~~ | - | ~~chrom1~~ | - | ...
TACTGAA | true | 30 | ~~chrom1~~ | ~~34232~~ | ...
~~TTGAATG~~ | ~~true~~ | ~~17~~ | ~~chrom1~~ | ~~309403~~ | ...

## Apache Avro

- Apache Avro is a data serialization system ([http://avro.apache.org](http://avro.apache.org))
- All Big Data Genomics schemas are published at [https://github.com/bigdatagenomics/bdg-formats](https://github.com/bigdatagenomics/bdg-formats)
- Having explicit schemas and self-describing data makes integrating, sharing and evolving formats easier

Our Avro schemas are directly converted into source code using Avro tools. Avro supports a number of computer languages. ADAM uses Java; you could
just as easily use this Avro IDL description as the basis for a Python project. Avro currently supports C, C++, C#, Java, JavaScript, PHP, Python and Ruby.

## More than k-mer counting

ADAM does much more than just k-mer counting. Running the ADAM CLI without arguments or with `--help` will display available commands.

```bash
$ adam-submit

e 888~-_ e e e
d8b 888 \ d8b d8b d8b
/Y88b 888 | /Y88b d888bdY88b
/ Y88b 888 | / Y88b / Y88Y Y888b
/____Y88b 888 / /____Y88b / YY Y888b
/ Y88b 888_-~ / Y88b / Y888b

Usage: adam-submit [<spark-args> --] <adam-args>

Choose one of the following commands:

ADAM ACTIONS
countKmers : Counts the k-mers/q-mers from a read dataset.
countContigKmers : Counts the k-mers/q-mers from a read dataset.
transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations
transformFeatures : Convert a file with sequence features into corresponding ADAM format and vice versa
transformGenotypes : Convert a file with genotypes into corresponding ADAM format and vice versa
transformVariants : Convert a file with variants into corresponding ADAM format and vice versa
mergeShards : Merges the shards of a file
reads2coverage : Calculate the coverage from a given ADAM file

CONVERSION OPERATIONS
fasta2adam : Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences.
adam2fasta : Convert ADAM nucleotide contig fragments to FASTA files
adam2fastq : Convert BAM to FASTQ files
fragments2reads : Convert alignment records into fragment records.
reads2fragments : Convert alignment records into fragment records.

PRINT
print : Print an ADAM formatted file
flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat)
view : View certain reads from an alignment-record file.
```

You can learn more about a command, by calling it without arguments or with `--help`.

```bash
$ adam-submit transformAlignments
Argument "INPUT" is required
INPUT : The ADAM, BAM or SAM file to apply the transforms to
OUTPUT : Location to write the transformed data in ADAM/Parquet format
-add_md_tags VAL : Add MD Tags to reads based on the FASTA (or equivalent) file passed to this option.
-aligned_read_predicate : Only load aligned reads. Only works for Parquet files.
-cache : Cache data to avoid recomputing between stages.
-coalesce N : Set the number of partitions written to the ADAM output directory
-concat VAL : Concatenate this file with <INPUT> and write the result to <OUTPUT>
-dump_observations VAL : Local path to dump BQSR observations to. Outputs CSV format.
-force_load_bam : Forces Transform to load from BAM/SAM.
-force_load_fastq : Forces Transform to load from unpaired FASTQ.
-force_load_ifastq : Forces Transform to load from interleaved FASTQ.
-force_load_parquet : Forces Transform to load from Parquet.
-force_shuffle_coalesce : Even if the repartitioned RDD has fewer partitions, force a shuffle.
-h (-help, --help, -?) : Print help
-known_indels VAL : VCF file including locations of known INDELs. If none is provided, default
consensus model will be used.
-known_snps VAL : Sites-only VCF giving location of known SNPs
-limit_projection : Only project necessary fields. Only works for Parquet files.
-log_odds_threshold N : The log-odds threshold for accepting a realignment. Default value is 5.0.
-mark_duplicate_reads : Mark duplicate reads
-max_consensus_number N : The maximum number of consensus to try realigning a target region to. Default
value is 30.
-max_indel_size N : The maximum length of an INDEL to realign to. Default value is 500.
-max_target_size N : The maximum length of a target region to attempt realigning. Default length is
3000.
-md_tag_fragment_size N : When adding MD tags to reads, load the reference in fragments of this size.
-md_tag_overwrite : When adding MD tags to reads, overwrite existing incorrect tags.
-paired_fastq VAL : When converting two (paired) FASTQ files to ADAM, pass the path to the second file
here.
-parquet_block_size N : Parquet block size (default = 128mb)
-parquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec
-parquet_disable_dictionary : Disable dictionary encoding
-parquet_logging_level VAL : Parquet logging level (default = severe)
-parquet_page_size N : Parquet page size (default = 1mb)
-print_metrics : Print metrics to the log on completion
-realign_indels : Locally realign indels present in reads.
-recalibrate_base_qualities : Recalibrate the base quality scores (ILLUMINA only)
-record_group VAL : Set converted FASTQs' record-group names to this value; if empty-string is passed,
use the basename of the input file, minus the extension.
-repartition N : Set the number of partitions to map data to
-single : Saves OUTPUT as single file
-sort_fastq_output : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default.
Ignored if not saving as FASTQ.
-sort_reads : Sort the reads by referenceId and read position
-storage_level VAL : Set the storage level to use for caching.
-stringency VAL : Stringency level for various checks; can be SILENT, LENIENT, or STRICT. Defaults
to LENIENT
```
The ADAM transformAlignments command allows you to mark duplicates, run base quality score recalibration (BQSR) and other pre-processing steps on your data.
There are also a number of projects built on ADAM:
- [Avocado](https://github.com/bigdatagenomics/avocado) is a variant caller built on top of ADAM for germline and somatic calling
- [Mango](https://github.com/bigdatagenomics/mango) is a library for visualizing large scale genomics data with interactive latencies
ADAM is a library and command line tool that enables the use of [Apache
Spark](https://spark.apache.org) to parallelize genomic data analysis across
cluster/cloud computing environments. ADAM uses a set of schemas to describe
genomic sequences, reads, variants/genotypes, and features, and can be used
with data in legacy genomic file formats such as SAM/BAM/CRAM or VCF, as well
as data stored in the columnar [Apache Parquet](https://parquet.apache.org)
format. On a single node, ADAM provides competitive performance to optimized
multi-threaded tools, while enabling scale out to clusters with more than a
thousand cores. ADAM's APIs can be used from Scala, Java, Python, R, and SQL.

## The ADAM/Big Data Genomics Ecosystem

ADAM builds upon the open source [Apache Spark](https://spark.apache.org),
[Apache Avro](https://avro.apache.org), and [Apache
Parquet](https://parquet.apache.org) projects. Additionally, ADAM can be
deployed for both interactive and production workflows using a variety of
platforms. A diagram of the ecosystem of tools and libraries that ADAM builds on
and the tools that build upon the ADAM APIs can be found below.

![The ADAM ecosystem.](source/img/bdgenomics-stack.pdf)

As the diagram shows, beyond the [ADAM CLI](#cli), there are a number of tools
built using ADAM's core APIs:

- [Avocado](https://github.com/bigdatagenomics/avocado) is a variant caller built
on top of ADAM for germline and somatic calling
- [Cannoli](https://github.com/bigdatagenomics/cannoli) uses ADAM's [pipe](#pipes)
API to parallelize common single-node genomics tools (e.g.,
[BWA](https://github.com/lh3/bwa), bowtie,
[FreeBayes](https://github.com/ekg/freebayes))
- [DECA](https://github.com/bigdatagenomics/deca) is a reimplementation of the
XHMM copy number variant caller on top of ADAM/Apache Spark
- [Gnocchi](https://github.com/bigdatagenomics/gnocchi) provides primitives for
running GWAS/eQTL tests on large genotype/phenotype datasets using ADAM
- [Lime](https://github.com/bigdatagenomics/lime) provides a parallel
implementation of genomic set theoretic primitives using the [region join
API](#join)
- [Mango](https://github.com/bigdatagenomics/mango) is a library for visualizing
large scale genomics data with interactive latencies and serving data using the
[GA4GH schemas](https://github.com/ga4gh/schemas)

0 comments on commit 26887db

Please sign in to comment.