Consolidate documentation into a single location in source.

This system uses `pandoc` to convert Markdown into PDF and HTML for each release. These currents docs were taken from our Wiki and README.
bigdatagenomics · Nov 13, 2014 · 6c1b89f · 6c1b89f
1 parent 17d1d4d
commit 6c1b89f
Show file tree

Hide file tree

Showing 14 changed files with 784 additions and 259 deletions.
diff --git a/README.md b/README.md
@@ -1,266 +1,13 @@
-ADAM
-====
-A genomics processing engine and specialized file format built using [Apache Avro](http://avro.apache.org), 
-[Apache Spark](http://spark.incubator.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.
+# ADAM
 
-# Introduction
+A genome analysis platform built on Apache Hadoop, Spark, Parquet and Avro. Apache 2 licensed.
 
-Current genomic file formats are not designed for
-distributed processing. ADAM addresses this by explicitly defining data
-formats as [Apache Avro](http://avro.apache.org) objects and storing them in 
-[Parquet](http://parquet.io) files. [Apache Spark](http://spark.incubator.apache.org/)
-is used as the cluster execution system.
+[![Build Status](https://amplab.cs.berkeley.edu/jenkins/buildStatus/icon?job=ADAM)](https://amplab.cs.berkeley.edu/jenkins/job/ADAM/)
 
-## Explicitly defined format
+To generate documentation,
 
-The [Sequencing Alignment Map (SAM) and Binary Alignment Map (BAM)
-file specification](http://samtools.sourceforge.net/SAM1.pdf) defines a data format 
-for storing reads from aligners. The specification is well-written but provides
-no tools for developers to implement the format. Developers have to hand-craft 
-source code to encode and decode the records which is error prone and an unneccesary
-hassle.
-
-In contrast, the [ADAM specification for storing reads]
-(https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl)
-is defined in the Avro Interface Description Language (IDL) which is directly converted
-into source code. Avro supports a number of computer languages. ADAM uses Java; you could 
-just as easily use this Avro IDL description as the basis for a Python project. Avro
-currently supports c, c++, csharp, java, javascript, php, python and ruby. 
-
-## Ready for distributed processing
-
-The SAM/BAM format is record-oriented with a single record for each read. However,
-the typical data access pattern is column oriented, e.g. search for bases at a
-specific position in a reference genome. The BAM specification tries to support
-this pattern by defining a format for a separate index file. However, this index
-needs to be regenerated anytime your BAM file changes which is costly. The index
-does help keep the cost down on file seeks but the columnar store ADAM uses reduces
-the cost of seeks even more.
-
-Once you convert your BAM file to ADAM, it can be directly accessed by 
-[Hadoop Map-Reduce](http://hadoop.apache.org), [Spark](http://spark-project.org/), 
-[Shark](http://shark.cs.berkeley.edu), [Impala](https://github.com/cloudera/impala), 
-[Pig](http://pig.apache.org), [Hive](http://hive.apache.org), whatever. Using
-ADAM will unlock your genomic data and make it available to a broader range of
-systems.
-
-# Getting Started
-
-## Installation
-
-You will need to have [Maven](http://maven.apache.org/) installed in order to build ADAM.
-
-> **Note:** The default configuration is for Hadoop 2.2.0. If building against a different
-> version of Hadoop, please edit the build configuration in the `<properties>` section of
-> the `pom.xml` file.
-
-```
-$ git clone https://github.com/bigdatagenomics/adam.git
-$ cd adam
-$ export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
-$ mvn clean package -DskipTests
-...
-[INFO] ------------------------------------------------------------------------
-[INFO] BUILD SUCCESS
-[INFO] ------------------------------------------------------------------------
-[INFO] Total time: 9.647s
-[INFO] Finished at: Thu May 23 15:50:42 PDT 2013
-[INFO] Final Memory: 19M/81M
-[INFO] ------------------------------------------------------------------------
-```
-
-You might want to take a peek at the `scripts/jenkins-test` script and give it a run. It will fetch a mouse chromosome, encode it to ADAM
-reads and pileups, run flagstat, etc. We use this script to test that ADAM is working correctly.
-
-## Running ADAM
-
-ADAM is packaged via [appassembler](http://mojo.codehaus.org/appassembler/appassembler-maven-plugin/) and includes all necessary
-dependencies
-
-You might want to add the following to your `.bashrc` to make running `adam` easier:
-
-```
-alias adam-local="bash ${ADAM_HOME}/adam-cli/target/appassembler/bin/adam"
-alias adam-submit="${ADAM_HOME}/bin/adam-submit"
-alias adam-shell="${ADAM_HOME}/bin/adam-shell"
 ```
-
-`$ADAM_HOME` should be the path to where you have checked ADAM out on your local filesystem. 
-The first alias should be used for running ADAM jobs that operate locally. The latter two aliases 
-call scripts that wrap the `spark-submit` and `spark-shell` commands to set up ADAM. You'll need
-to have the Spark binaries on your system; prebuilt binaries can be downloaded from the
-[Spark website](http://spark.apache.org/downloads.html). Currently, we build for
-[Spark 1.1, and Hadoop 2.3 (CDH5)](http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz).
-
-Once this alias is in place, you can run adam by simply typing `adam-local` at the commandline, e.g.
-
+cd docs
+./build.sh
 ```
-$ adam-local
-
-     e            888~-_              e                 e    e
-    d8b           888   \            d8b               d8b  d8b
-   /Y88b          888    |          /Y88b             d888bdY88b
-  /  Y88b         888    |         /  Y88b           / Y88Y Y888b
- /____Y88b        888   /         /____Y88b         /   YY   Y888b
-/      Y88b       888_-~         /      Y88b       /          Y888b
-
-Choose one of the following commands:
-
-           transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations
-            flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat)
-           reads2ref : Convert an ADAM read-oriented file to an ADAM reference-oriented file
-             mpileup : Output the samtool mpileup text from ADAM reference-oriented data
-               print : Print an ADAM formatted file
-   aggregate_pileups : Aggregate pileups in an ADAM reference-oriented file
-            listdict : Print the contents of an ADAM sequence dictionary
-             compare : Compare two ADAM files based on read name
-    compute_variants : Compute variant data from genotypes
-            bam2adam : Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input)
-            adam2vcf : Convert an ADAM variant to the VCF ADAM format
-            vcf2adam : Convert a VCF file to the corresponding ADAM format
-
-```
-
-ADAM outputs all the commands that are available for you to run. To get
-help for a specific command, run `adam-local <command>` without any additional arguments.
-
-````
-$ adam-submit transform
-Argument "INPUT" is required
- INPUT                                                           : The ADAM, BAM or SAM file to apply the transforms to
- OUTPUT                                                          : Location to write the transformed data in ADAM/Parquet format
- -coalesce N                                                     : Set the number of partitions written to the ADAM output directory
- -dump_observations VAL                                          : Local path to dump BQSR observations to. Outputs CSV format.
- -h (-help, --help, -?)                                          : Print help
- -known_indels VAL                                               : VCF file including locations of known INDELs. If none is provided, default
-                                                                   consensus model will be used.
- -known_snps VAL                                                 : Sites-only VCF giving location of known SNPs
- -log_odds_threshold N                                           : The log-odds threshold for accepting a realignment. Default value is 5.0.
- -mark_duplicate_reads                                           : Mark duplicate reads
- -max_consensus_number N                                         : The maximum number of consensus to try realigning a target region to. Default
-                                                                   value is 30.
- -max_indel_size N                                               : The maximum length of an INDEL to realign to. Default value is 500.
- -max_target_size N                                              : The maximum length of a target region to attempt realigning. Default length is
-                                                                   3000.
- -parquet_block_size N                                           : Parquet block size (default = 128mb)
- -parquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec
- -parquet_disable_dictionary                                     : Disable dictionary encoding
- -parquet_logging_level VAL                                      : Parquet logging level (default = severe)
- -parquet_page_size N                                            : Parquet page size (default = 1mb)
- -print_metrics                                                  : Print metrics to the log on completion
- -qualityBasedTrim                                               : Trims reads based on quality scores of prefix/suffixes across read group.
- -qualityThreshold N                                             : Phred scaled quality threshold used for trimming. If omitted, Phred 20 is used.
- -realign_indels                                                 : Locally realign indels present in reads.
- -recalibrate_base_qualities                                     : Recalibrate the base quality scores (ILLUMINA only)
- -repartition N                                                  : Set the number of partitions to map data to
- -sort_fastq_output                                              : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default.
-                                                                   Ignored if not saving as FASTQ.
- -sort_reads                                                     : Sort the reads by referenceId and read position
- -trimBeforeBQSR                                                 : Performs quality based trim before running BQSR. Default is to run quality based
-                                                                   trim after BQSR.
- -trimFromEnd N                                                  : Trim to be applied to end of read.
- -trimFromStart N                                                : Trim to be applied to start of read.
- -trimReadGroup VAL                                              : Read group to be trimmed. If omitted, all reads are trimmed.
- -trimReads                                                      : Apply a fixed trim to the prefix and suffix of all reads/reads in a specific read
-                                                                   group.
-
-````
-
-If you followed along above, now try making your first `.adam` file like this:
-
-````
-adam-submit transform $ADAM_HOME/adam-core/src/test/resources/small.sam /tmp/small.adam
-````
-
-... and if you didn't obtain your copy of adam from github, you can [grab `small.sam` from here](https://raw.githubusercontent.com/bigdatagenomics/adam/master/adam-core/src/test/resources/small.sam).
-
-
-# flagstat
-
-Once you have data converted to ADAM, you can gather statistics from the ADAM file using `flagstat`.
-This command will output stats identically to the samtools `flagstat` command.
-
-If you followed along above, now try gathering some statistics:
-
-````
-$ adam-local flagstat /tmp/small.adam
-20 + 0 in total (QC-passed reads + QC-failed reads)
-0 + 0 primary duplicates
-0 + 0 primary duplicates - both read and mate mapped
-0 + 0 primary duplicates - only read mapped
-0 + 0 primary duplicates - cross chromosome
-0 + 0 secondary duplicates
-0 + 0 secondary duplicates - both read and mate mapped
-0 + 0 secondary duplicates - only read mapped
-0 + 0 secondary duplicates - cross chromosome
-20 + 0 mapped (100.00%:0.00%)
-0 + 0 paired in sequencing
-0 + 0 read1
-0 + 0 read2
-0 + 0 properly paired (0.00%:0.00%)
-0 + 0 with itself and mate mapped
-0 + 0 singletons (0.00%:0.00%)
-0 + 0 with mate mapped to a different chr
-0 + 0 with mate mapped to a different chr (mapQ>=5)
-````
-
-In practice, you'll find that the ADAM `flagstat` command takes orders of magnitude less
-time than samtools to compute these statistics. For example, on a MacBook Pro
-`flagstat NA12878_chr20.bam` took 17 seconds to run while `samtools flagstat NA12878_chr20.bam`
-took 55 seconds. On larger files, the difference in speed is even more dramatic. ADAM is faster
-because it's multi-threaded and distributed and uses a columnar storage format (with a
-projected schema that only materializes the read flags instead of the whole read). 
-
-# count_kmers
-
-You can also use ADAM to count all K-mers present across all reads in the
-`.adam` file using `count_kmers`.  Try this:
-
-````
-$ adam-local count_kmers /tmp/small.adam /tmp/kmers.adam 10
-$ head /tmp/kmers.adam/part-*
-TTTTAAGGTT, 1
-TTCCGATTTT, 1
-GAGCAGCCTT, 1
-CCTGCTGTAT, 1
-AATTGGCACT, 1
-GGCCAGGACT, 1
-GCAGTCCCTC, 1
-AACTTTGAAT, 1
-GATGACGTGG, 1
-CTGTCCCTGT, 1
-````
-
-Each line contains part-* file(s) with line-based records that contain two
-comma-delimited values.  The first value is the K-mer itself and the second
-value is the number of times that K-mer occurred in the input file.  
-
-# Running on a cluster
-
-We provide the `adam-submit` and `adam-shell` commands under the `bin` directory. These can
-be used to submit ADAM jobs to a spark cluster, or to run ADAM interactively.
-
-## Running Plugins
-
-ADAM allows users to create plugins via the [ADAMPlugin](https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/plugins/ADAMPlugin.scala)
-trait. These plugins are then imported using the Java classpath at runtime. To add to the classpath when
-using appassembler, use the `$CLASSPATH_PREFIX` environment variable. For an example of how to use
-the plugin interface, please see the [adam-plugins repo](https://github.com/heuermh/adam-plugins).
-
-# Getting In Touch
-
-## Mailing List
-
-[The ADAM mailing list](https://groups.google.com/forum/#!forum/adam-developers) is a good
-way to sync up with other people who use ADAM including the core developers. You can subscribe
-by sending an email to `adam-developers+subscribe@googlegroups.com` or just post using
-the [web forum page](https://groups.google.com/forum/#!forum/adam-developers).
-
-## IRC Channel
-
-A lot of the developers are hanging on the [#adamdev](http://webchat.freenode.net/?channels=adamdev)
-freenode.net channel. Come join us and ask questions.
-
-# License
 
-ADAM is released under an [Apache 2.0 license](LICENSE.txt).
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1 @@
+output
diff --git a/docs/build.sh b/docs/build.sh
@@ -0,0 +1,20 @@
+#/usr/bin/env bash
+
+git_version=$(git rev-parse --short HEAD)
+output_dir="output"
+pdf_output="$output_dir/ADAM_v$git_version.pdf"
+html_output="$output_dir/ADAM_v$git_version.html"
+
+# Generate a PDF of the docs
+pandoc -N --template=template.tex \
+--variable mainfont="Georgia" \
+--variable sansfont="Arial" \
+--variable monofont="Andale Mono" \
+--variable fontsize=10pt \
+--variable version=$git_version \
+--variable listings=true \
+source/*.md -s -S --toc -o $pdf_output \
+--latex-engine=lualatex
+
+# Generate HTML of the docs
+pandoc source/*.md -s -S --toc -o $html_output
diff --git a/docs/source/01_intro.md b/docs/source/01_intro.md
@@ -0,0 +1,42 @@
+# Introduction
+
+ADAM is a genomics analysis platform with specialized file formats built using [Apache Avro](http://avro.apache.org), 
+[Apache Spark](http://spark.incubator.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.
+
+Current genomic file formats are not designed for
+distributed processing. ADAM addresses this by explicitly defining data
+formats as [Apache Avro](http://avro.apache.org) objects and storing them in 
+[Parquet](http://parquet.io) files. [Apache Spark](http://spark.incubator.apache.org/)
+is used as the cluster execution system.
+
+## Explicitly defined format
+
+The [Sequencing Alignment Map (SAM) and Binary Alignment Map (BAM)
+file specification](http://samtools.sourceforge.net/SAM1.pdf) defines a data format 
+for storing reads from aligners. The specification is well-written but provides
+no tools for developers to implement the format. Developers have to hand-craft 
+source code to encode and decode the records which is error prone and an unneccesary
+hassle.
+
+In contrast, the [ADAM specification for storing reads](https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl)
+is defined in the Avro Interface Description Language (IDL) which is directly converted
+into source code. Avro supports a number of computer languages. ADAM uses Java; you could 
+just as easily use this Avro IDL description as the basis for a Python project. Avro
+currently supports c, c++, csharp, java, javascript, php, python and ruby. 
+
+## Ready for distributed processing
+
+The SAM/BAM format is record-oriented with a single record for each read. However,
+the typical data access pattern is column oriented, e.g. search for bases at a
+specific position in a reference genome. The BAM specification tries to support
+this pattern by defining a format for a separate index file. However, this index
+needs to be regenerated anytime your BAM file changes which is costly. The index
+does help keep the cost down on file seeks but the columnar store ADAM uses reduces
+the cost of seeks even more.
+
+Once you convert your BAM file to ADAM, it can be directly accessed by 
+[Hadoop Map-Reduce](http://hadoop.apache.org), [Spark](http://spark-project.org/), 
+[Shark](http://shark.cs.berkeley.edu), [Impala](https://github.com/cloudera/impala), 
+[Pig](http://pig.apache.org), [Hive](http://hive.apache.org), whatever. Using
+ADAM will unlock your genomic data and make it available to a broader range of
+systems.