Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaned up docs. #1642

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 18 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,16 @@ ADAM

# Introduction

ADAM is a genomics analysis platform with specialized file formats built using [Apache Avro][Avro], [Apache Spark][Spark] and [Apache Parquet][Parquet]. Apache 2 licensed. Some quick links:
ADAM is a genomics analysis platform with specialized file formats built using [Apache Avro](http://avro.apache.org), [Apache Spark](http://spark.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.

* [Follow our Twitter account](https://twitter.com/bigdatagenomics/).
* [Chat with ADAM developers in Gitter](https://gitter.im/bigdatagenomics/adam).
* [Join our mailing list](http://bdgenomics.org/mail/).
* [Checkout the current build status](https://amplab.cs.berkeley.edu/jenkins/view/Big%20Data%20Genomics/).
* [Download official releases][releases].
* [View our software artifacts on Maven Central](http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics) ([…including snapshots](https://oss.sonatype.org/index.html#nexus-search;quick~bdgenomics)).
* [Look at our CHANGES file](https://github.com/bigdatagenomics/adam/blob/master/CHANGES.md).
* [Follow](https://twitter.com/bigdatagenomics/) our Twitter account
* [Chat](https://gitter.im/bigdatagenomics/adam) with ADAM developers on Gitter
* [Join](http://bdgenomics.org/mail) our mailing list
* [Check out](https://amplab.cs.berkeley.edu/jenkins/view/Big%20Data%20Genomics/) the current build status
* [Download](https://github.com/bigdatagenomics/adam/releases) official releases
* [View](http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics) our software artifacts on Maven Central
* [See](https://oss.sonatype.org/index.html#nexus-search;quick~bdgenomics) our snapshots
* [Look](https://github.com/bigdatagenomics/adam/blob/master/CHANGES.md) at our CHANGES file

## Why ADAM?

Expand Down Expand Up @@ -186,6 +187,14 @@ Outputs
You might want to take a peek at the `scripts/jenkins-test` script and give it a run. It will fetch a mouse chromosome, encode it to ADAM
reads and pileups, run flagstat, etc. We use this script to test that ADAM is working correctly.

### Homebrew

If you have Homebrew installed, you can install adam via:

```bash
$ brew install adam
```

### Installing Spark

You'll need to have a Spark release on your system and the `$SPARK_HOME` environment variable pointing at it; prebuilt binaries can be downloaded from the
Expand Down Expand Up @@ -406,4 +415,4 @@ architecture generalized beyond genomics. To cite this paper, please cite:
```

We prefer that you cite both papers, but if you can only cite one paper, we
prefer that you cite the SIGMOD 2015 manuscript.
prefer that you cite the SIGMOD 2015 manuscript.
43 changes: 20 additions & 23 deletions docs/source/01_intro.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Introduction

* Follow our Twitter account at [https://twitter.com/bigdatagenomics/](https://twitter.com/bigdatagenomics/)
* Chat with ADAM developers at [https://gitter.im/bigdatagenomics/adam](https://gitter.im/bigdatagenomics/adam)
* Join our mailing list at [http://bdgenomics.org/mail](http://bdgenomics.org/mail)
* Checkout the current build status at [https://amplab.cs.berkeley.edu/jenkins/](https://amplab.cs.berkeley.edu/jenkins/view/Big%20Data%20Genomics/)
* Download official releases at [https://github.com/bigdatagenomics/adam/releases](https://github.com/bigdatagenomics/adam/releases)
* View our software artifacts on Maven Central at [http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics](http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics)
* See our snapshots at [https://oss.sonatype.org/index.html#nexus-search;quick~bdgenomics](https://oss.sonatype.org/index.html#nexus-search;quick~bdgenomics)
* Look at our CHANGES file at [https://github.com/bigdatagenomics/adam/blob/master/CHANGES.md](https://github.com/bigdatagenomics/adam/blob/master/CHANGES.md)
ADAM is a genomics analysis platform with specialized file formats built using [Apache Avro](http://avro.apache.org), [Apache Spark](http://spark.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.

ADAM is a genomics analysis platform with specialized file formats built using [Apache Avro](http://avro.apache.org), [Apache Spark](http://spark.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.
* [Follow](https://twitter.com/bigdatagenomics/) our Twitter account
* [Chat](https://gitter.im/bigdatagenomics/adam) with ADAM developers on Gitter
* [Join](http://bdgenomics.org/mail) our mailing list
* [Check out](https://amplab.cs.berkeley.edu/jenkins/view/Big%20Data%20Genomics/) the current build status
* [Download](https://github.com/bigdatagenomics/adam/releases) official releases
* [View](http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics) our software artifacts on Maven Central
* [See](https://oss.sonatype.org/index.html#nexus-search;quick~bdgenomics) our snapshots
* [Look](https://github.com/bigdatagenomics/adam/blob/master/CHANGES.md) at our CHANGES file

## Apache Spark

[Apache Spark](http://spark.apache.org/) allows developers to write algorithms in succinct code that can run fast locally, on an in-house cluster or on Amazon, Google or Microsoft clouds.

For example, the following code snippet will print the top 10 21-mers in `NA2114` from 1000 Genomes.
For example, the following code snippet will print the top ten 21-mers in `NA2114` from 1000 Genomes:

```scala
val ac = new ADAMContext(sc)
Expand Down Expand Up @@ -51,10 +51,10 @@ Executing this Spark job will output the following:
(32484,CCTCCCAAAGTGCTGGGATTA)
```

You don't need to be Scala developer to use ADAM. You could also run the following ADAM CLI command for the same result:
You do not need to be a Scala developer to use ADAM. You could also run the following ADAM CLI command for the same result:

```bash
$ adam-submit count_kmers \
adam-submit count_kmers \
/data/NA21144.chrom11.ILLUMINA.adam \
/data/results.txt 21
```
Expand All @@ -71,7 +71,7 @@ $ adam-submit count_kmers \
- Parquet is simply a file format which makes it easy to sync and share data using tools like `distcp`, `rsync`, etc
- Parquet provides a command-line tool, `parquet.hadoop.PrintFooter`, which reports useful compression statistics

In the counting k-mers example above, you can see there is a defined *predicate* and *projection*. The *predicate* allows rapid filtering of rows while a *projection* allows you to efficiently materialize only specific columns for analysis. For this k-mer counting example, we filter out any records that are not mapped or have a `MAPQ` less than 20 using a `predicate` and only materialize the `Sequence`, `ReadMapped` flag and `MAPQ` columns and skip over all other fields like `Reference` or `Start` position, e.g.
In the counting k-mers example above, you can see that there is a defined *predicate* and *projection*. The *predicate* allows rapid filtering of rows while a *projection* allows you to efficiently materialize only specific columns for analysis. For this k-mer counting example, we filter out any records that are not mapped or have a `MAPQ` less than 20 using a `predicate` and only materialize the `Sequence`, `ReadMapped` flag and `MAPQ` columns and skip over all other fields like `Reference` or `Start` position, e.g.

Sequence| ReadMapped | MAPQ | ~~Reference~~ | ~~Start~~ | ...
--------|------------|------|-----------|-------|-------
Expand All @@ -81,21 +81,20 @@ TACTGAA | true | 30 | ~~chrom1~~ | ~~34232~~ | ...

## Apache Avro


- Apache Avro is a data serialization system ([http://avro.apache.org](http://avro.apache.org))
- All Big Data Genomics schemas are published at [https://github.com/bigdatagenomics/bdg-formats](https://github.com/bigdatagenomics/bdg-formats)
- Having explicit schemas and self-describing data makes integrating, sharing and evolving formats easier

Our Avro schemas are directly converted into source code using Avro tools. Avro supports a number of computer languages. ADAM uses Java; you could
just as easily use this Avro IDL description as the basis for a Python project. Avro currently supports c, c++, csharp, java, javascript, php, python and ruby.
just as easily use this Avro IDL description as the basis for a Python project. Avro currently supports C, C++, C#, Java, JavaScript, PHP, Python and Ruby.

## More than k-mer counting

ADAM does much more than just k-mer counting. Running the ADAM CLI without arguments or with `--help` will display available commands, e.g.
ADAM does much more than just k-mer counting. Running the ADAM CLI without arguments or with `--help` will display available commands.

```bash
$ adam-submit

```
e 888~-_ e e e
d8b 888 \ d8b d8b d8b
/Y88b 888 | /Y88b d888bdY88b
Expand Down Expand Up @@ -130,9 +129,9 @@ PRINT
view : View certain reads from an alignment-record file.
```

You can learn more about a command, by calling it without arguments or with `--help`, e.g.
You can learn more about a command, by calling it without arguments or with `--help`.

```
```bash
$ adam-submit transformAlignments
Argument "INPUT" is required
INPUT : The ADAM, BAM or SAM file to apply the transforms to
Expand Down Expand Up @@ -186,9 +185,7 @@ Argument "INPUT" is required

The ADAM transformAlignments command allows you to mark duplicates, run base quality score recalibration (BQSR) and other pre-processing steps on your data.

There are also a number of projects built on ADAM, e.g.
There are also a number of projects built on ADAM:

- [Avocado](https://github.com/bigdatagenomics/avocado) is a variant caller built on top of ADAM for germline and somatic calling
- [Mango](https://github.com/bigdatagenomics/mango) a library for visualizing large scale genomics data with interactive latencies


- [Mango](https://github.com/bigdatagenomics/mango) is a library for visualizing large scale genomics data with interactive latencies
14 changes: 7 additions & 7 deletions docs/source/02_installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ installed in order to build ADAM.
> 1.6.3. To build for Spark 2, run the `./scripts/move_to_spark2.sh` script.

```bash
$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam
$ export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
$ mvn clean package -DskipTests
git clone https://github.com/bigdatagenomics/adam.git
cd adam
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
mvn clean package -DskipTests
```
Outputs
```
Expand Down Expand Up @@ -43,16 +43,16 @@ alias adam-shell="${ADAM_HOME}/bin/adam-shell"

`$ADAM_HOME` should be the path to where you have checked ADAM out on your local filesystem.
The first alias should be used for running ADAM jobs that operate locally. The latter two aliases
call scripts that wrap the `spark-submit` and `spark-shell` commands to set up ADAM. You'll need
call scripts that wrap the `spark-submit` and `spark-shell` commands to set up ADAM. You will need
to have the Spark binaries on your system; prebuilt binaries can be downloaded from the
[Spark website](http://spark.apache.org/downloads.html). Our [continuous integration setup](
https://amplab.cs.berkeley.edu/jenkins/job/ADAM/) builds ADAM against Spark versions 1.6.1 and 2.0.0,
Scala versions 2.10 and 2.11, and Hadoop versions 2.3.0 and 2.6.0.

Once this alias is in place, you can run ADAM by simply typing `adam-submit` at the commandline, e.g.
Once this alias is in place, you can run ADAM by simply typing `adam-submit` at the command line.

```bash
$ adam-submit
adam-submit
```

## Building for Python {#python-build}
Expand Down
12 changes: 6 additions & 6 deletions docs/source/30_running_example.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

Once you have data converted to ADAM, you can gather statistics from the ADAM
file using [`flagstat`](#flagstat). This command will output stats identically
to the samtools `flagstat` command, e.g.
to the samtools `flagstat` command.

```bash
$ ./bin/adam-submit flagstat NA12878_chr20.adam
./bin/adam-submit flagstat NA12878_chr20.adam
```
Outputs:
```
Expand All @@ -22,11 +22,11 @@ Outputs:
105812 + 0 with mate mapped to a different chr (mapQ>=5)
```

In practice, you'll find that the ADAM `flagstat` command takes orders of magnitude less
time than samtools to compute these statistics. For example, on a MacBook Pro the command
above took 17 seconds to run while `samtools flagstat NA12878_chr20.bam` took 55 secs.
In practice, you will find that the ADAM `flagstat` command takes orders of magnitude less
time than samtools to compute these statistics. For example, on a MacBook Pro, the command
above took 17 seconds to run while `samtools flagstat NA12878_chr20.bam` took 55 seconds.
On larger files, the difference in speed is even more dramatic. ADAM is faster because
it's multi-threaded and distributed and uses a columnar storage format (with a projected
it is multi-threaded, distributed and uses a columnar storage format (with a projected
schema that only materializes the read flags instead of the whole read).

## Running on a cluster
Expand Down
Loading