Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1630] Overhauled docs introduction and added architecture section. #1653

Merged
merged 3 commits into from Dec 5, 2017

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Aug 2, 2017

WIP towards resolving #1630, #1632, #1633, #1662. Rewrote the introduction to focus on what ADAM provides and the ADAM echosystem. Adds an architecture section that talks about ADAM's stack model and schemas, and which introduces the ADAMContext and GenomicRDDs as implementations of the evidence access layer of the stack.

TODO:

  • Explanation of metadata in GenomicRDD
  • Diagram depicting flow of data from disk into GenomicRDD types back out to disk I think this is unnecessary.
  • A discussion of "why Parquet" and "why not BAM/VCF/etc" --> resolved by #1772
@fnothaft fnothaft added this to the 0.23.0 milestone Aug 2, 2017
@fnothaft fnothaft requested review from devin-petersohn and heuermh Aug 2, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Aug 2, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2306/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1653/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 3ab700e # timeout=10Checking out Revision 3ab700e (origin/pr/1653/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 3ab700ecb330c84ffb85a6895b2438a351d6008bFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Copy link
Member

@heuermh heuermh left a comment

Very good summary sections.

as data stored in the columnar [Apache Parquet](https://parquet.apache.org)
format. On a single node, ADAM provides competitive performance to optimized
multi-threaded tools, while enabling scale out to clusters with more than a
thousand cores. ADAM's APIs can be used from Scala, Java, Python, R, and SQL.

This comment has been minimized.


The stack model that ADAM is based upon was introduced in [@massie13] and
further refined in [@nothaft15], and is depicted in the figure below. This
stack model seperates computational patterns from the data model, and the

This comment has been minimized.

@heuermh

heuermh Aug 2, 2017
Member

seperates → separates

enables developers to write queries that will run seamlessly on both a single
node, or on a distributed cluster, on legacy genomics data files or on
data stored in a high performance columnar storage format, on sorted or
unsorted data, without making any modifications to their query. Additionally,

This comment has been minimized.

@heuermh

heuermh Aug 2, 2017
Member

How about

This enables developers to write queries that run seamlessly on a single
node or on a distributed cluster, on legacy genomics data files or on
data stored in a high performance columnar storage format, and on sorted or
unsorted data, without making any modifications to their query.


1. The *physical storage* layer is the type of storage media (e.g., hard
disk/solid state drives) that are used to store the data.
2. The *data distribution* layer determines how data is made accessible to all

This comment has been minimized.

@heuermh

heuermh Aug 2, 2017
Member

data is → data are

with a parallel collection of genomic data. In ADAM, we implement this layer
through the [GenomicRDD](#genomic-rdd) classes. This layer presents users
with a view of the metadata associated with a collection of genomic data,
and APIs for [transforming](#transforming) and [joining](#join) genomic data.

This comment has been minimized.

independence in database systems. We see this approach as an alternative to the
"stack smashing" commonly seen in genomics APIs, such as the GATK's "walker"
interface [@mckenna10]. In these APIs, implementation details about the layout
of data on disk (is the data sorted?) are propegated up to the application layer

This comment has been minimized.

@heuermh

heuermh Aug 2, 2017
Member

(is the data sorted?) → (are the data sorted?)
propegated → propagated

[Apache Avro](https://avro.apache.org) schema description language. This schema
definition language automatically generates implementations of this schema for
multiple common languages, including Java, C, C++, and Python. bdg-formats
contains 15 schemas in total, with seven core schemas:

This comment has been minimized.

@heuermh

heuermh Aug 2, 2017
Member

Probably best not to give numbers here, so they need to be updated in the future.
contains 15 schemas in total, with seven core schemas → several core schemas

the bdg-formats schemas are nullable, and the schemas themselves do not contain
invariants around valid values for a field. Instead, we validate data on ingress
and egress to/from a conventional genomic file format. This allows users to take
advantage of features such as field projection, which can improve the

This comment has been minimized.

@heuermh

heuermh Aug 2, 2017
Member

could field projection be a link here?

This comment has been minimized.

@fnothaft

fnothaft Dec 4, 2017
Author Member

What did you want to link to?

This comment has been minimized.

@heuermh

heuermh Dec 4, 2017
Member

There was a bit about projections in the old README, it is gone now, no worry

GenomicRDD is enriched with genomics-specific metadata such as computational
lineage and sample metadata, and optimized genomics-specific query patterns
such as [region joins](#join) and the [auto-parallelizing pipe API](#pipes)
for running legacy tools using Apache Spark.

This comment has been minimized.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

A couple of clarifying questions. Overall, looks good!

@@ -373,6 +371,10 @@ all are called in a similar way:
* Inner join and group by right
* Right outer join and group by right

A subset of these joins are depicted in Figure 2 below.

![Joins Available](source/img/join_examples.png)

This comment has been minimized.

@devin-petersohn

devin-petersohn Aug 3, 2017
Member

Is this link accurate? I found that I had to use the relative path from the doc.

This comment has been minimized.

@fnothaft

fnothaft Aug 3, 2017
Author Member

This worked OK for me when running ./build.sh

Spark](https://spark.apache.org) to parallelize genomic data analysis across
cluster/cloud computing environments. ADAM uses a set of schemas to describe
genomic sequences, reads, variants/genotypes, and features, and can be used
with data in legacy genomic file formats such as SAM/BAM/CRAM or VCF, as well

This comment has been minimized.

@devin-petersohn

devin-petersohn Aug 3, 2017
Member

Do we have a comprehensive list of all supported file formats anywhere?

This comment has been minimized.

@fnothaft

fnothaft Aug 3, 2017
Author Member

Yeah, it is down in 55_api.md in the ADAMContext docs. That said, I'm planning to add a graphic in the GenomicRDD section that makes the file format<->schema mapping more clear.

This comment has been minimized.

@heuermh

heuermh Aug 3, 2017
Member

I have draft mapping documentation for the formats here
https://github.com/heuermh/bdg-formats/tree/docs/docs/source

We don't currently host docs or javadocs for bdg-formats, so I hadn't found the motivation to complete them. Would those make sense in the ADAM docs?

This comment has been minimized.

@fnothaft

fnothaft Aug 3, 2017
Author Member

We should host the javadocs for bdg-formats, but I would agree that schema docs fit better in the ADAM docs.

This comment has been minimized.

@devin-petersohn

devin-petersohn Aug 3, 2017
Member

I think it would be useful to link to the supported file types from here. Someone reading the intro docs may be trying to figure out if ADAM will work for them and linking to the supported file types lets them quickly check if their preferred type is supported.

This comment has been minimized.

@fnothaft

fnothaft Aug 3, 2017
Author Member

Yeah, as I said, I am planning to add that exact info to this section.

@heuermh
Copy link
Member

@heuermh heuermh commented Sep 12, 2017

See pull request fnothaft#19 for updated README.md.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Oct 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2431/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse 188836a^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 188836a # timeout=10Checking out Revision 188836a (origin/pr/1653/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 188836a9bbfe82f757a2bfdcebe104cb2a20f782First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Resolves #1630, #1632, #1633. Rewrote the introduction to focus on what ADAM
provides and the ADAM echosystem. Adds an architecture section that talks about
ADAM's stack model and schemas, and which introduces the ADAMContext and
GenomicRDDs as implementations of the evidence access layer of the stack.
@fnothaft fnothaft force-pushed the fnothaft:issues/1630-architecture branch from 188836a to 02ea48b Dec 4, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Dec 4, 2017

This is good to go from my side.

README.md Outdated

[Apache Parquet][Parquet] is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
(index from readthedocs.io)
(include link to API docs at http://bdgenomics.org/adam/latest/scaladocs/index.html)

This comment has been minimized.

@heuermh

heuermh Dec 4, 2017
Member

Sorry, the links here still need to be fleshed out, I assume the main doc link would be to http://adam.readthedocs.io/en/latest/

This comment has been minimized.

@fnothaft

fnothaft Dec 4, 2017
Author Member

LOL whoops. I'll clean these for you.

[Avro]: http://avro.apache.org
[Spark]: https://spark.apache.org/
[Parquet]: https://parquet.apache.org/
[releases]: https://github.com/bigdatagenomics/adam/releases

# Citing ADAM

This comment has been minimized.

@heuermh

heuermh Dec 4, 2017
Member

The homebrew section in README.md (which doesn't show up here in the diff) should probably be removed, I'll replace it later with new homebrew and conda sections. Homebrew might take some work going forward, in that they go to JDK 9 by default which will break apache-spark and other upstream deps.

This comment has been minimized.

@fnothaft

fnothaft Dec 4, 2017
Author Member

JDK9. Nice. Aggressive.

@fnothaft fnothaft force-pushed the fnothaft:issues/1630-architecture branch from 02ea48b to 6bb43f9 Dec 4, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Dec 4, 2017

@heuermh I've cleaned the README nits.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Dec 4, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2498/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1653/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 6249b76 # timeout=10Checking out Revision 6249b76 (origin/pr/1653/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 6249b76ee94039e8f92333a1538786f16dc9cd69First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Dec 4, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2499/
Test PASSed.

@heuermh
heuermh approved these changes Dec 5, 2017
@heuermh heuermh merged commit 34b6bec into bigdatagenomics:master Dec 5, 2017
2 checks passed
2 checks passed
Codacy/PR Quality Review Good work! A positive pull request.
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Dec 5, 2017

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.