Add biologist targeted section to the README #497

Closed
laserson opened this Issue Nov 21, 2014 · 3 comments

Comments

Projects
5 participants
@laserson
Contributor

laserson commented Nov 21, 2014

After talking with someone at Strata, they mentioned that any biologists we interest who end up looking up ADAM get to the README and find a very computer science-y intro to ADAM. We should add a little section accessible to biologists about why ADAM is exciting.

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016

This was referenced Jul 20, 2016

@tverbeiren

This comment has been minimized.

Show comment
Hide comment
@tverbeiren

tverbeiren Dec 7, 2016

Contributor

Draft suggestion kindly provided by Joke Reumers:

Over the last decade, DNA and RNA sequencing has evolved from an expensive, labor intensive method to a cheap commodity. The consequence of this is generation of massive amounts of genomic and transcriptomic data. Typically, tools to process and interpret these data are developed at academic labs, with a focus on excellence of the results generated, not on scalability and interoperability. A typical "sequencing pipeline" consists of a string of tools going from quality control, mapping, mapped read preprocessing, to variant calling or quantification, depending on the application at hand. Concretely, this usually means that such a pipeline is a string of tools, glued together by scripts or workflow engines, with data written to files in each step.

This approach entails three main bottlenecks: 1) scaling the pipeline comes down to scaling each of the individual tools, 2) the stability of the pipeline heavily depends on the consistency of the intermediate file formats, and 3) writing to and reading from disk is a major slow-down.
We propose here a transformative solution for these problems, by replacing ad hoc pipelines by the ADAM framework, developed in the Apache Spark ecosystem.

ADAM provides specialized file formats for the standard data structures used in genomics analysis: mapped reads (typically stored as .bam files), representation of genomic regions (.bed files), and variants (.vcf files), using Avro and Parquet. This allows to use the in-memory cluster computing functionality of Apache Spark, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in classical distributed approaches.

Furthermore, the ADAM-Spark approach comes with an additional benefit. Typically, the endpoint of a sequencing pipeline is a file with processed data for a single sample: e.g. variants for DNA sequencing, read counts for RNA sequencing, .... However, the real endpoint of a sequencing experiment initiated by an investigator is interpretation of these data in a certain context. This usually translates into (statistical) analysis of multiple samples, connection with (clinical) metadata, interactive visualization, using data science tools such as R, Python, Tableau and Spotfire. In addition to scalable distributed processing, Spark also allows such interactive data analysis in the form of analysis notebooks (Spark Notebook or Zeppelin), or direct connection to the data in R and Python.

Contributor

tverbeiren commented Dec 7, 2016

Draft suggestion kindly provided by Joke Reumers:

Over the last decade, DNA and RNA sequencing has evolved from an expensive, labor intensive method to a cheap commodity. The consequence of this is generation of massive amounts of genomic and transcriptomic data. Typically, tools to process and interpret these data are developed at academic labs, with a focus on excellence of the results generated, not on scalability and interoperability. A typical "sequencing pipeline" consists of a string of tools going from quality control, mapping, mapped read preprocessing, to variant calling or quantification, depending on the application at hand. Concretely, this usually means that such a pipeline is a string of tools, glued together by scripts or workflow engines, with data written to files in each step.

This approach entails three main bottlenecks: 1) scaling the pipeline comes down to scaling each of the individual tools, 2) the stability of the pipeline heavily depends on the consistency of the intermediate file formats, and 3) writing to and reading from disk is a major slow-down.
We propose here a transformative solution for these problems, by replacing ad hoc pipelines by the ADAM framework, developed in the Apache Spark ecosystem.

ADAM provides specialized file formats for the standard data structures used in genomics analysis: mapped reads (typically stored as .bam files), representation of genomic regions (.bed files), and variants (.vcf files), using Avro and Parquet. This allows to use the in-memory cluster computing functionality of Apache Spark, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in classical distributed approaches.

Furthermore, the ADAM-Spark approach comes with an additional benefit. Typically, the endpoint of a sequencing pipeline is a file with processed data for a single sample: e.g. variants for DNA sequencing, read counts for RNA sequencing, .... However, the real endpoint of a sequencing experiment initiated by an investigator is interpretation of these data in a certain context. This usually translates into (statistical) analysis of multiple samples, connection with (clinical) metadata, interactive visualization, using data science tools such as R, Python, Tableau and Spotfire. In addition to scalable distributed processing, Spark also allows such interactive data analysis in the form of analysis notebooks (Spark Notebook or Zeppelin), or direct connection to the data in R and Python.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Dec 9, 2016

Member

I think this looks really good! @tverbeiren can you open a PR adding this to the README.md? Let me know if you're short on time and I can do it.

Member

fnothaft commented Dec 9, 2016

I think this looks really good! @tverbeiren can you open a PR adding this to the README.md? Let me know if you're short on time and I can do it.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Dec 12, 2016

Member

Fixed by #1310

Member

heuermh commented Dec 12, 2016

Fixed by #1310

@heuermh heuermh closed this Dec 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment