Replaced Contig with ContigName in AlignmentRecord and related changes #988

Merged
merged 1 commit into from Apr 6, 2016

Conversation

Projects
None yet
5 participants
@jpdna
Member

jpdna commented Apr 2, 2016

This PR depends on the related PR #72 here in bigdatagenomics/bdg-formats repo
which supplies version 0.7.2-SNAPSHOT of bdg-formats.
With the above available, this PR should compile and pass all tests

See related discussion in #983

The goal of this PR is to remove from AlignmentRecord for performance reasons the Contig and mateContig metadata details such as MD5 and URL that are repeated for each contig. This contig metadata is referred to now only in the SequenceDictionary and joined using the contigName and mateContigName which have replaced contig and mateContig in AlignmentRecord

Performance Improvement:
Testing on a 3.8 GB BAM file input on a single machine, I see the following improvements based on metrics in Spark web GUI with the changes in this PR as compared to current code as of b8e36b2 :

27% speed up in MarkDuplicates ( 2.3 minutes to 1.7 minutes )
11% reduction in Shuffle Read and Write size in MarkDuplicates ( 10.1GB to 9.0 GB)
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 2, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1128/

Build result: FAILURE

GitHub pull request #988 of commit b32a60b automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains b350f9c # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision b350f9c (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f b350f9cb2506348e4fd90f36639fdb06edc0ff8fFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1128/

Build result: FAILURE

GitHub pull request #988 of commit b32a60b automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains b350f9c # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision b350f9c (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f b350f9cb2506348e4fd90f36639fdb06edc0ff8fFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@ryan-williams ryan-williams referenced this pull request in bigdatagenomics/bdg-formats Apr 2, 2016

Closed

Replaced Contig with ContigName in AlignmentRecord #72

adam-apis/pom.xml
@@ -106,6 +106,7 @@
<dependency>
<groupId>org.bdgenomics.bdg-formats</groupId>
<artifactId>bdg-formats</artifactId>
+ <version>0.7.2-SNAPSHOT</version>

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

You don't need the version here, it inherits from the ${ADAM_HOME}/pom.xml.

@fnothaft

fnothaft Apr 3, 2016

Member

You don't need the version here, it inherits from the ${ADAM_HOME}/pom.xml.

adam-cli/pom.xml
@@ -167,6 +167,7 @@
<dependency>
<groupId>org.bdgenomics.bdg-formats</groupId>
<artifactId>bdg-formats</artifactId>
+ <version>0.7.2-SNAPSHOT</version>

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

Ditto RE: above.

@fnothaft

fnothaft Apr 3, 2016

Member

Ditto RE: above.

adam-core/pom.xml
@@ -118,6 +118,7 @@
<dependency>
<groupId>org.bdgenomics.bdg-formats</groupId>
<artifactId>bdg-formats</artifactId>
+ <version>0.7.2-SNAPSHOT</version>

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

Ditto.

@fnothaft

fnothaft Apr 3, 2016

Member

Ditto.

@@ -155,8 +154,8 @@ class AlignmentRecordConverter extends Serializable {
// only set alignment flags if read is aligned
if (m) {
// if we are aligned, we must have a reference
- assert(adamRecord.getContig != null, "Cannot have null contig if aligned.")
- builder.setReferenceName(adamRecord.getContig.getContigName)
+ assert(adamRecord.getContigName != null, "Cannot have null contig if aligned.")

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

Small nit: while we're changing this line, we should probably s/assert/require/g.

@fnothaft

fnothaft Apr 3, 2016

Member

Small nit: while we're changing this line, we should probably s/assert/require/g.

@@ -81,7 +81,7 @@ class SAMRecordConverter extends Serializable with Logging {
val readReference: Int = samRecord.getReferenceIndex
if (readReference != SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX) {
dict(samRecord.getReferenceName).foreach { (rec) =>
- builder.setContig(SequenceRecord.toADAMContig(rec))
+ builder.setContigName(SequenceRecord.toADAMContig(rec).getContigName)

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

I think you can remove the .foreach and just simplify to:

builder.setContigName(samRecord.getReferenceName)
@fnothaft

fnothaft Apr 3, 2016

Member

I think you can remove the .foreach and just simplify to:

builder.setContigName(samRecord.getReferenceName)
@@ -128,7 +128,7 @@ class SAMRecordConverter extends Serializable with Logging {
if (mateReference != SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX) {
dict(samRecord.getMateReferenceName).foreach { (rec) =>
- builder.setMateContig(SequenceRecord.toADAMContig(rec))
+ builder.setMateContigName(SequenceRecord.toADAMContig(rec).getContigName)

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

Ditto here RE: .foreach.

@fnothaft

fnothaft Apr 3, 2016

Member

Ditto here RE: .foreach.

import org.bdgenomics.formats.avro.Contig
+// We may want to reintroduce in the future MD5 concordance test for isSameContig as

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

I don't like to leave dead code in. Let's either remove the whole comment + code, or keep the comment but include a commit hash for when the code was removed so that it is easy to track down the diff with the old code.

As an aside, I would remove this class. IMO, it's not very useful to have a function that just calls String.equals.

@fnothaft

fnothaft Apr 3, 2016

Member

I don't like to leave dead code in. Let's either remove the whole comment + code, or keep the comment but include a commit hash for when the code was removed so that it is easy to track down the diff with the old code.

As an aside, I would remove this class. IMO, it's not very useful to have a function that just calls String.equals.

pom.xml
@@ -316,7 +316,7 @@
<dependency>
<groupId>org.bdgenomics.bdg-formats</groupId>
<artifactId>bdg-formats</artifactId>
- <version>0.7.0</version>
+ <version>0.7.2-SNAPSHOT</version>

This comment has been minimized.

@fnothaft

fnothaft Apr 3, 2016

Member

We'll cut a new release of bdg-formats to merge this.

@fnothaft

fnothaft Apr 3, 2016

Member

We'll cut a new release of bdg-formats to merge this.

This comment has been minimized.

@fnothaft

fnothaft Apr 5, 2016

Member

I am cutting the 0.7.1 bdg-formats release right now.

@fnothaft

fnothaft Apr 5, 2016

Member

I am cutting the 0.7.1 bdg-formats release right now.

This comment has been minimized.

This comment has been minimized.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 3, 2016

Member

Aside from a few nits, this LGTM! Thanks for profiling the change, @jpdna!

Member

fnothaft commented Apr 3, 2016

Aside from a few nits, this LGTM! Thanks for profiling the change, @jpdna!

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 5, 2016

Member

Made the changes suggested by @fnothaft above and squashed

Member

jpdna commented Apr 5, 2016

Made the changes suggested by @fnothaft above and squashed

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 5, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1130/

Build result: FAILURE

GitHub pull request #988 of commit c924952 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1130/

Build result: FAILURE

GitHub pull request #988 of commit c924952 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft added this to the 0.20.0 milestone Apr 5, 2016

import org.bdgenomics.formats.avro._
import scala.collection.JavaConversions._
-class FragmentRDDFunctions(rdd: RDD[Fragment]) extends ADAMSequenceDictionaryRDDAggregator[Fragment](rdd) {
+class FragmentRDDFunctions(rdd: RDD[Fragment]) extends Serializable with Logging {

This comment has been minimized.

@heuermh

heuermh Apr 5, 2016

Member

Why is this removed? I'm actually not sure what ADAMSequenceDictionaryRDDAggregator was for in the first place.

@heuermh

heuermh Apr 5, 2016

Member

Why is this removed? I'm actually not sure what ADAMSequenceDictionaryRDDAggregator was for in the first place.

This comment has been minimized.

@jpdna

jpdna Apr 5, 2016

Member

ADAMSequenceDictionaryRDDAggregator
seemed to be used to extract SeqDicts from existing object, but didn't seem to make sense anymore in context of having removed Contig from AlignmentRecord, including within Fragmen - and importantly was not actually used in any code or tests. Extending ADAMSequenceDictionaryRDDAggregator here in fact blocked the Contig factoring out as it requires data which is no longer in AlignmentRecord. I searched for any usages of the removed functions and there were none, so I removed it as a superclass.

@jpdna

jpdna Apr 5, 2016

Member

ADAMSequenceDictionaryRDDAggregator
seemed to be used to extract SeqDicts from existing object, but didn't seem to make sense anymore in context of having removed Contig from AlignmentRecord, including within Fragmen - and importantly was not actually used in any code or tests. Extending ADAMSequenceDictionaryRDDAggregator here in fact blocked the Contig factoring out as it requires data which is no longer in AlignmentRecord. I searched for any usages of the removed functions and there were none, so I removed it as a superclass.

This comment has been minimized.

@heuermh

heuermh Apr 5, 2016

Member

Sounds good to me, I'm always in favor of removing unnecessary code.

I wonder if this changes anyone's opinion on this commit ryan-williams@c5a8f51 that adds extension to some of the Functions classes.

@heuermh

heuermh Apr 5, 2016

Member

Sounds good to me, I'm always in favor of removing unnecessary code.

I wonder if this changes anyone's opinion on this commit ryan-williams@c5a8f51 that adds extension to some of the Functions classes.

This comment has been minimized.

@fnothaft

fnothaft Apr 5, 2016

Member

Both this and the commit you pointed at @heuermh LGTM.

@fnothaft

fnothaft Apr 5, 2016

Member

Both this and the commit you pointed at @heuermh LGTM.

- a.setContigMD5("md5")
+ // Prior to removing Contig from AlignmentRecprd the code below

This comment has been minimized.

@heuermh

heuermh Apr 5, 2016

Member

minor typo

@heuermh

heuermh Apr 5, 2016

Member

minor typo

Replaced Contig with ContigName in AlignmentRecord and related changes
Contig factor out project, cleaning up some comments

clean up some unintended whitespace arbitrary diffs

Removed subproject POM changes and other small issues

Updated bdformats to new published maven depedenecy 0.7.1, and fixed a comment typo
@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 5, 2016

Member

bdg-formats dependency has been updated to 0.7.1 and change squashed, so I think this is ready to go now

Member

jpdna commented Apr 5, 2016

bdg-formats dependency has been updated to 0.7.1 and change squashed, so I think this is ready to go now

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 5, 2016

Member

well crud, github webpage on my fork is not updating with my push, though git thinks I pushed. Github was down a few minutes ago - I'll ping this thread again in when it matches.

Member

jpdna commented Apr 5, 2016

well crud, github webpage on my fork is not updating with my push, though git thinks I pushed. Github was down a few minutes ago - I'll ping this thread again in when it matches.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 5, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1131/
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1131/
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 5, 2016

Member

Jenkins, retest this please.

Member

fnothaft commented Apr 5, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 5, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1132/
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1132/
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 5, 2016

Member

Jenkins, retest this please.

Member

fnothaft commented Apr 5, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 5, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1133/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1133/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 5, 2016

Member

LGTM now. Let's discuss on the call tomorrow and make sure everyone is good with merging.

Member

fnothaft commented Apr 5, 2016

LGTM now. Let's discuss on the call tomorrow and make sure everyone is good with merging.

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Apr 6, 2016

Contributor

+1000 from me

Contributor

tdanford commented Apr 6, 2016

+1000 from me

@fnothaft fnothaft merged commit 7823abd into bigdatagenomics:master Apr 6, 2016

1 check passed

default Merged build finished.
Details
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 6, 2016

Member

Thanks @jpdna! Merged.

Member

fnothaft commented Apr 6, 2016

Thanks @jpdna! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment