New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1599] Add explicit functions for updating GenomicRDD metadata. #1600

Merged
merged 2 commits into from Jul 12, 2017

Conversation

Projects
5 participants
@fnothaft
Member

fnothaft commented Jul 12, 2017

Resolves #1599. Adds functions that replaces or appends to the sequence dictionary/record groups/header lines/sample metadata that is attached to a GenomicRDD. This is necessary as a59b8ed eliminated the copy methods that were attached to all implementations of GenomicRDD by making the exposed implementations abstract classes instead of case classes.

One TODO:

  • Extend to Python API Punted to #1604

@fnothaft fnothaft added this to the 0.23.0 milestone Jul 12, 2017

@fnothaft fnothaft requested a review from heuermh Jul 12, 2017

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 12, 2017

Coverage Status

Coverage increased (+0.2%) to 84.244% when pulling 15e40b2 on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.2%) to 84.244% when pulling 15e40b2 on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2208/
Test PASSed.

AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2208/
Test PASSed.

@heuermh

LGTM but raises some other issues, see comments

@@ -1335,6 +1398,36 @@ abstract class AvroReadGroupGenomicRDD[T <% IndexedRecord: Manifest, U <: Produc
*/

This comment has been minimized.

@heuermh

heuermh Jul 12, 2017

Member

Slightly unrelated, would it make sense to rename AvroReadGroupGenomicRDDAvroRecordGroupGenomicRDD?

@heuermh

heuermh Jul 12, 2017

Member

Slightly unrelated, would it make sense to rename AvroReadGroupGenomicRDDAvroRecordGroupGenomicRDD?

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017

Member

Marking this as completed --> this was done by @heuermh in 79d26d5.

@fnothaft

fnothaft Jul 12, 2017

Member

Marking this as completed --> this was done by @heuermh in 79d26d5.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala Outdated
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 12, 2017

Coverage Status

Coverage increased (+0.2%) to 84.244% when pulling 9812642 on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.2%) to 84.244% when pulling 9812642 on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2209/
Test PASSed.

AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2209/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 12, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2210/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1600/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains f3b5de5 # timeout=10Checking out Revision f3b5de5 (origin/pr/1600/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f3b5de5140b22c3576a905dd01ce90d1a6fab498First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jul 12, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2210/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1600/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains f3b5de5 # timeout=10Checking out Revision f3b5de5 (origin/pr/1600/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f3b5de5140b22c3576a905dd01ce90d1a6fab498First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 12, 2017

Coverage Status

Coverage increased (+0.3%) to 84.273% when pulling 268dcfd on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.3%) to 84.273% when pulling 268dcfd on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2211/
Test PASSed.

AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2211/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 12, 2017

Member

@heuermh your comment RE: MultisampleAvroGenomicRDD got eaten by Github, so I'm copying it here:

This illustrates something I didn't notice before, the only subclass of MultisampleAvroGenomicRDD is GenotypeRDD.

VariantContextRDD extends MultisampleGenomicRDD and adds its own headerLines and VariantRDD extends AvroGenomicRDD and adds its own headerLines.

If MultisampleAvroGenomicRDD was intended to be a superclass for other data types, an association to VCF headers doesn't really make sense.

To address this, we could remove MultisampleAvroGenomicRDD and move headerLines to GenotypeRDD, or perhaps create something new for VCF headerLines that all of VariantContextRDD, VariantRDD, and GenotypeRDD could use.

Yeah I was noticing that as well. I think that got scoobied when we added support for non-default header lines, as the MultisampleAvroGenomicRDD trait existed prior to that. I agree that we should refactor this, but I'm reticent to take it on now. Would it be OK by you if we created an issue to track this and tabled it for 0.24.0?

Member

fnothaft commented Jul 12, 2017

@heuermh your comment RE: MultisampleAvroGenomicRDD got eaten by Github, so I'm copying it here:

This illustrates something I didn't notice before, the only subclass of MultisampleAvroGenomicRDD is GenotypeRDD.

VariantContextRDD extends MultisampleGenomicRDD and adds its own headerLines and VariantRDD extends AvroGenomicRDD and adds its own headerLines.

If MultisampleAvroGenomicRDD was intended to be a superclass for other data types, an association to VCF headers doesn't really make sense.

To address this, we could remove MultisampleAvroGenomicRDD and move headerLines to GenotypeRDD, or perhaps create something new for VCF headerLines that all of VariantContextRDD, VariantRDD, and GenotypeRDD could use.

Yeah I was noticing that as well. I think that got scoobied when we added support for non-default header lines, as the MultisampleAvroGenomicRDD trait existed prior to that. I agree that we should refactor this, but I'm reticent to take it on now. Would it be OK by you if we created an issue to track this and tabled it for 0.24.0?

@devin-petersohn

Looks great! A few comments regarding the usefulness of being able to have multiple SequenceDictionary, etc. additions.

* @param sequencesToAdd The new sequences to append.
* @return Returns a new GenomicRDD with the sequences appended.
*/
def addSequences(sequencesToAdd: SequenceDictionary): U = {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 12, 2017

Member

Would it be useful to have this as def addSequences(sequencesToAdd: SequenceDictionary*): U, allowing multiple SequenceDictionary additions at once?

@devin-petersohn

devin-petersohn Jul 12, 2017

Member

Would it be useful to have this as def addSequences(sequencesToAdd: SequenceDictionary*): U, allowing multiple SequenceDictionary additions at once?

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017

Member

I mean, SequenceDictionary supports the ++ operator, so my preference is not to make the function variadic. Mostly, it adds a bit of error checking that we need to do (WRT being provided no sequencesToAdd).

@fnothaft

fnothaft Jul 12, 2017

Member

I mean, SequenceDictionary supports the ++ operator, so my preference is not to make the function variadic. Mostly, it adds a bit of error checking that we need to do (WRT being provided no sequencesToAdd).

* @param sequenceToAdd The sequence to add.
* @return Returns a new GenomicRDD with this sequence appended.
*/
def addSequence(sequenceToAdd: SequenceRecord): U = {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 12, 2017

Member

Same note as above regarding adding multiple SequenceRecord additions.

@devin-petersohn

devin-petersohn Jul 12, 2017

Member

Same note as above regarding adding multiple SequenceRecord additions.

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017

Member

My preference here is that if you have multiple records to add, you either provide them as a SequenceDictionary or chain multiple addSequence calls.

@fnothaft

fnothaft Jul 12, 2017

Member

My preference here is that if you have multiple records to add, you either provide them as a SequenceDictionary or chain multiple addSequence calls.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala Outdated
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 12, 2017

Member

From a sequencing perspective, can we prioritize merging this? Once this is merged, then I'll make necessary changes to get #1545 compiling again.

Member

fnothaft commented Jul 12, 2017

From a sequencing perspective, can we prioritize merging this? Once this is merged, then I'll make necessary changes to get #1545 compiling again.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jul 12, 2017

Member

Absolutely. Are we still waiting on an update from Seq to Iterable, or merge as-is?

Member

devin-petersohn commented Jul 12, 2017

Absolutely. Are we still waiting on an update from Seq to Iterable, or merge as-is?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 12, 2017

Member

@devin-petersohn yeah, I need to do Seq->Iterable, Python updates, and @heuermh needs to re-review.

Member

fnothaft commented Jul 12, 2017

@devin-petersohn yeah, I need to do Seq->Iterable, Python updates, and @heuermh needs to re-review.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 12, 2017

Member

Careful, two of my commits were accidentally force pushed to @fnothaft's branch, addressing comment #1600 (comment) above.

The additional commits move the headerLine stuff to GenotypeRDD at risk of breaking DRY and clean up the saveMetadata code path, fixing #1601.

Member

heuermh commented Jul 12, 2017

Careful, two of my commits were accidentally force pushed to @fnothaft's branch, addressing comment #1600 (comment) above.

The additional commits move the headerLine stuff to GenotypeRDD at risk of breaking DRY and clean up the saveMetadata code path, fixing #1601.

fnothaft and others added some commits Jul 12, 2017

[ADAM-1599] Add explicit functions for updating GenomicRDD metadata.
Resolves #1599. Adds functions that replaces or appends to the
sequence dictionary/record groups/header lines/sample metadata that is attached
to a GenomicRDD. This is necessary as a59b8ed
eliminated the `copy` methods that were attached to all implementations of
`GenomicRDD` by making the exposed implementations abstract classes instead of
case classes.
Rename AvroReadGroupGenomicRDD to AvroRecordGroupGenomicRDD and move …
…VCF headers

to GenotypeRDD. Cleans up metadata save in VariantRDD. Resolves #1601.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 12, 2017

Member

The Python bit needs more work to be useful, so I'm punting that to #1604/0.24.0.

Member

fnothaft commented Jul 12, 2017

The Python bit needs more work to be useful, so I'm punting that to #1604/0.24.0.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 12, 2017

Member

OK, I've cleaned up the history in @heuermh's commits and Seq->Iterable'd. This is good to go from my end.

Member

fnothaft commented Jul 12, 2017

OK, I've cleaned up the history in @heuermh's commits and Seq->Iterable'd. This is good to go from my end.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 12, 2017

Coverage Status

Coverage decreased (-0.4%) to 83.633% when pulling b251fbb on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

coveralls commented Jul 12, 2017

Coverage Status

Coverage decreased (-0.4%) to 83.633% when pulling b251fbb on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 12, 2017

Coverage Status

Coverage increased (+0.1%) to 84.157% when pulling b251fbb on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.1%) to 84.157% when pulling b251fbb on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2214/
Test PASSed.

AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2214/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 12, 2017

Member

@devin-petersohn Protocol question, in cases like this with one review approval and one or more review-but-not-approval (looks like yours was Comment not Request changes), should we wait for all approvals?

Member

heuermh commented Jul 12, 2017

@devin-petersohn Protocol question, in cases like this with one review approval and one or more review-but-not-approval (looks like yours was Comment not Request changes), should we wait for all approvals?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 12, 2017

Member

I'd wait a short bit; @devin-petersohn is briefly OOO this PM but should be back online shortly.

Member

fnothaft commented Jul 12, 2017

I'd wait a short bit; @devin-petersohn is briefly OOO this PM but should be back online shortly.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jul 12, 2017

Member

Sorry about that. Looks good to me.

Member

devin-petersohn commented Jul 12, 2017

Sorry about that. Looks good to me.

@devin-petersohn devin-petersohn merged commit 1702fb2 into bigdatagenomics:master Jul 12, 2017

3 checks passed

codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage increased (+0.1%) to 84.157%
Details
default Merged build finished.
Details

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment