Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1599] Add explicit functions for updating GenomicRDD metadata. #1600

Merged

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 12, 2017

Resolves #1599. Adds functions that replaces or appends to the sequence dictionary/record groups/header lines/sample metadata that is attached to a GenomicRDD. This is necessary as a59b8ed eliminated the copy methods that were attached to all implementations of GenomicRDD by making the exposed implementations abstract classes instead of case classes.

One TODO:

  • Extend to Python API Punted to #1604
@fnothaft fnothaft added this to the 0.23.0 milestone Jul 12, 2017
@fnothaft fnothaft requested a review from heuermh Jul 12, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.2%) to 84.244% when pulling 15e40b2 on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2208/
Test PASSed.

Copy link
Member

@heuermh heuermh left a comment

LGTM but raises some other issues, see comments

@@ -1335,6 +1398,36 @@ abstract class AvroReadGroupGenomicRDD[T <% IndexedRecord: Manifest, U <: Produc
*/

This comment has been minimized.

@heuermh

heuermh Jul 12, 2017
Member

Slightly unrelated, would it make sense to rename AvroReadGroupGenomicRDDAvroRecordGroupGenomicRDD?

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017
Author Member

Marking this as completed --> this was done by @heuermh in 79d26d5.

@@ -1369,6 +1462,34 @@ abstract class MultisampleAvroGenomicRDD[T <% IndexedRecord: Manifest, U <: Prod
*/
val headerLines: Seq[VCFHeaderLine]

This comment has been minimized.

@heuermh

heuermh Jul 12, 2017
Member

This illustrates something I didn't notice before, the only subclass of MultisampleAvroGenomicRDD is GenotypeRDD.

VariantContextRDD extends MultisampleGenomicRDD and adds its own headerLines and VariantRDD extends AvroGenomicRDD and adds its own headerLines.

If MultisampleAvroGenomicRDD was intended to be a superclass for other data types, an association to VCF headers doesn't really make sense.

To address this, we could remove MultisampleAvroGenomicRDD and move headerLines to GenotypeRDD, or perhaps create something new for VCF headerLines that all of VariantContextRDD, VariantRDD, and GenotypeRDD could use.

@coveralls
Copy link

@coveralls coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.2%) to 84.244% when pulling 9812642 on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2209/
Test PASSed.

@heuermh heuermh force-pushed the fnothaft:issues/1599-no-copy-no-mo branch from 9812642 to 301ba1f Jul 12, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 12, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2210/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1600/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains f3b5de5 # timeout=10Checking out Revision f3b5de5 (origin/pr/1600/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f3b5de5140b22c3576a905dd01ce90d1a6fab498First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

@coveralls coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.3%) to 84.273% when pulling 268dcfd on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2211/
Test PASSed.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 12, 2017

@heuermh your comment RE: MultisampleAvroGenomicRDD got eaten by Github, so I'm copying it here:

This illustrates something I didn't notice before, the only subclass of MultisampleAvroGenomicRDD is GenotypeRDD.

VariantContextRDD extends MultisampleGenomicRDD and adds its own headerLines and VariantRDD extends AvroGenomicRDD and adds its own headerLines.

If MultisampleAvroGenomicRDD was intended to be a superclass for other data types, an association to VCF headers doesn't really make sense.

To address this, we could remove MultisampleAvroGenomicRDD and move headerLines to GenotypeRDD, or perhaps create something new for VCF headerLines that all of VariantContextRDD, VariantRDD, and GenotypeRDD could use.

Yeah I was noticing that as well. I think that got scoobied when we added support for non-default header lines, as the MultisampleAvroGenomicRDD trait existed prior to that. I agree that we should refactor this, but I'm reticent to take it on now. Would it be OK by you if we created an issue to track this and tabled it for 0.24.0?

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Looks great! A few comments regarding the usefulness of being able to have multiple SequenceDictionary, etc. additions.

* @param sequencesToAdd The new sequences to append.
* @return Returns a new GenomicRDD with the sequences appended.
*/
def addSequences(sequencesToAdd: SequenceDictionary): U = {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 12, 2017
Member

Would it be useful to have this as def addSequences(sequencesToAdd: SequenceDictionary*): U, allowing multiple SequenceDictionary additions at once?

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017
Author Member

I mean, SequenceDictionary supports the ++ operator, so my preference is not to make the function variadic. Mostly, it adds a bit of error checking that we need to do (WRT being provided no sequencesToAdd).

* @param sequenceToAdd The sequence to add.
* @return Returns a new GenomicRDD with this sequence appended.
*/
def addSequence(sequenceToAdd: SequenceRecord): U = {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 12, 2017
Member

Same note as above regarding adding multiple SequenceRecord additions.

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017
Author Member

My preference here is that if you have multiple records to add, you either provide them as a SequenceDictionary or chain multiple addSequence calls.

* @param samplesToAdd Zero or more samples to add.
* @return Returns a new RDD with samples added.
*/
def addSamples(samplesToAdd: Seq[Sample]): U = {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 12, 2017
Member

Same note as above regarding Iterable vs Seq.

* @param newSamples The new sample metadata to attach.
* @return A GenomicRDD with new sample metadata.
*/
def replaceSamples(newSamples: Seq[Sample]): U

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 12, 2017
Member

Is Iterable better than Seq for the public APIs?

This comment has been minimized.

@fnothaft

fnothaft Jul 12, 2017
Author Member

Ah, good point. My preference is that we keep the type of the field samples as a Seq (because Iterable doesn't support indexed lookup, and it is useful to be able to easily query "which sample is sample 100?") but for these functions, Iterable is preferable.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 12, 2017

From a sequencing perspective, can we prioritize merging this? Once this is merged, then I'll make necessary changes to get #1545 compiling again.

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Jul 12, 2017

Absolutely. Are we still waiting on an update from Seq to Iterable, or merge as-is?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 12, 2017

@devin-petersohn yeah, I need to do Seq->Iterable, Python updates, and @heuermh needs to re-review.

@heuermh
Copy link
Member

@heuermh heuermh commented Jul 12, 2017

Careful, two of my commits were accidentally force pushed to @fnothaft's branch, addressing comment #1600 (comment) above.

The additional commits move the headerLine stuff to GenotypeRDD at risk of breaking DRY and clean up the saveMetadata code path, fixing #1601.

fnothaft and others added 2 commits Jul 12, 2017
Resolves #1599. Adds functions that replaces or appends to the
sequence dictionary/record groups/header lines/sample metadata that is attached
to a GenomicRDD. This is necessary as a59b8ed
eliminated the `copy` methods that were attached to all implementations of
`GenomicRDD` by making the exposed implementations abstract classes instead of
case classes.
…VCF headers

to GenotypeRDD. Cleans up metadata save in VariantRDD. Resolves #1601.
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 12, 2017

The Python bit needs more work to be useful, so I'm punting that to #1604/0.24.0.

@fnothaft fnothaft force-pushed the fnothaft:issues/1599-no-copy-no-mo branch from 268dcfd to b251fbb Jul 12, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 12, 2017

OK, I've cleaned up the history in @heuermh's commits and Seq->Iterable'd. This is good to go from my end.

@coveralls
Copy link

@coveralls coveralls commented Jul 12, 2017

Coverage Status

Coverage decreased (-0.4%) to 83.633% when pulling b251fbb on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@coveralls
Copy link

@coveralls coveralls commented Jul 12, 2017

Coverage Status

Coverage increased (+0.1%) to 84.157% when pulling b251fbb on fnothaft:issues/1599-no-copy-no-mo into 238e044 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2214/
Test PASSed.

@heuermh
Copy link
Member

@heuermh heuermh commented Jul 12, 2017

@devin-petersohn Protocol question, in cases like this with one review approval and one or more review-but-not-approval (looks like yours was Comment not Request changes), should we wait for all approvals?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 12, 2017

I'd wait a short bit; @devin-petersohn is briefly OOO this PM but should be back online shortly.

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Jul 12, 2017

Sorry about that. Looks good to me.

@devin-petersohn devin-petersohn merged commit 1702fb2 into bigdatagenomics:master Jul 12, 2017
3 checks passed
3 checks passed
codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage increased (+0.1%) to 84.157%
Details
default Merged build finished.
Details
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.