Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge VariantAnnotation and DatabaseVariantAnnotation records #1250

Conversation

@heuermh
Copy link
Member

heuermh commented Nov 8, 2016

Supercedes #1144

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
* @tparam T The type of records to return
* @return An RDD with records of the specified type
*/
private[rdd] def loadParquet[T](
def loadParquet[T](

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 8, 2016

Author Member

I had to make this public again for unit tests in package o.b.a.projections. It also allows for loading user-defined schema (e.g. extensions to bdg-formats) from Avro-in-Parquet files.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

Perhaps let's make it private[adam]?

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 8, 2016

Author Member

It seems like this would be useful outside of ADAM. I haven't fully thought through the use case though: someone wants to add a new schema record Foo, they extend ADAMKyroRegistrator to register, then extend ADAMContext to add their new loadFoo method, which presumably would call loadParquet for foo.adam paths. If Foo has a sequence dictionary or samples, those avro metadata methods would also be useful, and extending from GenomicRDD and friends needs to be possible.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

I don't disagree, but I'd rather keep these private until someone knocks on our door asking to make them public. My philosophy here is simply that it is easier to make private interfaces public than it is to make public interfaces private. That being said, this is a weak preference: if you feel strongly about it, I'm OK with making it public, esp. since loadParquet has been public previously.

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1564/
Test PASSed.

Copy link
Member

fnothaft left a comment

This looks awesome. I've dropped a variety of suggestions and nits inline. Do we have a VCF with proper ANN fields that we could pull in and load in org.bdgenomics.adam.rdd.ADAMContextSuite and then save back out? I think that's a good round trip test that we should add.

Also, I think we can punt the next thing to a later PR, but I think we could probably autogen the test data (and more tests) for the *FieldSuites. Adding them is a massive step forward though. Thanks for pushing those in as well!

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {

val te = TranscriptEffect.newBuilder()
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
// note: annotationImpact is not mapped

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

I don't get this comment; can you flesh it out more?

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 8, 2016

Author Member

The annotationImpact field (and variable above) is output by SnpEff version 4.2 but is not part of the VCF ANN specification, so I did not include it in our TranscriptEffect schema.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

That makes sense, can you add that inline?

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {

val te = TranscriptEffect.newBuilder()
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
// note: annotationImpact is not mapped
if (!effects.isEmpty) te.setEffects(effects.asJava)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

effects.nonEmpty

@@ -132,26 +170,98 @@ object VariantAnnotations extends Serializable with Logging {
Seq(te.build())

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

Unrelated to this PR, as this line is unchanged, but whenever possible, I prefer Iterable to Seq unless you need random lookup by index.

variant: Variant,
vc: VariantContext,
stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

Instead of returning Option[List[TranscriptEffect]] I would just return List[TranscriptEffect]. If you would return a None, I would just return a List.empty instead.

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 8, 2016

Author Member

That would make my brain hurt less. The thought is elsewhere it matters whether this field has been set, so checking Option seemed more correct than checking for an empty list.

if (attr == VCFConstants.MISSING_VALUE_v4) {
None
} else {
val filtered = parseAnn(attr, stringency).filter(_.getAlternateAllele == variant.getAlternateAllele)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

If you make the above change, then the if-else clause here just becomes:

if (attr == VCFConstants.MISSING_VALUE_v4) {
  List.empty
} else {
  parseAnn(attr, stringency)
    .filter(_.getAlternateAllele == variant.getAlternateAllele)
}

Also, I would break at the .filter, because that line is a bit long.

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 8, 2016

Author Member

Will be adding try catch with validation stringency here shortly...

val numOpt = Option(numerator)
val denomOpt = Option(denominator)

val sb = StringBuilder.newBuilder

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

I think this code would be a bit cleaner with a match:

(numOpt, denomOpt) match {
  case (Some(n), Some(d)) => {
    "%d/%d".format(n, d)
  }
  case (None, None) => {
    ""
  }
  case _ => {
     // validate/throw?
     if (validationStringency == ValidationStringency.STRICT) {
       throw new IllegalArgumentException("Incorrect fractional value in %s.".format(te))
     } else if (validationStringency == ValidationStringency.LENIENT) {
       log.warn("Incorrect fractional value in %s.".format(te))
     }
     ""
  }
}

Also, I would either make this package private/private, or move it inside of toAnn, which I think is the only place it is used.

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 8, 2016

Author Member

I thought it was already private since it is nested in convertToVcfInfoAnnValue? Still have some to learn about visibility in Scala. The tuple of options is cleaner. (I can't believe I just said that)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

Ah yes, you are right RE: protection; I had missed the nesting.

@@ -17,18 +17,12 @@
*/
package org.bdgenomics.adam.projections

import org.bdgenomics.formats.avro.DatabaseVariantAnnotation
import org.bdgenomics.formats.avro.Contig

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

OOC, why does this show up as a file move? Any thoughts? May be just github being funky.

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
* @tparam T The type of records to return
* @return An RDD with records of the specified type
*/
private[rdd] def loadParquet[T](
def loadParquet[T](

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

Perhaps let's make it private[adam]?

@@ -60,20 +60,19 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* @param ann Annotation RDD to join against.
* @return Returns a VariantContextRDD where annotations have been filled in.
*/
def joinDatabaseVariantAnnotation(ann: DatabaseVariantAnnotationRDD): VariantContextRDD = {
def joinVariantAnnotations(ann: VariantAnnotationRDD): VariantContextRDD = {
replaceRdd(rdd.keyBy(_.variant)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

We might want to open a ticket for this, but after #1216 this should probably be implemented using a region join instead of a Spark core leftOuterJoin.

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 10, 2016

Author Member

Created new issue #1259

import com.google.common.collect.ImmutableList
import htsjdk.samtools.ValidationStringency
import htsjdk.variant.vcf.VCFConstants
import htsjdk.variant.variantcontext.VariantContext

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 8, 2016

Member

Nit: htsjdk.variant.vcf after htsjdk.variant.variantcontext

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1565/
Test PASSed.

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1566/
Test PASSed.

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 10, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1587/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1250/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7eff061 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1250/merge^{commit} # timeout=10Checking out Revision 7eff061 (origin/pr/1250/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7eff06161dcee656f3c48996818a95cb92e96267First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

@heuermh - is it reasonable / useful for me to try to build the code in this PR locally to test it out at this point? I tried to compile it, but can't seem to find a version bdg-formats that works with it.

I tried both
https://github.com/heuermh/bdg-formats/tree/master as it was a 0.9.1-SNAPSHOT version and then tried changing to current bdg-formats 0.10.1-SNAPSHOT, but bdg-formats compile problems with both.

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

@jpdna As is this branch will not compile due to the filter-related changes in bdg-formats. I've made the code changes locally but they need more unit tests. I'll push these in a commit tomorrow morning.

@@ -143,6 +143,5 @@ class VariantContext(
val position: ReferencePosition,
val variant: RichVariant,
val genotypes: Iterable[Genotype],
val databases: Option[DatabaseVariantAnnotation] = None) {
val databases: Option[VariantAnnotation] = None) {

This comment has been minimized.

Copy link
@jpdna

jpdna Nov 11, 2016

Member

"databases" seems kind of a strange name for this field now to me, I might prefer "annotations".

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 11, 2016

Member

+1, databases was always a kinda strange name, but it's definitely weird now!

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 11, 2016

Author Member

nice catch! fixed

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

Pushed new commits that fixes the separate variant and genotype filters issue and updates bdg-formats to the release version 0.10.0. I implemented the filter stuff to the best that htsjdk makes available to us; I could either continue to hack on it so that Genotype.filtersApplied is set correctly or punt until a later issue.

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 11, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1591/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

Fixes #194

Copy link
Member

fnothaft left a comment

1 small nit on the filters, otherwise LGTM

val copy = VariantCallingAnnotations.newBuilder(annotations)
// htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
// we might be able to calculate it by querying the FT FORMAT field value directly
copy.setFiltersApplied(true)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 11, 2016

Member

I think this would work:

g.getAnyAttribute("FT") != null

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

Can we create an issue to track the upstream htsjdk issue?

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 15, 2016

Author Member

Created new issue #1269

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

I tried to load this VCF:
https://drive.google.com/open?id=0B6jh69UgixwpTDlSemtreklDNUU
which is the ClinVar VCF with ANN column added by SNPEff

It seems to fail, but without an error message

val x = sc.loadVariantAnnotations("/home/paschallj/ADAM/nov11_annot/run1/test1.200.vcf")

scala> x.rdd.count
res13: Long = 0

This VCF does seem to load into a VariantRDD with loadVCF fine.

I suspect there is something unexpected about the format of my VCF file ANN field, but if this is current snpeff output then it could be problem for some users.

Can you point me to a test VCF with a ANN field that is working properly that I can compare to?

@fnothaft
Copy link
Member

fnothaft commented Nov 11, 2016

@jpdna do you get any error/warning messages in the logs? If you have ValidationStringency.LENIENT set, I'd expect to see error messages there.

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

Where do I set `ValidationStringency.LENIENT" ?
currently I don't see any error messages in adam.log when I run this in adam-shell

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

@jpdna It might be hard to follow, since things are spread over several issues, but this pull request does not yet support populating VariantAnnotation.transcriptEffects from VCF INFO reserved key ANN values. See #1044 (comment)

@fnothaft
Copy link
Member

fnothaft commented Nov 11, 2016

Sure, but even then @jpdna should be getting one VariantAnnotation record per Variant, no?

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

Maybe, I don't know how well that part of the code works. Based on this and recent conversations on gitter (same issue apparently), not too well?

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

this pull request does not yet support populating VariantAnnotation.transcriptEffects

ah, thanks for clarifying @heuermh - I'll plan to watch this PR then for the further commits and try my test again when you ping that reading ANN field into transcriptEffects is ready. Perhaps some rows of the VCF I linked to above can be a useful in the test suite - both a VEP and SNPeff derived example annotated VCF would be good.

Copy link
Member

fnothaft left a comment

Just two small changes: parseAndFilter should be private and there's still an Option related NPE issue in convertToVcfInfoAnnValue. Can you clean these two up and I will merge this PR manually?

*/
def convertToVcfInfoAnnValue(effects: Seq[TranscriptEffect]): String = {
def toFraction(numerator: java.lang.Integer, denominator: java.lang.Integer): String = {
val numOpt = Option(numerator)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

This NPE with Option types still needs to be fixed.

stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {

def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

This method should be private.

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 15, 2016

Author Member

error: illegal start of statement (no modifiers allowed here) [ERROR] private def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

Ah, sorry, I misread this and didn't notice that it is nested inside another function.

val copy = VariantCallingAnnotations.newBuilder(annotations)
// htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
// we might be able to calculate it by querying the FT FORMAT field value directly
copy.setFiltersApplied(true)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

Can we create an issue to track the upstream htsjdk issue?

loadVcf(filePath).toDatabaseVariantAnnotationRDD
def loadVcfAnnotations(
filePath: String): VariantAnnotationRDD = {
loadVcf(filePath).toVariantAnnotationRDD

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

Just for tracking RE @jpdna's comment about not getting any annotations from a VCF, this line is the culprit. Specifically, loadVcf right now just parses the Genotypes. We should make the VariantContextConverter parse out the annotations by default in the follow on PR.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

No changes necessary in this PR, just an FYI.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 15, 2016

Member

Also, @jpdna it'd be great to add some unit tests that use that file and try to load a few ANN fields. That should be an acceptance test for the release. Would you be able to do that?

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 15, 2016

Author Member

Running SnpEff on the VCF files we are already using for unit tests ends up being not too interesting, with all intragenic variants. It might take a little thinking to generate a more useful VCF file, say with variants right at intron/exon boundaries of a gene with a lot of splice variants, for example.

@heuermh
Copy link
Member Author

heuermh commented Nov 15, 2016

Pushed commit with some additional unit tests. Let me know if I've addressed all the review comments, and thank you for volunteering to merge this manually.

@fnothaft
Copy link
Member

fnothaft commented Nov 15, 2016

LGTM now! I will merge this manually shortly.

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 15, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1601/
Test PASSed.

@fnothaft fnothaft force-pushed the bigdatagenomics:upgrade-to-bdg-formats-0.10.0 branch from 12f245c to c06143b Nov 15, 2016
@fnothaft
Copy link
Member

fnothaft commented Nov 15, 2016

Merged into upgrade-to-0.10.0 as c750830 and c06143b. Thanks @heuermh! It is fantastic to get this refactor in. I will merge in the upgrade-to-0.10.0 branch on tests rerunning and passing.

@fnothaft fnothaft closed this Nov 15, 2016
@heuermh heuermh deleted the heuermh:merge-variant-annotation2 branch Nov 15, 2016
@heuermh
Copy link
Member Author

heuermh commented Nov 15, 2016

Woot! Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.