Merge VariantAnnotation and DatabaseVariantAnnotation records #1250

Closed
wants to merge 12 commits into
from

Conversation

Projects
4 participants
@heuermh
Member

heuermh commented Nov 8, 2016

Supercedes #1144

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
* @tparam T The type of records to return
* @return An RDD with records of the specified type
*/
- private[rdd] def loadParquet[T](
+ def loadParquet[T](

This comment has been minimized.

@heuermh

heuermh Nov 8, 2016

Member

I had to make this public again for unit tests in package o.b.a.projections. It also allows for loading user-defined schema (e.g. extensions to bdg-formats) from Avro-in-Parquet files.

@heuermh

heuermh Nov 8, 2016

Member

I had to make this public again for unit tests in package o.b.a.projections. It also allows for loading user-defined schema (e.g. extensions to bdg-formats) from Avro-in-Parquet files.

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

Perhaps let's make it private[adam]?

@fnothaft

fnothaft Nov 8, 2016

Member

Perhaps let's make it private[adam]?

This comment has been minimized.

@heuermh

heuermh Nov 8, 2016

Member

It seems like this would be useful outside of ADAM. I haven't fully thought through the use case though: someone wants to add a new schema record Foo, they extend ADAMKyroRegistrator to register, then extend ADAMContext to add their new loadFoo method, which presumably would call loadParquet for foo.adam paths. If Foo has a sequence dictionary or samples, those avro metadata methods would also be useful, and extending from GenomicRDD and friends needs to be possible.

@heuermh

heuermh Nov 8, 2016

Member

It seems like this would be useful outside of ADAM. I haven't fully thought through the use case though: someone wants to add a new schema record Foo, they extend ADAMKyroRegistrator to register, then extend ADAMContext to add their new loadFoo method, which presumably would call loadParquet for foo.adam paths. If Foo has a sequence dictionary or samples, those avro metadata methods would also be useful, and extending from GenomicRDD and friends needs to be possible.

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

I don't disagree, but I'd rather keep these private until someone knocks on our door asking to make them public. My philosophy here is simply that it is easier to make private interfaces public than it is to make public interfaces private. That being said, this is a weak preference: if you feel strongly about it, I'm OK with making it public, esp. since loadParquet has been public previously.

@fnothaft

fnothaft Nov 8, 2016

Member

I don't disagree, but I'd rather keep these private until someone knocks on our door asking to make them public. My philosophy here is simply that it is easier to make private interfaces public than it is to make public interfaces private. That being said, this is a weak preference: if you feel strongly about it, I'm OK with making it public, esp. since loadParquet has been public previously.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1564/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1564/
Test PASSed.

@fnothaft

This looks awesome. I've dropped a variety of suggestions and nits inline. Do we have a VCF with proper ANN fields that we could pull in and load in org.bdgenomics.adam.rdd.ADAMContextSuite and then save back out? I think that's a good round trip test that we should add.

Also, I think we can punt the next thing to a later PR, but I think we could probably autogen the test data (and more tests) for the *FieldSuites. Adding them is a massive step forward though. Thanks for pushing those in as well!

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {
val te = TranscriptEffect.newBuilder()
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
+ // note: annotationImpact is not mapped

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

I don't get this comment; can you flesh it out more?

@fnothaft

fnothaft Nov 8, 2016

Member

I don't get this comment; can you flesh it out more?

This comment has been minimized.

@heuermh

heuermh Nov 8, 2016

Member

The annotationImpact field (and variable above) is output by SnpEff version 4.2 but is not part of the VCF ANN specification, so I did not include it in our TranscriptEffect schema.

@heuermh

heuermh Nov 8, 2016

Member

The annotationImpact field (and variable above) is output by SnpEff version 4.2 but is not part of the VCF ANN specification, so I did not include it in our TranscriptEffect schema.

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

That makes sense, can you add that inline?

@fnothaft

fnothaft Nov 8, 2016

Member

That makes sense, can you add that inline?

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {
val te = TranscriptEffect.newBuilder()
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
+ // note: annotationImpact is not mapped
if (!effects.isEmpty) te.setEffects(effects.asJava)

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

effects.nonEmpty

@fnothaft

fnothaft Nov 8, 2016

Member

effects.nonEmpty

@@ -132,26 +170,98 @@ object VariantAnnotations extends Serializable with Logging {
Seq(te.build())

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

Unrelated to this PR, as this line is unchanged, but whenever possible, I prefer Iterable to Seq unless you need random lookup by index.

@fnothaft

fnothaft Nov 8, 2016

Member

Unrelated to this PR, as this line is unchanged, but whenever possible, I prefer Iterable to Seq unless you need random lookup by index.

variant: Variant,
vc: VariantContext,
- stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
+ stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

Instead of returning Option[List[TranscriptEffect]] I would just return List[TranscriptEffect]. If you would return a None, I would just return a List.empty instead.

@fnothaft

fnothaft Nov 8, 2016

Member

Instead of returning Option[List[TranscriptEffect]] I would just return List[TranscriptEffect]. If you would return a None, I would just return a List.empty instead.

This comment has been minimized.

@heuermh

heuermh Nov 8, 2016

Member

That would make my brain hurt less. The thought is elsewhere it matters whether this field has been set, so checking Option seemed more correct than checking for an empty list.

@heuermh

heuermh Nov 8, 2016

Member

That would make my brain hurt less. The thought is elsewhere it matters whether this field has been set, so checking Option seemed more correct than checking for an empty list.

+ if (attr == VCFConstants.MISSING_VALUE_v4) {
+ None
+ } else {
+ val filtered = parseAnn(attr, stringency).filter(_.getAlternateAllele == variant.getAlternateAllele)

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

If you make the above change, then the if-else clause here just becomes:

if (attr == VCFConstants.MISSING_VALUE_v4) {
  List.empty
} else {
  parseAnn(attr, stringency)
    .filter(_.getAlternateAllele == variant.getAlternateAllele)
}

Also, I would break at the .filter, because that line is a bit long.

@fnothaft

fnothaft Nov 8, 2016

Member

If you make the above change, then the if-else clause here just becomes:

if (attr == VCFConstants.MISSING_VALUE_v4) {
  List.empty
} else {
  parseAnn(attr, stringency)
    .filter(_.getAlternateAllele == variant.getAlternateAllele)
}

Also, I would break at the .filter, because that line is a bit long.

This comment has been minimized.

@heuermh

heuermh Nov 8, 2016

Member

Will be adding try catch with validation stringency here shortly...

@heuermh

heuermh Nov 8, 2016

Member

Will be adding try catch with validation stringency here shortly...

+ val numOpt = Option(numerator)
+ val denomOpt = Option(denominator)
+
+ val sb = StringBuilder.newBuilder

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

I think this code would be a bit cleaner with a match:

(numOpt, denomOpt) match {
  case (Some(n), Some(d)) => {
    "%d/%d".format(n, d)
  }
  case (None, None) => {
    ""
  }
  case _ => {
     // validate/throw?
     if (validationStringency == ValidationStringency.STRICT) {
       throw new IllegalArgumentException("Incorrect fractional value in %s.".format(te))
     } else if (validationStringency == ValidationStringency.LENIENT) {
       log.warn("Incorrect fractional value in %s.".format(te))
     }
     ""
  }
}

Also, I would either make this package private/private, or move it inside of toAnn, which I think is the only place it is used.

@fnothaft

fnothaft Nov 8, 2016

Member

I think this code would be a bit cleaner with a match:

(numOpt, denomOpt) match {
  case (Some(n), Some(d)) => {
    "%d/%d".format(n, d)
  }
  case (None, None) => {
    ""
  }
  case _ => {
     // validate/throw?
     if (validationStringency == ValidationStringency.STRICT) {
       throw new IllegalArgumentException("Incorrect fractional value in %s.".format(te))
     } else if (validationStringency == ValidationStringency.LENIENT) {
       log.warn("Incorrect fractional value in %s.".format(te))
     }
     ""
  }
}

Also, I would either make this package private/private, or move it inside of toAnn, which I think is the only place it is used.

This comment has been minimized.

@heuermh

heuermh Nov 8, 2016

Member

I thought it was already private since it is nested in convertToVcfInfoAnnValue? Still have some to learn about visibility in Scala. The tuple of options is cleaner. (I can't believe I just said that)

@heuermh

heuermh Nov 8, 2016

Member

I thought it was already private since it is nested in convertToVcfInfoAnnValue? Still have some to learn about visibility in Scala. The tuple of options is cleaner. (I can't believe I just said that)

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

Ah yes, you are right RE: protection; I had missed the nesting.

@fnothaft

fnothaft Nov 8, 2016

Member

Ah yes, you are right RE: protection; I had missed the nesting.

@@ -17,18 +17,12 @@
*/
package org.bdgenomics.adam.projections
-import org.bdgenomics.formats.avro.DatabaseVariantAnnotation
+import org.bdgenomics.formats.avro.Contig

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

OOC, why does this show up as a file move? Any thoughts? May be just github being funky.

@fnothaft

fnothaft Nov 8, 2016

Member

OOC, why does this show up as a file move? Any thoughts? May be just github being funky.

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
* @tparam T The type of records to return
* @return An RDD with records of the specified type
*/
- private[rdd] def loadParquet[T](
+ def loadParquet[T](

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

Perhaps let's make it private[adam]?

@fnothaft

fnothaft Nov 8, 2016

Member

Perhaps let's make it private[adam]?

@@ -60,20 +60,19 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* @param ann Annotation RDD to join against.
* @return Returns a VariantContextRDD where annotations have been filled in.
*/
- def joinDatabaseVariantAnnotation(ann: DatabaseVariantAnnotationRDD): VariantContextRDD = {
+ def joinVariantAnnotations(ann: VariantAnnotationRDD): VariantContextRDD = {
replaceRdd(rdd.keyBy(_.variant)

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

We might want to open a ticket for this, but after #1216 this should probably be implemented using a region join instead of a Spark core leftOuterJoin.

@fnothaft

fnothaft Nov 8, 2016

Member

We might want to open a ticket for this, but after #1216 this should probably be implemented using a region join instead of a Spark core leftOuterJoin.

This comment has been minimized.

@heuermh

heuermh Nov 10, 2016

Member

Created new issue #1259

@heuermh

heuermh Nov 10, 2016

Member

Created new issue #1259

+import com.google.common.collect.ImmutableList
+import htsjdk.samtools.ValidationStringency
+import htsjdk.variant.vcf.VCFConstants
+import htsjdk.variant.variantcontext.VariantContext

This comment has been minimized.

@fnothaft

fnothaft Nov 8, 2016

Member

Nit: htsjdk.variant.vcf after htsjdk.variant.variantcontext

@fnothaft

fnothaft Nov 8, 2016

Member

Nit: htsjdk.variant.vcf after htsjdk.variant.variantcontext

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1565/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1565/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1566/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1566/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 10, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1587/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1250/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7eff061 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1250/merge^{commit} # timeout=10Checking out Revision 7eff061 (origin/pr/1250/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7eff06161dcee656f3c48996818a95cb92e96267First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1587/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1250/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7eff061 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1250/merge^{commit} # timeout=10Checking out Revision 7eff061 (origin/pr/1250/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7eff06161dcee656f3c48996818a95cb92e96267First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 11, 2016

Member

@heuermh - is it reasonable / useful for me to try to build the code in this PR locally to test it out at this point? I tried to compile it, but can't seem to find a version bdg-formats that works with it.

I tried both
https://github.com/heuermh/bdg-formats/tree/master as it was a 0.9.1-SNAPSHOT version and then tried changing to current bdg-formats 0.10.1-SNAPSHOT, but bdg-formats compile problems with both.

Member

jpdna commented Nov 11, 2016

@heuermh - is it reasonable / useful for me to try to build the code in this PR locally to test it out at this point? I tried to compile it, but can't seem to find a version bdg-formats that works with it.

I tried both
https://github.com/heuermh/bdg-formats/tree/master as it was a 0.9.1-SNAPSHOT version and then tried changing to current bdg-formats 0.10.1-SNAPSHOT, but bdg-formats compile problems with both.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

@jpdna As is this branch will not compile due to the filter-related changes in bdg-formats. I've made the code changes locally but they need more unit tests. I'll push these in a commit tomorrow morning.

Member

heuermh commented Nov 11, 2016

@jpdna As is this branch will not compile due to the filter-related changes in bdg-formats. I've made the code changes locally but they need more unit tests. I'll push these in a commit tomorrow morning.

@@ -143,6 +143,5 @@ class VariantContext(
val position: ReferencePosition,
val variant: RichVariant,
val genotypes: Iterable[Genotype],
- val databases: Option[DatabaseVariantAnnotation] = None) {
+ val databases: Option[VariantAnnotation] = None) {

This comment has been minimized.

@jpdna

jpdna Nov 11, 2016

Member

"databases" seems kind of a strange name for this field now to me, I might prefer "annotations".

@jpdna

jpdna Nov 11, 2016

Member

"databases" seems kind of a strange name for this field now to me, I might prefer "annotations".

This comment has been minimized.

@fnothaft

fnothaft Nov 11, 2016

Member

+1, databases was always a kinda strange name, but it's definitely weird now!

@fnothaft

fnothaft Nov 11, 2016

Member

+1, databases was always a kinda strange name, but it's definitely weird now!

This comment has been minimized.

@heuermh

heuermh Nov 11, 2016

Member

nice catch! fixed

@heuermh

heuermh Nov 11, 2016

Member

nice catch! fixed

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

Pushed new commits that fixes the separate variant and genotype filters issue and updates bdg-formats to the release version 0.10.0. I implemented the filter stuff to the best that htsjdk makes available to us; I could either continue to hack on it so that Genotype.filtersApplied is set correctly or punt until a later issue.

Member

heuermh commented Nov 11, 2016

Pushed new commits that fixes the separate variant and genotype filters issue and updates bdg-formats to the release version 0.10.0. I implemented the filter stuff to the best that htsjdk makes available to us; I could either continue to hack on it so that Genotype.filtersApplied is set correctly or punt until a later issue.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 11, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1591/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1591/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

Fixes #194

Member

heuermh commented Nov 11, 2016

Fixes #194

@fnothaft

1 small nit on the filters, otherwise LGTM

+ val copy = VariantCallingAnnotations.newBuilder(annotations)
+ // htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
+ // we might be able to calculate it by querying the FT FORMAT field value directly
+ copy.setFiltersApplied(true)

This comment has been minimized.

@fnothaft

fnothaft Nov 11, 2016

Member

I think this would work:

g.getAnyAttribute("FT") != null
@fnothaft

fnothaft Nov 11, 2016

Member

I think this would work:

g.getAnyAttribute("FT") != null

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

Can we create an issue to track the upstream htsjdk issue?

@fnothaft

fnothaft Nov 15, 2016

Member

Can we create an issue to track the upstream htsjdk issue?

This comment has been minimized.

@heuermh

heuermh Nov 15, 2016

Member

Created new issue #1269

@heuermh

heuermh Nov 15, 2016

Member

Created new issue #1269

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 11, 2016

Member

I tried to load this VCF:
https://drive.google.com/open?id=0B6jh69UgixwpTDlSemtreklDNUU
which is the ClinVar VCF with ANN column added by SNPEff

It seems to fail, but without an error message

val x = sc.loadVariantAnnotations("/home/paschallj/ADAM/nov11_annot/run1/test1.200.vcf")

scala> x.rdd.count
res13: Long = 0

This VCF does seem to load into a VariantRDD with loadVCF fine.

I suspect there is something unexpected about the format of my VCF file ANN field, but if this is current snpeff output then it could be problem for some users.

Can you point me to a test VCF with a ANN field that is working properly that I can compare to?

Member

jpdna commented Nov 11, 2016

I tried to load this VCF:
https://drive.google.com/open?id=0B6jh69UgixwpTDlSemtreklDNUU
which is the ClinVar VCF with ANN column added by SNPEff

It seems to fail, but without an error message

val x = sc.loadVariantAnnotations("/home/paschallj/ADAM/nov11_annot/run1/test1.200.vcf")

scala> x.rdd.count
res13: Long = 0

This VCF does seem to load into a VariantRDD with loadVCF fine.

I suspect there is something unexpected about the format of my VCF file ANN field, but if this is current snpeff output then it could be problem for some users.

Can you point me to a test VCF with a ANN field that is working properly that I can compare to?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 11, 2016

Member

@jpdna do you get any error/warning messages in the logs? If you have ValidationStringency.LENIENT set, I'd expect to see error messages there.

Member

fnothaft commented Nov 11, 2016

@jpdna do you get any error/warning messages in the logs? If you have ValidationStringency.LENIENT set, I'd expect to see error messages there.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 11, 2016

Member

Where do I set `ValidationStringency.LENIENT" ?
currently I don't see any error messages in adam.log when I run this in adam-shell

Member

jpdna commented Nov 11, 2016

Where do I set `ValidationStringency.LENIENT" ?
currently I don't see any error messages in adam.log when I run this in adam-shell

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

@jpdna It might be hard to follow, since things are spread over several issues, but this pull request does not yet support populating VariantAnnotation.transcriptEffects from VCF INFO reserved key ANN values. See #1044 (comment)

Member

heuermh commented Nov 11, 2016

@jpdna It might be hard to follow, since things are spread over several issues, but this pull request does not yet support populating VariantAnnotation.transcriptEffects from VCF INFO reserved key ANN values. See #1044 (comment)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 11, 2016

Member

Sure, but even then @jpdna should be getting one VariantAnnotation record per Variant, no?

Member

fnothaft commented Nov 11, 2016

Sure, but even then @jpdna should be getting one VariantAnnotation record per Variant, no?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

Maybe, I don't know how well that part of the code works. Based on this and recent conversations on gitter (same issue apparently), not too well?

Member

heuermh commented Nov 11, 2016

Maybe, I don't know how well that part of the code works. Based on this and recent conversations on gitter (same issue apparently), not too well?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 11, 2016

Member

this pull request does not yet support populating VariantAnnotation.transcriptEffects

ah, thanks for clarifying @heuermh - I'll plan to watch this PR then for the further commits and try my test again when you ping that reading ANN field into transcriptEffects is ready. Perhaps some rows of the VCF I linked to above can be a useful in the test suite - both a VEP and SNPeff derived example annotated VCF would be good.

Member

jpdna commented Nov 11, 2016

this pull request does not yet support populating VariantAnnotation.transcriptEffects

ah, thanks for clarifying @heuermh - I'll plan to watch this PR then for the further commits and try my test again when you ping that reading ANN field into transcriptEffects is ready. Perhaps some rows of the VCF I linked to above can be a useful in the test suite - both a VEP and SNPeff derived example annotated VCF would be good.

@fnothaft

Just two small changes: parseAndFilter should be private and there's still an Option related NPE issue in convertToVcfInfoAnnValue. Can you clean these two up and I will merge this PR manually?

+ */
+ def convertToVcfInfoAnnValue(effects: Seq[TranscriptEffect]): String = {
+ def toFraction(numerator: java.lang.Integer, denominator: java.lang.Integer): String = {
+ val numOpt = Option(numerator)

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

This NPE with Option types still needs to be fixed.

@fnothaft

fnothaft Nov 15, 2016

Member

This NPE with Option types still needs to be fixed.

- stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
+ stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {
+
+ def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

This method should be private.

@fnothaft

fnothaft Nov 15, 2016

Member

This method should be private.

This comment has been minimized.

@heuermh

heuermh Nov 15, 2016

Member

error: illegal start of statement (no modifiers allowed here) [ERROR] private def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

@heuermh

heuermh Nov 15, 2016

Member

error: illegal start of statement (no modifiers allowed here) [ERROR] private def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

Ah, sorry, I misread this and didn't notice that it is nested inside another function.

@fnothaft

fnothaft Nov 15, 2016

Member

Ah, sorry, I misread this and didn't notice that it is nested inside another function.

+ val copy = VariantCallingAnnotations.newBuilder(annotations)
+ // htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
+ // we might be able to calculate it by querying the FT FORMAT field value directly
+ copy.setFiltersApplied(true)

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

Can we create an issue to track the upstream htsjdk issue?

@fnothaft

fnothaft Nov 15, 2016

Member

Can we create an issue to track the upstream htsjdk issue?

- loadVcf(filePath).toDatabaseVariantAnnotationRDD
+ def loadVcfAnnotations(
+ filePath: String): VariantAnnotationRDD = {
+ loadVcf(filePath).toVariantAnnotationRDD

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

Just for tracking RE @jpdna's comment about not getting any annotations from a VCF, this line is the culprit. Specifically, loadVcf right now just parses the Genotypes. We should make the VariantContextConverter parse out the annotations by default in the follow on PR.

@fnothaft

fnothaft Nov 15, 2016

Member

Just for tracking RE @jpdna's comment about not getting any annotations from a VCF, this line is the culprit. Specifically, loadVcf right now just parses the Genotypes. We should make the VariantContextConverter parse out the annotations by default in the follow on PR.

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

No changes necessary in this PR, just an FYI.

@fnothaft

fnothaft Nov 15, 2016

Member

No changes necessary in this PR, just an FYI.

This comment has been minimized.

@fnothaft

fnothaft Nov 15, 2016

Member

Also, @jpdna it'd be great to add some unit tests that use that file and try to load a few ANN fields. That should be an acceptance test for the release. Would you be able to do that?

@fnothaft

fnothaft Nov 15, 2016

Member

Also, @jpdna it'd be great to add some unit tests that use that file and try to load a few ANN fields. That should be an acceptance test for the release. Would you be able to do that?

This comment has been minimized.

@heuermh

heuermh Nov 15, 2016

Member

Running SnpEff on the VCF files we are already using for unit tests ends up being not too interesting, with all intragenic variants. It might take a little thinking to generate a more useful VCF file, say with variants right at intron/exon boundaries of a gene with a lot of splice variants, for example.

@heuermh

heuermh Nov 15, 2016

Member

Running SnpEff on the VCF files we are already using for unit tests ends up being not too interesting, with all intragenic variants. It might take a little thinking to generate a more useful VCF file, say with variants right at intron/exon boundaries of a gene with a lot of splice variants, for example.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 15, 2016

Member

Pushed commit with some additional unit tests. Let me know if I've addressed all the review comments, and thank you for volunteering to merge this manually.

Member

heuermh commented Nov 15, 2016

Pushed commit with some additional unit tests. Let me know if I've addressed all the review comments, and thank you for volunteering to merge this manually.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 15, 2016

Member

LGTM now! I will merge this manually shortly.

Member

fnothaft commented Nov 15, 2016

LGTM now! I will merge this manually shortly.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 15, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1601/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1601/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 15, 2016

Member

Merged into upgrade-to-0.10.0 as c750830 and c06143b. Thanks @heuermh! It is fantastic to get this refactor in. I will merge in the upgrade-to-0.10.0 branch on tests rerunning and passing.

Member

fnothaft commented Nov 15, 2016

Merged into upgrade-to-0.10.0 as c750830 and c06143b. Thanks @heuermh! It is fantastic to get this refactor in. I will merge in the upgrade-to-0.10.0 branch on tests rerunning and passing.

@fnothaft fnothaft closed this Nov 15, 2016

@heuermh heuermh deleted the heuermh:merge-variant-annotation2 branch Nov 15, 2016

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 15, 2016

Member

Woot! Thank you, @fnothaft!

Member

heuermh commented Nov 15, 2016

Woot! Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment