[ADAM-909] Refactoring variation RDDs. #1015

Merged
merged 3 commits into from Jun 3, 2016

Conversation

Projects
None yet
5 participants
@fnothaft
Member

fnothaft commented Apr 24, 2016

Resolves #909:

  • Refactors org.bdgenomics.adam.rdd.variation to add GenomicRDDs for Genotype, Variant, and VariantContext. These classes write sequence and sample metadata to disk.
  • Refactors ADAMRDDFunctions to an abstract class in preparation for further refactoring in #1011.
  • Added AvroGenomicRDD trait which consolidates Parquet + Avro metadata writing code across all Avro data models.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 24, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1176/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.6.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1176/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.6.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 25, 2016

Member

Jenkins, retest this please.

Member

fnothaft commented Apr 25, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 25, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1177/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1177/
Test PASSed.

@@ -270,7 +270,7 @@ class ADAMContextSuite extends ADAMFunSuite {
sparkTest("can read a small .vcf file") {
val path = resourcePath("small.vcf")
- val vcs = sc.loadGenotypes(path).toVariantContext.collect.sortBy(_.position)
+ val vcs = sc.loadGenotypes(path).toVariantContextRDD.rdd.collect.sortBy(_.position)

This comment has been minimized.

@heuermh

heuermh Apr 26, 2016

Member

Since we have this

object ADAMContext {
  implicit def genomicRDDToRDD[T](gRdd: GenomicRDD[T]): RDD[T] = gRdd.rdd

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L114

why is the .rdd here and elsewhere necessary? Implicits work fine except for when they don't?

@heuermh

heuermh Apr 26, 2016

Member

Since we have this

object ADAMContext {
  implicit def genomicRDDToRDD[T](gRdd: GenomicRDD[T]): RDD[T] = gRdd.rdd

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L114

why is the .rdd here and elsewhere necessary? Implicits work fine except for when they don't?

This comment has been minimized.

@fnothaft

fnothaft Apr 26, 2016

Member

Oh man. I zoned on the implicit. That's my bad. I'll fix this.

@fnothaft

fnothaft Apr 26, 2016

Member

Oh man. I zoned on the implicit. That's my bad. I'll fix this.

This comment has been minimized.

@heuermh

heuermh Apr 26, 2016

Member

Sorry, what I mean was that sometimes the genomicRDDToRDD implicit works for me and sometimes it doesn't. I assume because you are using .rdd here that they are necessary. The scala compiler and I are still getting to know each other.

@heuermh

heuermh Apr 26, 2016

Member

Sorry, what I mean was that sometimes the genomicRDDToRDD implicit works for me and sometimes it doesn't. I assume because you are using .rdd here that they are necessary. The scala compiler and I are still getting to know each other.

This comment has been minimized.

@fnothaft

fnothaft Apr 27, 2016

Member

Nah I was using them because I forgot that we had the implicit.

@fnothaft

fnothaft Apr 27, 2016

Member

Nah I was using them because I forgot that we had the implicit.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 20, 2016

Member

Just pushed an update with upgrades to bdg-formats 0.8.0 and cleanup of the .rdd bits. This fails unit tests right now, which I'll patch up this AM. CC @erictu @heuermh

The 0.8.0 upgrade is split out into a separate commit. This will need to be squashed down before merge.

Member

fnothaft commented May 20, 2016

Just pushed an update with upgrades to bdg-formats 0.8.0 and cleanup of the .rdd bits. This fails unit tests right now, which I'll patch up this AM. CC @erictu @heuermh

The 0.8.0 upgrade is split out into a separate commit. This will need to be squashed down before merge.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 20, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1239/

Build result: FAILURE

GitHub pull request #1015 of commit 0ecfae9 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1015/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8a08bb5f7a926e05f9627de97e383f281c935d51 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1015/merge^{commit} # timeout=10Checking out Revision 8a08bb5f7a926e05f9627de97e383f281c935d51 (origin/pr/1015/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8a08bb5f7a926e05f9627de97e383f281c935d51First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1239/

Build result: FAILURE

GitHub pull request #1015 of commit 0ecfae9 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1015/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8a08bb5f7a926e05f9627de97e383f281c935d51 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1015/merge^{commit} # timeout=10Checking out Revision 8a08bb5f7a926e05f9627de97e383f281c935d51 (origin/pr/1015/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8a08bb5f7a926e05f9627de97e383f281c935d51First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 20, 2016

Member

Just pushed another commit to address the failing unit tests. Please review @heuermh @erictu.

@heuermh you may be interested to find that GZIPed VCF loading doesn't work. Alas, the VCFHeaderReader code from Hadoop-BAM seems to fail when trying to load GZIPed VCF. I think there's a fall-through case that wasn't considered properly. I'm going to open an issue here to test more, and then if I can isolate/test/resolve it, I'll go upstream to Hadoop-BAM.

Member

fnothaft commented May 20, 2016

Just pushed another commit to address the failing unit tests. Please review @heuermh @erictu.

@heuermh you may be interested to find that GZIPed VCF loading doesn't work. Alas, the VCFHeaderReader code from Hadoop-BAM seems to fail when trying to load GZIPed VCF. I think there's a fall-through case that wasn't considered properly. I'm going to open an issue here to test more, and then if I can isolate/test/resolve it, I'll go upstream to Hadoop-BAM.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 20, 2016

Member

you may be interested to find that GZIPed VCF loading doesn't work.

It was working for me, and there is a unit test to verify it.
https://github.com/bigdatagenomics/adam/blame/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L284

Member

heuermh commented May 20, 2016

you may be interested to find that GZIPed VCF loading doesn't work.

It was working for me, and there is a unit test to verify it.
https://github.com/bigdatagenomics/adam/blame/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L284

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 20, 2016

Member

you may be interested to find that GZIPed VCF loading doesn't work.

It was working for me, and there is a unit test to verify it.
https://github.com/bigdatagenomics/adam/blame/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L284

Oh, yeah; didn't mean to imply that it was currently misimplemented. The issue comes up because we now rely on reading the VCF header separately to pull out sequence/sample metadata:

https://github.com/bigdatagenomics/adam/pull/1015/files#diff-d36ea7d0742decd0b040a73a96af06e9R136

The VCFHeaderReader in Hadoop-BAM appears to mishandle GZIP/BGZIPed VCFs. As far as I can tell, it falls through and tries to treat them as BCF files.

Member

fnothaft commented May 20, 2016

you may be interested to find that GZIPed VCF loading doesn't work.

It was working for me, and there is a unit test to verify it.
https://github.com/bigdatagenomics/adam/blame/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L284

Oh, yeah; didn't mean to imply that it was currently misimplemented. The issue comes up because we now rely on reading the VCF header separately to pull out sequence/sample metadata:

https://github.com/bigdatagenomics/adam/pull/1015/files#diff-d36ea7d0742decd0b040a73a96af06e9R136

The VCFHeaderReader in Hadoop-BAM appears to mishandle GZIP/BGZIPed VCFs. As far as I can tell, it falls through and tries to treat them as BCF files.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 20, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1240/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1240/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 20, 2016

Member

Yep, in agreement. There's some duplicated code in Hadoop-BAM that should be shared, and things probably diverged between the VCF input format and VCF header reader in this case.

Member

heuermh commented May 20, 2016

Yep, in agreement. There's some duplicated code in Hadoop-BAM that should be shared, and things probably diverged between the VCF input format and VCF header reader in this case.

+ val variantContextRdd = sc.loadVcf(args.vcfPath, sdOpt = dictionary)
+ var variantContextsToSave = if (args.coalesce > 0) {
+ if (args.coalesce > variantContextRdd.partitions.size || args.forceShuffle) {
+ variantContextRdd.transform(_.coalesce(args.coalesce, shuffle = true))

This comment has been minimized.

@heuermh

heuermh May 23, 2016

Member

could you explain what the .transform(_. part is doing here? is it not possible to call variantContextRdd.coalesce directly?

@heuermh

heuermh May 23, 2016

Member

could you explain what the .transform(_. part is doing here? is it not possible to call variantContextRdd.coalesce directly?

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

.transform(_.) transforms the RDD that underlies the VariantContextRDD and emits a new VariantContextRDD.

@fnothaft

fnothaft May 26, 2016

Member

.transform(_.) transforms the RDD that underlies the VariantContextRDD and emits a new VariantContextRDD.

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

If you did variantContextRdd.coalesce that would emit a RDD[VariantContext] and you'd lose the .saveAsVcf/etc functions, which we use later.

@fnothaft

fnothaft May 26, 2016

Member

If you did variantContextRdd.coalesce that would emit a RDD[VariantContext] and you'd lose the .saveAsVcf/etc functions, which we use later.

This comment has been minimized.

@heuermh

heuermh May 26, 2016

Member

I'm liking this pattern less as time goes on. Perhaps we should drop RDD from the VariantContextRDD class names for something more generic and remove the implicit conversion. Or at least remove the implicit conversion. What the user gets back from an ADAMContext load method should has_a RDD and not pretend to is_a RDD.

@heuermh

heuermh May 26, 2016

Member

I'm liking this pattern less as time goes on. Perhaps we should drop RDD from the VariantContextRDD class names for something more generic and remove the implicit conversion. Or at least remove the implicit conversion. What the user gets back from an ADAMContext load method should has_a RDD and not pretend to is_a RDD.

This comment has been minimized.

@jpdna

jpdna May 26, 2016

Member

Perhaps we should drop RDD from the VariantContextRDD

I agree there may be some nomenclature concern that our many *RDD container objects for rdd+dicts are not RDDs in same true sense as IntervalRDD and IndexedRDD which actually extend RDD. If I am reading correctly, there is not an implicit conversion that would even allow these to be treated as an RDD directly. The meaning seems more like Holder ofRDDofVariantContextWithDicts though I don't advocate that name. What do you think @fnothaft ? At this point though we have already gone down this path a ways using the suffix "RDD" in a broad sense, so a renaming should perhaps come in the future after this PR.

@jpdna

jpdna May 26, 2016

Member

Perhaps we should drop RDD from the VariantContextRDD

I agree there may be some nomenclature concern that our many *RDD container objects for rdd+dicts are not RDDs in same true sense as IntervalRDD and IndexedRDD which actually extend RDD. If I am reading correctly, there is not an implicit conversion that would even allow these to be treated as an RDD directly. The meaning seems more like Holder ofRDDofVariantContextWithDicts though I don't advocate that name. What do you think @fnothaft ? At this point though we have already gone down this path a ways using the suffix "RDD" in a broad sense, so a renaming should perhaps come in the future after this PR.

This comment has been minimized.

@heuermh

heuermh May 26, 2016

Member

There's this implicit conversion from GenomicRDD (which VariantContextRDD and others extend from) to RDD
https://github.com/fnothaft/adam/blob/822c64e3530d7ed8bb86daeb7b900bd7bf2f7854/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L116

@heuermh

heuermh May 26, 2016

Member

There's this implicit conversion from GenomicRDD (which VariantContextRDD and others extend from) to RDD
https://github.com/fnothaft/adam/blob/822c64e3530d7ed8bb86daeb7b900bd7bf2f7854/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L116

This comment has been minimized.

@heuermh

heuermh May 26, 2016

Member

Created a new issue #1040, I'm ok with this for now.

@heuermh

heuermh May 26, 2016

Member

Created a new issue #1040, I'm ok with this for now.

+
+@deprecated("Extend ADAMRDDFunctions and mix in GenomicRDD wherever possible in new development.",
+ since = "0.20.0")
+class ConcreteADAMRDDFunctions[T <% IndexedRecord: Manifest] private[rdd] (val rdd: RDD[T]) extends ADAMRDDFunctions[T] {

This comment has been minimized.

@heuermh

heuermh May 23, 2016

Member

what is this used for? if just for the implicit conversion could it be private to ADAMContext?

@heuermh

heuermh May 23, 2016

Member

what is this used for? if just for the implicit conversion could it be private to ADAMContext?

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

This is necessary for the flatten command line tool, IIRC.

@fnothaft

fnothaft May 26, 2016

Member

This is necessary for the flatten command line tool, IIRC.

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

But yes, I think making it private is a good idea.

@fnothaft

fnothaft May 26, 2016

Member

But yes, I think making it private is a good idea.

if (Metrics.isRecording) records.instrument() else records
- records.flatMap(p => vcc.convert(p._2.get))
+ // load vcf metadata
+ val (vcfSd, samples) = loadVcfMetadata(filePath)

This comment has been minimized.

@heuermh

heuermh May 23, 2016

Member

the header has already been read via VCFInputFormat above, and VCFRecordReader holds a reference to it. Is there some way to pull that out? VCFHeaderReader doesn't contain the decompression bits that VCFRecordReader does.

@heuermh

heuermh May 23, 2016

Member

the header has already been read via VCFInputFormat above, and VCFRecordReader holds a reference to it. Is there some way to pull that out? VCFHeaderReader doesn't contain the decompression bits that VCFRecordReader does.

This comment has been minimized.

@fnothaft

fnothaft May 25, 2016

Member

Actually, a really convoluted way to do this is to read the VCF via the input format and to then pull the header from a record.

I don't like this for various reasons, but I think this is a good WAR for the VCF.GZ/BCF issue. And when I say "good", I mean "passible".

@fnothaft

fnothaft May 25, 2016

Member

Actually, a really convoluted way to do this is to read the VCF via the input format and to then pull the header from a record.

I don't like this for various reasons, but I think this is a good WAR for the VCF.GZ/BCF issue. And when I say "good", I mean "passible".

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

Resolved by 0f90302.

@fnothaft

fnothaft May 26, 2016

Member

Resolved by 0f90302.

val path = resourcePath("test.vcf.gz")
val vcs = sc.loadVcf(path, None)
assert(vcs.count === 6)
}
- sparkTest("can read a BGZF gzipped .vcf file") {
+ ignore("can read a BGZF gzipped .vcf file") {

This comment has been minimized.

@heuermh

heuermh May 23, 2016

Member

bummer that a feature just added for users is gone again. Is there any way to continue to support this without waiting for another release of Hadoop-BAM?

@heuermh

heuermh May 23, 2016

Member

bummer that a feature just added for users is gone again. Is there any way to continue to support this without waiting for another release of Hadoop-BAM?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 24, 2016

Member

Created HadoopGenomics/Hadoop-BAM#96 to document missing support for gzipped and BGZF VCF formats.

Member

heuermh commented May 24, 2016

Created HadoopGenomics/Hadoop-BAM#96 to document missing support for gzipped and BGZF VCF formats.

- rdd.keyBy({ g => RichVariant.variantToRichVariant(g.getVariant) })
- .groupByKey
- .map { case (v: RichVariant, g) => new VariantContext(ReferencePosition(v), v, g, None) }
- }
def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {

This comment has been minimized.

@erictu

erictu May 24, 2016

Member

Can we get this operation supported over the GenotypeRDD class?

@erictu

erictu May 24, 2016

Member

Can we get this operation supported over the GenotypeRDD class?

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

Added.

@fnothaft

fnothaft May 26, 2016

Member

Added.

+ val sd = loadAvroSequences(filePath)
+
+ // load avro record group dictionary and convert to samples
+ val rgd = loadAvroSampleMetadata(filePath)

This comment has been minimized.

@erictu

erictu May 25, 2016

Member

This means that providing the rgdict upon vcf2adam conversion is required. I tried converting a vcf without providing the optional -dict flag, and I'm unable to load it in. Should we require the dict to be provided when converting from vcf to adam?

@erictu

erictu May 25, 2016

Member

This means that providing the rgdict upon vcf2adam conversion is required. I tried converting a vcf without providing the optional -dict flag, and I'm unable to load it in. Should we require the dict to be provided when converting from vcf to adam?

This comment has been minimized.

@heuermh

heuermh May 25, 2016

Member

No, we should generate it from the VCF header

@heuermh

heuermh May 25, 2016

Member

No, we should generate it from the VCF header

This comment has been minimized.

@jpdna

jpdna May 25, 2016

Member

No, we should generate it from the VCF header

I'm going to try to add this functionality and make a PR against this PR

@jpdna

jpdna May 25, 2016

Member

No, we should generate it from the VCF header

I'm going to try to add this functionality and make a PR against this PR

This comment has been minimized.

@fnothaft

fnothaft May 25, 2016

Member

I believe this functionality already exists?

@fnothaft

fnothaft May 25, 2016

Member

I believe this functionality already exists?

This comment has been minimized.

@jpdna

jpdna May 25, 2016

Member

@erictu - using this PR as is I was able to run
adam-submit vcf2adam small.vcf out.adam
and it appears to have worked fine without the -dict and produced _seqdict.avro by extracting from the VCF header as we would expect.
Can you describe more about how to reproduce the error you saw when -dict is ommited?
Did your VCF lack a ##contig= in the header ?

@jpdna

jpdna May 25, 2016

Member

@erictu - using this PR as is I was able to run
adam-submit vcf2adam small.vcf out.adam
and it appears to have worked fine without the -dict and produced _seqdict.avro by extracting from the VCF header as we would expect.
Can you describe more about how to reproduce the error you saw when -dict is ommited?
Did your VCF lack a ##contig= in the header ?

This comment has been minimized.

@heuermh

heuermh May 26, 2016

Member

Are samples in a VCF really the same as record groups in a SAM/BAM file? Instead of saving _samples.avro as RecordGroupDictionary perhaps we should add something new for samples.

@heuermh

heuermh May 26, 2016

Member

Are samples in a VCF really the same as record groups in a SAM/BAM file? Instead of saving _samples.avro as RecordGroupDictionary perhaps we should add something new for samples.

This comment has been minimized.

@jpdna

jpdna May 26, 2016

Member

Are samples in a VCF really the same as record groups in a SAM/BAM file?

Good point @heuermh - I also had this concern about conflating SAM/BAM RecordGroup metadata with concept of a VCF sample - I was just focused so far on fixing the bug in code as written. I need to study the schema as a whole a bit more before stating an opinion - but if there is even a marginal distinction between SAM/BAM (Record/Read Group) and VCF sample I'd suggest erring on the side of breaking the concepts apart for clarity and adding a new "GenotypedSampleMetadata" avro schema and object to hold VCF sample dict metadata.

@jpdna

jpdna May 26, 2016

Member

Are samples in a VCF really the same as record groups in a SAM/BAM file?

Good point @heuermh - I also had this concern about conflating SAM/BAM RecordGroup metadata with concept of a VCF sample - I was just focused so far on fixing the bug in code as written. I need to study the schema as a whole a bit more before stating an opinion - but if there is even a marginal distinction between SAM/BAM (Record/Read Group) and VCF sample I'd suggest erring on the side of breaking the concepts apart for clarity and adding a new "GenotypedSampleMetadata" avro schema and object to hold VCF sample dict metadata.

This comment has been minimized.

@fnothaft

fnothaft May 26, 2016

Member

I need to study the schema as a whole a bit more before stating an opinion - but if there is even a marginal distinction between SAM/BAM (Record/Read Group) and VCF sample I'd suggest erring on the side of breaking the concepts apart for clarity and adding a new "GenotypedSampleMetadata" avro schema and object to hold VCF sample dict metadata.

I think that sounds reasonable, but maybe we can open that as future work? My general reasoning for starting with a record group is that a VCF sample descends from one or more record groups. I would like to punt this to after this PR for a few reasons:

  • This PR resolves a blocking issue for a downstream tool.
  • Since we present just a Seq[String] with the sample metadata and the actual way that the sample is saved to disk is never visible to the end user, I don't think it is a critical issue.
  • We should do a broader refactoring of the various metadata bits --> factor out computational provenance from wetlab provenance, allow VCF sample to descend from read group samples, etc.
  • Additionally, this refactor would block on a bdg-formats release.
@fnothaft

fnothaft May 26, 2016

Member

I need to study the schema as a whole a bit more before stating an opinion - but if there is even a marginal distinction between SAM/BAM (Record/Read Group) and VCF sample I'd suggest erring on the side of breaking the concepts apart for clarity and adding a new "GenotypedSampleMetadata" avro schema and object to hold VCF sample dict metadata.

I think that sounds reasonable, but maybe we can open that as future work? My general reasoning for starting with a record group is that a VCF sample descends from one or more record groups. I would like to punt this to after this PR for a few reasons:

  • This PR resolves a blocking issue for a downstream tool.
  • Since we present just a Seq[String] with the sample metadata and the actual way that the sample is saved to disk is never visible to the end user, I don't think it is a critical issue.
  • We should do a broader refactoring of the various metadata bits --> factor out computational provenance from wetlab provenance, allow VCF sample to descend from read group samples, etc.
  • Additionally, this refactor would block on a bdg-formats release.

This comment has been minimized.

@jpdna

jpdna May 26, 2016

Member

to punt this to after this PR

+1 - Just created #1039 to hold that thought for future

@jpdna

jpdna May 26, 2016

Member

to punt this to after this PR

+1 - Just created #1039 to hold that thought for future

This comment has been minimized.

@heuermh

heuermh May 26, 2016

Member

+1 to later formats changes, -0 on adding any new records with Metadata in the name.

@heuermh

heuermh May 26, 2016

Member

+1 to later formats changes, -0 on adding any new records with Metadata in the name.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 25, 2016

Member

Just pushed 0f90302 to address the VCF.GZ issues. I will merge @jpdna's PR as well. @heuermh I'm still working on the BCF issue. There's something more going on here. I will ping back with more of an error trace.

Member

fnothaft commented May 25, 2016

Just pushed 0f90302 to address the VCF.GZ issues. I will merge @jpdna's PR as well. @heuermh I'm still working on the BCF issue. There's something more going on here. I will ping back with more of an error trace.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 25, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1243/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1243/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 26, 2016

Member

I'm still working on the BCF issue.

Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification.

Member

heuermh commented May 26, 2016

I'm still working on the BCF issue.

Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

I'm still working on the BCF issue.

Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification.

Ah, OK! That is the exact issue I was running into. I won't hassle myself trying to fix that then.

(I tried regenerating BCF using Picard, which amusingly enough, doesn't seem to work)

Member

fnothaft commented May 26, 2016

I'm still working on the BCF issue.

Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification.

Ah, OK! That is the exact issue I was running into. I won't hassle myself trying to fix that then.

(I tried regenerating BCF using Picard, which amusingly enough, doesn't seem to work)

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 26, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1244/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1244/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

Just pushed a commit that (I think) addresses the remaining review comments. Can I get a final pass on this? I will then squash this down to two commits.

Member

fnothaft commented May 26, 2016

Just pushed a commit that (I think) addresses the remaining review comments. Can I get a final pass on this? I will then squash this down to two commits.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 26, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1245/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1245/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 26, 2016

Member

+1, thanks!

Member

heuermh commented May 26, 2016

+1, thanks!

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 26, 2016

Member

+1 - I have no outstanding concerns

Member

jpdna commented May 26, 2016

+1 - I have no outstanding concerns

@erictu

This comment has been minimized.

Show comment
Hide comment
@erictu

erictu May 26, 2016

Member

+1, seems good to me!

Member

erictu commented May 26, 2016

+1, seems good to me!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

Squashed down into three commits and cleaned up the history.

Member

fnothaft commented May 26, 2016

Squashed down into three commits and cleaned up the history.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

Hold on the merge; I see a unit test failure in one of the intermediate commits when building locally.

Member

fnothaft commented May 26, 2016

Hold on the merge; I see a unit test failure in one of the intermediate commits when building locally.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 26, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1246/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1246/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

Fixed and pushed.

Member

fnothaft commented May 26, 2016

Fixed and pushed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 26, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1247/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1247/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

Jenkins, retest this please.

Member

fnothaft commented May 26, 2016

Jenkins, retest this please.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 26, 2016

Member

Do you think our logging related changes might have anything to do with that failure? We removed some of the logging configuration initialization bits in #1028

Member

heuermh commented May 26, 2016

Do you think our logging related changes might have anything to do with that failure? We removed some of the logging configuration initialization bits in #1028

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 26, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1248/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1248/
Test PASSed.

@erictu

This comment has been minimized.

Show comment
Hide comment
@erictu

erictu May 27, 2016

Member

Can we add in the corresponding changes of schema to org.bdgenomics.adam.projections.GenotypeField?

Member

erictu commented May 27, 2016

Can we add in the corresponding changes of schema to org.bdgenomics.adam.projections.GenotypeField?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 1, 2016

Member

@fnothaft do you want to push another commit for GenotypeField and then re-squash? I could pr against this one but it probably isn't worth it for such a small change.

Member

heuermh commented Jun 1, 2016

@fnothaft do you want to push another commit for GenotypeField and then re-squash? I could pr against this one but it probably isn't worth it for such a small change.

+
+ // due to a bug upstream in Hadoop-BAM, the VCFHeaderReader class errors when reading
+ // headers from .vcf.gz files
+ //

This comment has been minimized.

fnothaft added some commits Apr 22, 2016

[ADAM-909] Refactoring variation RDDs.
Resolves #909:

* Refactors `org.bdgenomics.adam.rdd.variation` to add `GenomicRDD`s for
  `Genotype`, `Variant`, and `VariantContext`. These classes write
  sequence and sample metadata to disk.
* Refactors `ADAMRDDFunctions` to an abstract class in preparation for
  further refactoring in #1011.
* Added `AvroGenomicRDD` trait which consolidates Parquet + Avro metadata
  writing code across all Avro data models.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 1, 2016

Member

@fnothaft do you want to push another commit for GenotypeField and then re-squash? I could pr against this one but it probably isn't worth it for such a small change.

Sorry about the delay on this @heuermh @erictu! Just pushed the change.

Member

fnothaft commented Jun 1, 2016

@fnothaft do you want to push another commit for GenotypeField and then re-squash? I could pr against this one but it probably isn't worth it for such a small change.

Sorry about the delay on this @heuermh @erictu! Just pushed the change.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 1, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1249/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1249/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 1, 2016

Member

+1

Member

heuermh commented Jun 1, 2016

+1

@erictu

This comment has been minimized.

Show comment
Hide comment
@erictu

erictu Jun 2, 2016

Member

+1

Member

erictu commented Jun 2, 2016

+1

@heuermh heuermh merged commit 525fda8 into bigdatagenomics:master Jun 3, 2016

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 3, 2016

Member

Merged manually as 449a882, 2c07382 and 525fda8. Thanks!

Member

heuermh commented Jun 3, 2016

Merged manually as 449a882, 2c07382 and 525fda8. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment