Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Vcf2ADAM and ADAM2Vcf into TransformGenotypes and TransformVariants #1532

Merged
merged 2 commits into from May 31, 2017

Conversation

@heuermh
Copy link
Member

@heuermh heuermh commented May 17, 2017

Work in progress.

  • Update docs
  • Failing unit tests due to FT Number=. vs. FT Number=1
  • Test on cluster
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2020/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1532/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ca1492e # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1532/merge^{commit} # timeout=10Checking out Revision ca1492e (origin/pr/1532/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ca1492eca4df1318f9b141d20b1764ce8e7047bcFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2022/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1532/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 88e8596f3faf13b75fb2f7517083f30d0ef4d246 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1532/merge^{commit} # timeout=10Checking out Revision 88e8596f3faf13b75fb2f7517083f30d0ef4d246 (origin/pr/1532/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 88e8596f3faf13b75fb2f7517083f30d0ef4d246First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

@coveralls coveralls commented May 18, 2017

Coverage Status

Coverage increased (+0.2%) to 82.437% when pulling 1a60973 on heuermh:transform-variants into 37b971a on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2025/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Good start, I have a variety of misc suggestions. Specifically, I'd like to remove the saveAsVcf functions and funnel those through VariantContextRDD.

stringency = stringency)

// convert to variant contexts
val variantContexts = genotypes.toVariantContextRDD
Copy link
Member

@fnothaft fnothaft May 19, 2017

Should only convert to variant contexts if saving as VCF. The sort and transform methods are all defined on GenomicRDD so just make the maybeCoalesce functions take [T: GenomicRDD] or something like that.

Copy link
Member Author

@heuermh heuermh May 21, 2017

That is the way it was implemented before the latest commit and failed unit tests. Sorting GenotypeRDD didn't work as expected.

Copy link
Member

@fnothaft fnothaft May 21, 2017

Yeah, sorting GenotypeRDD and then converting to VariantContextRDD currently does a groupBy without a defined partitioner, which causes a hash-partitioned shuffle, and thus un-sorts the VariantContextRDD. Yes, this code does fix that, but is incredibly inefficient if you're going VCF to parquet in which case you flatMap VariantContext to Genotype, then groupBy on Genotypes to reconstruct VariantContexts, which you then flatMap back to Parquet. Doesn't make sense to me.

* @param stringency The validation stringency to use when writing the VCF.
* Defaults to LENIENT.
*/
def saveAsVcf(args: ADAMSaveAnyArgs,
Copy link
Member

@fnothaft fnothaft May 19, 2017

Thoughts on removing this method?

Copy link
Member Author

@heuermh heuermh May 21, 2017

I wouldn't have added it if I thought it should be removed ;)

As mentioned elsewhere, save methods are a bit of a mess and need to be made consistent across subclasses of GenomicRDD. I'm ok if this PR makes it a little worse before it gets better.

I'd like it if the TransformXxx code could start with the format-guessing sc.loadXxx and end with the format-guessing save. I don't want if (outputPath.endsWith(".vcf")) then convert to VariantContextRDD in there.

Copy link
Member

@fnothaft fnothaft May 21, 2017

I think this is inconsistent with the direction we've been going, where we've broken up GenomicRDD methods that transiently change the underlying type of the GenomicRDD (e.g., toCoverage(collapse = true) becoming toCoverage().collapse(). I think that this method is particularly problematic because the changes to the method actually break the utility of the method. Specifically, this change means that the .save method always saves unsorted VCF, which is non-compliant with the VCF spec.

Copy link
Member Author

@heuermh heuermh May 22, 2017

Isn't the fix here and above to address the "groupBy without a defined partitioner, which causes a hash-partitioned shuffle, and thus un-sorts the VariantContextRDD."?

Copy link
Member

@fnothaft fnothaft May 22, 2017

I believe it is a bit more complex than that, and would be best to handle after #1324 merges (since you lose partitioning on "externally sorted" data).

Copy link
Member Author

@heuermh heuermh May 22, 2017

Hmm, yeah I do see that.

So you're suggesting no format guessing save methods for GenotypeRDD, VariantRDD, and VariantContextRDD. GenotypeRDD and VariantRDD have only saveAsParquet and VariantContextRDD has only saveAsVcf. The format guessing happens in the CLI classes. I could support that.

Copy link
Member

@fnothaft fnothaft May 22, 2017

Yup! I +1 that approach.

Copy link
Member Author

@heuermh heuermh May 22, 2017

ok, what are the scala syntax sprinkles necessary for maybeSort to operate on GenotypeRDD and return the specific subclass?

private def maybeSort[U <: GenomicRDD[_, U]](rdd: U): U = {

* @param stringency The validation stringency to use when writing the VCF.
*/
def saveAsVcf(filePath: String,
asSingleFile: Boolean,
stringency: ValidationStringency) {
toVariantContextRDD.saveAsVcf(filePath, asSingleFile, stringency)
deferMerging: Boolean,
Copy link
Member

@fnothaft fnothaft May 19, 2017

Thoughts on removing this method as well?

* Defaults to LENIENT.
*/
def saveAsVcf(args: ADAMSaveAnyArgs,
stringency: ValidationStringency = ValidationStringency.LENIENT): Unit = {
Copy link
Member

@fnothaft fnothaft May 19, 2017

I'd prefer to not add this method.

* @param stringency The validation stringency to use when writing the VCF.
*/
def saveAsVcf(filePath: String,
asSingleFile: Boolean,
stringency: ValidationStringency) {
toVariantContextRDD.saveAsVcf(filePath, asSingleFile, stringency)
deferMerging: Boolean,
Copy link
Member

@fnothaft fnothaft May 19, 2017

I'd prefer to remove this method.

@@ -0,0 +1,74 @@
##fileformat=VCFv4.2
Copy link
Member

@fnothaft fnothaft May 19, 2017

We already have a adam-core/src/test/resources/sorted.vcf. Why is this file needed?

Copy link
Member Author

@heuermh heuermh May 21, 2017

sorted-variants.vcf is the site-only/-only_variants version for TransformVariants

Copy link
Member

@fnothaft fnothaft May 21, 2017

Ah, I see! Nevermind.

@coveralls
Copy link

@coveralls coveralls commented May 22, 2017

Coverage Status

Coverage increased (+0.6%) to 82.845% when pulling bcb3a2f on heuermh:transform-variants into 37b971a on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 22, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2036/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1532/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 900d7dc0591c43a8bf89d87289b2dc7cea4dec47 # timeout=10Checking out Revision 900d7dc0591c43a8bf89d87289b2dc7cea4dec47 (origin/pr/1532/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 900d7dc0591c43a8bf89d87289b2dc7cea4dec47First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

@coveralls coveralls commented May 22, 2017

Coverage Status

Coverage increased (+0.6%) to 82.845% when pulling b14b94e on heuermh:transform-variants into 37b971a on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 22, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2038/
Test PASSed.

@fnothaft fnothaft added this to the 0.23.0 milestone May 24, 2017
@heuermh heuermh force-pushed the transform-variants branch from b14b94e to baf789d May 26, 2017
@coveralls
Copy link

@coveralls coveralls commented May 26, 2017

Coverage Status

Coverage increased (+0.9%) to 82.796% when pulling baf789d on heuermh:transform-variants into fc3e5fd on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2068/
Test PASSed.

@heuermh heuermh force-pushed the transform-variants branch from baf789d to 96af178 May 30, 2017
@coveralls
Copy link

@coveralls coveralls commented May 30, 2017

Coverage Status

Coverage increased (+0.9%) to 82.807% when pulling 96af178 on heuermh:transform-variants into b0c5c5f on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 30, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2072/
Test PASSed.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented May 30, 2017

@heuermh Ping me when this is ready for review.

@heuermh
Copy link
Member Author

@heuermh heuermh commented May 30, 2017

The commit history looks a mess, otherwise I think it is ready to review.

Copy link
Member

@fnothaft fnothaft left a comment

Two small bits! Otherwise looks really good. Thanks @heuermh!

optProjection = None,
stringency = stringency)

if (args.outputPath.endsWith(".vcf")) {
Copy link
Member

@fnothaft fnothaft May 31, 2017

+1! I like this refactor. One thing here though is that we lost the check for compressed formats (e.g., ".vcf.gz").

Copy link
Member Author

@heuermh heuermh May 31, 2017

There wasn't a check for compressed formats before, and I don't think we support writing compressed VCF through HadoopBAM. With a code change to use FileExtensions.isVcfExt

$ ./bin/adam-submit transformVariants \
  -single \
  adam-core/src/test/resources/small.vcf \
  small.vcf.gz

$ gzcat small.vcf.gz 
gzcat: small.vcf.gz: not in gzip format

$ head small.vcf.gz 
##fileformat=VCFv4.2
##FILTER=<ID=IndelFS,Description="FS > 200.0">
##FILTER=<ID=IndelQD,Description="QD < 2.0">
...

$ ./bin/adam-submit transformVariants \
  -single \
  adam-core/src/test/resources/small.vcf \
  small.vcf.bgz

$ head small.vcf.bgz 
##fileformat=VCFv4.2
##FILTER=<ID=IndelFS,Description="FS > 200.0">
##FILTER=<ID=IndelQD,Description="QD < 2.0">
...

$ ./bin/adam-submit transformVariants \
  -single \
  adam-core/src/test/resources/small.vcf \
  small.vcf.bgzf

Command body threw exception:
java.lang.AssertionError: assertion failed: BCF not yet supported
Exception in thread "main" java.lang.AssertionError: assertion failed: BCF not yet supported
	at scala.Predef$.assert(Predef.scala:170)
	at org.bdgenomics.adam.rdd.variant.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:164)
	at org.bdgenomics.adam.rdd.variant.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:135)
	at org.bdgenomics.adam.cli.TransformVariants.run(TransformVariants.scala:128)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.TransformVariants.run(TransformVariants.scala:75)
...

Do we have to set a configuration parameter or otherwise kick HadoopBAM to write .gz or .bgz/.bgzf?

Copy link
Member

@fnothaft fnothaft May 31, 2017

Ah, sorry about that, you are correct that we haven't been checking that for output.

That said, I read through the source and Hadoop-BAM writes BGZIP if you set the FileOutputFormat.COMPRESS configuration value to true.

Copy link
Member Author

@heuermh heuermh May 31, 2017

How about leaving this pr as-is then, squashed to two commits to keep commit c1b7375 separate, and then a new pr to add support for writing .gz (as BGZIP VCF) and .bgz/.bgzf (also as BGZIP VCF, I believe the exception error message above is incorrect)?

optProjection = None,
stringency = stringency)

if (args.outputPath.endsWith(".vcf")) {
Copy link
Member

@fnothaft fnothaft May 31, 2017

Ditto here RE: compression.

@@ -150,7 +150,7 @@ object DefaultHeaderLines {
VCFHeaderLineType.Float,
"Read-backed phasing quality")
lazy val genotypeFilter = new VCFFormatHeaderLine("FT",
1,
VCFHeaderLineCount.UNBOUNDED,
Copy link
Member

@fnothaft fnothaft May 31, 2017

Can you keep this split out into a different commit so that we have a single commit per issue? No sweat if you've already squashed.

Copy link
Member Author

@heuermh heuermh May 31, 2017

Will do

Copy link
Member

@fnothaft fnothaft left a comment

Moving the compression stuff to a later issue works for me! Thanks @heuermh.

@coveralls
Copy link

@coveralls coveralls commented May 31, 2017

Coverage Status

Coverage increased (+0.6%) to 82.796% when pulling 504e325 on heuermh:transform-variants into bbd695e on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 31, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2075/
Test PASSed.

@fnothaft fnothaft merged commit b7762c2 into bigdatagenomics:master May 31, 2017
2 of 3 checks passed
@fnothaft
Copy link
Member

@fnothaft fnothaft commented May 31, 2017

Merged! Thanks @heuermh!

@heuermh heuermh deleted the transform-variants branch May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants