Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1554] Support saving BGZF VCF output. #1608

Merged
merged 1 commit into from Jul 17, 2017

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 17, 2017

Resolves #1554. Additionally:

  • Moves VCF file extension logic into ADAM, since Hadoop-BAM does not recognize .bgzf as a valid BGZF extension
  • Refactored VCF output formats to remove need for ADAMHeaderlessVCFOutputFormat

This isn't completely done... Specifically, it produces output that bcftools thinks is good:

$ bcftools stats bqsr1.vcf.bgz 
# This file was produced by bcftools stats (1.3.1+htslib-1.3.1) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  bqsr1.vcf.bgz
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	bqsr1.vcf.bgz
# SN, Summary numbers:
# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	0
SN	0	number of records:	681
SN	0	number of no-ALTs:	0
SN	0	number of SNPs:	681
SN	0	number of MNPs:	0
SN	0	number of indels:	0
SN	0	number of others:	0
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0

But, that fails to load in properly when read back from ADAM:

scala> sc.loadVcf("bqsr1.vcf.bgz").rdd.count
res1: Long = 0

I'm still debugging this, but was wondering if anyone else had thoughts?

@coveralls
Copy link

@coveralls coveralls commented Jul 17, 2017

Coverage Status

Coverage decreased (-0.08%) to 84.076% when pulling 5cce59e on fnothaft:issues/1554-save-bgzip-vcf into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2222/
Test PASSed.

if (stringency == ValidationStringency.STRICT) {
throw new IllegalArgumentException(msg)
} else {
log.warn(msg)

This comment has been minimized.

@heuermh

heuermh Jul 17, 2017
Member

Does it make sense to stringency this? If we warn here, does the file still get written correctly?

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Author Member

I mean, you'll get a valid VCF file, but the file won't be BGZF'ed. That said, I'd be fine not stringencying this.

Resolves #1554. Additionally:

* Moves VCF file extension logic into ADAM, since Hadoop-BAM does not recognize
  .bgzf as a valid BGZF extension
* Refactored VCF output formats to remove need for ADAMHeaderlessVCFOutputFormat
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 17, 2017

OK, sorted out the issue; I was writing a BGZF EOF terminator at the end of the header. htslib is OK with this, htsjdk isn't. Also addressed @heuermh's comment RE validation stringency.

@fnothaft fnothaft force-pushed the fnothaft:issues/1554-save-bgzip-vcf branch from 5cce59e to 55d1af7 Jul 17, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 17, 2017

Coverage Status

Coverage decreased (-0.04%) to 84.112% when pulling 55d1af7 on fnothaft:issues/1554-save-bgzip-vcf into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2223/
Test PASSed.

@heuermh heuermh merged commit 406d1e3 into bigdatagenomics:master Jul 17, 2017
3 checks passed
3 checks passed
codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage decreased (-0.04%) to 84.112%
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Jul 17, 2017

Thank you, @fnothaft

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.