New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1554] Support saving BGZF VCF output. #1608

Merged
merged 1 commit into from Jul 17, 2017

Conversation

Projects
4 participants
@fnothaft
Member

fnothaft commented Jul 17, 2017

Resolves #1554. Additionally:

  • Moves VCF file extension logic into ADAM, since Hadoop-BAM does not recognize .bgzf as a valid BGZF extension
  • Refactored VCF output formats to remove need for ADAMHeaderlessVCFOutputFormat

This isn't completely done... Specifically, it produces output that bcftools thinks is good:

$ bcftools stats bqsr1.vcf.bgz 
# This file was produced by bcftools stats (1.3.1+htslib-1.3.1) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  bqsr1.vcf.bgz
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	bqsr1.vcf.bgz
# SN, Summary numbers:
# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	0
SN	0	number of records:	681
SN	0	number of no-ALTs:	0
SN	0	number of SNPs:	681
SN	0	number of MNPs:	0
SN	0	number of indels:	0
SN	0	number of others:	0
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0

But, that fails to load in properly when read back from ADAM:

scala> sc.loadVcf("bqsr1.vcf.bgz").rdd.count
res1: Long = 0

I'm still debugging this, but was wondering if anyone else had thoughts?

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 17, 2017

Coverage Status

Coverage decreased (-0.08%) to 84.076% when pulling 5cce59e on fnothaft:issues/1554-save-bgzip-vcf into 607cd50 on bigdatagenomics:master.

coveralls commented Jul 17, 2017

Coverage Status

Coverage decreased (-0.08%) to 84.076% when pulling 5cce59e on fnothaft:issues/1554-save-bgzip-vcf into 607cd50 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2222/
Test PASSed.

AmplabJenkins commented Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2222/
Test PASSed.

[ADAM-1554] Support saving BGZF VCF output.
Resolves #1554. Additionally:

* Moves VCF file extension logic into ADAM, since Hadoop-BAM does not recognize
  .bgzf as a valid BGZF extension
* Refactored VCF output formats to remove need for ADAMHeaderlessVCFOutputFormat
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 17, 2017

Member

OK, sorted out the issue; I was writing a BGZF EOF terminator at the end of the header. htslib is OK with this, htsjdk isn't. Also addressed @heuermh's comment RE validation stringency.

Member

fnothaft commented Jul 17, 2017

OK, sorted out the issue; I was writing a BGZF EOF terminator at the end of the header. htslib is OK with this, htsjdk isn't. Also addressed @heuermh's comment RE validation stringency.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 17, 2017

Coverage Status

Coverage decreased (-0.04%) to 84.112% when pulling 55d1af7 on fnothaft:issues/1554-save-bgzip-vcf into 607cd50 on bigdatagenomics:master.

coveralls commented Jul 17, 2017

Coverage Status

Coverage decreased (-0.04%) to 84.112% when pulling 55d1af7 on fnothaft:issues/1554-save-bgzip-vcf into 607cd50 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2223/
Test PASSed.

AmplabJenkins commented Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2223/
Test PASSed.

@heuermh heuermh merged commit 406d1e3 into bigdatagenomics:master Jul 17, 2017

3 checks passed

codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage decreased (-0.04%) to 84.112%
Details
default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 17, 2017

Member

Thank you, @fnothaft

Member

heuermh commented Jul 17, 2017

Thank you, @fnothaft

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment