Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf corrupt #911

Closed
dwaggott opened this issue Jun 29, 2015 · 8 comments
Closed

vcf corrupt #911

dwaggott opened this issue Jun 29, 2015 · 8 comments

Comments

@dwaggott
Copy link

I'm using the gatk joint calling pipeline. Looks like some corrupt vcf's are sneaking through. Similar issue to #771

# first error
[2015-06-28T06:36Z] scg3-0-1.local: [E::hts_idx_push] unsorted positions
[2015-06-28T06:36Z] scg3-0-1.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/srv/gsfs0/projects/ashley/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/srv/gsfs0/projects/ashley/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /srv/gsfs0/projects/ashley/apps/bin/tabix -f -p vcf /home/dwaggott/scratch/bcbiotx/f
[E::hts_idx_push] unsorted positions
tbx_index_build failed: /home/dwaggott/scratch/bcbiotx/f8f1c431-ede2-40e8-83ff-9a4fa36f0f05/tmpF8B95F/aaa_12888610
' returned non-zero exit status 1
# looks to be a bad vcf
$ bcftools view bcbio/tk_gatk_joint/work/gatk-haplotype/5/aaa-5_128886102_144562715.vcf.gz | grep 136836378
[vcf.c:1723 _vcf_parse_format] Number of columns at 5:136836378 does not match the number of samples (0 vs 1).
# deleted the vcf and restarted
  File "/srv/gsfs0/projects/ashley/apps/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 104, in get
    raise self._exception
IPython.parallel.error.CompositeError: one or more exceptions from call to method: concat_variant_files
[119:apply]: CalledProcessError: Command '/srv/gsfs0/projects/ashley/apps/bin/gatk-framework org.broadinstitute.gatk.tools.CatVariants -R /srv/gsfs0/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /srv/gsfs0/pr
INFO  13:56:15,787 HelpFormatter - ------------------------------------------------------- 
INFO  13:56:15,790 HelpFormatter - Program Name: org.broadinstitute.gatk.tools.CatVariants 
INFO  13:56:15,795 HelpFormatter - Program Args: -R /srv/gsfs0/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /srv/gsfs0/projects/ashley/dwaggott/projects/tk/results/bcbio/tk_gatk_joint/work/gatk-haplotype/LP6
INFO  13:56:15,798 HelpFormatter - Executing as dwaggott@scg3-0-21.local on Linux 2.6.32-504.16.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_03-b04. 
INFO  13:56:15,798 HelpFormatter - Date/Time: 2015/06/28 13:56:15 
INFO  13:56:15,798 HelpFormatter - ------------------------------------------------------- 
INFO  13:56:15,799 HelpFormatter - ------------------------------------------------------- 
.........10.........20.........30.........40.........50.........60.........70.........80.........90.........100.........110.........120.........130.........140.........150.........160.........170.........180.......##### ERROR ----
##### ERROR stack trace 
htsjdk.samtools.FileTruncatedException: Premature end of file
        at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:382)
        at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:127)
# looping over all vcfs I find some additional files
for i in `cat results/bcbio/tk_gatk_joint/work/gatk-haplotype/aaa-files.list`; do echo $i; bcftools view $i | wc -l; done;

[vcf.c:1166 bcf_write] Broken VCF record, the number of columns at X:363136 does not match the number of samples (0 vs 1).
@chapmanb
Copy link
Member

Daryl;
Sorry about the issues. This should be separate from #771 since you're using GATK, so this uses a totally different set of programs for creating and merging the VCFs. It appears like some of the variant call files from GATK are truncated/invalid but it's hard to tell why from this debugging info. A couple of thoughts/suggestions:

  • Are you using the latest GATK?
  • When you remove the problem files and re-run, do the same files have issues or do they build cleanly?
  • If you look at the problem files manually, can you spot the issues?

If we can get more of a sense of what exactly is going wrong we can try to look for it or work around the issues. Hope this helps some.

@dwaggott
Copy link
Author

Yup, it should be a separate issue. I wasn't sure if there was a common vcf validation step.

  • GATK is up to date and the version is being recognized by the the yaml / logs
  • I removed the problem files and so far it seems to run fine. Stochastic not deterministic.
  • Yes, bcftools gets all prissy on generally one line where the sample column is missing. Could be a cluster i/o thing but I've got no clue.

How's the vcf verification code? For fault tolerant-ness, this seems like a worthwhile effort.

@chapmanb
Copy link
Member

Daryl;
Thanks for the additional details. That's a tough one, it sounds like there is some kind of intermittent write error on the filesystem since you're getting truncated lines with no error messages. Short of running bcftools on every output to double verify it, I'm not sure of a good way to detect and avoid this. I don't want to add too much overhead on what appears to be a filesystem issue. If you can come up with any way to reproduce/detect this during runs I'd be happy to add that in. Glad it's running cleanly for you now and hope it finished okay.

@dwaggott
Copy link
Author

Not sure what a reasonable solution is. However, I just ran into the same problem on a completely different cluster using freebayes instead of gatk. A single line in a single vcf is truncated and kills the whole pipeline at the catvariants step.

My vote is for some vcf validating, maybe a fault tolerant mode.

What about only running the vcf validation if catvariants errors out?

@chapmanb
Copy link
Member

chapmanb commented Jul 1, 2015

Daryl;
Sorry about the problems on other systems as well. I'm not exactly sure what is going on and will try to reproduce and see what we can come up with for checking. Unfortunately by the time we get to the catvariants step it's going to be difficult to unwind and go back to the previous processes. Would you be able to pass along one of the problem files (off list is fine) and I can try to see if I can come up with any ideas to validate them cheaply? Thanks much.

@chapmanb chapmanb reopened this Jul 1, 2015
@dwaggott
Copy link
Author

dwaggott commented Jul 1, 2015

Thanks for re-opening, I'll rerun everything to see what pans out. I deleted all the problem vcfs so can't send much for diagnostics.

What do you think... if I delete the vcf that was problematic in *-joint-files.list and restart the pipeline will the pipeline understand? Or do I need to do more to roll back the pipeline to one of the transactional points.

Thanks!!!

@chapmanb
Copy link
Member

chapmanb commented Jul 1, 2015

Daryl;
If you delete the files and re-start the pipeline will pick up right where it left off. You do want to remove checkpoints_parallel/full.done so it'll re-parallelize the variant calling but otherwise you shouldn't need to change anything. If you get any additional details/examples I'm happy to explore more.

@roryk
Copy link
Collaborator

roryk commented Oct 5, 2015

Thanks Daryl, closing this out for now since it seems like it might have been a one off issue and the actual files that were failing were deleted so we can't do much forensics. Feel free to reopen it if I'm wrong or there is still a problem going on.

@roryk roryk closed this as completed Oct 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants