Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CSI index reading for bgzipped VCF files #6110

Open
tfenne opened this issue Aug 22, 2019 · 8 comments
Open

Enable CSI index reading for bgzipped VCF files #6110

tfenne opened this issue Aug 22, 2019 · 8 comments

Comments

@tfenne
Copy link
Contributor

tfenne commented Aug 22, 2019

Feature request

Tool(s) or class(es) involved

Any tools that read VCF, but specifically GenotypeGVCFs

Description

I'm doing work where I'm working with genomes that have chromosomes that are too long for both BAI and tabix index formats. I'm working around the problem for BAMs by disabling on-the-fly index generation in Picard/GATK based tools and then running samtools index --csi to generate the CSI index, which GATK will happily use.

Then I ran into the exact same problem with VCFs. If I'm using bgzipped VCFs then I have to disable index creation in the GATK as it will fail when it hits a feature with a position higher than 512 * 2^20. It's possible to then generate a CSI index using (surprisingly) tabix. But I can't find a way to get the GATK to detect and use a CSI index for a bgzipped VCF. I think almost everything that is needed is there in HTSJDK, I think it's just a case of auto-detecting the .csi index.

I'm working around this for now by using uncompressed VCFs as the .idx format doesn't have the same limit. But it's not great having uncompressed VCFs.

Bonus: it would be nice if the GATK auto-defaulted index creation for bgzipped VCFs to off if any of the sequences in the sequence dictionary is longer than is supported by tabix.

@lbergelson
Copy link
Member

@tfenne I don't think htsjdk supports CSI for vcf. I'm pretty sure it was only wired up for bam.

Defaulting to false when the references are too big is a good idea.

@frabanal
Copy link

frabanal commented May 5, 2020

Having the same issue here, and I follow essentially the same steps with .csi and .idx indexing for bam and .g.vcf files, respectively.
@tfenne or anyone else, have you figured a workaround to worked with compressed VCF files properly indexed for large chromosomes (> 512 * 2^20)?

I would have to carry ~1000 uncompressed *.g.vcf to GenomicsDBImport and I simply don't have the disk-space for that manoeuvre.

@shenweima
Copy link

doi: 10.1093/gigascience/giab007
image

@ClayBirkett
Copy link

This is still a problem. Is anyone working on it?

@gvarmaslu
Copy link

I do have the same problem with samtools indexing, in order to use this for GATK I need it in .bai index, .csi index is not supported in GATK!
Error:
samtools index ${filenm_root}.cutad.sort.bam
[E::hts_idx_push] Region 537233901..537233984 cannot be stored in a bai index. Try using a csi index with min_shift = 14, n_lvls >= 6

@shenweima
Copy link

It is time to solve ...?

@Floating-Element
Copy link

I have the same problem when I try BaseRecalibrator.
A USER ERROR has occurred: Can not read file://Users/....../file_name.vcf.tbi because no suitable codecs found

@matthdsm
Copy link

We also have the same issue.
@nvnieuwk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants