Enable CSI index reading for bgzipped VCF files #6110

tfenne · 2019-08-22T22:09:06Z

Feature request

Tool(s) or class(es) involved

Any tools that read VCF, but specifically GenotypeGVCFs

Description

I'm doing work where I'm working with genomes that have chromosomes that are too long for both BAI and tabix index formats. I'm working around the problem for BAMs by disabling on-the-fly index generation in Picard/GATK based tools and then running samtools index --csi to generate the CSI index, which GATK will happily use.

Then I ran into the exact same problem with VCFs. If I'm using bgzipped VCFs then I have to disable index creation in the GATK as it will fail when it hits a feature with a position higher than 512 * 2^20. It's possible to then generate a CSI index using (surprisingly) tabix. But I can't find a way to get the GATK to detect and use a CSI index for a bgzipped VCF. I think almost everything that is needed is there in HTSJDK, I think it's just a case of auto-detecting the .csi index.

I'm working around this for now by using uncompressed VCFs as the .idx format doesn't have the same limit. But it's not great having uncompressed VCFs.

Bonus: it would be nice if the GATK auto-defaulted index creation for bgzipped VCFs to off if any of the sequences in the sequence dictionary is longer than is supported by tabix.

The text was updated successfully, but these errors were encountered:

lbergelson · 2019-08-23T19:01:49Z

@tfenne I don't think htsjdk supports CSI for vcf. I'm pretty sure it was only wired up for bam.

Defaulting to false when the references are too big is a good idea.

frabanal · 2020-05-05T16:16:21Z

Having the same issue here, and I follow essentially the same steps with .csi and .idx indexing for bam and .g.vcf files, respectively.
@tfenne or anyone else, have you figured a workaround to worked with compressed VCF files properly indexed for large chromosomes (> 512 * 2^20)?

I would have to carry ~1000 uncompressed *.g.vcf to GenomicsDBImport and I simply don't have the disk-space for that manoeuvre.

shenweima · 2021-09-17T04:49:41Z

doi: 10.1093/gigascience/giab007

ClayBirkett · 2022-02-11T15:46:29Z

This is still a problem. Is anyone working on it?

gvarmaslu · 2022-02-22T09:25:09Z

I do have the same problem with samtools indexing, in order to use this for GATK I need it in .bai index, .csi index is not supported in GATK!
Error:
samtools index ${filenm_root}.cutad.sort.bam
[E::hts_idx_push] Region 537233901..537233984 cannot be stored in a bai index. Try using a csi index with min_shift = 14, n_lvls >= 6

shenweima · 2022-02-23T00:55:14Z

It is time to solve ...?

Floating-Element · 2022-06-17T05:52:20Z

I have the same problem when I try BaseRecalibrator.
A USER ERROR has occurred: Can not read file://Users/....../file_name.vcf.tbi because no suitable codecs found

matthdsm · 2023-02-17T10:30:08Z

We also have the same issue.
@nvnieuwk

bbimber mentioned this issue May 17, 2023

Using VariantQC with csi index BimberLab/DISCVRSeq#236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable CSI index reading for bgzipped VCF files #6110

Enable CSI index reading for bgzipped VCF files #6110

tfenne commented Aug 22, 2019

lbergelson commented Aug 23, 2019

frabanal commented May 5, 2020

shenweima commented Sep 17, 2021

ClayBirkett commented Feb 11, 2022

gvarmaslu commented Feb 22, 2022

shenweima commented Feb 23, 2022

Floating-Element commented Jun 17, 2022

matthdsm commented Feb 17, 2023

Enable CSI index reading for bgzipped VCF files #6110

Enable CSI index reading for bgzipped VCF files #6110

Comments

tfenne commented Aug 22, 2019

Feature request

Tool(s) or class(es) involved

Description

lbergelson commented Aug 23, 2019

frabanal commented May 5, 2020

shenweima commented Sep 17, 2021

ClayBirkett commented Feb 11, 2022

gvarmaslu commented Feb 22, 2022

shenweima commented Feb 23, 2022

Floating-Element commented Jun 17, 2022

matthdsm commented Feb 17, 2023