-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gVCF output support #76
Comments
Is there a specification I can follow? On Wed, May 7, 2014 at 10:08 AM, Fengyuan Hu notifications@github.comwrote:
|
It seems only Illumina provides gVCF files: Thanks |
Right now one can also use GATK to generate gVCFs. Support in fredagen den 9:e maj 2014 skrev Fengyuan Hu notifications@github.com:
|
gvcf output would be useful ! //PL |
Any plans with Gvcfs ? It would be great |
+1 for gVCF support, or some other way around the n+1 problem |
+1 for gVCF support, having this feature in GATK and not in Freebayes is the last reason for us to use GATK, while we would definitely prefer to use Freebayes |
+1, this would be really beneficial! |
gVCF support is something which @dillonl and I are aiming to work on, but to it's not necessary to resolve the problems around calling samples independently and uniformly. _We already support a standard way to address the "n+1 problem" which is implemented in freebayes and glia._ _This method has been tested at scale, and the results have been validated by many orthogonal analyses as part of the integration process in the 1000 Genomes Phase 3._ We are using the same approach as part of the ICGC Pan-Cancer project. This dataset includes 2000 very-deep germline and tumor pairs, and requires a distributed calling design like this. It's straightforward and crucially it ensures complete cross-sample sensitivity to alleles regardless of size. This is something which gVCF cannot guarantee, and a major reason why I have not pursued gVCF support as a solution to this problem. The approach is like this:
Note that you can use any set of callers to generate the union VCF. They do not have to produce a gVCF to be used this way. Some filtering is in order prior to sending the results into phasing. Also note that at no point do we have to pick up alignments from all the samples at the same time. From time to time you can re-genotype your entire panel. This is relatively efficient from a variant calling perspective, but is costly for imputation and phasing. Otherwise, there will be slow improvement in sensitivity to variants in new samples as the likelihood of seeing a particular variant in previous ones increases. If you need coverage information, as for clinical purposes, you should either be keeping compressed versions of the alignments around or capturing the per-base coverage of the alignments. I don't see any way around this. Even if freebayes supported gVCF, I would still recommend this two-pass approach for resolving the n+1 pattern. Why would I not use gVCF? It cannot report putative SVs or indels that don't pass detection or assembly thresholds in the sample the VCF was generated from. This means that there is no way to integrate alleles which aren't alignable with the approach, which seriously diminishes its utility. I don't see any way to resolve this besides maintaining a more complete representation of the alignments than is possible in gVCF. It would be helpful for me to understand what gVCF resolves that can't be solved with this pattern. This might suggest ways we could improve our implementation of it beyond what is currently available in the GATK and Platypus, which also supports it AFAIK. |
Thanks for your mail and response on GitHub, very much appreciated. I'm happy to give my 2c although I am very far from a gVCF expert. I have 30X whole genome Illumina data for ~3000 Anopheles gambiae (mosquito) samples and right now I'm actually just looking for any way to call across all of them that will compute in a reasonable time. Anopheles gambiae is particularly diverse - in an initial calling run across ~800 samples we discovered a SNP every 2bp (genome size is ~250Mbp) - so both the scale of the data and the level of diversity are challenging. This an ongoing project, and more samples are coming, so I also need some way to deal with the n+1 problem. I initially investigated gVCF and GATK's new calling pipeline because I thought it would be more efficient (i.e., quicker) and was the only way to deal with n+1. However, HaplotypeCaller is a non-starter because it is achingly slow on our data, so I've been looking around. Several people have since suggested a two-pass approach, using FreeBayes or something else, and it is reassuring to hear you are taking this approach in your work. I can see how the two-pass approach as you've described it would work well for situations where you are genotyping lots of samples from a small number of populations and/or from populations where a fairly comprehensive catalogue of variation is already available. In those cases, the bias caused by not using all samples for variant discovery is manageable, similar to the bias you have to live with when using a chip. The main project I'm working on is a 1000 genomes style project for Anopheles gambiae. Relatively little is known about genome variation and population structure in this species, although we do know that there are multiple factors contributing to population differentiation, and the population genetics are complicated by incomplete reproductive isolation between sibling species, large chromosomal inversions, strong recent and ongoing selective sweeps driven by insecticide pressure, and highly seasonal variations in population size and composition. Basically, it's a complicated and dynamic picture, very different from human populations. Given this, I don't think we can afford to bias ourselves by limiting variant discovery to a subset of samples, although this is open to discussion. The two-pass approach is probably a better option than what we have now, because we could process each sample independently and avoid calling across multiple BAMs, but I think we would have to re-genotype the entire panel every time we add mosquitoes collected from a different time or place, which is basically every time we get new samples, and that would still mean a significant amount of compute. My crude understanding was that a gVCF stores the reference likelihoods for all non-variant positions for a single sample, binned over contiguous regions without evidence for variation in that sample. I don't understand how these reference likelihoods are calculated or exactly what assumptions are involved, but I understood this to mean that a sample only needs to be called once, period. When new samples are added and new variants and/or alleles are discovered, you can somehow genotype previous samples at the new loci because you have an approximate reference likelihood. Basically, in practice I thought this means you can go back and either call a ref/ref genotype or a missing genotype at any position. Calling across the gVCFs then becomes the headache, so some of the scalability issues remain, but overall it's more efficient and closer to n+1. Does that sound vaguely right? I don't get your point about gVCF not being able to report putative SVs or indels that don't pass detection or assembly thresholds, could you elaborate a bit? One last question, currently I rely a lot on various variant-level annotations such as DP, MQ, QD, etc., for variant filtering. I assume step 3 gives you one VCF per sample, so if I want to construct a single square VCF I have to combine these annotations somehow. How do you handle this? Thanks again, |
I've added basic support for gVCF. It's really not sufficient for merging yet, but I'll be working on it to get it to the the point that the output gVCFs can be merged. Judging by work that @pd3 has been doing on gVCF merging in samtools, gVCF works fine for SNPs, but any time there is ambiguity in allele representation it is effectively impossible to produce a correct representation of a population from only gVCFs. For instance, any indel-containing locus with more than a few different indels is likely to exhibit this kind of problem. As most of the errors that callers make are coming from such regions, I'd imagine the use of gVCF would only complicate things. I don't know if he's ready to share information about this, but it's really impressive. That all said, there have been a lot of requests for gVCF and as of recently I'm even funded to support it! (Details on that to come!) Also, it's critical to know if a region is covered and if so at what depth, otherwise we have to go back to the input alignments, which I think is really important. So, gVCF output is now part of freebayes. I'll close this issue, and let's describe improvements to the current format in other issues. |
Does Freebayes recommend any specific tool for converting .gvcf to .vcf files? |
Note that freebayes does have rudimentary support for gvcf. Use --gvcf and On Thu, Mar 17, 2016, 23:11 tantrev notifications@github.com wrote:
|
I guess what I'm trying to say is that bcftools has a (buggy) command "gvcf2vcf" that's supposed to convert .gvcf files to .vcf files. I was just poking around and couldn't see any obvious way to do the same thing with vcflib? |
No it may be easiest to write a short script to filter out the lines. On Thu, Mar 17, 2016, 23:19 tantrev notifications@github.com wrote:
|
Sorry this may not seem to be the right place to raise the questions. However, I need some clarification related to this issue. I made a brief summary of the way to handle "n+1 problem" implemented in freebayes and glia process based on @ekg 's comment on Nov 24, 2014 and @tantrev 's last comment. Steps (without dealing with phased haplotypes):
My questions are:
Many thanks in advance for helping clarify this! |
Any plan on this? Thanks
The text was updated successfully, but these errors were encountered: