Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gVCF output support #76

Closed
boboppie opened this issue May 7, 2014 · 16 comments
Closed

gVCF output support #76

boboppie opened this issue May 7, 2014 · 16 comments

Comments

@boboppie
Copy link

boboppie commented May 7, 2014

Any plan on this? Thanks

@ekg
Copy link
Collaborator

ekg commented May 7, 2014

Is there a specification I can follow?

On Wed, May 7, 2014 at 10:08 AM, Fengyuan Hu notifications@github.comwrote:

Any plan on this? Thanks


Reply to this email directly or view it on GitHubhttps://github.com//issues/76
.

@boboppie
Copy link
Author

boboppie commented May 9, 2014

It seems only Illumina provides gVCF files:
https://sites.google.com/site/gvcftools/home/about-gvcf
https://github.com/ctsa/gvcftools

Thanks

@dakl
Copy link

dakl commented May 9, 2014

Right now one can also use GATK to generate gVCFs. Support in
freebayeswould be great.

fredagen den 9:e maj 2014 skrev Fengyuan Hu notifications@github.com:

It seems only Illumina provides gVCF files:
about gvcf | gvcftoolshttps://sites.google.com/site/gvcftools/home/about-gvcf
https://github.com/ctsa/gvcftools

Thanks


Reply to this email directly or view it on GitHubhttps://github.com//issues/76#issuecomment-42641414
.

@parlar
Copy link

parlar commented May 12, 2014

gvcf output would be useful !

//PL

@gpcr
Copy link

gpcr commented Sep 24, 2014

Any plans with Gvcfs ? It would be great

@alimanfoo
Copy link

+1 for gVCF support, or some other way around the n+1 problem

@stawinski
Copy link

+1 for gVCF support, having this feature in GATK and not in Freebayes is the last reason for us to use GATK, while we would definitely prefer to use Freebayes

@drmjc
Copy link

drmjc commented Nov 24, 2014

+1, this would be really beneficial!

@ekg
Copy link
Collaborator

ekg commented Nov 24, 2014

@alimanfoo @stawinski

gVCF support is something which @dillonl and I are aiming to work on, but to it's not necessary to resolve the problems around calling samples independently and uniformly.

_We already support a standard way to address the "n+1 problem" which is implemented in freebayes and glia._

_This method has been tested at scale, and the results have been validated by many orthogonal analyses as part of the integration process in the 1000 Genomes Phase 3._

We are using the same approach as part of the ICGC Pan-Cancer project. This dataset includes 2000 very-deep germline and tumor pairs, and requires a distributed calling design like this.

It's straightforward and crucially it ensures complete cross-sample sensitivity to alleles regardless of size. This is something which gVCF cannot guarantee, and a major reason why I have not pursued gVCF support as a solution to this problem.

The approach is like this:

  1. call variants in your samples individually, and record a whole-genome coverage map
  2. create a union VCF of your variants
  3. call each sample individually, realigning the reads to the variants in the union VCF with glia, and calling the output with freebayes (which is also fed the union VCF as --variant-input).
  4. as new samples are added, goto 2, and repeat 3 with the new samples
  5. when phased haplotypes are required for the entire population, take the genotype likelihoods from step 4 and put them into a phasing and imputation framework such as Beagle and/or Shapeit2.

Note that you can use any set of callers to generate the union VCF. They do not have to produce a gVCF to be used this way. Some filtering is in order prior to sending the results into phasing. Also note that at no point do we have to pick up alignments from all the samples at the same time.

From time to time you can re-genotype your entire panel. This is relatively efficient from a variant calling perspective, but is costly for imputation and phasing. Otherwise, there will be slow improvement in sensitivity to variants in new samples as the likelihood of seeing a particular variant in previous ones increases.

If you need coverage information, as for clinical purposes, you should either be keeping compressed versions of the alignments around or capturing the per-base coverage of the alignments. I don't see any way around this.

Even if freebayes supported gVCF, I would still recommend this two-pass approach for resolving the n+1 pattern. Why would I not use gVCF? It cannot report putative SVs or indels that don't pass detection or assembly thresholds in the sample the VCF was generated from. This means that there is no way to integrate alleles which aren't alignable with the approach, which seriously diminishes its utility. I don't see any way to resolve this besides maintaining a more complete representation of the alignments than is possible in gVCF.

It would be helpful for me to understand what gVCF resolves that can't be solved with this pattern. This might suggest ways we could improve our implementation of it beyond what is currently available in the GATK and Platypus, which also supports it AFAIK.

@alimanfoo
Copy link

@ekg

Thanks for your mail and response on GitHub, very much appreciated. I'm happy to give my 2c although I am very far from a gVCF expert. I have 30X whole genome Illumina data for ~3000 Anopheles gambiae (mosquito) samples and right now I'm actually just looking for any way to call across all of them that will compute in a reasonable time. Anopheles gambiae is particularly diverse - in an initial calling run across ~800 samples we discovered a SNP every 2bp (genome size is ~250Mbp) - so both the scale of the data and the level of diversity are challenging. This an ongoing project, and more samples are coming, so I also need some way to deal with the n+1 problem.

I initially investigated gVCF and GATK's new calling pipeline because I thought it would be more efficient (i.e., quicker) and was the only way to deal with n+1. However, HaplotypeCaller is a non-starter because it is achingly slow on our data, so I've been looking around. Several people have since suggested a two-pass approach, using FreeBayes or something else, and it is reassuring to hear you are taking this approach in your work.

I can see how the two-pass approach as you've described it would work well for situations where you are genotyping lots of samples from a small number of populations and/or from populations where a fairly comprehensive catalogue of variation is already available. In those cases, the bias caused by not using all samples for variant discovery is manageable, similar to the bias you have to live with when using a chip.

The main project I'm working on is a 1000 genomes style project for Anopheles gambiae. Relatively little is known about genome variation and population structure in this species, although we do know that there are multiple factors contributing to population differentiation, and the population genetics are complicated by incomplete reproductive isolation between sibling species, large chromosomal inversions, strong recent and ongoing selective sweeps driven by insecticide pressure, and highly seasonal variations in population size and composition. Basically, it's a complicated and dynamic picture, very different from human populations. Given this, I don't think we can afford to bias ourselves by limiting variant discovery to a subset of samples, although this is open to discussion. The two-pass approach is probably a better option than what we have now, because we could process each sample independently and avoid calling across multiple BAMs, but I think we would have to re-genotype the entire panel every time we add mosquitoes collected from a different time or place, which is basically every time we get new samples, and that would still mean a significant amount of compute.

My crude understanding was that a gVCF stores the reference likelihoods for all non-variant positions for a single sample, binned over contiguous regions without evidence for variation in that sample. I don't understand how these reference likelihoods are calculated or exactly what assumptions are involved, but I understood this to mean that a sample only needs to be called once, period. When new samples are added and new variants and/or alleles are discovered, you can somehow genotype previous samples at the new loci because you have an approximate reference likelihood. Basically, in practice I thought this means you can go back and either call a ref/ref genotype or a missing genotype at any position. Calling across the gVCFs then becomes the headache, so some of the scalability issues remain, but overall it's more efficient and closer to n+1. Does that sound vaguely right?

I don't get your point about gVCF not being able to report putative SVs or indels that don't pass detection or assembly thresholds, could you elaborate a bit?

One last question, currently I rely a lot on various variant-level annotations such as DP, MQ, QD, etc., for variant filtering. I assume step 3 gives you one VCF per sample, so if I want to construct a single square VCF I have to combine these annotations somehow. How do you handle this?

Thanks again,
Alistair

@ekg
Copy link
Collaborator

ekg commented Nov 14, 2015

I've added basic support for gVCF. It's really not sufficient for merging yet, but I'll be working on it to get it to the the point that the output gVCFs can be merged.

Judging by work that @pd3 has been doing on gVCF merging in samtools, gVCF works fine for SNPs, but any time there is ambiguity in allele representation it is effectively impossible to produce a correct representation of a population from only gVCFs. For instance, any indel-containing locus with more than a few different indels is likely to exhibit this kind of problem. As most of the errors that callers make are coming from such regions, I'd imagine the use of gVCF would only complicate things. I don't know if he's ready to share information about this, but it's really impressive.

That all said, there have been a lot of requests for gVCF and as of recently I'm even funded to support it! (Details on that to come!) Also, it's critical to know if a region is covered and if so at what depth, otherwise we have to go back to the input alignments, which I think is really important. So, gVCF output is now part of freebayes. I'll close this issue, and let's describe improvements to the current format in other issues.

@ekg ekg closed this as completed Nov 14, 2015
@tantrev
Copy link

tantrev commented Mar 17, 2016

Does Freebayes recommend any specific tool for converting .gvcf to .vcf files?

@ekg
Copy link
Collaborator

ekg commented Mar 17, 2016

Note that freebayes does have rudimentary support for gvcf. Use --gvcf and
--gcvf-chunk-size 1000 for instance.

On Thu, Mar 17, 2016, 23:11 tantrev notifications@github.com wrote:

Does Freebayes recommend any specific tool for converting .gvcf to .vcf
files?


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#76 (comment)

@tantrev
Copy link

tantrev commented Mar 17, 2016

I guess what I'm trying to say is that bcftools has a (buggy) command "gvcf2vcf" that's supposed to convert .gvcf files to .vcf files. I was just poking around and couldn't see any obvious way to do the same thing with vcflib?

@ekg
Copy link
Collaborator

ekg commented Mar 17, 2016

No it may be easiest to write a short script to filter out the lines.
Unfortunately I don't have anything handy.

On Thu, Mar 17, 2016, 23:19 tantrev notifications@github.com wrote:

I guess what I'm trying to say is that bcftools has a (buggy) command
"gvcf2vcf" that's supposed to convert .gvcf files to .vcf files. I was just
poking around and couldn't see any obvious way to do the same thing with
vcflib?


You are receiving this because you modified the open/close state.

Reply to this email directly or view it on GitHub
#76 (comment)

@elinwang
Copy link

elinwang commented Jun 6, 2016

Sorry this may not seem to be the right place to raise the questions. However, I need some clarification related to this issue.

I made a brief summary of the way to handle "n+1 problem" implemented in freebayes and glia process based on @ekg 's comment on Nov 24, 2014 and @tantrev 's last comment.

Steps (without dealing with phased haplotypes):

  1. For each sample #i, run freebayes --report-monomorphic to output i.vcf, where i = 1..N. This step is for obtaining whole-genome coverage maps and resulting vcfs are supposed to be very large in size.

  2. run vcfcombine 1.vcf 2.vcf, ... N.vcf to get union.vcf (a giant vcf)

  3. run glia -v union.vcf | freebayes -@union.vcf to output new_i.vcf, where i = 1..N.

  4. as a new sample N+1 is added,

    run vcfcombine new_1.vcf new_2.vcf ... new_N.vcf to get newunion.vcf

    run glia -v newunion.vcf | freebayes -@newunion.vcf on sample N+1

My questions are:

Are the steps described above correct?
Do I really use the big union.vcf file as a known input?

Many thanks in advance for helping clarify this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants