New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gVCF merging and genotyping (e.g. CombineGVCFs and GenotypeGVCFs) #1312

Closed
NeillGibson opened this Issue Dec 12, 2016 · 3 comments

Comments

Projects
3 participants
@NeillGibson
Contributor

NeillGibson commented Dec 12, 2016

Hi,

Are you planning to support gVCF merging and genotyping on Spark / Adam?

As far as I know the only way to variant call 100K samples is trough creating gVCF files per sample and subsequent gVCF merging and genotyping.

The most well known / production ready implementation of this is from the Broad in GATK:

CombineGVCFs
https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineGVCFs.php

GenotypeGVCFs
https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php

For variant calling of the the 100K+ samples in Exac/GnomAD the first merge step was replaced with GenomicsDB from intel. As I understand it GenomicsDB efficiently stores per sample gVCF tracks and can then efficiently stream merged VCF into GenotypeGVCFs

Broad/Intel GenomicsDB
https://github.com/Intel-HLS/GenomicsDB
https://vimeo.com/194823486/506da42daf (from minute 19)

Genomics DB is based on Intel TileDB
http://istc-bigdata.org/tiledb/index.html

Something similar to CombineGVCFs/GenotypeGVCFs/GenomicsDB is being developed by DNAnexus that also supports on demand joint genotyping from Freebayes gVCF:

GLnexus
https://github.com/dnanexus-rnd/GLnexus

Are you also planning scalable gVCF storage and on demand gVCF merge and joint genotyping on top of Spark / Adam?

Thank you.

fnothaft added a commit to fnothaft/avocado that referenced this issue Oct 15, 2017

Add code for squaring off genotypes that include reference models.
Resolves bigdatagenomics/adam#1312. Enables gVCF style data to be used with the
joint variant caller by extracting the sites where a variant was called as a
genotype in one of the gVCFs. These variant sites are then joined back against
the original genotypes. If the genotyped allele was not present and a reference
model block is present, the allele is extracted out from the reference model.

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

@ggittu

This comment has been minimized.

Show comment
Hide comment
@ggittu

ggittu Oct 3, 2018

@heuermh @fnothaft Is there any document that says how to use CombineGVCFs and GenotypeGVCFs in ADAM (spark way)?

ggittu commented Oct 3, 2018

@heuermh @fnothaft Is there any document that says how to use CombineGVCFs and GenotypeGVCFs in ADAM (spark way)?

@heuermh

This comment has been minimized.

Show comment
Hide comment
Member

heuermh commented Oct 3, 2018

@ggittu

This comment has been minimized.

Show comment
Hide comment
@ggittu

ggittu Oct 3, 2018

@heuermh Got it. So i will run something like

avocado-submit jointer -from_gvcf /vcf/*.gvcf /output

Here I think the jointer is for CombineGVCFs , what is the one I can use for GenotypeGVCFs?

ggittu commented Oct 3, 2018

@heuermh Got it. So i will run something like

avocado-submit jointer -from_gvcf /vcf/*.gvcf /output

Here I think the jointer is for CombineGVCFs , what is the one I can use for GenotypeGVCFs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment