Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Support for gVCF merging and genotyping (e.g. CombineGVCFs and GenotypeGVCFs) #1312
Are you planning to support gVCF merging and genotyping on Spark / Adam?
As far as I know the only way to variant call 100K samples is trough creating gVCF files per sample and subsequent gVCF merging and genotyping.
The most well known / production ready implementation of this is from the Broad in GATK:
For variant calling of the the 100K+ samples in Exac/GnomAD the first merge step was replaced with GenomicsDB from intel. As I understand it GenomicsDB efficiently stores per sample gVCF tracks and can then efficiently stream merged VCF into GenotypeGVCFs
Genomics DB is based on Intel TileDB
Something similar to CombineGVCFs/GenotypeGVCFs/GenomicsDB is being developed by DNAnexus that also supports on demand joint genotyping from Freebayes gVCF:
Are you also planning scalable gVCF storage and on demand gVCF merge and joint genotyping on top of Spark / Adam?
Resolves bigdatagenomics/adam#1312. Enables gVCF style data to be used with the joint variant caller by extracting the sites where a variant was called as a genotype in one of the gVCFs. These variant sites are then joined back against the original genotypes. If the genotyped allele was not present and a reference model block is present, the allele is extracted out from the reference model.