Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Parquet storage of VariantContext #1151
Within ADAM we can represent Genotypes in a "variant major" mode in VariantContextRdd where each row of RDD is an array of Genotypes ( directly loaded from a multisample VCF for example )
We have currently no way to directly persist this to Parquet directly, rather we transpose this to a GenotypeRDD and save to Parquet as we can write Genotype to parquet. We can of course reload this to a GenotypeRDD and then reconvert to a VariantContextRDD, but this requires a a big groupBy and sort.
Would there be value to creating a VariantContext avro object with an array of Genotype so that we can more directly persist to Parquet the equivalent of a multi-sample VCF?
Thanks @fnothaft - for your further consideration: to give some more context here, such a persisted "VariantContext" in Parquet represents to me basically the persisted result of a "group by Variant Position, sort by variant position" over a RDD[Genotype], and not a different data structure at least in terms of within Spark processing.
It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.
I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.
My use case here is also a bit conflated with thinking about: #651 (comment)
( in my ideal world ADAM formats would be as good as BAM/CRAM/VCF/BCF) both for stand alone and cluster usage -- I wish we had tabix seek equivalent.... if we do and I don't realize it please let me know)
Does the expanded use-case / scenario make this any more interesting Frank?
I guess we'd need some demonstrated use case (like better performance in the range queries) as to why it is worth the complexity of adding - but do you see it as otherwise deleterious?
I agree with your general gist, but I've always grumbled about what an
I think we have a good sense for what the problems with the
I would prefer to fix these problems. I think the HBase work is a good solution to 3*, and a solution to 1 would be 90% of a solution to 2. I think 1 could be solved (? or at least "worked around") with clever abuse of metadata. I've been mentally working through an approach for 1 for a while.
* I agree we'd need to solve said problem for Parquet as well. I don't think that'd be impossible, but I don't know how much work it'd involve.
I'll think more on the hbase and the partitioning / bucketing stuff.