Parquet storage of VariantContext #1151

Closed
jpdna opened this Issue Sep 6, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@jpdna
Member

jpdna commented Sep 6, 2016

Within ADAM we can represent Genotypes in a "variant major" mode in VariantContextRdd where each row of RDD is an array of Genotypes ( directly loaded from a multisample VCF for example )

We have currently no way to directly persist this to Parquet directly, rather we transpose this to a GenotypeRDD and save to Parquet as we can write Genotype to parquet. We can of course reload this to a GenotypeRDD and then reconvert to a VariantContextRDD, but this requires a a big groupBy and sort.

Would there be value to creating a VariantContext avro object with an array of Genotype so that we can more directly persist to Parquet the equivalent of a multi-sample VCF?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

Personally, I'd rather not. In my ideal world, we'd actually get rid of VariantContext for everything except converting to/from VCF. I think the VariantContext data structure preserves many of the problems inherent in VCF when working with "wide" cohorts.

Member

fnothaft commented Sep 6, 2016

Personally, I'd rather not. In my ideal world, we'd actually get rid of VariantContext for everything except converting to/from VCF. I think the VariantContext data structure preserves many of the problems inherent in VCF when working with "wide" cohorts.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Sep 6, 2016

Member

Thanks @fnothaft - for your further consideration: to give some more context here, such a persisted "VariantContext" in Parquet represents to me basically the persisted result of a "group by Variant Position, sort by variant position" over a RDD[Genotype], and not a different data structure at least in terms of within Spark processing.

It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.

I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.

My use case here is also a bit conflated with thinking about: #651 (comment)
trying to enable more efficient range based queries, that could retrieve all the genotypes for a given variant, in a given genomic range, quickly while avoiding shuffle/sort. The Hbase work is one approach to do this - but I'd also like to see it work in Parquet both to compare and because I think it might be more viable for smaller clusters than HBase and for "Stand alone" usage of a ADAM file more equivalent for BCF.

( in my ideal world ADAM formats would be as good as BAM/CRAM/VCF/BCF) both for stand alone and cluster usage -- I wish we had tabix seek equivalent.... if we do and I don't realize it please let me know)

Does the expanded use-case / scenario make this any more interesting Frank?
What are the limitations / concerns that you'd see with such a parquet (variant major) genotype data object?

I guess we'd need some demonstrated use case (like better performance in the range queries) as to why it is worth the complexity of adding - but do you see it as otherwise deleterious?

Member

jpdna commented Sep 6, 2016

Thanks @fnothaft - for your further consideration: to give some more context here, such a persisted "VariantContext" in Parquet represents to me basically the persisted result of a "group by Variant Position, sort by variant position" over a RDD[Genotype], and not a different data structure at least in terms of within Spark processing.

It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.

I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.

My use case here is also a bit conflated with thinking about: #651 (comment)
trying to enable more efficient range based queries, that could retrieve all the genotypes for a given variant, in a given genomic range, quickly while avoiding shuffle/sort. The Hbase work is one approach to do this - but I'd also like to see it work in Parquet both to compare and because I think it might be more viable for smaller clusters than HBase and for "Stand alone" usage of a ADAM file more equivalent for BCF.

( in my ideal world ADAM formats would be as good as BAM/CRAM/VCF/BCF) both for stand alone and cluster usage -- I wish we had tabix seek equivalent.... if we do and I don't realize it please let me know)

Does the expanded use-case / scenario make this any more interesting Frank?
What are the limitations / concerns that you'd see with such a parquet (variant major) genotype data object?

I guess we'd need some demonstrated use case (like better performance in the range queries) as to why it is worth the complexity of adding - but do you see it as otherwise deleterious?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

I agree with your general gist, but I've always grumbled about what an RDD[VariantContext] presents. Essentially, the RDD[VariantContext] is the Genotype embodiment of the groupBy "anti-pattern" in Spark. I put scare quotes around anti-pattern, because it's not that groupBys are necessarily terrible, but they can be really really terrible.

I think we have a good sense for what the problems with the RDD[Genotype] programming model are. Specifically:

It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.

and

I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.

and

My use case here is also a bit conflated with thinking about: #651 (comment)
trying to enable more efficient range based queries, that could retrieve all the genotypes for a given variant, in a given genomic range, quickly while avoiding shuffle/sort. The Hbase work is one approach to do this - but I'd also like to see it work in Parquet both to compare and because I think it might be more viable for smaller clusters than HBase and for "Stand alone" usage of a ADAM file more equivalent for BCF.

I would prefer to fix these problems. I think the HBase work is a good solution to 3*, and a solution to 1 would be 90% of a solution to 2. I think 1 could be solved (? or at least "worked around") with clever abuse of metadata. I've been mentally working through an approach for 1 for a while.

* I agree we'd need to solve said problem for Parquet as well. I don't think that'd be impossible, but I don't know how much work it'd involve.

Member

fnothaft commented Sep 6, 2016

I agree with your general gist, but I've always grumbled about what an RDD[VariantContext] presents. Essentially, the RDD[VariantContext] is the Genotype embodiment of the groupBy "anti-pattern" in Spark. I put scare quotes around anti-pattern, because it's not that groupBys are necessarily terrible, but they can be really really terrible.

I think we have a good sense for what the problems with the RDD[Genotype] programming model are. Specifically:

It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.

and

I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.

and

My use case here is also a bit conflated with thinking about: #651 (comment)
trying to enable more efficient range based queries, that could retrieve all the genotypes for a given variant, in a given genomic range, quickly while avoiding shuffle/sort. The Hbase work is one approach to do this - but I'd also like to see it work in Parquet both to compare and because I think it might be more viable for smaller clusters than HBase and for "Stand alone" usage of a ADAM file more equivalent for BCF.

I would prefer to fix these problems. I think the HBase work is a good solution to 3*, and a solution to 1 would be 90% of a solution to 2. I think 1 could be solved (? or at least "worked around") with clever abuse of metadata. I've been mentally working through an approach for 1 for a while.

* I agree we'd need to solve said problem for Parquet as well. I don't think that'd be impossible, but I don't know how much work it'd involve.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Sep 6, 2016

Member

Interesting Discussion!
I do see your point that we don't want to elevate the group by when we should be streaming through in one pass with an accumulator ( ideally on data that once is sorted doesn't have to be again) - this is also much better for growing datasets.

I'll think more on the hbase and the partitioning / bucketing stuff.
Closing this ticket.

Member

jpdna commented Sep 6, 2016

Interesting Discussion!
I do see your point that we don't want to elevate the group by when we should be streaming through in one pass with an accumulator ( ideally on data that once is sorted doesn't have to be again) - this is also much better for growing datasets.

I'll think more on the hbase and the partitioning / bucketing stuff.
Closing this ticket.

@jpdna jpdna closed this Sep 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment