Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
How to convert genotype DataFrame to VariantContext DataFrame / RDD #886
How can I convert a genotype DataFrame to a VariantContext DataFrame / RDD ?
In Spark 1.5.2 it is much faster to do queries on a genotype DataFrame than on a genotypes RDD.
Compare 14s with a Genotype DataFrame direct on the parquet file:
to 6.4 minutes with the same query on a genotypeRDD based on the same parquet file:
In the end I would like to do further queries a VariantContext RDD, or export a VariantContext / vcf like file.
To convert a Genotype RDD to a VariantContext RDD I can do genotypes.toVariantContext.
This function is not available on the genotype DataFrame. How do I convert the genotype DataFrame to a VariantContext RDD?
Dataframe.rdd converts to a RDD of org.apache.spark.sql.Row not a RDD of org.bdgenomics.formats.avro.Genotype.
Is it possible to somehow couple the org.bdgenomics.formats.avro.Genotype type to the DataFrame? Or at specify it in the conversion to RDD?
Thank you very much!
Thanks for another great question! Sorry the team has been a bit slow to reply the past week, we were preparing for and attending AMPCamp 6 (slides and video available).
After Josh Rosen's presentation about future directions in Spark and conversations with collaborators most interested in access via Python and R, it feels to me that this (RDD <--> DataFrame/Dataset) is something we need to focus on. I've only been working on ADAM a few months and haven't worked with DataFrames yet . . . maybe we could meet up on IRC (#adamdev on freenode) or gitter to work it out?
I don't necessarily want a DataFrame of VariantContext. The main reason I am starting with a DataFrame of Genotypes is the much improved performance for queries.
Projection and Predicates are done automatically I think. And DataFrames can make full use of the cpu and explicitly managed memory performance improvements from project Tungsten. That it runs just as fast in R and Python is an additional benefit. RDD's as I understand can't make full use of those performance improvements.
What I understand that is lost with DataFrames is type information. Only the primitive types are still known. And therefore I cant use the toVariantContext functionality.
The upcoming Spark 1.6 release will have the DataSet API which will make it possible to couple a specific Java/Scala type back to a DataFrame? And to use user defined functions?
New Developments in Spark by Matei Zaharia (at IBM Research November 2, 2015)
Should this issue wait until the Spark 1.6 is released with the DataSet API?
I am based in Europe btw , UTC+2h, I could also communicate via Gitter or IRC if that would be useful.
And thank you for all help already, I appreciate all the functionality already available and that there is a place to ask question and to get an answer.
Ah, OK! Thanks for providing more info. I do agree about the improved query performance. We've been holding off on moving to DataFrames since the Spark SQL APIs are moving so quickly, but I do imagine that we'll start moving some new functionality to DataSets from Spark 1.6 and onwards. Actually, for genotype-specific queries, I'm doing work in gnocchi that will accelerate the groupBy/join that is needed to create a VariantContext. I don't have it ready yet, but should have it ready in a couple of weeks. I can post back here once I've got that ready to share.