How to convert genotype DataFrame to VariantContext DataFrame / RDD #886

Closed
NeillGibson opened this Issue Nov 23, 2015 · 6 comments

Comments

Projects
3 participants
@NeillGibson
Contributor

NeillGibson commented Nov 23, 2015

Hi,

How can I convert a genotype DataFrame to a VariantContext DataFrame / RDD ?

In Spark 1.5.2 it is much faster to do queries on a genotype DataFrame than on a genotypes RDD.

Compare 14s with a Genotype DataFrame direct on the parquet file:

val genotypeParquetFile = sqlContext.read.parquet("/user/ec2-user/1kg/chr22.adam")
genotypeParquetFile.registerTempTable("parquetFile")
sqlContext.sql("SELECT * FROM parquetFile WHERE sampleId = 'NA19114'").count

count at <console>:20, took 14.139183 s

to 6.4 minutes with the same query on a genotypeRDD based on the same parquet file:

val genotypes = ac.loadGenotypes("/user/ec2-user/1kg/chr22.adam")
genotypes.filter(_.sampleId =="NA19114" ).count

count at <console>:33) finished in 384.664 s  (6.4 minutes)

In the end I would like to do further queries a VariantContext RDD, or export a VariantContext / vcf like file.

To convert a Genotype RDD to a VariantContext RDD I can do genotypes.toVariantContext.

This function is not available on the genotype DataFrame. How do I convert the genotype DataFrame to a VariantContext RDD?

Dataframe.rdd converts to a RDD of org.apache.spark.sql.Row not a RDD of org.bdgenomics.formats.avro.Genotype.

genotypesDataframe.rdd.first
res3: org.apache.spark.sql.Row = 
[[100,[22,null,null,null,null,null,null],24002134,24002135,T,C,null,false],[true,WrappedArray(),null,null,null,null,null,null,null,WrappedArray(),WrappedArray(),null,null,Map()],HG00096,null,null,WrappedArray(Ref, Ref),null,null,null,null,null,null,WrappedArray(),WrappedArray(),WrappedArray(),false,true,null,null]


genotypes.first
res5: org.bdgenomics.formats.avro.Genotype = 
{"variant": {"variantErrorProbability": 100, "contig": {"contigName": "22", "contigLength": null, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null, "referenceIndex": null}, 
"start": 16050074, "end": 16050075, "referenceAllele": "A", "alternateAllele": "G", "svAllele": null, "isSomatic": false}, "variantCallingAnnotations": {"variantIsPassing": true, "variantFilters": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": null, "rmsMapQ": null, "mapq0Reads": null, "mqRankSum": null, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {}}, 
"sampleId": "HG00096", "sampleDescription": null, "processingDescription": null, "alleles": ...

Is it possible to somehow couple the org.bdgenomics.formats.avro.Genotype type to the DataFrame? Or at specify it in the conversion to RDD?

Thank you very much!

Neill

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 23, 2015

Member

Thanks for another great question! Sorry the team has been a bit slow to reply the past week, we were preparing for and attending AMPCamp 6 (slides and video available).

After Josh Rosen's presentation about future directions in Spark and conversations with collaborators most interested in access via Python and R, it feels to me that this (RDD <--> DataFrame/Dataset) is something we need to focus on. I've only been working on ADAM a few months and haven't worked with DataFrames yet . . . maybe we could meet up on IRC (#adamdev on freenode) or gitter to work it out?

Member

heuermh commented Nov 23, 2015

Thanks for another great question! Sorry the team has been a bit slow to reply the past week, we were preparing for and attending AMPCamp 6 (slides and video available).

After Josh Rosen's presentation about future directions in Spark and conversations with collaborators most interested in access via Python and R, it feels to me that this (RDD <--> DataFrame/Dataset) is something we need to focus on. I've only been working on ADAM a few months and haven't worked with DataFrames yet . . . maybe we could meet up on IRC (#adamdev on freenode) or gitter to work it out?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 23, 2015

Member

If you want to have a Dataframe of VariantContexts, I would load an RDD of Genotypes, do _.toVariantContext on the RDD (this is an implicit added by importing org.bdgenomics.adam.rdd.ADAMContext._), and then convert the RDD to a Dataframe.

Member

fnothaft commented Nov 23, 2015

If you want to have a Dataframe of VariantContexts, I would load an RDD of Genotypes, do _.toVariantContext on the RDD (this is an implicit added by importing org.bdgenomics.adam.rdd.ADAMContext._), and then convert the RDD to a Dataframe.

@NeillGibson

This comment has been minimized.

Show comment
Hide comment
@NeillGibson

NeillGibson Nov 23, 2015

Contributor

I don't necessarily want a DataFrame of VariantContext. The main reason I am starting with a DataFrame of Genotypes is the much improved performance for queries.

Projection and Predicates are done automatically I think. And DataFrames can make full use of the cpu and explicitly managed memory performance improvements from project Tungsten. That it runs just as fast in R and Python is an additional benefit. RDD's as I understand can't make full use of those performance improvements.

What I understand that is lost with DataFrames is type information. Only the primitive types are still known. And therefore I cant use the toVariantContext functionality.

The upcoming Spark 1.6 release will have the DataSet API which will make it possible to couple a specific Java/Scala type back to a DataFrame? And to use user defined functions?

New Developments in Spark by Matei Zaharia (at IBM Research November 2, 2015)
https://www.youtube.com/watch?feature=player_detailpage&v=zK9u21HX6dM#t=2848

Should this issue wait until the Spark 1.6 is released with the DataSet API?

I am based in Europe btw , UTC+2h, I could also communicate via Gitter or IRC if that would be useful.

And thank you for all help already, I appreciate all the functionality already available and that there is a place to ask question and to get an answer.

Contributor

NeillGibson commented Nov 23, 2015

I don't necessarily want a DataFrame of VariantContext. The main reason I am starting with a DataFrame of Genotypes is the much improved performance for queries.

Projection and Predicates are done automatically I think. And DataFrames can make full use of the cpu and explicitly managed memory performance improvements from project Tungsten. That it runs just as fast in R and Python is an additional benefit. RDD's as I understand can't make full use of those performance improvements.

What I understand that is lost with DataFrames is type information. Only the primitive types are still known. And therefore I cant use the toVariantContext functionality.

The upcoming Spark 1.6 release will have the DataSet API which will make it possible to couple a specific Java/Scala type back to a DataFrame? And to use user defined functions?

New Developments in Spark by Matei Zaharia (at IBM Research November 2, 2015)
https://www.youtube.com/watch?feature=player_detailpage&v=zK9u21HX6dM#t=2848

Should this issue wait until the Spark 1.6 is released with the DataSet API?

I am based in Europe btw , UTC+2h, I could also communicate via Gitter or IRC if that would be useful.

And thank you for all help already, I appreciate all the functionality already available and that there is a place to ask question and to get an answer.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 23, 2015

Member

Ah, OK! Thanks for providing more info. I do agree about the improved query performance. We've been holding off on moving to DataFrames since the Spark SQL APIs are moving so quickly, but I do imagine that we'll start moving some new functionality to DataSets from Spark 1.6 and onwards. Actually, for genotype-specific queries, I'm doing work in gnocchi that will accelerate the groupBy/join that is needed to create a VariantContext. I don't have it ready yet, but should have it ready in a couple of weeks. I can post back here once I've got that ready to share.

Member

fnothaft commented Nov 23, 2015

Ah, OK! Thanks for providing more info. I do agree about the improved query performance. We've been holding off on moving to DataFrames since the Spark SQL APIs are moving so quickly, but I do imagine that we'll start moving some new functionality to DataSets from Spark 1.6 and onwards. Actually, for genotype-specific queries, I'm doing work in gnocchi that will accelerate the groupBy/join that is needed to create a VariantContext. I don't have it ready yet, but should have it ready in a couple of weeks. I can post back here once I've got that ready to share.

@NeillGibson

This comment has been minimized.

Show comment
Hide comment
@NeillGibson

NeillGibson Nov 29, 2015

Contributor

Ok thank you also for the information. Good to hear that conversion to VariantContext is something you are working on. I look forward to your work from gnocchi.

Contributor

NeillGibson commented Nov 29, 2015

Ok thank you also for the information. Good to hear that conversion to VariantContext is something you are working on. I look forward to your work from gnocchi.

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016

@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2017

Member

This is resolved by WIP PR #1391. Closing in favor of #1018.

Member

fnothaft commented Mar 3, 2017

This is resolved by WIP PR #1391. Closing in favor of #1018.

@fnothaft fnothaft closed this Mar 3, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment