Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert genotype DataFrame to VariantContext DataFrame / RDD #886

Closed
NeillGibson opened this issue Nov 23, 2015 · 6 comments
Closed

How to convert genotype DataFrame to VariantContext DataFrame / RDD #886

NeillGibson opened this issue Nov 23, 2015 · 6 comments

Comments

@NeillGibson
Copy link
Contributor

@NeillGibson NeillGibson commented Nov 23, 2015

Hi,

How can I convert a genotype DataFrame to a VariantContext DataFrame / RDD ?

In Spark 1.5.2 it is much faster to do queries on a genotype DataFrame than on a genotypes RDD.

Compare 14s with a Genotype DataFrame direct on the parquet file:

val genotypeParquetFile = sqlContext.read.parquet("/user/ec2-user/1kg/chr22.adam")
genotypeParquetFile.registerTempTable("parquetFile")
sqlContext.sql("SELECT * FROM parquetFile WHERE sampleId = 'NA19114'").count

count at <console>:20, took 14.139183 s

to 6.4 minutes with the same query on a genotypeRDD based on the same parquet file:

val genotypes = ac.loadGenotypes("/user/ec2-user/1kg/chr22.adam")
genotypes.filter(_.sampleId =="NA19114" ).count

count at <console>:33) finished in 384.664 s  (6.4 minutes)

In the end I would like to do further queries a VariantContext RDD, or export a VariantContext / vcf like file.

To convert a Genotype RDD to a VariantContext RDD I can do genotypes.toVariantContext.

This function is not available on the genotype DataFrame. How do I convert the genotype DataFrame to a VariantContext RDD?

Dataframe.rdd converts to a RDD of org.apache.spark.sql.Row not a RDD of org.bdgenomics.formats.avro.Genotype.

genotypesDataframe.rdd.first
res3: org.apache.spark.sql.Row = 
[[100,[22,null,null,null,null,null,null],24002134,24002135,T,C,null,false],[true,WrappedArray(),null,null,null,null,null,null,null,WrappedArray(),WrappedArray(),null,null,Map()],HG00096,null,null,WrappedArray(Ref, Ref),null,null,null,null,null,null,WrappedArray(),WrappedArray(),WrappedArray(),false,true,null,null]


genotypes.first
res5: org.bdgenomics.formats.avro.Genotype = 
{"variant": {"variantErrorProbability": 100, "contig": {"contigName": "22", "contigLength": null, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null, "referenceIndex": null}, 
"start": 16050074, "end": 16050075, "referenceAllele": "A", "alternateAllele": "G", "svAllele": null, "isSomatic": false}, "variantCallingAnnotations": {"variantIsPassing": true, "variantFilters": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": null, "rmsMapQ": null, "mapq0Reads": null, "mqRankSum": null, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {}}, 
"sampleId": "HG00096", "sampleDescription": null, "processingDescription": null, "alleles": ...

Is it possible to somehow couple the org.bdgenomics.formats.avro.Genotype type to the DataFrame? Or at specify it in the conversion to RDD?

Thank you very much!

Neill

@heuermh
Copy link
Member

@heuermh heuermh commented Nov 23, 2015

Thanks for another great question! Sorry the team has been a bit slow to reply the past week, we were preparing for and attending AMPCamp 6 (slides and video available).

After Josh Rosen's presentation about future directions in Spark and conversations with collaborators most interested in access via Python and R, it feels to me that this (RDD <--> DataFrame/Dataset) is something we need to focus on. I've only been working on ADAM a few months and haven't worked with DataFrames yet . . . maybe we could meet up on IRC (#adamdev on freenode) or gitter to work it out?

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Nov 23, 2015

If you want to have a Dataframe of VariantContexts, I would load an RDD of Genotypes, do _.toVariantContext on the RDD (this is an implicit added by importing org.bdgenomics.adam.rdd.ADAMContext._), and then convert the RDD to a Dataframe.

@NeillGibson
Copy link
Contributor Author

@NeillGibson NeillGibson commented Nov 23, 2015

I don't necessarily want a DataFrame of VariantContext. The main reason I am starting with a DataFrame of Genotypes is the much improved performance for queries.

Projection and Predicates are done automatically I think. And DataFrames can make full use of the cpu and explicitly managed memory performance improvements from project Tungsten. That it runs just as fast in R and Python is an additional benefit. RDD's as I understand can't make full use of those performance improvements.

What I understand that is lost with DataFrames is type information. Only the primitive types are still known. And therefore I cant use the toVariantContext functionality.

The upcoming Spark 1.6 release will have the DataSet API which will make it possible to couple a specific Java/Scala type back to a DataFrame? And to use user defined functions?

New Developments in Spark by Matei Zaharia (at IBM Research November 2, 2015)
https://www.youtube.com/watch?feature=player_detailpage&v=zK9u21HX6dM#t=2848

Should this issue wait until the Spark 1.6 is released with the DataSet API?

I am based in Europe btw , UTC+2h, I could also communicate via Gitter or IRC if that would be useful.

And thank you for all help already, I appreciate all the functionality already available and that there is a place to ask question and to get an answer.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Nov 23, 2015

Ah, OK! Thanks for providing more info. I do agree about the improved query performance. We've been holding off on moving to DataFrames since the Spark SQL APIs are moving so quickly, but I do imagine that we'll start moving some new functionality to DataSets from Spark 1.6 and onwards. Actually, for genotype-specific queries, I'm doing work in gnocchi that will accelerate the groupBy/join that is needed to create a VariantContext. I don't have it ready yet, but should have it ready in a couple of weeks. I can post back here once I've got that ready to share.

@NeillGibson
Copy link
Contributor Author

@NeillGibson NeillGibson commented Nov 29, 2015

Ok thank you also for the information. Good to hear that conversion to VariantContext is something you are working on. I look forward to your work from gnocchi.

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016
@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 3, 2017

This is resolved by WIP PR #1391. Closing in favor of #1018.

@fnothaft fnothaft closed this Mar 3, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.