New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bucketing strategy #1553

Closed
jpdna opened this Issue May 30, 2017 · 1 comment

Comments

Projects
3 participants
@jpdna
Member

jpdna commented May 30, 2017

An issue that we discussed in that past, but I am not sure if we ever prototyped:

I'd like to try writing parquet files from the soon to be ready ADAM dataset api, bucketed by 10 megabase genomic regions using df.write.bucketyby() available in Spark 2.1 , and then compare "random access" performance of lookup and joins (hopefully through partition discovery here: https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery ) use a "chr_bin" filter in a query - as compared with the normal parquet predicate pushdown.

I'll go ahead and experiment with this - but wanted to get anyone else's thoughts on if this seems viable or worthwhile. I'm hoping to achieve the effect here of a very course index.

@fnothaft fnothaft added the duplicate label Jun 22, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 22, 2017

Member

Hi @jpdna! I think this is a dupe of #651. I'm closing it as a dupe, but please reopen if you disagree.

Member

fnothaft commented Jun 22, 2017

Hi @jpdna! I think this is a dupe of #651. I'm closing it as a dupe, but please reopen if you disagree.

@fnothaft fnothaft closed this Jun 22, 2017

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment