Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bucketing strategy #1553

jpdna opened this issue May 30, 2017 · 1 comment

bucketing strategy #1553

jpdna opened this issue May 30, 2017 · 1 comment


Copy link

@jpdna jpdna commented May 30, 2017

An issue that we discussed in that past, but I am not sure if we ever prototyped:

I'd like to try writing parquet files from the soon to be ready ADAM dataset api, bucketed by 10 megabase genomic regions using df.write.bucketyby() available in Spark 2.1 , and then compare "random access" performance of lookup and joins (hopefully through partition discovery here: ) use a "chr_bin" filter in a query - as compared with the normal parquet predicate pushdown.

I'll go ahead and experiment with this - but wanted to get anyone else's thoughts on if this seems viable or worthwhile. I'm hoping to achieve the effect here of a very course index.

@fnothaft fnothaft added the duplicate label Jun 22, 2017
Copy link

@fnothaft fnothaft commented Jun 22, 2017

Hi @jpdna! I think this is a dupe of #651. I'm closing it as a dupe, but please reopen if you disagree.

@fnothaft fnothaft closed this Jun 22, 2017
@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.