Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to count genotypes with a 10 node Spark/Adam cluster faster than with BCFTools on a single machine? #879

NeillGibson opened this issue Nov 10, 2015 · 4 comments


Copy link

@NeillGibson NeillGibson commented Nov 10, 2015


How can I count genotypes in a Adam Genotype parquet file, using a Spark and Adam cluster, faster than doing the same thing with a BCF file and BCFTools on a single machine?

I am looking into Spark/Adam as a potential back end for a scalable Variant and Genotype store.

As a baseline for performance I used BCFTools to decompress, inflate and count variant/genotype records from the Chr_22 1000 genomes BCF file.
This took 5m7s with a single thread on a single machine. Chr_22 is just a small part of the full 1000 genomes Variant/Genotype data set, which of course includes the other chromosomes.

My expectation was that with a small Spark and Adam cluster this could be done at least an order of magnitude faster. Say below a minute.

To my surprise the best result I can get with a 10 node Spark/Adam cluster is slower than my single machine BCFTools baseline, and that is with using at least 10 times as much resources.
The best results I could get with Spark and Adam is 8m8s. This doesn't even take in to account that I really would like to count (query) variants for which a expensive groupByKey (variant id/pos) is needed.

Am I mistaken to look in to Spark and Adam as potential (future) scalable back end for storing and querying Variants and Genotypes?
Should I look into related technology for setting up a scalable variant and genotype store? For instance SparkSQL or Impala, or something completely else?
Is in memory caching (of compressed Variant&Genotype data?) needed to get good (interactive) query performance? And would that be possible for the 1000 genomes vcf file which is raw 1TB over all chromosomes? z

Or did I just make some mistakes with the commands somewhere?
My time to count genotypes looks slower than from the performance results mentioned here
Count chr22 genotypes (from HDFS): 10 min (2nodes) 1.4 min(20nodes)

The exact code used to start up a Spark and Adam cluster and do the performance comparison with BCF tools is here:

Any comments and insights would be much appreciated.

Best Regards,

Neill Gibson

Copy link

@heuermh heuermh commented Nov 11, 2015

Thank you for documenting your use case, will take a look.

Copy link
Contributor Author

@NeillGibson NeillGibson commented Nov 12, 2015

Thank you!

Copy link
Contributor Author

@NeillGibson NeillGibson commented Nov 18, 2015

Hi Michael I had some time to look at the comments and test the changes. For readability I split the results I found in multiple comment here:

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016
@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016
@heuermh heuermh added this to Triage in Release 0.23.0 Mar 8, 2017
Copy link

@fnothaft fnothaft commented Jul 12, 2017

Using the chr22.vcf from 1kg, count times on 5 16-core nodes are:

  • Using RDD API ~4.5min
  • Using Dataset API <10sec

Marking this as closed.

@fnothaft fnothaft closed this Jul 12, 2017
@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.