Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
How to count genotypes with a 10 node Spark/Adam cluster faster than with BCFTools on a single machine? #879
How can I count genotypes in a Adam Genotype parquet file, using a Spark and Adam cluster, faster than doing the same thing with a BCF file and BCFTools on a single machine?
I am looking into Spark/Adam as a potential back end for a scalable Variant and Genotype store.
As a baseline for performance I used BCFTools to decompress, inflate and count variant/genotype records from the Chr_22 1000 genomes BCF file.
My expectation was that with a small Spark and Adam cluster this could be done at least an order of magnitude faster. Say below a minute.
To my surprise the best result I can get with a 10 node Spark/Adam cluster is slower than my single machine BCFTools baseline, and that is with using at least 10 times as much resources.
Am I mistaken to look in to Spark and Adam as potential (future) scalable back end for storing and querying Variants and Genotypes?
Or did I just make some mistakes with the commands somewhere?
The exact code used to start up a Spark and Adam cluster and do the performance comparison with BCF tools is here:
Any comments and insights would be much appreciated.