Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
How to count genotypes with a 10 node Spark/Adam cluster faster than with BCFTools on a single machine? #879
How can I count genotypes in a Adam Genotype parquet file, using a Spark and Adam cluster, faster than doing the same thing with a BCF file and BCFTools on a single machine?
I am looking into Spark/Adam as a potential back end for a scalable Variant and Genotype store.
As a baseline for performance I used BCFTools to decompress, inflate and count variant/genotype records from the Chr_22 1000 genomes BCF file.
My expectation was that with a small Spark and Adam cluster this could be done at least an order of magnitude faster. Say below a minute.
To my surprise the best result I can get with a 10 node Spark/Adam cluster is slower than my single machine BCFTools baseline, and that is with using at least 10 times as much resources.
Am I mistaken to look in to Spark and Adam as potential (future) scalable back end for storing and querying Variants and Genotypes?
Or did I just make some mistakes with the commands somewhere?
The exact code used to start up a Spark and Adam cluster and do the performance comparison with BCF tools is here:
Any comments and insights would be much appreciated.