Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
How to read glob of multiple parquet Genotype #1179
Is it possible to load the data of several parquet files storing bdg-formats Genotype into a single GenotypeRDD?
See discussions below from email:
On Tue, Sep 20, 2016 at 11:19 AM, Justin Paschall firstname.lastname@example.org wrote:
Similarly - is there any sense to to converting single-sample VCFs to single sample Parquet files of [Genotype] as a data source of reference?
However, I am not sure there is a way to typeglob load a set of many individual ADAM parquet genotypes files upon loading, in the same way that we can load a bunch of VCF with the typeglob?
It is tricky to get the globs in correctly without having them expanded by the shell; it appears that even my tricks don't work for .adam files
$ ./bin/adam-submit transform "*.adam" -single combined.sam
$ ./bin/adam-submit transform "file:///
Will have to dig in the new code changes to look (getFsAndFilesWithFilter).
The VCF type glob works, but it is slower than I hoped, 10 minutes to load/count on 100 samples of chr22 on my workstation (testing on cluster shortly), and I want to make sure that I am not missing out on a prior one-time conversion to parquet on a per-sample-file basis if that would be better.
Read the unit tests ;)
I jest. This works, but it's not obvious. Specifically, I created two Parquet files:
Then I globbed them like so:
This isn't a great example (since it correctly throws an error for dupe sample names) but it shows how to do the glob. Specifically, you need to glob inside the directories that you're globbing. Long story short, Hadoop treats globs on files and on directories differently. If you want to dig into the implementation details, grep for
That being said, since this isn't obvious, I'll update the exception to include a hint.