[ADAM-1793] Adding vcf2adam and adam2vcf that handle separate variant and genotype data. #1794
Conversation
Merged build finished. Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1794/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains a27e59b # timeout=10Checking out Revision a27e59b (origin/pr/1794/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f a27e59b59f2aa0fdea51c4927ad63e11b541e8f0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
One small change, otherwise LGTM. |
args.inputPath, | ||
stringency = stringency) | ||
|
||
// todo: cache variantContexts? sort variantContexts first? |
fnothaft
Nov 7, 2017
Member
I would present this as an option. It's fine to convert twice if you're working with a small dataset, but you'll want to persist if you're working with a large data.
I would present this as an option. It's fine to convert twice if you're working with a small dataset, but you'll want to persist if you're working with a large data.
stringency = stringency) | ||
|
||
val join = variants.shuffleRegionJoin(genotypes) | ||
val updatedGenotypes = join.rdd.map(pair => { |
fnothaft
Nov 7, 2017
Member
Reading through this a second time, what I'd do here is not a shuffleRegionJoin (think of the multiallelic sites) but instead a join on contig/start/end/ref/alt. You might be able to do this through the dataset API for better (?) performance.
Reading through this a second time, what I'd do here is not a shuffleRegionJoin (think of the multiallelic sites) but instead a join on contig/start/end/ref/alt. You might be able to do this through the dataset API for better (?) performance.
heuermh
Nov 7, 2017
Author
Member
Agreed, will push this post-0.23.0 as to not have to conditionally build against the Spark 1.x vs 2.x Dataset API.
Agreed, will push this post-0.23.0 as to not have to conditionally build against the Spark 1.x vs 2.x Dataset API.
Move to close this in favor of #1865. |
Closing in favor of #1865. |
Fixes #1793
Rough first cut, needs review.