Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RemoveDuplicateReadVariantPairs #633

Closed
davidadamsphd opened this issue Jul 10, 2015 · 1 comment
Closed

Optimize RemoveDuplicateReadVariantPairs #633

davidadamsphd opened this issue Jul 10, 2015 · 1 comment

Comments

@davidadamsphd
Copy link
Contributor

Right now we have 9 transforms to remove duplicates :(
We do
(1) pTransform to get PCollection<KV<UUID, UUID>>
(2) remove dupes of (1)
(3) GroupByKey of (2) (produces PCollection<UUID,Iterable>)
(4) create PCollection<UUID, KV<UUID, Beta>>
(5) create KeyedPCollectionTuple of (3) and (4)
(6) use (5) to create PCollection<UUID, Iterable>
(7) use pKvAB to create PCollection<KV<UUID, Alpha>>
(8) create KeyedPCollectionTuple of (6) and (7)
(9) join Alpha back in (using (8)) to get PCollection<Alpha, Iterable

Frances suggested roughly the following (which has fewer steps)
(1) pTransform to get PCollection<KV<KV<UUID,UUID>, KV<A,B>>>
(2) GroupByKey of (1) to get PCollection<KV<KV<UUID,UUID>, Iterable<KV<A,B>>>>
(3) Make PCollection<KV<UUID, Iterable<KV<A,B>>>
(4) GroupByKey of (3) and get PCollection<KV<UUID, Iterable<Iterable<KV<A,B>>>>
(5) Now back out to get the final result PCollection<KV<A, Iterable>>

Perhaps a better solution would be to avoid having to remove duplicates entirely. The solution would be to find the extent of the longest read and add variants to every shard they overlap (plus a margin of the length of the longest read). The reads would only be mapped to the shard of their start position.

@akiezun
Copy link
Contributor

akiezun commented Oct 7, 2015

This issue was moved to broadinstitute/gatk-dataflow#27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants