Optimize RemoveDuplicateReadVariantPairs #633

davidadamsphd · 2015-07-10T14:33:27Z

Right now we have 9 transforms to remove duplicates :(
We do
(1) pTransform to get PCollection<KV<UUID, UUID>>
(2) remove dupes of (1)
(3) GroupByKey of (2) (produces PCollection<UUID,Iterable>)
(4) create PCollection<UUID, KV<UUID, Beta>>
(5) create KeyedPCollectionTuple of (3) and (4)
(6) use (5) to create PCollection<UUID, Iterable>
(7) use pKvAB to create PCollection<KV<UUID, Alpha>>
(8) create KeyedPCollectionTuple of (6) and (7)
(9) join Alpha back in (using (8)) to get PCollection<Alpha, Iterable

Frances suggested roughly the following (which has fewer steps)
(1) pTransform to get PCollection<KV<KV<UUID,UUID>, KV<A,B>>>
(2) GroupByKey of (1) to get PCollection<KV<KV<UUID,UUID>, Iterable<KV<A,B>>>>
(3) Make PCollection<KV<UUID, Iterable<KV<A,B>>>
(4) GroupByKey of (3) and get PCollection<KV<UUID, Iterable<Iterable<KV<A,B>>>>
(5) Now back out to get the final result PCollection<KV<A, Iterable>>

Perhaps a better solution would be to avoid having to remove duplicates entirely. The solution would be to find the extent of the longest read and add variants to every shard they overlap (plus a margin of the length of the longest read). The reads would only be mapped to the shard of their start position.

akiezun · 2015-10-07T00:51:49Z

This issue was moved to broadinstitute/gatk-dataflow#27

Created PoN workflow, and added WGS option, closes #516 #523

davidadamsphd added Engine Dataflow labels Jul 10, 2015

davidadamsphd self-assigned this Jul 10, 2015

jean-philippe-martin mentioned this issue Jul 16, 2015

Skeleton of the reads preprocessing pipeline #655

Merged

droazen added this to the alpha milestone Jul 22, 2015

droazen mentioned this issue Jul 22, 2015

Profile and optimize the ReadsPreprocessingPipeline #696

Closed

droazen added the performance label Jul 27, 2015

akiezun removed this from the alpha milestone Oct 5, 2015

akiezun mentioned this issue Oct 7, 2015

Profile and optimize the ReadsPreprocessingPipeline broadinstitute/gatk-dataflow#24

Open

akiezun closed this as completed Oct 7, 2015

akiezun mentioned this issue Oct 7, 2015

Optimize RemoveDuplicateReadVariantPairs broadinstitute/gatk-dataflow#27

Open

lbergelson pushed a commit that referenced this issue May 31, 2017

Merge pull request #633 from broadinstitute/as_wgs_pon_wdl

a99a09c

Created PoN workflow, and added WGS option, closes #516 #523

droazen mentioned this issue Jun 5, 2017

ClusteringGenomicHMMSegmenter can fail if random memory length exceeds maximum allowed. #2877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize RemoveDuplicateReadVariantPairs #633

Optimize RemoveDuplicateReadVariantPairs #633

davidadamsphd commented Jul 10, 2015

akiezun commented Oct 7, 2015

Optimize RemoveDuplicateReadVariantPairs #633

Optimize RemoveDuplicateReadVariantPairs #633

Comments

davidadamsphd commented Jul 10, 2015

akiezun commented Oct 7, 2015