You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we have 9 transforms to remove duplicates :(
We do
(1) pTransform to get PCollection<KV<UUID, UUID>>
(2) remove dupes of (1)
(3) GroupByKey of (2) (produces PCollection<UUID,Iterable>)
(4) create PCollection<UUID, KV<UUID, Beta>>
(5) create KeyedPCollectionTuple of (3) and (4)
(6) use (5) to create PCollection<UUID, Iterable>
(7) use pKvAB to create PCollection<KV<UUID, Alpha>>
(8) create KeyedPCollectionTuple of (6) and (7)
(9) join Alpha back in (using (8)) to get PCollection<Alpha, Iterable
Frances suggested roughly the following (which has fewer steps)
(1) pTransform to get PCollection<KV<KV<UUID,UUID>, KV<A,B>>>
(2) GroupByKey of (1) to get PCollection<KV<KV<UUID,UUID>, Iterable<KV<A,B>>>>
(3) Make PCollection<KV<UUID, Iterable<KV<A,B>>>
(4) GroupByKey of (3) and get PCollection<KV<UUID, Iterable<Iterable<KV<A,B>>>>
(5) Now back out to get the final result PCollection<KV<A, Iterable>>
Perhaps a better solution would be to avoid having to remove duplicates entirely. The solution would be to find the extent of the longest read and add variants to every shard they overlap (plus a margin of the length of the longest read). The reads would only be mapped to the shard of their start position.
The text was updated successfully, but these errors were encountered:
Right now we have 9 transforms to remove duplicates :(
We do
(1) pTransform to get PCollection<KV<UUID, UUID>>
(2) remove dupes of (1)
(3) GroupByKey of (2) (produces PCollection<UUID,Iterable>)
(4) create PCollection<UUID, KV<UUID, Beta>>
(5) create KeyedPCollectionTuple of (3) and (4)
(6) use (5) to create PCollection<UUID, Iterable>
(7) use pKvAB to create PCollection<KV<UUID, Alpha>>
(8) create KeyedPCollectionTuple of (6) and (7)
(9) join Alpha back in (using (8)) to get PCollection<Alpha, Iterable
Frances suggested roughly the following (which has fewer steps)
(1) pTransform to get PCollection<KV<KV<UUID,UUID>, KV<A,B>>>
(2) GroupByKey of (1) to get PCollection<KV<KV<UUID,UUID>, Iterable<KV<A,B>>>>
(3) Make PCollection<KV<UUID, Iterable<KV<A,B>>>
(4) GroupByKey of (3) and get PCollection<KV<UUID, Iterable<Iterable<KV<A,B>>>>
(5) Now back out to get the final result PCollection<KV<A, Iterable>>
Perhaps a better solution would be to avoid having to remove duplicates entirely. The solution would be to find the extent of the longest read and add variants to every shard they overlap (plus a margin of the length of the longest read). The reads would only be mapped to the shard of their start position.
The text was updated successfully, but these errors were encountered: