adamGetReferenceString doesn't reduce pairs correctly #967

Closed
erictu opened this Issue Mar 17, 2016 · 0 comments

Comments

Projects
None yet
1 participant
@erictu
Member

erictu commented Mar 17, 2016

Throws the error Regions being joined must be adjacent. when reducing using the reducePairs function.

This issue manifests when trying to reduce multiple pairs of NucleotideContigFragments. Spark's reduce operation requires the function to be commutative and associative when applied to the data, which isn't necessarily the case all the time. If we have 10 fragments, each of ReferenceRegion 10kb, sometimes we will end up reducing nonadjacent regions.

To solve this issue, I implemented a solution that first collects the data and applies scala Array's reduceLeft operator so that the merged regions are always adjacent (shown below in NucleotideContigFragmentRDDFunctions). I'll submit a PR for it soon.

      val refPairRDD: RDD[(ReferenceRegion, String)] = rdd.keyBy(ReferenceRegion(_))
        .filter(kv => kv._1.isDefined)
        .map(kv => (kv._1.get, kv._2))
        .filter(kv => kv._1.overlaps(region))
        .sortByKey()
        .map(kv => getString(kv))

      val pair: (ReferenceRegion, String) = refPairRDD.collect.reduceLeft(reducePairs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment