Top level WrappedRDD or similar abstraction #1173

heuermh · 2016-09-15T16:14:28Z

Thinking about implementing phenotype support, or sequence/slice/read as proposed in bdg-formats, it appears that GenomicRDD is not quite the top level abstraction that it should be.

I propose a new top level trait/interface/abstract class WrappedRDD (or a better name) that includes only

  val rdd: RDD[T]

  lazy val jrdd: JavaRDD[T] = {
    rdd.toJavaRDD()
  }

  def transform(tFn: RDD[T] => RDD[T]): U = {
    replaceRdd(tFn(rdd))
  }

  protected def replaceRdd(newRdd: RDD[T]): U

and perhaps some of the saveAsParquet stuff (not exactly sure, does that come in from ADAMRDDFunctions?) Then GenomicRDD adds the sequence dictionary, and so on.

This would give us something to extend from when a sequence dictionary is not required and a place to put code level documentation about the wrapped-RDD pattern that we've established. It may also help address #1092.

The text was updated successfully, but these errors were encountered:

fnothaft · 2016-09-15T16:34:39Z

I don't think the wrapped RDD pattern makes sense unless you have metadata that you want to package up with the RDD. That's the sole reason I wanted to move to the GenomicRDD pattern; we had lots of metadata we needed to access (Sequence/RecordGroupDictionary) that was expensive to aggregate and messy to pass around as tuples. Since then, there have been lots of other nice things that have dropped out of said pattern (being able to remove boilerplate around doing the region joins in the common case), but that all derives from having said SequenceDictionary metadata around. If you don't need the metadata wrapped up, then the wrapped RDD is cumbersome, since you need to wrap all RDD transformations in the transform function.

and perhaps some of the saveAsParquet stuff (not exactly sure, does that come in from ADAMRDDFunctions?)

Indeed; that comes in from here, but the concrete non-GenomicRDD implementation of that class is deprecated and is only intended to support the Flatten CLI. In the GenomicRDD hierarchy, that comes in from AvroGenomicRDD, which extends ADAMRDDFunctions.

This would give us something to extend from when a sequence dictionary is not required and a place to put code level documentation about the wrapped-RDD pattern that we've established. It may also help address #1092.

If it's a trait and not an abstract class, you'll still be struggling with #1092. Well, specifically if implementing classes make use of multiple trait inheritance, that is.

heuermh · 2016-09-15T16:50:36Z

Couldn't the methods in ADAMRDDFunctions could go into a top level abstract class, combined with the wrapping methods above? We kind of gave up on the idea of importing from Functions classes.

akmorrow13 · 2016-09-28T17:49:03Z

Right now we can only support RDD's under the GenomicRDD wrapper. It would be nice to swap this out for IntervalRDD's.

fnothaft · 2016-09-28T17:51:13Z

IntervalRDD extends RDD, so GenomicRDD already supports IntervalRDD, no? That being said, GenomicRDD could probably be specialized for IntervalRDD (e.g., getReferenceRegion).

akmorrow13 · 2016-09-29T18:57:51Z

The main thing we do with IntervalRDD is filterByInterval(), which there is no good query for in GenomicRdd.

akmorrow13 · 2016-10-02T17:06:21Z

So we could modify IntervalRDD so it accepts [T:ClassTag] and we hard code ReferenceRegion instead of allowing it to be a ClassTag, which I am fine with. The only problem that still remains is Mango has no way to reference things by sample name/file. We need this to store multiple samples and differentiate them in the GUI.

fnothaft · 2017-06-22T20:11:28Z

@heuermh With #1505 in flight, what's your current thought on this proposal?

heuermh · 2017-06-22T20:26:22Z

Since reporting this, I've gone with having a sequence dictionary even when one doesn't necessarily make sense, so GenomicRDD as the top level is fine.

Note regarding your comment above, we've since removed the Flatten CLI, so perhaps more cleanup can be done. That concrete class moved to https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala#L175

heuermh added the discussion label Sep 15, 2016

heuermh mentioned this issue Sep 26, 2016

Swapping out IntervalRDD under GenomicRDD's #1184

Closed

fnothaft closed this as completed Jun 22, 2017

heuermh mentioned this issue Jun 28, 2017

Adding read, sequence, and slice RDDs #1505

Closed

heuermh modified the milestone: 0.23.0 Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Top level WrappedRDD or similar abstraction #1173

Top level WrappedRDD or similar abstraction #1173

heuermh commented Sep 15, 2016

fnothaft commented Sep 15, 2016

heuermh commented Sep 15, 2016

akmorrow13 commented Sep 28, 2016

fnothaft commented Sep 28, 2016

akmorrow13 commented Sep 29, 2016

akmorrow13 commented Oct 2, 2016

fnothaft commented Jun 22, 2017

heuermh commented Jun 22, 2017

Top level WrappedRDD or similar abstraction #1173

Top level WrappedRDD or similar abstraction #1173

Comments

heuermh commented Sep 15, 2016

fnothaft commented Sep 15, 2016

heuermh commented Sep 15, 2016

akmorrow13 commented Sep 28, 2016

fnothaft commented Sep 28, 2016

akmorrow13 commented Sep 29, 2016

akmorrow13 commented Oct 2, 2016

fnothaft commented Jun 22, 2017

heuermh commented Jun 22, 2017