New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top level WrappedRDD or similar abstraction #1173

Closed
heuermh opened this Issue Sep 15, 2016 · 8 comments

Comments

Projects
3 participants
@heuermh
Copy link
Member

heuermh commented Sep 15, 2016

Thinking about implementing phenotype support, or sequence/slice/read as proposed in bdg-formats, it appears that GenomicRDD is not quite the top level abstraction that it should be.

I propose a new top level trait/interface/abstract class WrappedRDD (or a better name) that includes only

  val rdd: RDD[T]

  lazy val jrdd: JavaRDD[T] = {
    rdd.toJavaRDD()
  }

  def transform(tFn: RDD[T] => RDD[T]): U = {
    replaceRdd(tFn(rdd))
  }

  protected def replaceRdd(newRdd: RDD[T]): U

and perhaps some of the saveAsParquet stuff (not exactly sure, does that come in from ADAMRDDFunctions?) Then GenomicRDD adds the sequence dictionary, and so on.

This would give us something to extend from when a sequence dictionary is not required and a place to put code level documentation about the wrapped-RDD pattern that we've established. It may also help address #1092.

@heuermh heuermh added the discussion label Sep 15, 2016

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Sep 15, 2016

I don't think the wrapped RDD pattern makes sense unless you have metadata that you want to package up with the RDD. That's the sole reason I wanted to move to the GenomicRDD pattern; we had lots of metadata we needed to access (Sequence/RecordGroupDictionary) that was expensive to aggregate and messy to pass around as tuples. Since then, there have been lots of other nice things that have dropped out of said pattern (being able to remove boilerplate around doing the region joins in the common case), but that all derives from having said SequenceDictionary metadata around. If you don't need the metadata wrapped up, then the wrapped RDD is cumbersome, since you need to wrap all RDD transformations in the transform function.

and perhaps some of the saveAsParquet stuff (not exactly sure, does that come in from ADAMRDDFunctions?)

Indeed; that comes in from here, but the concrete non-GenomicRDD implementation of that class is deprecated and is only intended to support the Flatten CLI. In the GenomicRDD hierarchy, that comes in from AvroGenomicRDD, which extends ADAMRDDFunctions.

This would give us something to extend from when a sequence dictionary is not required and a place to put code level documentation about the wrapped-RDD pattern that we've established. It may also help address #1092.

If it's a trait and not an abstract class, you'll still be struggling with #1092. Well, specifically if implementing classes make use of multiple trait inheritance, that is.

@heuermh

This comment has been minimized.

Copy link
Member Author

heuermh commented Sep 15, 2016

Couldn't the methods in ADAMRDDFunctions could go into a top level abstract class, combined with the wrapping methods above? We kind of gave up on the idea of importing from Functions classes.

@akmorrow13

This comment has been minimized.

Copy link
Contributor

akmorrow13 commented Sep 28, 2016

Right now we can only support RDD's under the GenomicRDD wrapper. It would be nice to swap this out for IntervalRDD's.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Sep 28, 2016

IntervalRDD extends RDD, so GenomicRDD already supports IntervalRDD, no? That being said, GenomicRDD could probably be specialized for IntervalRDD (e.g., getReferenceRegion).

@akmorrow13

This comment has been minimized.

Copy link
Contributor

akmorrow13 commented Sep 29, 2016

The main thing we do with IntervalRDD is filterByInterval(), which there is no good query for in GenomicRdd.

@akmorrow13

This comment has been minimized.

Copy link
Contributor

akmorrow13 commented Oct 2, 2016

So we could modify IntervalRDD so it accepts [T:ClassTag] and we hard code ReferenceRegion instead of allowing it to be a ClassTag, which I am fine with. The only problem that still remains is Mango has no way to reference things by sample name/file. We need this to store multiple samples and differentiate them in the GUI.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jun 22, 2017

@heuermh With #1505 in flight, what's your current thought on this proposal?

@heuermh

This comment has been minimized.

Copy link
Member Author

heuermh commented Jun 22, 2017

Since reporting this, I've gone with having a sequence dictionary even when one doesn't necessarily make sense, so GenomicRDD as the top level is fine.

Note regarding your comment above, we've since removed the Flatten CLI, so perhaps more cleanup can be done. That concrete class moved to https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala#L175

@fnothaft fnothaft closed this Jun 22, 2017

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment