Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top level WrappedRDD or similar abstraction #1173

Closed
heuermh opened this issue Sep 15, 2016 · 8 comments
Closed

Top level WrappedRDD or similar abstraction #1173

heuermh opened this issue Sep 15, 2016 · 8 comments
Milestone

Comments

@heuermh
Copy link
Member

heuermh commented Sep 15, 2016

Thinking about implementing phenotype support, or sequence/slice/read as proposed in bdg-formats, it appears that GenomicRDD is not quite the top level abstraction that it should be.

I propose a new top level trait/interface/abstract class WrappedRDD (or a better name) that includes only

  val rdd: RDD[T]

  lazy val jrdd: JavaRDD[T] = {
    rdd.toJavaRDD()
  }

  def transform(tFn: RDD[T] => RDD[T]): U = {
    replaceRdd(tFn(rdd))
  }

  protected def replaceRdd(newRdd: RDD[T]): U

and perhaps some of the saveAsParquet stuff (not exactly sure, does that come in from ADAMRDDFunctions?) Then GenomicRDD adds the sequence dictionary, and so on.

This would give us something to extend from when a sequence dictionary is not required and a place to put code level documentation about the wrapped-RDD pattern that we've established. It may also help address #1092.

@fnothaft
Copy link
Member

I don't think the wrapped RDD pattern makes sense unless you have metadata that you want to package up with the RDD. That's the sole reason I wanted to move to the GenomicRDD pattern; we had lots of metadata we needed to access (Sequence/RecordGroupDictionary) that was expensive to aggregate and messy to pass around as tuples. Since then, there have been lots of other nice things that have dropped out of said pattern (being able to remove boilerplate around doing the region joins in the common case), but that all derives from having said SequenceDictionary metadata around. If you don't need the metadata wrapped up, then the wrapped RDD is cumbersome, since you need to wrap all RDD transformations in the transform function.

and perhaps some of the saveAsParquet stuff (not exactly sure, does that come in from ADAMRDDFunctions?)

Indeed; that comes in from here, but the concrete non-GenomicRDD implementation of that class is deprecated and is only intended to support the Flatten CLI. In the GenomicRDD hierarchy, that comes in from AvroGenomicRDD, which extends ADAMRDDFunctions.

This would give us something to extend from when a sequence dictionary is not required and a place to put code level documentation about the wrapped-RDD pattern that we've established. It may also help address #1092.

If it's a trait and not an abstract class, you'll still be struggling with #1092. Well, specifically if implementing classes make use of multiple trait inheritance, that is.

@heuermh
Copy link
Member Author

heuermh commented Sep 15, 2016

Couldn't the methods in ADAMRDDFunctions could go into a top level abstract class, combined with the wrapping methods above? We kind of gave up on the idea of importing from Functions classes.

@akmorrow13
Copy link
Contributor

Right now we can only support RDD's under the GenomicRDD wrapper. It would be nice to swap this out for IntervalRDD's.

@fnothaft
Copy link
Member

IntervalRDD extends RDD, so GenomicRDD already supports IntervalRDD, no? That being said, GenomicRDD could probably be specialized for IntervalRDD (e.g., getReferenceRegion).

@akmorrow13
Copy link
Contributor

The main thing we do with IntervalRDD is filterByInterval(), which there is no good query for in GenomicRdd.

@akmorrow13
Copy link
Contributor

So we could modify IntervalRDD so it accepts [T:ClassTag] and we hard code ReferenceRegion instead of allowing it to be a ClassTag, which I am fine with. The only problem that still remains is Mango has no way to reference things by sample name/file. We need this to store multiple samples and differentiate them in the GUI.

@fnothaft
Copy link
Member

@heuermh With #1505 in flight, what's your current thought on this proposal?

@heuermh
Copy link
Member Author

heuermh commented Jun 22, 2017

Since reporting this, I've gone with having a sequence dictionary even when one doesn't necessarily make sense, so GenomicRDD as the top level is fine.

Note regarding your comment above, we've since removed the Flatten CLI, so perhaps more cleanup can be done. That concrete class moved to https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala#L175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants