Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ReferenceRegion with a trait #513

Closed
tdanford opened this issue Nov 28, 2014 · 5 comments
Closed

Replace ReferenceRegion with a trait #513

tdanford opened this issue Nov 28, 2014 · 5 comments

Comments

@tdanford
Copy link
Contributor

I think we've talked about this on the ADAM call before, but I don't see a ticket for it...

This suggestion here is to replace the ReferenceRegion case class (and the corresponding ReferenceMapping trait and implementations) with a ReferenceRegion trait, and to drop the reference mapping stuff altogether.

On the plus side:

  • makes it easier to define new model classes, and use them for spatial calculations (coverage regions, joins, etc) without having to define a ReferenceMapping and take the GC hit associated with basically doubling the number of objects you're dealing with
  • avoid operations where you key each model class by a reference region, etc.

Negatives:

  • it might be hard (impossible?) to have model classes defined in Avro (e.g. AlignmentRecord, Genotype, Variant) extend this trait

Thoughts? (Ping @massie @fnothaft @laserson @carlyeks)

@fnothaft
Copy link
Member

fnothaft commented Feb 7, 2015

So, I've been thinking about this for a while. A few thoughts:

  • IMO, ReferenceMapping is really inconvenient to use. E.g., let's say I want to do a region join where one of my RDDs is RDD[(AlignmentRecord, Long)]. Instead of just doing rdd.keyBy(kv => ReferenceRegion(kv._1)), I have to define an implicit reference mapping for (AlignmentRecord, Long).
  • You can't actually get rid of the GC penalty anyways...? You still wind up using the ReferenceMapping[T] to create a ReferenceRegion object. This is especially true in the ShuffleRegionJoin, since you need a key to repartition.

I've been using the region join code a lot in avocado recently and I actually don't see any positives to ReferenceMapping. What was the original reason we decided to go down this route?

@fnothaft
Copy link
Member

fnothaft commented Feb 7, 2015

Also, the ReferenceMapping[T] approach is pretty restrictive if you're dealing with objects that map to multiple genomic regions (e.g., a chimeric read pair).

@laserson
Copy link
Contributor

laserson commented Feb 8, 2015

+1 overall

@fnothaft, your first issue is something I've been complaining about for a while, but expressed very clearly. Haven't thought about the read pair issue at all...

@tdanford, replacing ReferenceRegion with a trait won't make the Avro => ReferenceRegion mapping any worse. It already has to be done manually in the code (with the ReferenceMapping).

@fnothaft
Copy link
Member

fnothaft commented Feb 8, 2015

@laserson ja, I thought you felt similarly... I'm not sure how strongly @tdanford feels about it, but I think we're not too invested in the ReferenceMapping route to rewrite the relevant code (which is mostly the two region join implementations).

I'll also add that the implicits in the ReferenceMappingContext are cumbersome to use. When updating bigdatagenomics/avocado#127 for ADAM 0.15.1-SNAPSHOT, it took several hours to chase down diverging implicits, which is suboptimal. Part of this is because the relevant ReferenceMapping[T] classes are defined as implicit objects (and not pure objects), which means you can't explicitly set the implicit parameter on the region join calls if you do have diverging implicits. This is partly specific to the ReferenceMappingContext implementation, but I think it's preferable to strip out the ReferenceMapping[T] pattern instead of fixing the ReferenceMappingContext.

Also, IIRC, I think that rolling ReferenceRegion into a ReferenceMappable trait was inspired by the HasReferenceRegion trait from hammerlab/guacamole. That pattern makes sense in guacamole because they define their read/variant models differently. Since their read/variant models are Scala case classes, by tacking on the HasReferenceMapping trait they can largely bypass the need to have a ReferenceRegion model. @arahuja @ryan-williams, would my understanding be correct? We're "constrained" by what @laserson was pointing out:

@tdanford, replacing ReferenceRegion with a trait won't make the Avro => ReferenceRegion mapping any worse. It already has to be done manually in the code (with the ReferenceMapping).

@fnothaft
Copy link
Member

fnothaft commented Mar 3, 2015

Closed by #592.

@fnothaft fnothaft closed this as completed Mar 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants