New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate bdg-convert external conversion library proposal #1197

Closed
heuermh opened this Issue Oct 5, 2016 · 5 comments

Comments

Projects
2 participants
@heuermh
Member

heuermh commented Oct 5, 2016

@tomwhite @ryan-williams @fnothaft @massie

Ping for review of a proposal for a new external bdg-formats <--> { htsjdk, ga4gh, string, etc.} conversion library. I've migrated the repo from heuermh to the bigdatagenomics organization here

https://github.com/bigdatagenomics/bdg-convert

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 7, 2016

Member

Generally looks OK to me. I'm +0. A variety of nits:

  • I imagine that there's going to be lots of common logic that is duplicated across converters (e.g., the logic that is shared between AlignmentRecordToSamRecord and AlignmentRecordToFastqLine, etc). Since the classes here are specific to a single conversion sink/source pair, is the plan to factor those out into some private static classes?
  • Why Java? I'm weakly anti-Java, esp. since all of our other implementation code is in Scala.
  • I'm not entirely sure that splitting it out into a separate repo makes sense, since it is ultimately going to rely on some of our models (e.g., SAMRecord <-> AlignmentRecord needs RecordGroupDictionary access, IIRC).
  • There's a bit of weird package naming going on, e.g., org.bdgenomics.convert.bdgenomics, org.bdgenomics.convert.htsjdk. What's the reasoning behind this scheme? It reads really weird.
  • Can't we just use HTSJDK's validation stringency instead of this? https://github.com/bigdatagenomics/bdg-convert/blob/master/src/main/java/org/bdgenomics/convert/ConversionStringency.java
Member

fnothaft commented Oct 7, 2016

Generally looks OK to me. I'm +0. A variety of nits:

  • I imagine that there's going to be lots of common logic that is duplicated across converters (e.g., the logic that is shared between AlignmentRecordToSamRecord and AlignmentRecordToFastqLine, etc). Since the classes here are specific to a single conversion sink/source pair, is the plan to factor those out into some private static classes?
  • Why Java? I'm weakly anti-Java, esp. since all of our other implementation code is in Scala.
  • I'm not entirely sure that splitting it out into a separate repo makes sense, since it is ultimately going to rely on some of our models (e.g., SAMRecord <-> AlignmentRecord needs RecordGroupDictionary access, IIRC).
  • There's a bit of weird package naming going on, e.g., org.bdgenomics.convert.bdgenomics, org.bdgenomics.convert.htsjdk. What's the reasoning behind this scheme? It reads really weird.
  • Can't we just use HTSJDK's validation stringency instead of this? https://github.com/bigdatagenomics/bdg-convert/blob/master/src/main/java/org/bdgenomics/convert/ConversionStringency.java
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 7, 2016

Member

Thanks for the review.

There are more implementations here, including some with nested converters
https://github.com/heuermh/dishevelled-bio/tree/master/convert/src/main/java/org/dishevelled/bio/convert

For example, the VcfRecordToGenotype converter delegates to VcfRecordToVariant (see VcfRecordToGenotypes.java#L85).

Other shared code could be extracted to (package) private static classes.

I proposed to write it in Java because I have found calling Java from Scala to be less troublesome than calling Scala from Java, and some potential clients of this library are implemented in Java (e.g. my stuff above, GATK4, UCSD's https://github.com/biojava/biojava-spark, etc.)

A dependency on our Scala models may answer the question though, and the separate repo question as well.

I don't like the package naming either. Suggestions are welcome.

For ConversionStringency, I don't like the name, and would rather not have a public API dependency on a third party class/enum out of our control. It may also allow say clients of the biojava package to exclude the htsjdk transitive dependency.

Member

heuermh commented Oct 7, 2016

Thanks for the review.

There are more implementations here, including some with nested converters
https://github.com/heuermh/dishevelled-bio/tree/master/convert/src/main/java/org/dishevelled/bio/convert

For example, the VcfRecordToGenotype converter delegates to VcfRecordToVariant (see VcfRecordToGenotypes.java#L85).

Other shared code could be extracted to (package) private static classes.

I proposed to write it in Java because I have found calling Java from Scala to be less troublesome than calling Scala from Java, and some potential clients of this library are implemented in Java (e.g. my stuff above, GATK4, UCSD's https://github.com/biojava/biojava-spark, etc.)

A dependency on our Scala models may answer the question though, and the separate repo question as well.

I don't like the package naming either. Suggestions are welcome.

For ConversionStringency, I don't like the name, and would rather not have a public API dependency on a third party class/enum out of our control. It may also allow say clients of the biojava package to exclude the htsjdk transitive dependency.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 7, 2016

Member

A dependency on our Scala models may answer the question though, and the separate repo question as well.

We may be able to defer to the Avro companions to said models. E.g., org.bdgenomics.adam.models.SequenceRecord <-> org.bdgenomics.formats.avro.Contig, org.bdgenomics.adam.models.RecordGroup <-> org.bdgenomics.formats.avro.RecordGroupMetadata.

I don't like the package naming either. Suggestions are welcome.

What's the general goal of the present naming scheme? I would suggest something along the lines of org.bdgenomics.convert.<recordname>, similar to what we do with org.bdgenomics.adam.rdd.*. I know that has it's own problems, but I think it's a bit cleaner to organize by datatype.

For ConversionStringency, I don't like the name, and would rather not have a public API dependency on a third party class/enum out of our control. It may also allow say clients of the biojava package to exclude the htsjdk transitive dependency.

I get your point, and agree in spirit, but it's pretty hard to avoid public third party API dependencies in a library that converts to/from third party formats. ;)

Member

fnothaft commented Oct 7, 2016

A dependency on our Scala models may answer the question though, and the separate repo question as well.

We may be able to defer to the Avro companions to said models. E.g., org.bdgenomics.adam.models.SequenceRecord <-> org.bdgenomics.formats.avro.Contig, org.bdgenomics.adam.models.RecordGroup <-> org.bdgenomics.formats.avro.RecordGroupMetadata.

I don't like the package naming either. Suggestions are welcome.

What's the general goal of the present naming scheme? I would suggest something along the lines of org.bdgenomics.convert.<recordname>, similar to what we do with org.bdgenomics.adam.rdd.*. I know that has it's own problems, but I think it's a bit cleaner to organize by datatype.

For ConversionStringency, I don't like the name, and would rather not have a public API dependency on a third party class/enum out of our control. It may also allow say clients of the biojava package to exclude the htsjdk transitive dependency.

I get your point, and agree in spirit, but it's pretty hard to avoid public third party API dependencies in a library that converts to/from third party formats. ;)

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 9, 2016

Member

I would suggest something along the lines of org.bdgenomics.convert., similar to what we do with org.bdgenomics.adam.rdd.*.

While I don't care for the awkward package names, the current packaging is essential.

The only proper public API is in the convert package. Extensibility is provided by the public *Module classes in each third party dependency-specific package. The packages could be split out into separate Maven modules. Clients pick and choose which dependencies to enable by assembling modules when the injector is instantiated.

If packaging were done by recordname, then the module would have to refer to all the different third party classes, and there would be no extensibility.

Member

heuermh commented Oct 9, 2016

I would suggest something along the lines of org.bdgenomics.convert., similar to what we do with org.bdgenomics.adam.rdd.*.

While I don't care for the awkward package names, the current packaging is essential.

The only proper public API is in the convert package. Extensibility is provided by the public *Module classes in each third party dependency-specific package. The packages could be split out into separate Maven modules. Clients pick and choose which dependencies to enable by assembling modules when the injector is instantiated.

If packaging were done by recordname, then the module would have to refer to all the different third party classes, and there would be no extensibility.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 1, 2017

Member

bdg-convert version 0.1 was released to Maven Central on May 26 2017.
https://github.com/bigdatagenomics/bdg-convert/releases

Member

heuermh commented Jun 1, 2017

bdg-convert version 0.1 was released to Maven Central on May 26 2017.
https://github.com/bigdatagenomics/bdg-convert/releases

@heuermh heuermh closed this Jun 1, 2017

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment