New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ADAMContext loadXxx methods for consistency #1481

Closed
heuermh opened this Issue Apr 10, 2017 · 3 comments

Comments

Projects
1 participant
@heuermh
Member

heuermh commented Apr 10, 2017

Suggestions:

@param doc strings should be consistent
@return doc strings should be consistent
log.info messages across load methods should be consistent
Method parameters of type Path should be path
Method parameters of type String should be pathName
Method parameters of type Option should be named optXxx
Include StorageLevel parameter if cache() is called
All load methods should include stringency parameter with STRICT default
All load methods should handle lenient and silent stringency consistently
All load methods should instrument the RDD if Metrics.isRecording
All load methods should be wrapped in LoadXxx.time { metrics
Parquet load methods should include optPredicate and optProjection parameters with None defaults
Type-guessing methods should use methods similar to isVcfExt for consistency
Type-guessing methods should explicitly document support for compressed files
Type-guessing load methods should include optPredicate and optProjection parameters with None defaults

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 12, 2017

Member
$ javap -classpath adam-core/target/adam-core_2.10-0.23.0-SNAPSHOT.jar org.bdgenomics.adam.rdd.ADAMContext | grep load | grep -v default | sort
  public <T extends org.apache.avro.specific.SpecificRecordBase> scala.collection.Seq<T> org$bdgenomics$adam$rdd$ADAMContext$$loadAvro(java.lang.String, org.apache.avro.Schema, scala.reflect.ClassTag<T>);
  public <T> org.apache.spark.rdd.RDD<T> loadParquet(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>, scala.Function1<T, org.apache.avro.specific.SpecificRecord>, scala.reflect.Manifest<T>);
  public org.bdgenomics.adam.models.RecordGroupDictionary loadAvroReadGroupMetadata(java.lang.String);
  public org.bdgenomics.adam.models.RecordGroupDictionary loadBamReadGroups(htsjdk.samtools.SAMFileHeader);
  public org.bdgenomics.adam.models.RecordGroupDictionary org$bdgenomics$adam$rdd$ADAMContext$$loadAvroReadGroupMetadataFile(java.lang.String);
  public org.bdgenomics.adam.models.SequenceDictionary loadAvroSequences(java.lang.String);
  public org.bdgenomics.adam.models.SequenceDictionary loadBamDictionary(htsjdk.samtools.SAMFileHeader);
  public org.bdgenomics.adam.models.SequenceDictionary org$bdgenomics$adam$rdd$ADAMContext$$loadAvroSequencesFile(java.lang.String);
  public org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD loadFasta(java.lang.String, long);
  public org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD loadParquetContigFragments(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD loadSequences(java.lang.String, scala.Option<org.apache.avro.Schema>, long);
  public org.bdgenomics.adam.rdd.feature.CoverageRDD loadCoverage(java.lang.String);
  public org.bdgenomics.adam.rdd.feature.CoverageRDD loadParquetCoverage(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadBed(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadFeatures(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<org.apache.avro.Schema>, scala.Option<java.lang.Object>);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadGff3(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadGtf(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadIntervalList(java.lang.String, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadNarrowPeak(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadParquetFeatures(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.fragment.FragmentRDD loadFragments(java.lang.String);
  public org.bdgenomics.adam.rdd.fragment.FragmentRDD loadInterleavedFastqAsFragments(java.lang.String);
  public org.bdgenomics.adam.rdd.fragment.FragmentRDD loadParquetFragments(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadAlignments(java.lang.String, scala.Option<org.apache.avro.Schema>, scala.Option<java.lang.String>, scala.Option<java.lang.String>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadBam(java.lang.String, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadFastq(java.lang.String, scala.Option<java.lang.String>, scala.Option<java.lang.String>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadIndexedBam(java.lang.String, org.bdgenomics.adam.models.ReferenceRegion);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadIndexedBam(java.lang.String, scala.collection.Iterable<org.bdgenomics.adam.models.ReferenceRegion>, scala.Predef$DummyImplicit);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadInterleavedFastq(java.lang.String);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadPairedFastq(java.lang.String, java.lang.String, scala.Option<java.lang.String>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadParquetAlignments(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadUnpairedFastq(java.lang.String, scala.Option<java.lang.String>, boolean, boolean, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.variant.GenotypeRDD loadGenotypes(java.lang.String, scala.Option<org.apache.avro.Schema>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.variant.GenotypeRDD loadParquetGenotypes(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.variant.VariantContextRDD loadIndexedVcf(java.lang.String, org.bdgenomics.adam.models.ReferenceRegion);
  public org.bdgenomics.adam.rdd.variant.VariantContextRDD loadIndexedVcf(java.lang.String, scala.collection.Iterable<org.bdgenomics.adam.models.ReferenceRegion>, htsjdk.samtools.ValidationStringency, scala.Predef$DummyImplicit);
  public org.bdgenomics.adam.rdd.variant.VariantContextRDD loadVcf(java.lang.String, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.variant.VariantRDD loadParquetVariants(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.variant.VariantRDD loadVariants(java.lang.String, scala.Option<org.apache.avro.Schema>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.util.ReferenceFile loadReferenceFile(java.lang.String, long);
  public scala.Tuple3<org.bdgenomics.adam.models.SequenceDictionary, scala.collection.Seq<org.bdgenomics.formats.avro.Sample>, scala.collection.Seq<htsjdk.variant.vcf.VCFHeaderLine>> loadVcfMetadata(java.lang.String);
  public scala.Tuple3<org.bdgenomics.adam.models.SequenceDictionary, scala.collection.Seq<org.bdgenomics.formats.avro.Sample>, scala.collection.Seq<htsjdk.variant.vcf.VCFHeaderLine>> org$bdgenomics$adam$rdd$ADAMContext$$loadSingleVcfMetadata(java.lang.String);
  public scala.collection.Seq<org.bdgenomics.formats.avro.Sample> loadAvroSampleMetadata(java.lang.String);
Member

heuermh commented Apr 12, 2017

$ javap -classpath adam-core/target/adam-core_2.10-0.23.0-SNAPSHOT.jar org.bdgenomics.adam.rdd.ADAMContext | grep load | grep -v default | sort
  public <T extends org.apache.avro.specific.SpecificRecordBase> scala.collection.Seq<T> org$bdgenomics$adam$rdd$ADAMContext$$loadAvro(java.lang.String, org.apache.avro.Schema, scala.reflect.ClassTag<T>);
  public <T> org.apache.spark.rdd.RDD<T> loadParquet(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>, scala.Function1<T, org.apache.avro.specific.SpecificRecord>, scala.reflect.Manifest<T>);
  public org.bdgenomics.adam.models.RecordGroupDictionary loadAvroReadGroupMetadata(java.lang.String);
  public org.bdgenomics.adam.models.RecordGroupDictionary loadBamReadGroups(htsjdk.samtools.SAMFileHeader);
  public org.bdgenomics.adam.models.RecordGroupDictionary org$bdgenomics$adam$rdd$ADAMContext$$loadAvroReadGroupMetadataFile(java.lang.String);
  public org.bdgenomics.adam.models.SequenceDictionary loadAvroSequences(java.lang.String);
  public org.bdgenomics.adam.models.SequenceDictionary loadBamDictionary(htsjdk.samtools.SAMFileHeader);
  public org.bdgenomics.adam.models.SequenceDictionary org$bdgenomics$adam$rdd$ADAMContext$$loadAvroSequencesFile(java.lang.String);
  public org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD loadFasta(java.lang.String, long);
  public org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD loadParquetContigFragments(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.contig.NucleotideContigFragmentRDD loadSequences(java.lang.String, scala.Option<org.apache.avro.Schema>, long);
  public org.bdgenomics.adam.rdd.feature.CoverageRDD loadCoverage(java.lang.String);
  public org.bdgenomics.adam.rdd.feature.CoverageRDD loadParquetCoverage(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadBed(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadFeatures(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<org.apache.avro.Schema>, scala.Option<java.lang.Object>);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadGff3(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadGtf(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadIntervalList(java.lang.String, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadNarrowPeak(java.lang.String, scala.Option<org.apache.spark.storage.StorageLevel>, scala.Option<java.lang.Object>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.feature.FeatureRDD loadParquetFeatures(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.fragment.FragmentRDD loadFragments(java.lang.String);
  public org.bdgenomics.adam.rdd.fragment.FragmentRDD loadInterleavedFastqAsFragments(java.lang.String);
  public org.bdgenomics.adam.rdd.fragment.FragmentRDD loadParquetFragments(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadAlignments(java.lang.String, scala.Option<org.apache.avro.Schema>, scala.Option<java.lang.String>, scala.Option<java.lang.String>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadBam(java.lang.String, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadFastq(java.lang.String, scala.Option<java.lang.String>, scala.Option<java.lang.String>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadIndexedBam(java.lang.String, org.bdgenomics.adam.models.ReferenceRegion);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadIndexedBam(java.lang.String, scala.collection.Iterable<org.bdgenomics.adam.models.ReferenceRegion>, scala.Predef$DummyImplicit);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadInterleavedFastq(java.lang.String);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadPairedFastq(java.lang.String, java.lang.String, scala.Option<java.lang.String>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadParquetAlignments(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.read.AlignmentRecordRDD loadUnpairedFastq(java.lang.String, scala.Option<java.lang.String>, boolean, boolean, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.variant.GenotypeRDD loadGenotypes(java.lang.String, scala.Option<org.apache.avro.Schema>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.variant.GenotypeRDD loadParquetGenotypes(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.variant.VariantContextRDD loadIndexedVcf(java.lang.String, org.bdgenomics.adam.models.ReferenceRegion);
  public org.bdgenomics.adam.rdd.variant.VariantContextRDD loadIndexedVcf(java.lang.String, scala.collection.Iterable<org.bdgenomics.adam.models.ReferenceRegion>, htsjdk.samtools.ValidationStringency, scala.Predef$DummyImplicit);
  public org.bdgenomics.adam.rdd.variant.VariantContextRDD loadVcf(java.lang.String, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.rdd.variant.VariantRDD loadParquetVariants(java.lang.String, scala.Option<org.apache.parquet.filter2.predicate.FilterPredicate>, scala.Option<org.apache.avro.Schema>);
  public org.bdgenomics.adam.rdd.variant.VariantRDD loadVariants(java.lang.String, scala.Option<org.apache.avro.Schema>, htsjdk.samtools.ValidationStringency);
  public org.bdgenomics.adam.util.ReferenceFile loadReferenceFile(java.lang.String, long);
  public scala.Tuple3<org.bdgenomics.adam.models.SequenceDictionary, scala.collection.Seq<org.bdgenomics.formats.avro.Sample>, scala.collection.Seq<htsjdk.variant.vcf.VCFHeaderLine>> loadVcfMetadata(java.lang.String);
  public scala.Tuple3<org.bdgenomics.adam.models.SequenceDictionary, scala.collection.Seq<org.bdgenomics.formats.avro.Sample>, scala.collection.Seq<htsjdk.variant.vcf.VCFHeaderLine>> org$bdgenomics$adam$rdd$ADAMContext$$loadSingleVcfMetadata(java.lang.String);
  public scala.collection.Seq<org.bdgenomics.formats.avro.Sample> loadAvroSampleMetadata(java.lang.String);
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 12, 2017

Member

Per https://github.com/databricks/scala-style-guide#default-parameter-values perhaps it would be better to not use default parameter values at all.

Member

heuermh commented Apr 12, 2017

Per https://github.com/databricks/scala-style-guide#default-parameter-values perhaps it would be better to not use default parameter values at all.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 12, 2017

Member

These methods both need default values, and that is disallowed by the scala compiler

error: in class ADAMContext, multiple overloaded alternatives of loadIndexedVcf define default arguments

def loadIndexedVcf(
  pathName: String,
  viewRegion: ReferenceRegion,
  stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD = {
  loadIndexedVcf(pathName, Iterable(viewRegion))
}

def loadIndexedVcf(
  pathName: String,
  viewRegions: Iterable[ReferenceRegion],
  stringency: ValidationStringency = ValidationStringency.STRICT)(implicit s: DummyImplicit): VariantContextRDD = {
  // ...
Member

heuermh commented Apr 12, 2017

These methods both need default values, and that is disallowed by the scala compiler

error: in class ADAMContext, multiple overloaded alternatives of loadIndexedVcf define default arguments

def loadIndexedVcf(
  pathName: String,
  viewRegion: ReferenceRegion,
  stringency: ValidationStringency = ValidationStringency.STRICT): VariantContextRDD = {
  loadIndexedVcf(pathName, Iterable(viewRegion))
}

def loadIndexedVcf(
  pathName: String,
  viewRegions: Iterable[ReferenceRegion],
  stringency: ValidationStringency = ValidationStringency.STRICT)(implicit s: DummyImplicit): VariantContextRDD = {
  // ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment