New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified CalculateDepth to calcuate coverage from alignment files #1108

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@akmorrow13
Contributor

akmorrow13 commented Aug 9, 2016

  • Calculates coverage from reads files on command line
  • Creates CoverageRDD from FeatureRDD
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @param position Specifies position of coverage
* @param count Specifies count of coverage at location
*/
case class Coverage(referenceName: String, position: Long, count: Double)

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

There is already a Coverage class. See here. We do not seem to use this/ these are very different. Is there anything difference I can name my Coverage?

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

There is already a Coverage class. See here. We do not seem to use this/ these are very different. Is there anything difference I can name my Coverage?

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
val t: List[Long] = List.range(r.getStart, r.getEnd)
val m: List[(ReferenceRegion, Int)] = t.map(n => (ReferenceRegion(r.getContigName, n, n + 1), 1))
m
}).reduceByKey(_ + _).sortByKey()

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

Currently, this is storing an individual record for each position in the genome. I would like to run length encode this using ReferenceRegion. What would be the most efficient way to group by ReferenceRegion and count to merge adjacent regions?

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

Currently, this is storing an individual record for each position in the genome. I would like to run length encode this using ReferenceRegion. What would be the most efficient way to group by ReferenceRegion and count to merge adjacent regions?

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member
import scala.annotation.tailrec
@tailrec def merge(iter: Iterator[(ReferenceRegion, Int)],
                                 lastRegion: ReferenceRegion = <nonsense_val>,
                                 lastCoverage: Int = -1,
                                 condensed: List[(ReferenceRegion, Int)]): Iterator[(ReferenceRegion, Int)] = {
  if (!iter.hasNext) {
    condensed.toIterator
  } else {
    val (rr, cov) = iter.next
    val (nextRegion, nextCoverage, nextCondensed) = if (rr.adjacent(lastRegion) && lastCoverage == nextCoverage) {
     (rr.collapseAdjacent(lastRegion), lastCoverage, condensed)
    } else {
      (rr, cov, (lastRegion, lastCoverage) :: condensed)
    }
    merge(iter, nextRegion, nextCoverage, nextCondensed)
  }
}

coverage.mapPartitions(merge)
@fnothaft

fnothaft Aug 10, 2016

Member
import scala.annotation.tailrec
@tailrec def merge(iter: Iterator[(ReferenceRegion, Int)],
                                 lastRegion: ReferenceRegion = <nonsense_val>,
                                 lastCoverage: Int = -1,
                                 condensed: List[(ReferenceRegion, Int)]): Iterator[(ReferenceRegion, Int)] = {
  if (!iter.hasNext) {
    condensed.toIterator
  } else {
    val (rr, cov) = iter.next
    val (nextRegion, nextCoverage, nextCondensed) = if (rr.adjacent(lastRegion) && lastCoverage == nextCoverage) {
     (rr.collapseAdjacent(lastRegion), lastCoverage, condensed)
    } else {
      (rr, cov, (lastRegion, lastCoverage) :: condensed)
    }
    merge(iter, nextRegion, nextCoverage, nextCondensed)
  }
}

coverage.mapPartitions(merge)

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Also, .cache() before sortByKey().

@fnothaft

fnothaft Aug 10, 2016

Member

Also, .cache() before sortByKey().

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 9, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1373/

Build result: FAILURE

GitHub pull request #1108 of commit c773387 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1108/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c70d48af29b2cc5f12301fad0c6323369b7cfa0 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1108/merge^{commit} # timeout=10Checking out Revision 0c70d48af29b2cc5f12301fad0c6323369b7cfa0 (origin/pr/1108/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c70d48af29b2cc5f12301fad0c6323369b7cfa0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Aug 9, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1373/

Build result: FAILURE

GitHub pull request #1108 of commit c773387 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1108/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c70d48af29b2cc5f12301fad0c6323369b7cfa0 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1108/merge^{commit} # timeout=10Checking out Revision 0c70d48af29b2cc5f12301fad0c6323369b7cfa0 (origin/pr/1108/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c70d48af29b2cc5f12301fad0c6323369b7cfa0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* equal size.
*
* @param region ReferenceRegion to fetch overlapping coverage from
* @param bpPerBin base pairs per bin, number of bases to combine to one bin

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

This function does a mod for bpPerBin instead of aggregating the bin. However, I couldn't think of how to computing the average size of coverage bins without shuffling. Is this ok?

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

This function does a mod for bpPerBin instead of aggregating the bin. However, I couldn't think of how to computing the average size of coverage bins without shuffling. Is this ok?

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

I have added the option for either aggregation or taking the first element in the bin. There are use cases for both. The use case for no aggregation is visualization, because aggregation provides more information than is required and incurs high overhead. Aggregation of coverage is useful for algorithms that bin coverage by some heuristic.

@akmorrow13

akmorrow13 Aug 9, 2016

Contributor

I have added the option for either aggregation or taking the first element in the bin. There are use cases for both. The use case for no aggregation is visualization, because aggregation provides more information than is required and incurs high overhead. Aggregation of coverage is useful for algorithms that bin coverage by some heuristic.

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

I don't think you can do the binning without shuffling, so that's OK to me.

@fnothaft

fnothaft Aug 10, 2016

Member

I don't think you can do the binning without shuffling, so that's OK to me.

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Can you document aggregatePerBin? I don't intuitively grok what that parameter does.

@fnothaft

fnothaft Aug 10, 2016

Member

Can you document aggregatePerBin? I don't intuitively grok what that parameter does.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 9, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1374/
Test PASSed.

AmplabJenkins commented Aug 9, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1374/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 9, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1375/
Test PASSed.

AmplabJenkins commented Aug 9, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1375/
Test PASSed.

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
var inputPath: String = null
@Argument(required = true, metaVar = "OUTPUT", usage = "Location to write the coverage data in ADAM/Parquet format", index = 1)
var outputPath: String = null
@Args4jOption(required = false, name = "-negativeStrands", usage = "Compute coverage for negative strands")

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Is there a big use case for stranded coverage info?

@fnothaft

fnothaft Aug 10, 2016

Member

Is there a big use case for stranded coverage info?

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Also add "only" to both of these options.

@fnothaft

fnothaft Aug 10, 2016

Member

Also add "only" to both of these options.

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 10, 2016

Contributor

The use case for stranded coverage is for featurization of coverage for DNase in many algorithms. These require separate featurization for negative and positive strands.

@akmorrow13

akmorrow13 Aug 10, 2016

Contributor

The use case for stranded coverage is for featurization of coverage for DNase in many algorithms. These require separate featurization for negative and positive strands.

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
// if negative strand, save coverage of only negative strands
if (args.negativeStrands) {
// count sites for all strands
val negativeReadsRDD: AlignmentRecordRDD = AlignedReadRDD(readsRDD.rdd.filter(_.getReadNegativeStrand), readsRDD.sequences, readsRDD.recordGroups)

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

prefer readsRDD.transform() instead

@fnothaft

fnothaft Aug 10, 2016

Member

prefer readsRDD.transform() instead

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
val readsRDD: AlignmentRecordRDD = sc.loadAlignments(args.inputPath, projection = Some(proj))
// if strand direction is not specified, save unified coverage
if (!args.negativeStrands && !args.positiveStrands) {

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Mayhaps refactor to:

val readsRdd = sc.loadAlignments(args.inputPath, projection = Some(proj))

val finalReads = if (args.negativeStrands && !args.positiveStrands) {
  readsRdd.transform(rdd => rdd.filter(_.getReadNegativeStrand)
} else if (!args.negativeStrands && args.positiveStrands) {
  readsRdd.transform(rdd => rdd.filter(!_.getReadNegativeStrand)
} else {
  readsRdd
}

readsRdd.toCoverage
  .save(args.outputPath)

Tis a bit more concise.

@fnothaft

fnothaft Aug 10, 2016

Member

Mayhaps refactor to:

val readsRdd = sc.loadAlignments(args.inputPath, projection = Some(proj))

val finalReads = if (args.negativeStrands && !args.positiveStrands) {
  readsRdd.transform(rdd => rdd.filter(_.getReadNegativeStrand)
} else if (!args.negativeStrands && args.positiveStrands) {
  readsRdd.transform(rdd => rdd.filter(!_.getReadNegativeStrand)
} else {
  readsRdd
}

readsRdd.toCoverage
  .save(args.outputPath)

Tis a bit more concise.

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Also, in ADAM, we generally prefer lowercase [rR]dd in variable names.

@fnothaft

fnothaft Aug 10, 2016

Member

Also, in ADAM, we generally prefer lowercase [rR]dd in variable names.

Show outdated Hide outdated adam-cli/src/test/scala/org/bdgenomics/adam/cli/Reads2CoverageSuite.scala
val pointCoverage = coverage.rdd.filter(_.position == 30).first
assert(pointCoverage.count == 5)
}
}

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Add newline at EOF

@fnothaft

fnothaft Aug 10, 2016

Member

Add newline at EOF

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
}
}
}
}

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Add newline at EOF

@fnothaft

fnothaft Aug 10, 2016

Member

Add newline at EOF

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
}).filter(r => r.position >= region.start && r.position < region.end) // filter out positions from flanking regions
if (bpPerBin == 1) {

This comment has been minimized.

@fnothaft

fnothaft Aug 10, 2016

Member

Nit: Extra whitespace

@fnothaft

fnothaft Aug 10, 2016

Member

Nit: Extra whitespace

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
var outputPath: String = null
@Args4jOption(required = false, name = "-negativeStrands", usage = "Compute coverage for negative strands")
var negativeStrands: Boolean = false
@Args4jOption(required = false, name = "-positiveStrands", usage = "Compute coverage for positive strands")

This comment has been minimized.

@heuermh

heuermh Aug 10, 2016

Member

We're using -positive_strands (lowercase with underscores) elsewhere

@heuermh

heuermh Aug 10, 2016

Member

We're using -positive_strands (lowercase with underscores) elsewhere

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
/**
* Java friendly save function. Automatically detects the output format.
*
* If the filename ends in ".bed", we write a BED file. If the file name ends

This comment has been minimized.

@heuermh

heuermh Aug 10, 2016

Member

This doc is maybe not so useful, a link to FeatureRDD.save might be more appropriate.
Or a table of examples, showing what the coverage features look like in each format.

@heuermh

heuermh Aug 10, 2016

Member

This doc is maybe not so useful, a link to FeatureRDD.save might be more appropriate.
Or a table of examples, showing what the coverage features look like in each format.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @param bpPerBin base pairs per bin, number of bases to combine to one bin
* @return RDD of Coverage Records
*/
def getCoverage(region: ReferenceRegion, bpPerBin: Int = 1, aggregateBins: Boolean = false): RDD[Coverage] = {

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 10, 2016

Contributor

split into 2 functions

@akmorrow13

akmorrow13 Aug 10, 2016

Contributor

split into 2 functions

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 10, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1378/
Test PASSed.

AmplabJenkins commented Aug 10, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1378/
Test PASSed.

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
/**
* Reads2Coverage (accessible as the command 'coverage' through the CLI) takes two arguments,
* an Read file and an output file, and calculates the number of reads from the Read file
* at every location in the file. Optional arguments are negativeStrands, and positiveStrands,

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

argument names should match below

@heuermh

heuermh Aug 11, 2016

Member

argument names should match below

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
}
/**
* Coverage Recordfor CoverageRDD

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

Recordfor → Record for, end with period

@heuermh

heuermh Aug 11, 2016

Member

Recordfor → Record for, end with period

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Lowercase record, as well.

@fnothaft

fnothaft Aug 11, 2016

Member

Lowercase record, as well.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, thoughts on moving the Coverage case class out to org.bdgenomics.adam.models?

@fnothaft

fnothaft Aug 11, 2016

Member

Also, thoughts on moving the Coverage case class out to org.bdgenomics.adam.models?

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

Sure, if there were conversions in both directions (Feature <--> Coverage)

@heuermh

heuermh Aug 11, 2016

Member

Sure, if there were conversions in both directions (Feature <--> Coverage)

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Yeah I feel like there should be an object Coverage with methods private[adam] def apply(feature: Feature) and private[adam] def apply(rdd: RDD[Feature]) and then a def toFeature: Feature method in the case class.

@fnothaft

fnothaft Aug 11, 2016

Member

Yeah I feel like there should be an object Coverage with methods private[adam] def apply(feature: Feature) and private[adam] def apply(rdd: RDD[Feature]) and then a def toFeature: Feature method in the case class.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @param end Specifies end position of coverage
* @param count Specifies count of coverage at location
*/
case class Coverage(contigName: String, start: Long, end: Long, count: Double)

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

I'm thinking Coverage here and CoverageRDD cover (sorry) your use case appropriately.

This case class and an RDD of continuous values over regions could be made more generic though, if only in name, and in turn support wiggle/bigWig/bedGraph formats. I'm not feeling very creative at the moment, there must be something better than ContinuousValuedFeatureRDD. Something for a future feature request?

@heuermh

heuermh Aug 11, 2016

Member

I'm thinking Coverage here and CoverageRDD cover (sorry) your use case appropriately.

This case class and an RDD of continuous values over regions could be made more generic though, if only in name, and in turn support wiggle/bigWig/bedGraph formats. I'm not feeling very creative at the moment, there must be something better than ContinuousValuedFeatureRDD. Something for a future feature request?

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

cover (sorry)

rimshot

This case class and an RDD of continuous values over regions could be made more generic though, if only in name, and in turn support wiggle/bigWig/bedGraph formats. I'm not feeling very creative at the moment, there must be something better than ContinuousValuedFeatureRDD. Something for a future feature request?

+1! Let's open a ticket after merging?

@fnothaft

fnothaft Aug 11, 2016

Member

cover (sorry)

rimshot

This case class and an RDD of continuous values over regions could be made more generic though, if only in name, and in turn support wiggle/bigWig/bedGraph formats. I'm not feeling very creative at the moment, there must be something better than ContinuousValuedFeatureRDD. Something for a future feature request?

+1! Let's open a ticket after merging?

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

Yep

@heuermh
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @param featureRdd FeatureRDD holding records for coverage and corresponding
* referenceRegion
*/
case class CoverageRDD(featureRdd: FeatureRDD) {

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

I feel a little weird having this wrap a FeatureRDD. My preference would be to have the CoverageRDD be a GenomicRDD that wraps a RDD[Coverage]. I feel like this would be more consistent with our recent refactoring. Thoughts?

@fnothaft

fnothaft Aug 11, 2016

Member

I feel a little weird having this wrap a FeatureRDD. My preference would be to have the CoverageRDD be a GenomicRDD that wraps a RDD[Coverage]. I feel like this would be more consistent with our recent refactoring. Thoughts?

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Then, where you currently return a RDD[Coverage], you'd return a new CoverageRDD.

@fnothaft

fnothaft Aug 11, 2016

Member

Then, where you currently return a RDD[Coverage], you'd return a new CoverageRDD.

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

+1, with toFeatureRDD method on CoverageRDD and toCoverageRDD method on FeatureRDD.

@heuermh

heuermh Aug 11, 2016

Member

+1, with toFeatureRDD method on CoverageRDD and toCoverageRDD method on FeatureRDD.

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
}).reduceByKey(_ + _)
.cache()
.sortByKey()
.mapPartitions(i => merge(i)) // merge adjacent regions with the same coverage

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

IMO, this (cache --> sort --> merge) should get factored out to CoverageRDD, and the CoverageRDD should have some sort of a collapse method? There are places where it doesn't necessarily make things simpler to have coverage collapsed down, e.g., https://github.com/bigdatagenomics/quinine/blob/master/quinine-core/src/main/scala/org/bdgenomics/quinine/metrics/targeted/PanelStats.scala#L64-L101.

Also,

@fnothaft

fnothaft Aug 11, 2016

Member

IMO, this (cache --> sort --> merge) should get factored out to CoverageRDD, and the CoverageRDD should have some sort of a collapse method? There are places where it doesn't necessarily make things simpler to have coverage collapsed down, e.g., https://github.com/bigdatagenomics/quinine/blob/master/quinine-core/src/main/scala/org/bdgenomics/quinine/metrics/targeted/PanelStats.scala#L64-L101.

Also,

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Sorry, forgot to finish writing the thought after the orphan Also,. Wherever you have a cache, that should be paired with an unpersist.

@fnothaft

fnothaft Aug 11, 2016

Member

Sorry, forgot to finish writing the thought after the orphan Also,. Wherever you have a cache, that should be paired with an unpersist.

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
* @param iter partition iterator of ReferenceRegion and coverage values
* @return merged tuples of adjacent ReferenceRegions and coverage
*/
private def merge(iter: Iterator[(ReferenceRegion, Int)]): Iterator[(ReferenceRegion, Int)] = {

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Make private[rdd] and write unit tests. We'll really want to make sure this one is right...!

@fnothaft

fnothaft Aug 11, 2016

Member

Make private[rdd] and write unit tests. We'll really want to make sure this one is right...!

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, I'd advise against the wrapper. With "reasonable" defaults for the arguments to the tail call function, you can eliminate this function.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, I'd advise against the wrapper. With "reasonable" defaults for the arguments to the tail call function, you can eliminate this function.

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
* @return merged tuples of adjacent ReferenceRegions and coverage
*/
@tailrec private def merge(iter: Iterator[(ReferenceRegion, Int)],
lastRegion: ReferenceRegion,

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

E.g., lastRegion = _

@fnothaft

fnothaft Aug 11, 2016

Member

E.g., lastRegion = _

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
*/
@tailrec private def merge(iter: Iterator[(ReferenceRegion, Int)],
lastRegion: ReferenceRegion,
lastCoverage: Int,

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

lastCoverage = -1

@fnothaft

fnothaft Aug 11, 2016

Member

lastCoverage = -1

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
@tailrec private def merge(iter: Iterator[(ReferenceRegion, Int)],
lastRegion: ReferenceRegion,
lastCoverage: Int,
condensed: List[(ReferenceRegion, Int)]): Iterator[(ReferenceRegion, Int)] = {

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

condensed = List.empty

would do the trick

@fnothaft

fnothaft Aug 11, 2016

Member

condensed = List.empty

would do the trick

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 11, 2016

Member

Hate to be a PITA, but would it be possible to split the deletion of dead code out into a separate commit? LMK if that'd be too much of a hassle.

Member

fnothaft commented Aug 11, 2016

Hate to be a PITA, but would it be possible to split the deletion of dead code out into a separate commit? LMK if that'd be too much of a hassle.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
@@ -757,6 +757,11 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
FragmentRDD.fromRdd(records.map(fastqRecordConverter.convertFragment))
}
def loadCoverage(filePath: String, minPartitions: Option[Int] = None): CoverageRDD = {

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Add scaladoc. Drop unused minPartitions argument?

@fnothaft

fnothaft Aug 11, 2016

Member

Add scaladoc. Drop unused minPartitions argument?

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* a feature.
*
* @param featureRdd FeatureRDD holding records for coverage and corresponding
* referenceRegion

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Nit: indent 2 spaces

@fnothaft

fnothaft Aug 11, 2016

Member

Nit: indent 2 spaces

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
flattenedCoverage // no binning, return raw results
} else {
// subtract region.start to shift mod to start of ReferenceRegion
flattenedCoverage.filter(r => (r.start - region.start) % bpPerBin == 0)

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

So, when I hear "binning", I think "averaging" not "sampling".

@fnothaft

fnothaft Aug 11, 2016

Member

So, when I hear "binning", I think "averaging" not "sampling".

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @param bpPerBin base pairs per bin, number of bases to combine to one bin
* @return RDD of Coverage Records
*/
def getCoverage(region: ReferenceRegion, bpPerBin: Int = 1): RDD[Coverage] = {

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

What exactly is the use case for this, as distinct from filterByOverlappingRegion?

@fnothaft

fnothaft Aug 11, 2016

Member

What exactly is the use case for this, as distinct from filterByOverlappingRegion?

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 11, 2016

Contributor

filterByOverlappingRegion returns pure records. getCoverage and getAggregatedCoverage return flattened coverage that can be binned through sampling and aggregation, respectively.

@akmorrow13

akmorrow13 Aug 11, 2016

Contributor

filterByOverlappingRegion returns pure records. getCoverage and getAggregatedCoverage return flattened coverage that can be binned through sampling and aggregation, respectively.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

I'd prefer to chain filterByOverlappingRegion with binCoverage or aggregateCoverage than have them rolled together.

@fnothaft

fnothaft Aug 11, 2016

Member

I'd prefer to chain filterByOverlappingRegion with binCoverage or aggregateCoverage than have them rolled together.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
// subtract region.start to shift mod to start of ReferenceRegion
val key = r.start - ((r.start - region.start) % bpPerBin)
(key, r)
}).groupByKey

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Avoid groupByKey. I think this works as a reduce by key with the following changes:

def reduceFn(a: (Int, Int), b: (Int, Int)): (Int, Int) = {
  (a._1 + b._1, a._2 + b._2)
}
flattenedCoverage.map(r => {
         val key = r.start - ((r.start - region.start) % bpPerBin)
         (key, (r, 1))
      }).reduceByKey(reduceFn)
  .map(r => Coverage(r._1.contigName, r._1.pos, r._1.pos + bpPerBin, r._2._1.toDouble / r._2._2.toDouble))
@fnothaft

fnothaft Aug 11, 2016

Member

Avoid groupByKey. I think this works as a reduce by key with the following changes:

def reduceFn(a: (Int, Int), b: (Int, Int)): (Int, Int) = {
  (a._1 + b._1, a._2 + b._2)
}
flattenedCoverage.map(r => {
         val key = r.start - ((r.start - region.start) % bpPerBin)
         (key, (r, 1))
      }).reduceByKey(reduceFn)
  .map(r => Coverage(r._1.contigName, r._1.pos, r._1.pos + bpPerBin, r._2._1.toDouble / r._2._2.toDouble))

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, I don't like keying here with just the start position. Generally, I prefer keying with a ReferencePosition/ReferenceRegion.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, I don't like keying here with just the start position. Generally, I prefer keying with a ReferencePosition/ReferenceRegion.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, I'd like to break this out into filterByOverlappingRegion and aggregateCoverage.

Also, avoid get in function names unless they are a setter/getter.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, I'd like to break this out into filterByOverlappingRegion and aggregateCoverage.

Also, avoid get in function names unless they are a setter/getter.

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

How could we fit getCoverage to filterByOverlappingRegion? I believe these are different functions because filterByOverlappingRegion returns raw records. For example, if the coverage records are not merged to begin with, then filterByOverlappingRegion will return chunks of coverage. Also, filterByOverlappingRegion cannot specify binning.

@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

How could we fit getCoverage to filterByOverlappingRegion? I believe these are different functions because filterByOverlappingRegion returns raw records. For example, if the coverage records are not merged to begin with, then filterByOverlappingRegion will return chunks of coverage. Also, filterByOverlappingRegion cannot specify binning.

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

We could always merge getCoverage() and aggregatedCoverage() back into 1 function. Then there would be filterByOverlappingRegion and aggregatedCoverage, with the option to reduce by key or sample.

@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

We could always merge getCoverage() and aggregatedCoverage() back into 1 function. Then there would be filterByOverlappingRegion and aggregatedCoverage, with the option to reduce by key or sample.

This comment has been minimized.

@fnothaft

fnothaft Aug 12, 2016

Member

Ah, my earlier comment was unclear. What I mean is, currently, the method signature is:

def getAggregatedCoverage(region: ReferenceRegion, bpPerBin: Int = 1)

I don't like having the region: ReferenceRegion in here. I'd like the signature to be:

def aggregateCoverage(bpPerBin: Int = 1)

and if the user wants to filter out coverage by a specific region, they call filterByOverlappingRegion before calling aggregateCoverage.

@fnothaft

fnothaft Aug 12, 2016

Member

Ah, my earlier comment was unclear. What I mean is, currently, the method signature is:

def getAggregatedCoverage(region: ReferenceRegion, bpPerBin: Int = 1)

I don't like having the region: ReferenceRegion in here. I'd like the signature to be:

def aggregateCoverage(bpPerBin: Int = 1)

and if the user wants to filter out coverage by a specific region, they call filterByOverlappingRegion before calling aggregateCoverage.

This comment has been minimized.

@heuermh

heuermh Aug 12, 2016

Member

+1

@heuermh

This comment has been minimized.

@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

Ahh I see. Much cleaner!

@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

Ahh I see. Much cleaner!

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
// flatmap features in region to coverage
features.rdd.flatMap(r => {
val positions: List[Long] = List.range(r.getStart, r.getEnd)

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Prefer r.getStart to r.getEnd

@fnothaft

fnothaft Aug 11, 2016

Member

Prefer r.getStart to r.getEnd

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Actually, until, not to.

@fnothaft

fnothaft Aug 11, 2016

Member

Actually, until, not to.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* Gets flattened RDD of coverage, with coverage at each base pair
* @return RDD[Coverage] from underlying FeatureRDD
*/
def flattenedRdd: RDD[Coverage] = {

This comment has been minimized.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, for methods that "do heavy work", prefer () even if there are no parameters.

@fnothaft

fnothaft Aug 11, 2016

Member

Also, for methods that "do heavy work", prefer () even if there are no parameters.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1381/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Aug 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1381/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1382/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Aug 12, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1382/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13

This comment has been minimized.

Show comment
Hide comment
@akmorrow13

akmorrow13 Aug 12, 2016

Contributor

Jenkins, retest this please.

Contributor

akmorrow13 commented Aug 12, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 12, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1383/
Test PASSed.

AmplabJenkins commented Aug 12, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1383/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1384/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Aug 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1384/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 17, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1388/
Test PASSed.

AmplabJenkins commented Aug 17, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1388/
Test PASSed.

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Reads2Coverage.scala
import org.kohsuke.args4j.{ Argument, Option => Args4jOption }
/**
* Reads2Coverage (accessible as the command 'coverage' through the CLI) takes two arguments,

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Nit: should be "accessible as the command 'reads2coverage'".

@fnothaft

fnothaft Aug 18, 2016

Member

Nit: should be "accessible as the command 'reads2coverage'".

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
}
/**
* Merges adjacent ReferenceRegions with the same value. This is used for merging coverage.

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

This is used for merging coverage.

This isn't really detailed. Split it out of the summary line and expand upon the explanation.

@fnothaft

fnothaft Aug 18, 2016

Member

This is used for merging coverage.

This isn't really detailed. Split it out of the summary line and expand upon the explanation.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
collapse(iter, first, List.empty)
} else iter
})
this.replaceRdd(newRDD)

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Don't need the this.

@fnothaft

fnothaft Aug 18, 2016

Member

Don't need the this.

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Also, prefer whitespace before the line.

@fnothaft

fnothaft Aug 18, 2016

Member

Also, prefer whitespace before the line.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @see merge
*
* @param iter partition iterator of ReferenceRegion and coverage values
* @param lastCoverage: the last coverage from a sorted Iterator that has been considered to merge

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

No colons here and in the line below after param name.

@fnothaft

fnothaft Aug 18, 2016

Member

No colons here and in the line below after param name.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
val nextCondensed =
if (condensed.map(r => ReferenceRegion(r)).filter(_.overlaps(ReferenceRegion(lastCoverage))).isEmpty) {
lastCoverage :: condensed
} else condensed

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Be consistent with braces. If the leading if clause has braces, the else clause (and any else ifs) should as well.

@fnothaft

fnothaft Aug 18, 2016

Member

Be consistent with braces. If the leading if clause has braces, the else clause (and any else ifs) should as well.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
if (rr.isAdjacent(lastRegion) && lastCoverage.count == cov.count) {
(Coverage(rr.merge(lastRegion), lastCoverage.count), condensed)
} else {
(cov, (lastCoverage) :: condensed)

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Nit: remove parens around lastCoverage.

@fnothaft

fnothaft Aug 18, 2016

Member

Nit: remove parens around lastCoverage.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
} else condensed
nextCondensed.toIterator
} else {
val cov: Coverage = iter.next

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Stylistically, I don't love the explicit typing here. Is it needed, or can we let scalac infer the types?

@fnothaft

fnothaft Aug 18, 2016

Member

Stylistically, I don't love the explicit typing here. Is it needed, or can we let scalac infer the types?

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
}
/**
* Gets coverage overlapping specified ReferenceRegion. For large, ReferenceRegions,

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Split after first sentence.

@fnothaft

fnothaft Aug 18, 2016

Member

Split after first sentence.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
*/
def coverage(bpPerBin: Int = 1): CoverageRDD = {
val flattened = this.flatten

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

No this., flatten -> flatten()

@fnothaft

fnothaft Aug 18, 2016

Member

No this., flatten -> flatten()

flattened // no binning, return raw results
} else {
// subtract region.start to shift mod to start of ReferenceRegion
val newRDD = flattened.rdd.filter(r => r.start % bpPerBin == 0)

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Prefer .transform(_.filter...) to call to .replaceRdd

@fnothaft

fnothaft Aug 18, 2016

Member

Prefer .transform(_.filter...) to call to .replaceRdd

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
}
/**
* Gets coverage overlapping specified ReferenceRegion. For large, ReferenceRegions,

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Split at For large, no comma before ReferenceRegion.

@fnothaft

fnothaft Aug 18, 2016

Member

Split at For large, no comma before ReferenceRegion.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
*/
def aggregatedCoverage(bpPerBin: Int = 1): CoverageRDD = {
val flattened = this.flatten

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

No this., flatten -> flatten().

@fnothaft

fnothaft Aug 18, 2016

Member

No this., flatten -> flatten().

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
// compute average coverage in bin
Coverage(r._1.referenceName, r._1.start, r._1.end, r._2._1 / r._2._2)
})
flattened.replaceRdd(newRDD)

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

In general, I prefer .transform. I would factor L173–L184 into a function that is called by .transform.

@fnothaft

fnothaft Aug 18, 2016

Member

In general, I prefer .transform. I would factor L173–L184 into a function that is called by .transform.

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

The way I look at the replaceRdd method is that it exists solely so that the transform method could be implemented in the GenomicRDD trait. Ideally, it's never called elsewhere. However, I don't think this intention is documented. I'll go back and doc this.

@fnothaft

fnothaft Aug 18, 2016

Member

The way I look at the replaceRdd method is that it exists solely so that the transform method could be implemented in the GenomicRDD trait. Ideally, it's never called elsewhere. However, I don't think this intention is documented. I'll go back and doc this.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
* @return CoverageRDD of flattened Coverage records
*/
def flatten: CoverageRDD = {
this.replaceRdd(flatMapCoverage(rdd))

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

No this., same comment above RE: transform --> this becomes transform(flatMapCoverage)

@fnothaft

fnothaft Aug 18, 2016

Member

No this., same comment above RE: transform --> this becomes transform(flatMapCoverage)

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/CoverageRDD.scala
/**
* Flat maps coverage into ReferenceRegion and counts for each base pair.
* @param rdd
* @return

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Complete scaladoc.

@fnothaft

fnothaft Aug 18, 2016

Member

Complete scaladoc.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureRDD.scala
@@ -246,6 +246,14 @@ case class FeatureRDD(rdd: RDD[Feature],
}
/**
* @return Returns CoverageRDD build from FeatureRDD

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

s/build/built/g

@fnothaft

fnothaft Aug 18, 2016

Member

s/build/built/g

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
@@ -41,6 +41,7 @@ import org.bdgenomics.adam.converters.AlignmentRecordConverter
import org.bdgenomics.adam.instrumentation.Timers._
import org.bdgenomics.adam.models._
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.features.CoverageRDD

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

org.bdgenomics.adam.rdd.features should come after org.bdgenomics.adam.rdd and before org.bdgenomics.adam.rdd.read

@fnothaft

fnothaft Aug 18, 2016

Member

org.bdgenomics.adam.rdd.features should come after org.bdgenomics.adam.rdd and before org.bdgenomics.adam.rdd.read

val t: List[Long] = List.range(r.getStart, r.getEnd)
t.map(n => (ReferenceRegion(r.getContigName, n, n + 1), 1))
}).reduceByKey(_ + _)
.cache()

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

You don't want the .cache() call here. I mean, you do, but you don't want the code after it. This should be broken up:

val covCounts = rdd.rdd
         .flatMap(r => {
           val t: List[Long] = List.range(r.getStart, r.getEnd)
           t.map(n => (ReferenceRegion(r.getContigName, n, n + 1), 1))
         }).reduceByKey(_ + _)
         .cache()

val coverage = covCounts.sortByKey()
        .map(r => Coverage(r._1, r._2.toDouble))
@fnothaft

fnothaft Aug 18, 2016

Member

You don't want the .cache() call here. I mean, you do, but you don't want the code after it. This should be broken up:

val covCounts = rdd.rdd
         .flatMap(r => {
           val t: List[Long] = List.range(r.getStart, r.getEnd)
           t.map(n => (ReferenceRegion(r.getContigName, n, n + 1), 1))
         }).reduceByKey(_ + _)
         .cache()

val coverage = covCounts.sortByKey()
        .map(r => Coverage(r._1, r._2.toDouble))

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Also, why are we sorting here? If we don't collapse, there's no reason to do the sort. I think it would make sense to move the sortByKey into collapse, since the data being sorted is an invariant of collapse.

@fnothaft

fnothaft Aug 18, 2016

Member

Also, why are we sorting here? If we don't collapse, there's no reason to do the sort. I think it would make sense to move the sortByKey into collapse, since the data being sorted is an invariant of collapse.

Show outdated Hide outdated ...ore/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala
CoverageRDD(coverage, sequences).collapse
else CoverageRDD(coverage, sequences)
coverage.unpersist()

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

You don't want to unpersist coverage. You'd want to unpersist covCounts from https://github.com/bigdatagenomics/adam/pull/1108/files#r75248157. Calling unpersist on coverage should cause everything done to the CoverageRDD returned by this method to throw an exception...

@fnothaft

fnothaft Aug 18, 2016

Member

You don't want to unpersist coverage. You'd want to unpersist covCounts from https://github.com/bigdatagenomics/adam/pull/1108/files#r75248157. Calling unpersist on coverage should cause everything done to the CoverageRDD returned by this method to throw an exception...

Show outdated Hide outdated ...e/src/test/scala/org/bdgenomics/adam/rdd/features/CoverageRDDSuite.scala
class CoverageRDDSuite extends ADAMFunSuite {
def tempLocation(suffix: String = ".adam"): String = {

This comment has been minimized.

@fnothaft

fnothaft Aug 18, 2016

Member

Don't copy this; factor it out of ADAMContextSuite into ADAMFunSuite.