Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient Joins and (re)Partitioning #1324

Merged
merged 4 commits into from Jun 5, 2017

Conversation

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Dec 23, 2016

Ready for review.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Dec 23, 2016

Can one of the admins verify this patch?

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Dec 23, 2016

Jenkins, add to whitelist.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Dec 23, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1700/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 1bbc8ab # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 1bbc8ab (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1bbc8abae508530dab55ae88be11935a74b594e0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

val i = z.leftOuterShuffleRegionJoin(x).rdd.collect
assert(h.length == i.length)

val t = sc.loadParquetAlignments("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/sortedAlignments.parquet.txt")

This comment has been minimized.

@devin-petersohn

devin-petersohn Dec 23, 2016
Author Member

I will fix this path.

sparkTest("testing partitioner") {
time {
//val x = sc.loadBam("/data/recompute/alignments/NA12878.bam.aln.bam")
val x = sc.loadBam("/Users/DevinPetersohn/software_builds/adam/adam-core/src/test/resources/bqsr1.sam")

This comment has been minimized.

@devin-petersohn

devin-petersohn Dec 23, 2016
Author Member

I will fix this path.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Dec 23, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1701/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 58cfe15 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 58cfe15 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 58cfe152a16c84931ba087d38ffa53a9397b5bbfFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Copy link
Member

@fnothaft fnothaft left a comment

Generally LGTM! Just made a first review pass. For more detailed review, I need more inline docs.

In addition to the changes I suggested, please remove the .parquet test resource files from this PR. If needed for tests, they should be generated inside the test. We try to avoid checking in binary files whenever possible.

Linking to #1216.

@@ -62,8 +62,11 @@ import org.bdgenomics.utils.io.LocalFileByteAccess
import org.bdgenomics.utils.misc.{ HadoopUtil, Logging }
import org.seqdoop.hadoop_bam._
import org.seqdoop.hadoop_bam.util._

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Nit: remove space.

@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The (possibly globbed) filepath to load a VCF from.
* @return Returns a tuple of metadata from the VCF header, including the
* sequence dictionary and a list of the samples contained in the VCF.
*

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Nit: I'd prefer to keep these spaces.

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Yeah, there are a lot of these whitespace changes and there shouldn't be any of them. ;)

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jan 17, 2017
Author Member

I will refix these. They were fixed at some point but my IDE is not cooperating :(

* @param filename the filename for the metadata
* @return a partition map if the data was written sorted, or an empty Seq if unsorted
*/
def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = {

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

This should be (package) private.

// this unfortunately seems to be the only way to do this
// avro does not seem to support getting metadata fields out once you have the from the string
val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]]
//we want this for the use case

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Can you expand this comment a bit? Actually, what might be preferable, is to have a longer inline comment that documents the format of what we are parsing. I think that would make this section of the code much easier to review.

// parsing the json from the metadata header
// this unfortunately seems to be the only way to do this
// avro does not seem to support getting metadata fields out once you have the from the string
val metaDataMap = JSON.parseFull(fr.getMetaString("avro.schema")).get.asInstanceOf[Map[String, String]]

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Why do we need the cast here?

@@ -308,7 +308,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
filePath: String,
asType: Option[SAMFormat] = None,
asSingleFile: Boolean = false,
isSorted: Boolean = false,
isSorted: Boolean = SortedTrait.isSorted,

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Ditto RE: eliminating parameter.

@@ -583,7 +583,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
*/
def realignIndels(
consensusModel: ConsensusGenerator = new ConsensusGeneratorFromReads,
isSorted: Boolean = false,
isSorted: Boolean = SortedTrait.isSorted,

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Ditto RE: eliminating parameter.

@@ -20,7 +20,7 @@ package org.bdgenomics.adam.models
import htsjdk.samtools.SAMReadGroupRecord
import org.scalatest.FunSuite

class RecordGroupDictionarySuite extends FunSuite {
class moRecordGroupDictionarySuite extends FunSuite {

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Revert mo

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.utils.misc.SparkFunSuite

/**

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

No author comments.

@@ -0,0 +1,72 @@
package org.bdgenomics.adam.rdd

This comment has been minimized.

@fnothaft

fnothaft Dec 23, 2016
Member

Add license header.

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Running ./scripts/format-source should add the license header.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Dec 28, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1709/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 86150cf # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 86150cf (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 86150cfa21599654d53606727c562e3ae5c93ec3First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts
* and saves to disk as VCF.
*
* @param filePath The filepath to save to.
* @param args The arguments for saving the data

This comment has been minimized.

@devin-petersohn

devin-petersohn Dec 28, 2016
Author Member

Updated docs in a few places where they were not correct.

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

use full sentences for method parameter docs

Copy link
Member

@fnothaft fnothaft left a comment

Made an additional review pass. Let's chat about the boundary case where we have a record that is duplicated across partition boundaries.

@@ -296,8 +298,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The filepath to load a single Avro file of sequence
* dictionary info from.
* @return Returns the SequenceDictionary representing said reference build.
*
* @see loadAvroSequences
* @see loadAvroSequences

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Spaced one space too far.

@@ -330,8 +331,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The filepath to load a single Avro file containing read
* group metadata.
* @return Returns a RecordGroupDictionary.
*
* @see loadAvroReadGroupMetadata
* @see loadAvroReadGroupMetadata

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Spaced one space too far.

*
* @throws FileNotFoundException if the path does not match any files.
* @see getFsAndFiles
* @throws FileNotFoundException if the path does not match any files.

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Spaced one space too far.

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Also, lost whitespace.

* @see getFiles
*
* @throws FileNotFoundException if the path does not match any files.
* @see getFiles

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Spaced one space too far.
Also, lost whitespace.

*
* @throws FileNotFoundException if the path does not match any files.
* @see getFiles
* @throws FileNotFoundException if the path does not match any files.

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Spaced one space too far.
Also, lost whitespace.

replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd))
}

private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner {

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Move to GenomicPartitioners.scala

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner?

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

GenomicPosition is used in GenomicPartitioners.scala.

if (isSorted) {
sampleSchema.addProp("sorted", "true".asInstanceOf[Any])
sampleSchema.addProp("partitionMap", partitionMap.mkString(",").asInstanceOf[Any])
}

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Add whitespace after.

@@ -251,7 +251,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
*
* @return Returns a SAM/BAM formatted RDD of reads, as well as the file header.
*/
def convertToSam(isSorted: Boolean = false): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time {
def convertToSam(isSorted: Boolean = isSorted): (RDD[SAMRecordWritable], SAMFileHeader) = ConvertToSAM.time {

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

There's going to be more logic that is needed here WRT dealing with records that are duplicated by the flanking process.

@@ -0,0 +1,72 @@
package org.bdgenomics.adam.rdd

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Running ./scripts/format-source should add the license header.

*/
class SortedGenomicRDDSuite extends SparkFunSuite {

def time[R](block: => R): R = {

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Dropping a note here to remove this block before we merge.

@@ -186,7 +189,6 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
* @param filePath The (possibly globbed) filepath to load a VCF from.
* @return Returns a tuple of metadata from the VCF header, including the
* sequence dictionary and a list of the samples contained in the VCF.
*

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

Right, check yer IDE settings, or use one that doesn't make unwanted changes on your behalf :)

* @param filename the filename for the metadata
* @return a partition map if the data was written sorted, or an empty Seq if unsorted
*/
private[rdd] def determineIsSortedAndExtractPartitionMap(filename: String): Seq[(ReferenceRegion, ReferenceRegion)] = {

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

determineIsSortedAndExtractPartitionMapextractPartitionMap

val maybePartitionMap = metaDataMap.get("partitionMap")
// we didn't write a partition map, which means this was not sorted at write
// or at least we didn't have information that it was sorted
if(maybePartitionMap.isEmpty) {

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

whitespace if(if (

// Seq with commas separating each tuple of (ReferenceRegion, ReferenceRegion)
// The first split is breaking up the tuples. Each tuple starts with a
// "(" then a ReferenceRegion, so we are simply pulling out the tuples
// by using the start of each tuple as the indicator

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

How about adding ReferenceRegion to the bdg-formats schemas and using Avro to write out the partition map? I don't see why we should be writing some stuff out as Parquet, some as Avro, some as JSON, and some as Strings or byte-serialized JVM objects.

VariantRDD(rdd, sd, headers, maybePartitionMapRdd = Some(sc.parallelize(pMap, pMap.length)))
// if we have no information about partition map we assume unsorted
} else {
// default to isSorted = false

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

isSortedsorted here and everywhere else

* @return A SortedGenomicRDDMixIn that contains the sorted and partitioned
* RDD
*/
def repartitionAndSortByGenomicCoordinate(partitions: Int = rdd.partitions.length)(implicit c: ClassTag[T]): U = {

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

we don't use the word coordinate elsewhere, how about repartitionAndSortByReferenceRegion?

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Or just repartitionAndSort to be consistent with the sort/sortLexicographically methods on GenomicRDD.

This comment has been minimized.

replaceRdd(partitionedRDD.values, Some(newPartitionMapRdd))
}

private[rdd] class GenomicPositionRangePartitioner[V](partitions: Int, elements: Int = 0) extends Partitioner {

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

Again, "genomic position" isn't used elsewhere, why not ReferenceRegionRangePartitioner?

@@ -70,7 +70,10 @@ private[rdd] object NucleotideContigFragmentRDD extends Serializable {
*/
case class NucleotideContigFragmentRDD(
rdd: RDD[NucleotideContigFragment],
sequences: SequenceDictionary) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile {
sequences: SequenceDictionary,
maybePartitionMapRdd: Option[RDD[(ReferenceRegion, ReferenceRegion)]] = None) extends AvroGenomicRDD[NucleotideContigFragment, NucleotideContigFragmentRDD] with ReferenceFile {

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

maybePartitionMapRddpartitionMap or partitions

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

+1. I'd prefer partitionMap over partitions as partitions sounds more harmonious with Spark's abstract idea of what a Partition is, while we are abusing that notion.

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

If partitionMap remains an RDD (which you've argued against above), then it shouldn't really be called Map. :)

@@ -145,7 +145,7 @@ sealed trait AlignmentRecordRDD extends AvroReadGroupGenomicRDD[AlignmentRecord,
* file was saved.
*/
private[rdd] def maybeSaveBam(args: ADAMSaveAnyArgs,
isSorted: Boolean = false): Boolean = {
isSorted: Boolean = isSorted): Boolean = {

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

isSortedsorted, as elsewhere

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

Where do we have sorted? I've generally been preferring isSorted.

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

We're not writing Java Beans. if (sorted) reads much better to me.

This comment has been minimized.

@fnothaft

fnothaft Dec 28, 2016
Member

You're right that sorted is the prevailing usage, BTW:

adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc
      23     158    1295
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc
      48     463    3178

This disparity grows if you include test sources:

adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc
      30     179    1459
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc
     135     839    7236

This comment has been minimized.

@fnothaft

fnothaft Dec 29, 2016
Member

You're right that sorted is the prevailing usage, BTW:

adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep isSorted {} \; | wc
      23     158    1295
adam fnothaft$ find adam-*/src/main -name "*.scala" -exec grep sorted {} \; | wc
      48     463    3178

This disparity grows if you include test sources:

adam fnothaft$ find adam-*/src -name "*.scala" -exec grep isSorted {} \; | wc
      30     179    1459
adam fnothaft$ find adam-*/src -name "*.scala" -exec grep sorted {} \; | wc
     135     839    7236

This comment has been minimized.

@heuermh

heuermh Jan 4, 2017
Member

Created #1341 to fix this later

This comment has been minimized.

@devin-petersohn

devin-petersohn May 25, 2017
Author Member

Leaving a note here to say that we decided to use isSorted because of the potential collision with sorted in scala.collection.

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Our isSorted is not 100% the same as the SAM/BAM definition of coordinate sorted. We'd need to filter out any replicated records.

@@ -105,7 +108,7 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* Converts an RDD of ADAM VariantContexts to HTSJDK VariantContexts
* and saves to disk as VCF.
*
* @param filePath The filepath to save to.
* @param args The arguments for saving the data

This comment has been minimized.

@heuermh

heuermh Dec 28, 2016
Member

use full sentences for method parameter docs

@heuermh
Copy link
Member

@heuermh heuermh commented Jan 4, 2017

ok to set the milestone for this as 0.22.0?

@fnothaft fnothaft added this to the 0.22.0 milestone Jan 4, 2017
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jan 4, 2017

+1 @heuermh, triaged to 0.22.0.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1729/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains e244fcfc34d1580be73e9c76e6dc72ceced60383 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision e244fcfc34d1580be73e9c76e6dc72ceced60383 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e244fcfc34d1580be73e9c76e6dc72ceced60383First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1730/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8f803afa8e862dff1626eb9002de0cb27175465a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8f803afa8e862dff1626eb9002de0cb27175465a (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8f803afa8e862dff1626eb9002de0cb27175465aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented Jan 11, 2017

@fnothaft @heuermh How do we handle cross strand joins in the shuffleRegionJoin code? From what I can tell, nothing special is done about it, but the bug I encountered involved the ReferenceRegion.overlap() skipping things that were not on the same strand.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jan 11, 2017

Cross-strand joins are handled by

.

ReferenceRegion.overlap() skipping things that were not on the same strand.

Ah yes, things that are not on the same strand do not overlap. ;) This is intentional. IIRC, we clarified a lot of this in e98ee2d. I think the question here is less "what behavior is correct in ReferenceRegion.overlaps" and more "should stranded or unstranded reference regions be input into the region join"? Would you agree? CC @laserson who might have opinions as well.

@laserson
Copy link
Contributor

@laserson laserson commented Jan 11, 2017

Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap?

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jan 11, 2017

Seems like we should support both stranded and un-stranded. Could this simply be passed through to the operation that determines whether there is an overlap?

I think there's a couple of ways we could do it. I think the simplest would be to generate unstranded reference regions as the keys to the join (after e98ee2d, this is v. easy). Perhaps we'd just pass this as a switch on the GenomicRDD methods?

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1731/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8ccd18912301e91573c5245dc5c09edcf11fc337 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 8ccd18912301e91573c5245dc5c09edcf11fc337 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8ccd18912301e91573c5245dc5c09edcf11fc337First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1733/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 2b68245 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 2b68245 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2b682452ac6d6453870022697b535357b023771dFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 16, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1736/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision c9aac611a0d99b8f3e2e6266369d1fcf6467ef46 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c9aac611a0d99b8f3e2e6266369d1fcf6467ef46First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1737/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b7204f839e6a69208114c630cf6cd025c13a6096 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision b7204f839e6a69208114c630cf6cd025c13a6096 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b7204f839e6a69208114c630cf6cd025c13a6096First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.


val c = jRdd0.rdd.collect
val c = jRdd.rdd.collect

This comment has been minimized.

@devin-petersohn

devin-petersohn Jan 17, 2017
Author Member

This seemed to be incorrect previously. Let me know if I should revert.

@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented Jan 17, 2017

Pinging @fnothaft @heuermh for review. Please ignore the whitespace issues for now, I will push something shortly that fixes that.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1738/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ec88406f303d32bcdaefa004bfd2c31ad748d080 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ec88406f303d32bcdaefa004bfd2c31ad748d080 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ec88406f303d32bcdaefa004bfd2c31ad748d080First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1739/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 46af637fa97d2727a0077052d50d55270b18b707 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision 46af637fa97d2727a0077052d50d55270b18b707 (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 46af637fa97d2727a0077052d50d55270b18b707First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jan 18, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1740/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1324/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains ab4cefe # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1324/merge^{commit} # timeout=10Checking out Revision ab4cefe (origin/pr/1324/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f ab4cefeabf657914e554f7acd41e735144cfd5b2First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

* we are only sorting a number of elements equal to the number of
* partitions written.
*/
private[rdd] class ADAMInputFormat[T] extends ParquetInputFormat[T] {

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Can we rename ADAMInputFormat to ADAMParquetInputFormat, à la ADAMBAMInputFormat, ADAMVCFInputFormat, etc?

Some(partitionMapBuilder.toArray)
} catch {
case e: FileNotFoundException => None
// TODO: Log Exception

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

TODO ;)

Should we log this or rethrow?

This comment has been minimized.

@devin-petersohn

devin-petersohn May 26, 2017
Author Member

I think rethrowing is correct here.

* @param filename the filename for the metadata
* @return a partition map if the data was written sorted, or an empty Seq if unsorted
*/
private[rdd] def extractPartitionMap(

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

IMO, we should move this into an avro schema or something more structured, instead of parsing JSON. That said, let's not do it here and now. Can you open a ticket to refactor this into avroland post 0.23.0?

This comment has been minimized.

* Load a path name in Parquet + Avro format into an AlignmentRecordRDD.
*
* @note The sequence dictionary is read from an Avro file stored at
* pathName/_seqdict.avro and the record group dictionary is read from an
* Avro file stored at pathName/_rgdict.avro. These files are pure Avro,
* not Parquet + Avro.
*

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Keep space.

key match {
case (_, f2: Int) => f2
case _ => {
throw new Exception("Unable to partition without destination assignment")

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

We should include the offending key in the exception message. Otherwise, the exception is impossible to debug if thrown.

}).filter(f => f._1.nonEmpty).map(f => (f._1.get, f._2))
.sortBy(elem => elem._1, ascending = true, numPartitions = partitions)

partitionedRDD.cache()

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

I thought we were going to pull the partition mapping out of the sort code, so that people could sort without computing the partition map? Also, storage level should be parametrizable.

"Cannot copartition with an unsorted rdd!")

val destinationPartitionMap = rddToCoPartitionWith.optPartitionMap.get
//number of partitions we will have after repartition

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Nits for the code in this function:

  • there should be whitespace before a comment
  • space after //
// the zipWithIndex gives us the destination partition ID
destinationPartitionMap.flatten.zipWithIndex.map(g => {
val (firstRegion, secondRegion, index) = (g._1._1, g._1._2, g._2)
// in the case where we span multiple referenceNames

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Can you add a bit more documentation on this codepath? E.g., what is necessary to represent a partition, why is the first region sometimes enough, when isn't it, etc.

// includes any extremely long regions. we include the firstRegion for
// the case that the first region is extremely long
(iter ++ Iterator(firstRegion)).maxBy(f => (f._1.referenceName, f._1.end, f._1.start))
// only one record on this partition, so this is the extent of the bounds

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Nit: this comment should be in the else clause.

(lo to hi).map(i => ((region, i), y))
})
def compute(): RDD[(RT, RU)] = {
leftRdd.zipPartitions(rightRdd, preservesPartitioning = true)(makeIterator)

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Technically, this doesn't "preserve partitioning" as defined by Spark, right?

This comment has been minimized.

@devin-petersohn

devin-petersohn May 26, 2017
Author Member

In this case, I doubt that it matters. It just lets Spark know that the partitioner it used (in this case ManualRegionPartitioner) to partition the data is still valid after this operation.

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

That's my point: isn't the ManualRegionPartitioner invalid after the zipPartitions, since the "key" now has type RT and the partitioner expects type (_, Int)?

This comment has been minimized.

@devin-petersohn

devin-petersohn May 26, 2017
Author Member

+1 I will remove it.

*
* @param rdd The underlying AlignmentRecord RDD.
* @return A new AlignmentRecordRDD.
*/
def unaligned(rdd: RDD[AlignmentRecord]): AlignmentRecordRDD = {
AlignmentRecordRDD(rdd,
SequenceDictionary.empty,

This comment has been minimized.

@heuermh

heuermh May 26, 2017
Member

This could call AlignmentRecordRDD(rdd, sequences, recordGroupDictionary, None) directly

This comment has been minimized.

@fnothaft

fnothaft May 26, 2017
Member

Or even better, it could call the constructor.

This comment has been minimized.

@heuermh

heuermh May 26, 2017
Member

That's what I meant. Or is there a different syntax?

This comment has been minimized.

@devin-petersohn

devin-petersohn May 26, 2017
Author Member

See next 2 lines for solution

@heuermh
Copy link
Member

@heuermh heuermh commented May 26, 2017

All the apply refactoring looks good to me, thanks!

@@ -196,7 +196,7 @@ class VariantRDDSuite extends ADAMFunSuite {
// we can't guarantee that we get exactly the number of partitions requested,
// we get close though
assert(jRdd.rdd.partitions.length === 1)
assert(jRdd0.rdd.partitions.length === 5)
assert(jRdd0.rdd.partitions.length === 4)

val c = jRdd0.rdd.collect

This comment has been minimized.

@devin-petersohn

devin-petersohn May 26, 2017
Author Member

This is a bug. I will fix it.

@coveralls
Copy link

@coveralls coveralls commented May 26, 2017

Coverage Status

Coverage decreased (-0.07%) to 81.967% when pulling 7805a7f on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@coveralls
Copy link

@coveralls coveralls commented May 26, 2017

Coverage Status

Coverage increased (+0.4%) to 82.418% when pulling 7805a7f on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2067/
Test PASSed.

@coveralls
Copy link

@coveralls coveralls commented May 26, 2017

Coverage Status

Coverage decreased (-0.08%) to 81.96% when pulling f15fe55 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@coveralls
Copy link

@coveralls coveralls commented May 26, 2017

Coverage Status

Coverage increased (+0.6%) to 82.599% when pulling f15fe55 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2069/
Test PASSed.

@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented May 30, 2017

@fnothaft, @heuermh, I think we're good to go for another review. At your earliest convenience of course.

Copy link
Member

@heuermh heuermh left a comment

Boom!

@coveralls
Copy link

@coveralls coveralls commented May 31, 2017

Coverage Status

Coverage increased (+0.6%) to 82.599% when pulling 536b936 on devin-petersohn:partitioner into 3ea4f18 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 31, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2076/
Test PASSed.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented May 31, 2017

@devin-petersohn you've got a conflict in adam-core/src/main/scala/org/bdgenomics/adam/rdd/variant/VariantContextRDD.scala. Can you rebase and clear the conflict?

Addressing reviewer comments

Performance improvements related to decreasing compute

Major refactor to clean up ShuffleRegionJoin

Fixing serialization issue

Cleaning up tests and code

More cleanup

Addressing reviewer comments

Addressing reviewer comments, cleaning up after intellij

Testing that data stays sorted on Jenkins

Still debugging the sorted persistance

Testing that we can persist sort

Checking output to hopefully see what is happening

Testing more related to sort persist

Adding ADAMInputFormat class to ensure ordering is maintained

Adding docs and cleanup

Cleaning up code a bit

Cleaning up after IntelliJ

Addressing reviewer comments

Addressing reviewer comments

Addressing reviewer comments

Clean up docs, cutting at 80 characters.

Addressing reviewer comments

Addressing reviewer comments. sorted to isSorted

Fixing some spacing issues

Addressing reviwer comments
@devin-petersohn devin-petersohn force-pushed the devin-petersohn:partitioner branch from 536b936 to 7a9503c May 31, 2017
@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented May 31, 2017

@fnothaft Rebased, thanks!

@coveralls
Copy link

@coveralls coveralls commented May 31, 2017

Coverage Status

Coverage decreased (-0.2%) to 82.569% when pulling 7a9503c on devin-petersohn:partitioner into b7762c2 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 31, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2077/
Test PASSed.

@fnothaft fnothaft merged commit e5ae270 into bigdatagenomics:master Jun 5, 2017
1 of 3 checks passed
1 of 3 checks passed
codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage decreased (-0.2%) to 82.569%
Details
default Merged build finished.
Details
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 5, 2017

Merged! Thanks @devin-petersohn!

@devin-petersohn devin-petersohn mentioned this pull request Jul 7, 2017
4 of 4 tasks complete
@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects