[ADAM-1141] Add support for saving/loading AlignmentRecords to/from CRAM. #1145

Merged
merged 1 commit into from Sep 13, 2016

Conversation

Projects
None yet
3 participants
@fnothaft
Member

fnothaft commented Sep 1, 2016

Resolves #1141. Changes the signature of AlignmentRecordRDD.saveAsSAM to take an Option[SAMFormat] parameter, since asSam is now no longer a binary choice.

-1 for now, as I need to make a pass back and write some more tests. Depends on #1104, #1117.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 1, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1451/

Build result: FAILURE

GitHub pull request #1145 of commit e18db5a automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1145/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f5465f8 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1145/merge^{commit} # timeout=10Checking out Revision f5465f8 (origin/pr/1145/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f5465f81f3d35a033c626b652ccd157936e8f3d0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1451/

Build result: FAILURE

GitHub pull request #1145 of commit e18db5a automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1145/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f5465f8 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1145/merge^{commit} # timeout=10Checking out Revision f5465f8 (origin/pr/1145/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f5465f81f3d35a033c626b652ccd157936e8f3d0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

* @param asSingleFile If true, saves output as a single file.
* @param isSorted If the output is sorted, this will modify the header.
*/
def saveAsSam(
filePath: String,
- asSam: Boolean = true,
+ asType: Option[SAMFormat] = None,

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

Is there a generic term for these formats? Otherwise I think the optional asType is reasonable.

@heuermh

heuermh Sep 6, 2016

Member

Is there a generic term for these formats? Otherwise I think the optional asType is reasonable.

- log.info("Saving data in BAM format")
+ if (args.outputPath.endsWith(".sam") ||
+ args.outputPath.endsWith(".bam") ||
+ args.outputPath.endsWith(".cram")) {

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

Would it be worth throwing all these file extension checks in the same utility class? Or perhaps on a trait with a check method and a descriptive one (getSupportedFileExtensions or somthing similar)?

@heuermh

heuermh Sep 6, 2016

Member

Would it be worth throwing all these file extension checks in the same utility class? Or perhaps on a trait with a check method and a descriptive one (getSupportedFileExtensions or somthing similar)?

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

Yeah, I think I did something similar for VCF in #1104, although a private method and not a utility class. I'll get that on a review pass.

@fnothaft

fnothaft Sep 6, 2016

Member

Yeah, I think I did something similar for VCF in #1104, although a private method and not a utility class. I'll get that on a review pass.

@@ -453,47 +454,57 @@ class AlignmentRecordRDDSuite extends ADAMFunSuite {
val inputPath = resourcePath("small.sam")
val reads: AlignmentRecordRDD = sc.loadAlignments(inputPath)
val outputPath = tmpLocation(".sam")
- reads.saveAsSam(outputPath, true)
+ reads.saveAsSam(outputPath, asType = Some(SAMFormat.SAM))

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

asType isn't strictly necessary here, right?

@heuermh

heuermh Sep 6, 2016

Member

asType isn't strictly necessary here, right?

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

Yeah, it isn't strictly necessary, but I wanted to have the unit tests hit both the case where SAM was identified by extension and where the SAM format was explicitly set.

@fnothaft

fnothaft Sep 6, 2016

Member

Yeah, it isn't strictly necessary, but I wanted to have the unit tests hit both the case where SAM was identified by extension and where the SAM format was explicitly set.

@heuermh heuermh referenced this pull request Sep 7, 2016

Closed

Release ADAM version 0.20.0 #1048

47 of 61 tasks complete
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 11, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1478/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1478/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 12, 2016

Member

@fnothaft did you want to take off the -1 here?

What is here LGTM, though I would like to see some CRAM-specific unit tests and a small CRAM test file to read.

Member

heuermh commented Sep 12, 2016

@fnothaft did you want to take off the -1 here?

What is here LGTM, though I would like to see some CRAM-specific unit tests and a small CRAM test file to read.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 12, 2016

Member

Nah this is still -1 pending the CRAM specific tests.

Member

fnothaft commented Sep 12, 2016

Nah this is still -1 pending the CRAM specific tests.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 13, 2016

Member

OK! Removing my -1 here. I've added a commit (b2d40c7) with CRAM specific tests. Can I get a review pass of said commit? Once it looks good to everyone, I'll squash down and we can merge this.

Member

fnothaft commented Sep 13, 2016

OK! Removing my -1 here. I've added a commit (b2d40c7) with CRAM specific tests. Can I get a review pass of said commit? Once it looks good to everyone, I'll squash down and we can merge this.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 13, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1488/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1488/
Test PASSed.

+ val readsA = rddA.rdd.collect()
+ val readsB = rddB.rdd.collect()
+
+ readsA.indices.foreach {

This comment has been minimized.

@heuermh

heuermh Sep 13, 2016

Member

Hmm... this may be a more robust way to validate than the zip I've been using (with various problems) in FeatureRDDSuite

@heuermh

heuermh Sep 13, 2016

Member

Hmm... this may be a more robust way to validate than the zip I've been using (with various problems) in FeatureRDDSuite

This comment has been minimized.

@fnothaft

fnothaft Sep 13, 2016

Member

The two are equivalent, no?

@fnothaft

fnothaft Sep 13, 2016

Member

The two are equivalent, no?

This comment has been minimized.

@heuermh

heuermh Sep 13, 2016

Member

Yeah, the zip fails inconsistently for me with SparkException: Can only zip RDDs with same number of elements in each partition. Sometimes less clever is better.

@heuermh

heuermh Sep 13, 2016

Member

Yeah, the zip fails inconsistently for me with SparkException: Can only zip RDDs with same number of elements in each partition. Sometimes less clever is better.

This comment has been minimized.

@fnothaft

fnothaft Sep 13, 2016

Member

Ah, I typically do a collect before the zip, which eliminates said issue (and we need to collect to use asserts anyways).

@fnothaft

fnothaft Sep 13, 2016

Member

Ah, I typically do a collect before the zip, which eliminates said issue (and we need to collect to use asserts anyways).

asSingleFile = true)
+ println("Saved file to %s/%s".format(tempFile.toAbsolutePath.toString, filename))

This comment has been minimized.

@heuermh

heuermh Sep 13, 2016

Member

use logger or remove

@heuermh

heuermh Sep 13, 2016

Member

use logger or remove

This comment has been minimized.

@fnothaft

fnothaft Sep 13, 2016

Member

Dur, had that in for debugging. I'll remove this, then squash and rebase. Thanks for catching this @heuermh !

@fnothaft

fnothaft Sep 13, 2016

Member

Dur, had that in for debugging. I'll remove this, then squash and rebase. Thanks for catching this @heuermh !

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 13, 2016

Member

LGTM

Member

heuermh commented Sep 13, 2016

LGTM

[ADAM-1141] Add support for saving/loading AlignmentRecords to/from C…
…RAM.


Resolves #1141. Changes the signature of `AlignmentRecordRDD.saveAsSAM` to take
an `Option[SAMFormat]` parameter, since `asSam` is now no longer a binary
choice.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 13, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1489/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1489/
Test PASSed.

@heuermh heuermh merged commit 0b7e03e into bigdatagenomics:master Sep 13, 2016

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 13, 2016

Member

Merged as commit 0b7e03e.

Thank you, @fnothaft!

Member

heuermh commented Sep 13, 2016

Merged as commit 0b7e03e.

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment