Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1449] Add loadSequenceDictionary to ADAM context. #1557

Merged
merged 2 commits into from Jul 11, 2017

Conversation

@heuermh
Copy link
Member

@heuermh heuermh commented Jun 5, 2017

Fixes #1449
Fixes #1534

Implements ADAMContext.loadSequenceDictionary that supports HTSJDK sequence dictionary (.dict), Bedtools genome file (.genome), and UCSC Genome Browser chromInfo (.txt) files.

Taking suggestions on how best to add this to other load methods.

@coveralls
Copy link

@coveralls coveralls commented Jun 5, 2017

Coverage Status

Coverage increased (+0.4%) to 83.18% when pulling d1d84dc on heuermh:sequence-dictionaries into b7762c2 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jun 5, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2083/
Test PASSed.

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Jun 5, 2017

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

Copy link
Member

@fnothaft fnothaft left a comment

Few small nits. Looks great otherwise! Thanks @heuermh!

* @param pathName The path name to load a sequence dictionary from.
* @return Returns a sequence dictionary.
* @throws IllegalArgumentException if pathName file extension not one of .dict,
* .genome, or .txt

This comment has been minimized.

@fnothaft

fnothaft Jun 5, 2017
Member

Hard to tell in the web viewer, but I think you need another space here to match our typical scaladoc indent conventions.

* @param pathName The path name to match.
* @return Returns true if the path name matches an HTSJDK sequence dictionary (.dict) extension.
*/
private[adam] def isDictExt(pathName: String): Boolean = {

This comment has been minimized.

@fnothaft

fnothaft Jun 5, 2017
Member

Small nit, but instead of making all the methods private[adam], can we make the object private[adam].

This comment has been minimized.

@heuermh

heuermh Jun 5, 2017
Author Member

+1

// even though we are just using the codec to read, it requires us to give
// a writer...? setting the writer to stderr seems like a reasonable choice.
val codec = new SAMSequenceDictionaryCodec(
new BufferedWriter(new OutputStreamWriter(System.err)))

This comment has been minimized.

@fnothaft

fnothaft Jun 5, 2017
Member

I agree that this is a reasonable choice. That said, what happens if we provide a null pointer?

This comment has been minimized.

@heuermh

heuermh Jun 6, 2017
Author Member

It passes unit tests. I'd lean towards keeping this as-is, given how little I trust htsjdk

This comment has been minimized.

@heuermh
Copy link
Member Author

@heuermh heuermh commented Jun 5, 2017

Suggestions from meeting today:

  • Bedtools documentation often refers to .genome.txt files
  • Check for sequence dictionary file with matching base name when loading features
@heuermh heuermh force-pushed the heuermh:sequence-dictionaries branch from d1d84dc to 95a8b6f Jun 6, 2017
assert(sequences("chr17_gl000206_random").get.length === 41001L)
}

sparkTest("load BED features with Bedtools .genome file as sequence dictionary, no matching features") {

This comment has been minimized.

@heuermh

heuermh Jun 6, 2017
Author Member

What should happen here?
Similar with features that match on reference name but not on coordinates (e.g. start > reference length).

This comment has been minimized.

@heuermh

heuermh Jun 14, 2017
Author Member

Will add ValidationStringency to ignore or throw exception.

This comment has been minimized.

@heuermh

heuermh Jun 15, 2017
Author Member

On further consideration, since checking at load time might significantly slow down the common case unnecessarily, where the sequence dictionary is created from the features themselves, I'm leaning towards a separate pull request to implement

def filterBySequenceDictionary(sd: SequenceDictionary): U = {
  ...
}

and

def filterBySequenceDictionary(): U = {
  filterBySequenceDictionary(this.sequences)
}

methods on GenomicRDD.

This comment has been minimized.

@heuermh

heuermh Jul 6, 2017
Author Member

Per discussion on #1575, will leave this test case as is

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jun 6, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2084/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1557/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains e19278fe3d9a6979f7549ae163fa8c74b1a91730 # timeout=10Checking out Revision e19278fe3d9a6979f7549ae163fa8c74b1a91730 (origin/pr/1557/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e19278fe3d9a6979f7549ae163fa8c74b1a91730First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

@coveralls coveralls commented Jun 6, 2017

Coverage Status

Coverage increased (+0.2%) to 83.29% when pulling 0727a61 on heuermh:sequence-dictionaries into ad5ae6d on bigdatagenomics:master.

@heuermh
Copy link
Member Author

@heuermh heuermh commented Jun 6, 2017

I am liking the suggestion that we check for sequence dictionary file with matching base name when loading features less now that I've thought about it.

What if pathName is a glob that matches more than one feature file? I'd rather the user be more explicit, calling

val sd = sc.loadSequenceDictionary("hg19.genome")
val features = sc.loadFeatures("**.bed", optSequenceDictionary = Some(sd))

to share one sequence dictionary across a bunch of BED files.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jun 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2085/
Test PASSed.

*/
def isGenomeExt(pathName: String): Boolean = {
pathName.endsWith(".genome") ||
pathName.endsWith(".genome.txt")

This comment has been minimized.

@devin-petersohn

devin-petersohn Jun 9, 2017
Member

We can probably get rid of this and be strict about it being a *.genome file, as long as we give good feedback about our requirements. I haven't seen people use genome.txt extensions, it's been mostly either .txt or .genome.

This comment has been minimized.

@heuermh

heuermh Jun 14, 2017
Author Member

All three (.genome, .genome.txt, and .txt) go through the same code path. The only difference is the log message.

@heuermh heuermh changed the title Add loadSequenceDictionary to ADAM context. [ADAM-1449] Add loadSequenceDictionary to ADAM context. Jun 14, 2017
@heuermh
Copy link
Member Author

@heuermh heuermh commented Jun 14, 2017

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

No, I don't really understand what that would mean.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 14, 2017

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

No, I don't really understand what that would mean.

I mean, if anyone wants to do that, they can load the SequenceDictionary manually and attach it to the GenomicRDD. E.g.:

val reads = sc.loadParquetAlignments("my/reads.adam")
val sequences = sc.loadSequenceDictionary("hs37d5.dict")

val readsWithANewDictionary = reads.copy(sequences = sequences)
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 15, 2017

@heuermh filterBySequenceDictionary SGTM! Would that filter out all items that were mapped to contigs that were not present, and all items that mapped off the end of a contig that was known? How would it handle unmapped items?

@heuermh
Copy link
Member Author

@heuermh heuermh commented Jun 21, 2017

See #1575

Would that filter out all items that were mapped to contigs that were not present, and all items that mapped off the end of a contig that was known?

Yes

How would it handle unmapped items?

Depends on how getReferenceRegions is implemented, I suppose

@heuermh
Copy link
Member Author

@heuermh heuermh commented Jul 6, 2017

Rebased and squashed down to two commits. Fixes #1588.

@coveralls
Copy link

@coveralls coveralls commented Jul 6, 2017

Coverage Status

Coverage decreased (-0.05%) to 83.213% when pulling 52258fd on heuermh:sequence-dictionaries into 8572fb7 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2175/
Test PASSed.

@heuermh heuermh force-pushed the heuermh:sequence-dictionaries branch from 52258fd to 1a0dd3f Jul 11, 2017
@heuermh heuermh force-pushed the heuermh:sequence-dictionaries branch from 1a0dd3f to 61f45be Jul 11, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 11, 2017

Coverage Status

Coverage increased (+0.08%) to 84.097% when pulling 61f45be on heuermh:sequence-dictionaries into 97a8498 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 11, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2191/
Test PASSed.

@coveralls
Copy link

@coveralls coveralls commented Jul 11, 2017

Coverage Status

Coverage increased (+0.08%) to 84.097% when pulling 61f45be on heuermh:sequence-dictionaries into 97a8498 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 11, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2192/
Test PASSed.

sc: SparkContext): SequenceDictionary = {

val records = sc
.textFile(filePath)

This comment has been minimized.

@fnothaft

fnothaft Jul 11, 2017
Member

There's nothing inherently wrong with using sc.textFile to load this, but it seems a bit... aggressive to use it to load such a small file.

@fnothaft fnothaft merged commit 01b0dc7 into bigdatagenomics:master Jul 11, 2017
2 of 3 checks passed
2 of 3 checks passed
codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage increased (+0.08%) to 84.097%
Details
default Merged build finished.
Details
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 11, 2017

Merged! Thanks @heuermh!

@heuermh heuermh deleted the heuermh:sequence-dictionaries branch Jul 11, 2017
@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.