New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1449] Add loadSequenceDictionary to ADAM context. #1557

Merged
merged 2 commits into from Jul 11, 2017

Conversation

Projects
5 participants
@heuermh
Member

heuermh commented Jun 5, 2017

Fixes #1449
Fixes #1534

Implements ADAMContext.loadSequenceDictionary that supports HTSJDK sequence dictionary (.dict), Bedtools genome file (.genome), and UCSC Genome Browser chromInfo (.txt) files.

Taking suggestions on how best to add this to other load methods.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 5, 2017

Coverage Status

Coverage increased (+0.4%) to 83.18% when pulling d1d84dc on heuermh:sequence-dictionaries into b7762c2 on bigdatagenomics:master.

coveralls commented Jun 5, 2017

Coverage Status

Coverage increased (+0.4%) to 83.18% when pulling d1d84dc on heuermh:sequence-dictionaries into b7762c2 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 5, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2083/
Test PASSed.

AmplabJenkins commented Jun 5, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2083/
Test PASSed.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jun 5, 2017

Member

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

Member

devin-petersohn commented Jun 5, 2017

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

@fnothaft

Few small nits. Looks great otherwise! Thanks @heuermh!

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/util/FileExtensions.scala Outdated
// even though we are just using the codec to read, it requires us to give
// a writer...? setting the writer to stderr seems like a reasonable choice.
val codec = new SAMSequenceDictionaryCodec(
new BufferedWriter(new OutputStreamWriter(System.err)))

This comment has been minimized.

@fnothaft

fnothaft Jun 5, 2017

Member

I agree that this is a reasonable choice. That said, what happens if we provide a null pointer?

@fnothaft

fnothaft Jun 5, 2017

Member

I agree that this is a reasonable choice. That said, what happens if we provide a null pointer?

This comment has been minimized.

@heuermh

heuermh Jun 6, 2017

Member

It passes unit tests. I'd lean towards keeping this as-is, given how little I trust htsjdk

@heuermh

heuermh Jun 6, 2017

Member

It passes unit tests. I'd lean towards keeping this as-is, given how little I trust htsjdk

This comment has been minimized.

@fnothaft

fnothaft Jul 11, 2017

Member

SGTM

@fnothaft

fnothaft Jul 11, 2017

Member

SGTM

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 5, 2017

Member

Suggestions from meeting today:

  • Bedtools documentation often refers to .genome.txt files
  • Check for sequence dictionary file with matching base name when loading features
Member

heuermh commented Jun 5, 2017

Suggestions from meeting today:

  • Bedtools documentation often refers to .genome.txt files
  • Check for sequence dictionary file with matching base name when loading features
assert(sequences("chr17_gl000206_random").get.length === 41001L)
}
sparkTest("load BED features with Bedtools .genome file as sequence dictionary, no matching features") {

This comment has been minimized.

@heuermh

heuermh Jun 6, 2017

Member

What should happen here?
Similar with features that match on reference name but not on coordinates (e.g. start > reference length).

@heuermh

heuermh Jun 6, 2017

Member

What should happen here?
Similar with features that match on reference name but not on coordinates (e.g. start > reference length).

This comment has been minimized.

@heuermh

heuermh Jun 14, 2017

Member

Will add ValidationStringency to ignore or throw exception.

@heuermh

heuermh Jun 14, 2017

Member

Will add ValidationStringency to ignore or throw exception.

This comment has been minimized.

@heuermh

heuermh Jun 15, 2017

Member

On further consideration, since checking at load time might significantly slow down the common case unnecessarily, where the sequence dictionary is created from the features themselves, I'm leaning towards a separate pull request to implement

def filterBySequenceDictionary(sd: SequenceDictionary): U = {
  ...
}

and

def filterBySequenceDictionary(): U = {
  filterBySequenceDictionary(this.sequences)
}

methods on GenomicRDD.

@heuermh

heuermh Jun 15, 2017

Member

On further consideration, since checking at load time might significantly slow down the common case unnecessarily, where the sequence dictionary is created from the features themselves, I'm leaning towards a separate pull request to implement

def filterBySequenceDictionary(sd: SequenceDictionary): U = {
  ...
}

and

def filterBySequenceDictionary(): U = {
  filterBySequenceDictionary(this.sequences)
}

methods on GenomicRDD.

This comment has been minimized.

@heuermh

heuermh Jul 6, 2017

Member

Per discussion on #1575, will leave this test case as is

@heuermh

heuermh Jul 6, 2017

Member

Per discussion on #1575, will leave this test case as is

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 6, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2084/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1557/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains e19278fe3d9a6979f7549ae163fa8c74b1a91730 # timeout=10Checking out Revision e19278fe3d9a6979f7549ae163fa8c74b1a91730 (origin/pr/1557/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e19278fe3d9a6979f7549ae163fa8c74b1a91730First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jun 6, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2084/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1557/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains e19278fe3d9a6979f7549ae163fa8c74b1a91730 # timeout=10Checking out Revision e19278fe3d9a6979f7549ae163fa8c74b1a91730 (origin/pr/1557/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e19278fe3d9a6979f7549ae163fa8c74b1a91730First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 6, 2017

Coverage Status

Coverage increased (+0.2%) to 83.29% when pulling 0727a61 on heuermh:sequence-dictionaries into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 6, 2017

Coverage Status

Coverage increased (+0.2%) to 83.29% when pulling 0727a61 on heuermh:sequence-dictionaries into ad5ae6d on bigdatagenomics:master.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 6, 2017

Member

I am liking the suggestion that we check for sequence dictionary file with matching base name when loading features less now that I've thought about it.

What if pathName is a glob that matches more than one feature file? I'd rather the user be more explicit, calling

val sd = sc.loadSequenceDictionary("hg19.genome")
val features = sc.loadFeatures("**.bed", optSequenceDictionary = Some(sd))

to share one sequence dictionary across a bunch of BED files.

Member

heuermh commented Jun 6, 2017

I am liking the suggestion that we check for sequence dictionary file with matching base name when loading features less now that I've thought about it.

What if pathName is a glob that matches more than one feature file? I'd rather the user be more explicit, calling

val sd = sc.loadSequenceDictionary("hg19.genome")
val features = sc.loadFeatures("**.bed", optSequenceDictionary = Some(sd))

to share one sequence dictionary across a bunch of BED files.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2085/
Test PASSed.

AmplabJenkins commented Jun 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2085/
Test PASSed.

*/
def isGenomeExt(pathName: String): Boolean = {
pathName.endsWith(".genome") ||
pathName.endsWith(".genome.txt")

This comment has been minimized.

@devin-petersohn

devin-petersohn Jun 9, 2017

Member

We can probably get rid of this and be strict about it being a *.genome file, as long as we give good feedback about our requirements. I haven't seen people use genome.txt extensions, it's been mostly either .txt or .genome.

@devin-petersohn

devin-petersohn Jun 9, 2017

Member

We can probably get rid of this and be strict about it being a *.genome file, as long as we give good feedback about our requirements. I haven't seen people use genome.txt extensions, it's been mostly either .txt or .genome.

This comment has been minimized.

@heuermh

heuermh Jun 14, 2017

Member

All three (.genome, .genome.txt, and .txt) go through the same code path. The only difference is the log message.

@heuermh

heuermh Jun 14, 2017

Member

All three (.genome, .genome.txt, and .txt) go through the same code path. The only difference is the log message.

@heuermh heuermh changed the title from Add loadSequenceDictionary to ADAM context. to [ADAM-1449] Add loadSequenceDictionary to ADAM context. Jun 14, 2017

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 14, 2017

Member

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

No, I don't really understand what that would mean.

Member

heuermh commented Jun 14, 2017

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

No, I don't really understand what that would mean.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 14, 2017

Member

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

No, I don't really understand what that would mean.

I mean, if anyone wants to do that, they can load the SequenceDictionary manually and attach it to the GenomicRDD. E.g.:

val reads = sc.loadParquetAlignments("my/reads.adam")
val sequences = sc.loadSequenceDictionary("hs37d5.dict")

val readsWithANewDictionary = reads.copy(sequences = sequences)
Member

fnothaft commented Jun 14, 2017

Do we want to allow users to provide SequenceDictionary when they've loaded in a *.adam file?

No, I don't really understand what that would mean.

I mean, if anyone wants to do that, they can load the SequenceDictionary manually and attach it to the GenomicRDD. E.g.:

val reads = sc.loadParquetAlignments("my/reads.adam")
val sequences = sc.loadSequenceDictionary("hs37d5.dict")

val readsWithANewDictionary = reads.copy(sequences = sequences)
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 15, 2017

Member

@heuermh filterBySequenceDictionary SGTM! Would that filter out all items that were mapped to contigs that were not present, and all items that mapped off the end of a contig that was known? How would it handle unmapped items?

Member

fnothaft commented Jun 15, 2017

@heuermh filterBySequenceDictionary SGTM! Would that filter out all items that were mapped to contigs that were not present, and all items that mapped off the end of a contig that was known? How would it handle unmapped items?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 21, 2017

Member

See #1575

Would that filter out all items that were mapped to contigs that were not present, and all items that mapped off the end of a contig that was known?

Yes

How would it handle unmapped items?

Depends on how getReferenceRegions is implemented, I suppose

Member

heuermh commented Jun 21, 2017

See #1575

Would that filter out all items that were mapped to contigs that were not present, and all items that mapped off the end of a contig that was known?

Yes

How would it handle unmapped items?

Depends on how getReferenceRegions is implemented, I suppose

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 6, 2017

Member

Rebased and squashed down to two commits. Fixes #1588.

Member

heuermh commented Jul 6, 2017

Rebased and squashed down to two commits. Fixes #1588.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 6, 2017

Coverage Status

Coverage decreased (-0.05%) to 83.213% when pulling 52258fd on heuermh:sequence-dictionaries into 8572fb7 on bigdatagenomics:master.

coveralls commented Jul 6, 2017

Coverage Status

Coverage decreased (-0.05%) to 83.213% when pulling 52258fd on heuermh:sequence-dictionaries into 8572fb7 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2175/
Test PASSed.

AmplabJenkins commented Jul 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2175/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 11, 2017

Coverage Status

Coverage increased (+0.08%) to 84.097% when pulling 61f45be on heuermh:sequence-dictionaries into 97a8498 on bigdatagenomics:master.

coveralls commented Jul 11, 2017

Coverage Status

Coverage increased (+0.08%) to 84.097% when pulling 61f45be on heuermh:sequence-dictionaries into 97a8498 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 11, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2191/
Test PASSed.

AmplabJenkins commented Jul 11, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2191/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 11, 2017

Coverage Status

Coverage increased (+0.08%) to 84.097% when pulling 61f45be on heuermh:sequence-dictionaries into 97a8498 on bigdatagenomics:master.

coveralls commented Jul 11, 2017

Coverage Status

Coverage increased (+0.08%) to 84.097% when pulling 61f45be on heuermh:sequence-dictionaries into 97a8498 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 11, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2192/
Test PASSed.

AmplabJenkins commented Jul 11, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2192/
Test PASSed.

sc: SparkContext): SequenceDictionary = {
val records = sc
.textFile(filePath)

This comment has been minimized.

@fnothaft

fnothaft Jul 11, 2017

Member

There's nothing inherently wrong with using sc.textFile to load this, but it seems a bit... aggressive to use it to load such a small file.

@fnothaft

fnothaft Jul 11, 2017

Member

There's nothing inherently wrong with using sc.textFile to load this, but it seems a bit... aggressive to use it to load such a small file.

@fnothaft fnothaft merged commit 01b0dc7 into bigdatagenomics:master Jul 11, 2017

2 of 3 checks passed

codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage increased (+0.08%) to 84.097%
Details
default Merged build finished.
Details
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 11, 2017

Member

Merged! Thanks @heuermh!

Member

fnothaft commented Jul 11, 2017

Merged! Thanks @heuermh!

@heuermh heuermh deleted the heuermh:sequence-dictionaries branch Jul 11, 2017

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment