Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports access to indexed fa and fasta files #1320

Merged
merged 1 commit into from Dec 23, 2016

Conversation

@akmorrow13
Copy link
Contributor

akmorrow13 commented Dec 19, 2016

No description provided.

@AmplabJenkins
Copy link

AmplabJenkins commented Dec 19, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1685/
Test PASSed.

}

// Get sequence dictionary. If sequence dictionary is not defined,
// generate sequence dictionary from dile

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 21, 2016

Author Contributor

spelling

Copy link
Member

fnothaft left a comment

Looks good over all! I've dropped a bunch of comments inline. Most are nits, but a few are proposals to remove away from the Picard bits.

@@ -185,5 +185,10 @@
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
<dependency>

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

@heuermh and I would like to delete this dependency


package org.bdgenomics.adam.util

import java.net.URI

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

Nit: Move java after htsjdk


import java.net.URI
import java.nio.file.Paths

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

Nit: no space

case class IndexedFastaFile(sc: SparkContext, filePath: String) extends ReferenceFile with Logging {

// get absolute path and scheme to create URI
val path = new Path(filePath).toString

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

Let's make all these vals private.


// get absolute path and scheme to create URI
val path = new Path(filePath).toString
val scheme: String = new Path(path).getFileSystem(sc.hadoopConfiguration).getScheme

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

Nit: Do you need the : String type definition here?

This comment has been minimized.

Copy link
@ryan-williams

ryan-williams Dec 22, 2016

Member

I often leave it off in these situations, but FWIW IntelliJ has an inspection that warns against not explicitly denoting types of public members, for one reason or another.

val scheme: String = new Path(path).getFileSystem(sc.hadoopConfiguration).getScheme
val pathWithScheme = s"${scheme}://${path}"

val uri: URI = new URI(pathWithScheme)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

Nit: Ditto here RE : URI


val uri: URI = new URI(pathWithScheme)

val file = Paths.get(uri).toFile

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

We require the dependency on https://github.com/damiencarol/jsr203-hadoop, for this, right? If so, let's document that here. Also, I believe we pull that in right now as a transient dependency through Hadoop-BAM. Let's make that an explicit dependency in the POM.

This comment has been minimized.

Copy link
@heuermh

heuermh Dec 21, 2016

Member

Explicit dependency on something for runtime scope? Might not be worth the headache of keeping our dependency version number in sync with the one provided by Hadoop-BAM.

try {
SequenceDictionary(ref.getSequenceDictionary)
} catch {
case e: Exception => {

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

I'd rather not generate the Sequence Dictionary file, my preference would be to fall back on a validation stringency check here. I'm thinking it'd be best to:

} catch {
  case e: Exception => {
    if (stringency == ValidationStringency.STRICT) {
      throw e
    } else {
      if (stringency == ValidationStringency.LENIENT) {
        log.warn("Caught exception %s when loading FASTA sequence dictionary. Using empty dictionary instead.".format(e))
      }
      SequenceDictionary.empty
    }
  }
}

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 22, 2016

Author Contributor

For Mango, we require a sequence dictionary. So in this case, for any fa or fasta file, we will not have a sequence dictionary and thus Mango will fail. If this is a problem and we can't find another way to get the sequence dictionary, I can move this code to Mango.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

In that case, you'd pass stringency = ValidationStringency.STRICT from Mango. There's several reasons (in no specific order) I'd like to fail(/validate) instead of regenerate the .dict file:

  1. Consistency with other tools. All of the Picard CLI's (and the GATK) will fail by default if the .dict file is missing. They won't regenerate the file.
  2. Simplicity of handling edge cases. If loading the sequence dictionary fails, this could've happened because the file is missing (and then regenerating the file is a reasonable answer) but it could also be a file permission issue, corrupt data on file system, bad Hadoop configuration causes JSR-203-hadoop to barf, etc. In all of those cases, regenerating the sequence dictionary is not the correct answer.
  3. Packaging. I don't want to add Picard as a dependency.

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 22, 2016

Author Contributor

Ok, that makes sense, Mango can require a .dict file.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

Yeah, I would just pass stringency = ValidationStringency.STRICT and try/catch for the error.

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 22, 2016

Author Contributor

Where does the stringency get passed? Do we let this be a parameter in the case class?

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

Choice is yours, but yes, sounds reasonable.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

Also, small aside: I think folks in the Scala ecosystem prefer to catch Throwable as opposed to Exception.

def extract(region: ReferenceRegion): String = {
// Note: index is one based
val start = Math.max(1L, region.start)
val end = Math.min(sequences(region.referenceName).get.length, region.end)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

I'd rather not do the Math.min here, at least not by default. Do we get a validation error if region.end is longer than the sequence length? If so, let's handle that with validation stringency --> perhaps we do the Math.min in the SILENT/LENIENT cases?

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 22, 2016

Author Contributor

yes, there is an error if region end is longer than the sequence.

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 22, 2016

Author Contributor

However, if we are going to allow SequenceDictionary.empty mentioned in the above request, we can't call Math.min for getting the sequence length because sequences will be null.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

I would rather us not Math.max/.min here, and throw the error instead, so I see that as a feature and not a bug.

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Dec 22, 2016

Author Contributor

What I don't like about that is that twobit and NucleotideContig extract() functions don't throw errors when you pass the contig end, so this will be the only ReferenceFile implementation that throws an error for that.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

I don't think we define what the function returns if the provided query reference region is invalid.

pom.xml Outdated
@@ -564,6 +565,17 @@
<artifactId>scala-guice_${scala.version.prefix}</artifactId>
<version>4.0.1</version>
</dependency>
<dependency>

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 21, 2016

Member

Would like to remove.

@heuermh
Copy link
Member

heuermh commented Dec 21, 2016

I didn't notice during our conversation this afternoon that this is for ReferenceFile, not something that we might want to use to populate an RDD of Sequences or Slices.

Thus I'm generally +1 after @fnothaft's requested changes are in.

<dependency>
<groupId>com.github.broadinstitute</groupId>
<artifactId>picard</artifactId>
<scope>compile</scope>

This comment has been minimized.

Copy link
@ryan-williams

ryan-williams Dec 22, 2016

Member

small drive-by comment but any reason to explicitly declare compile scope?

@akmorrow13 akmorrow13 force-pushed the akmorrow13:indexedFasta branch 3 times, most recently from a5ee162 to b024779 Dec 22, 2016
Copy link
Member

fnothaft left a comment

2 small nits, otherwise LGTM!

try {
SequenceDictionary(ref.getSequenceDictionary)
} catch {
case e: Exception => {

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

Also, small aside: I think folks in the Scala ecosystem prefer to catch Throwable as opposed to Exception.

pom.xml Outdated
@@ -30,6 +30,7 @@
<bdg-formats.version>0.10.0</bdg-formats.version>
<bdg-utils.version>0.2.10</bdg-utils.version>
<htsjdk.version>2.5.0</htsjdk.version>
<picard.version>2.8.0</picard.version>

This comment has been minimized.

Copy link
@fnothaft

fnothaft Dec 22, 2016

Member

Left this one in.

@AmplabJenkins
Copy link

AmplabJenkins commented Dec 22, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1691/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1320/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains dd19fd05abd735171eeef5c9fe7d8ed1c67e65df # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1320/merge^{commit} # timeout=10Checking out Revision dd19fd05abd735171eeef5c9fe7d8ed1c67e65df (origin/pr/1320/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f dd19fd05abd735171eeef5c9fe7d8ed1c67e65dfFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

AmplabJenkins commented Dec 22, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1692/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1320/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 29744002139861f0fc75b2c1c15b2e2e15d9c445 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1320/merge^{commit} # timeout=10Checking out Revision 29744002139861f0fc75b2c1c15b2e2e15d9c445 (origin/pr/1320/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 29744002139861f0fc75b2c1c15b2e2e15d9c445First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Dec 22, 2016

@akmorrow13 can you run ./scripts/format-source?

@akmorrow13 akmorrow13 force-pushed the akmorrow13:indexedFasta branch from b024779 to 3fd8e67 Dec 23, 2016
@AmplabJenkins
Copy link

AmplabJenkins commented Dec 23, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1697/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1320/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 151b77cc00b2d925a021ed5b4d90304c81994b77 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1320/merge^{commit} # timeout=10Checking out Revision 151b77cc00b2d925a021ed5b4d90304c81994b77 (origin/pr/1320/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 151b77cc00b2d925a021ed5b4d90304c81994b77 > /home/jenkins/git2/bin/git rev-list 29744002139861f0fc75b2c1c15b2e2e15d9c445 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13 akmorrow13 force-pushed the akmorrow13:indexedFasta branch from 3fd8e67 to dbdac69 Dec 23, 2016
@AmplabJenkins
Copy link

AmplabJenkins commented Dec 23, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1698/
Test PASSed.

@akmorrow13 akmorrow13 force-pushed the akmorrow13:indexedFasta branch from dbdac69 to 1a52fb6 Dec 23, 2016
Copy link
Member

fnothaft left a comment

LGTM! @akmorrow13 let me know if this is good to go from your side, and I will merge (when tests finish).

@AmplabJenkins
Copy link

AmplabJenkins commented Dec 23, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1699/
Test PASSed.

@fnothaft fnothaft merged commit 63077b6 into bigdatagenomics:master Dec 23, 2016
1 check passed
1 check passed
default Merged build finished.
Details
@fnothaft
Copy link
Member

fnothaft commented Dec 23, 2016

Merged! Thanks @akmorrow13!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.