Fastq record converter #1185

Closed
wants to merge 41 commits into
from

Conversation

Projects
None yet
4 participants
@zyxue
Contributor

zyxue commented Sep 28, 2016

There are still problems:

  • The description for convertRead does don't match with what it does (e.g. the return type). The method name is kind of vague, not sure what behavior is actually intended. The refactoring tries not to change its behavior.
  • I am also a bit confused for what default values should be used when creating a AlignmentRecord instance based on a Fastq entry. The original code set many fields to null. I thought leaving them as default would make more sense. e.g. ReadNegativeStrand is set to null while the default is false. ProperPair is set to true, which I don't think make sense since you can't really tell just based on Fastq, the default is also fasle. It would be helpful if someone can clarify what's the most sensible values for the following fields for both paired-end and single-end fastq entries.
.setReadPaired(readPaired)
.setReadInFragment(readInFragment)
.setReadNegativeStrand(null)
.setMateNegativeStrand(null)
.setPrimaryAlignment(null)
.setSecondaryAlignment(null)
.setSupplementaryAlignment(null)
  • Also, what to do when the read length and that of qualities don't match? There is a stringency parameter involved in convertRead. If it's STRICT, then qualities must exist (CANNOT be *) and also match the length of reads. When it's NOT STRICT, qualities will be padded with B if not exist or shorter than read length. If it's longer than read length, NotImplementedError will be thrown. Such behavior seems quite arbitrary and doesn't make much sense to me, and it doesn't apply to convertPair and convertFragment, in which the read length and qualities must match without consideration for stringency.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 28, 2016

Can one of the admins verify this patch?

Can one of the admins verify this patch?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 28, 2016

Member

Jenkins, test this please.

Member

heuermh commented Sep 28, 2016

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 28, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1505/

Build result: FAILURE

GitHub pull request #1185 of commit d4c5ad6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f280ecc19296a5841548d95e94b1bc3b986b0012 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision f280ecc19296a5841548d95e94b1bc3b986b0012 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f280ecc19296a5841548d95e94b1bc3b986b0012First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1505/

Build result: FAILURE

GitHub pull request #1185 of commit d4c5ad6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f280ecc19296a5841548d95e94b1bc3b986b0012 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision f280ecc19296a5841548d95e94b1bc3b986b0012 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f280ecc19296a5841548d95e94b1bc3b986b0012First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

+
+ private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = {
+ val lines = input.toString.split('\n')
+ require(lines.length == 8,

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

I've mentioned this before; perhaps now is the time to fix it? FASTQ format allows for hard line wrapping, so there may be new line characters at any place in the record.

@heuermh

heuermh Sep 28, 2016

Member

I've mentioned this before; perhaps now is the time to fix it? FASTQ format allows for hard line wrapping, so there may be new line characters at any place in the record.

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

Can you give an example for hard line wrapping? What does it look like?

@zyxue

zyxue Sep 28, 2016

Contributor

Can you give an example for hard line wrapping? What does it look like?

This comment has been minimized.

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

@fnothaft

fnothaft Sep 30, 2016

Member

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

This comment has been minimized.

@zyxue

zyxue Oct 1, 2016

Contributor

I tried to implement parsing for wrapped lines. Then I found that it would require exact match of sequence length and quality length. Otherwise, it's ambiguous to tell when the quality lines stop. This makes padding with B when length(qual line) < length(seq line), is that right? e.g. error_short_qual.fastq from biojava is an error.

@zyxue

zyxue Oct 1, 2016

Contributor

I tried to implement parsing for wrapped lines. Then I found that it would require exact match of sequence length and quality length. Otherwise, it's ambiguous to tell when the quality lines stop. This makes padding with B when length(qual line) < length(seq line), is that right? e.g. error_short_qual.fastq from biojava is an error.

+ s"Input must have 4 lines (${lines.length.toString} found):\n${input}")
+
+ val readName = lines(0).drop(1)
+ if (readName.endsWith("/1") && setSecondOfPair)

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

I've seen files in the wild that use 1 and 2 (with space) instead of /1 and /2. Should we add that here?
See e.g. PairedEndFastqReader.java#L59

@heuermh

heuermh Sep 28, 2016

Member

I've seen files in the wild that use 1 and 2 (with space) instead of /1 and /2. Should we add that here?
See e.g. PairedEndFastqReader.java#L59

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

I have seen both, as well. I am not aware if there is a specification that lists all possibilities. I am thinking of using regex to account for all of them gradually.

@zyxue

zyxue Sep 28, 2016

Contributor

I have seen both, as well. I am not aware if there is a specification that lists all possibilities. I am thinking of using regex to account for all of them gradually.

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

There isn't a specification, only convention. See http://dx.doi.org/10.1093/nar/gkp1137

@heuermh

heuermh Sep 28, 2016

Member

There isn't a specification, only convention. See http://dx.doi.org/10.1093/nar/gkp1137

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

+1, I've also seen _1/2

@fnothaft

fnothaft Sep 30, 2016

Member

+1, I've also seen _1/2

This comment has been minimized.

@zyxue

zyxue Oct 1, 2016

Contributor

addressed by regex like [/ +_]1$

@zyxue

zyxue Oct 1, 2016

Contributor

addressed by regex like [/ +_]1$

+ else {
+ if (readQualitiesRaw == "*") "B" * readSequence.length
+ else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length))
+ else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented")

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

NotImplementedError doesn't seem right, should be IllegalArgumentException. These length checks should also happen with strict stringency.

@heuermh

heuermh Sep 28, 2016

Member

NotImplementedError doesn't seem right, should be IllegalArgumentException. These length checks should also happen with strict stringency.

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

Also, what's the reason for padding B in case qualities information is incomplete?

@zyxue

zyxue Sep 28, 2016

Contributor

Also, what's the reason for padding B in case qualities information is incomplete?

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

IIRC, B is the code for "unknown" quality. CC @ryan-williams

@fnothaft

fnothaft Sep 30, 2016

Member

IIRC, B is the code for "unknown" quality. CC @ryan-williams

This comment has been minimized.

@zyxue

zyxue Sep 30, 2016

Contributor

https://en.wikipedia.org/wiki/FASTQ_format

Also, in Illumina runs using PhiX controls, the character 'B' was observed to represent an "unknown quality score". The error rate of 'B' reads was roughly 3 phred scores lower the mean observed score of a given run.

@zyxue

zyxue Sep 30, 2016

Contributor

https://en.wikipedia.org/wiki/FASTQ_format

Also, in Illumina runs using PhiX controls, the character 'B' was observed to represent an "unknown quality score". The error rate of 'B' reads was roughly 3 phred scores lower the mean observed score of a given run.

This comment has been minimized.

@zyxue

zyxue Oct 1, 2016

Contributor

fixed NotImplementedError => IllegalArgumentException

@zyxue

zyxue Oct 1, 2016

Contributor

fixed NotImplementedError => IllegalArgumentException

+ .setSequence(sequence)
+ .setQual(qual)
+ .setReadPaired(readPaired)
+ .setProperPair(null)

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

I'm not sure why these are explicitly set to null.

@heuermh

heuermh Sep 28, 2016

Member

I'm not sure why these are explicitly set to null.

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

explicitly setting null is dropped in newer commits. If you rerun the tests, it should all pass.

@zyxue

zyxue Sep 28, 2016

Contributor

explicitly setting null is dropped in newer commits. If you rerun the tests, it should all pass.

+import org.scalatest.FunSuite
+
+/**
+ * Created by zyxue on 2016-09-27.

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

Unit test suite for FastqRecordConverter.

@heuermh

heuermh Sep 28, 2016

Member

Unit test suite for FastqRecordConverter.

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

What does this comment mean?

@zyxue

zyxue Sep 28, 2016

Contributor

What does this comment mean?

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

Sorry, am suggesting a doc comment change.

@heuermh

heuermh Sep 28, 2016

Member

Sorry, am suggesting a doc comment change.

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

ok, removed the unnecessary comment.

@zyxue

zyxue Sep 28, 2016

Contributor

ok, removed the unnecessary comment.

+/**
+ * Created by zyxue on 2016-09-27.
+ */
+class FastqConverterSuite extends FunSuite {

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

Should be FastqRecordConverterSuite.

@heuermh

heuermh Sep 28, 2016

Member

Should be FastqRecordConverterSuite.

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

Are you sure? I followed convention in FastaConverterSuite

@zyxue

zyxue Sep 28, 2016

Contributor

Are you sure? I followed convention in FastaConverterSuite

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

OK, you're right. I realized it's

FastaConverter.scala & FastaConverterSuite

so it should be

FastqRecordConverter & FastqRecordConverterSuite

@zyxue

zyxue Sep 28, 2016

Contributor

OK, you're right. I realized it's

FastaConverter.scala & FastaConverterSuite

so it should be

FastqRecordConverter & FastqRecordConverterSuite

This comment has been minimized.

@heuermh

heuermh Sep 28, 2016

Member

Yep, although I find the word "record" redundant almost everywhere, so we could possibly drop it here

@heuermh

heuermh Sep 28, 2016

Member

Yep, although I find the word "record" redundant almost everywhere, so we could possibly drop it here

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

I will stick to the original convention for now.

@zyxue

zyxue Sep 28, 2016

Contributor

I will stick to the original convention for now.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 28, 2016

Member

Thanks @zyxue! I've left some review comments but they may not completely answer your questions. Hopefully others will chime in.

Note there are two unit test failures in AlignmentRecordRDDSuite, see https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1505/HADOOP_VERSION=2.6.0,SCALAVER=2.10,SPARK_VERSION=1.5.2,label=centos/testReport/

Member

heuermh commented Sep 28, 2016

Thanks @zyxue! I've left some review comments but they may not completely answer your questions. Hopefully others will chime in.

Note there are two unit test failures in AlignmentRecordRDDSuite, see https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1505/HADOOP_VERSION=2.6.0,SCALAVER=2.10,SPARK_VERSION=1.5.2,label=centos/testReport/

renaming a test suite
FastqConverterSuite.scala => FastqRecordConverterSuite.scala
@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 28, 2016

Contributor

I have located the reason at least for the first test failure. As mentioned, it's due to inconsistent default values. In the original convertRead, it'ssetReadNegativeStrand(false), setting it to null, which is done in convertPair and convertFragment, would cause java.lang.NullPointerException when getReadNegativeStrand. I suggest dropping all the explicitly set null, which seems irrelevant to a fastq file, and just use default values instead, what do you think?

Contributor

zyxue commented Sep 28, 2016

I have located the reason at least for the first test failure. As mentioned, it's due to inconsistent default values. In the original convertRead, it'ssetReadNegativeStrand(false), setting it to null, which is done in convertPair and convertFragment, would cause java.lang.NullPointerException when getReadNegativeStrand. I suggest dropping all the explicitly set null, which seems irrelevant to a fastq file, and just use default values instead, what do you think?

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 28, 2016

Contributor

Is it easy to rerun all tests on jenkins?

Contributor

zyxue commented Sep 28, 2016

Is it easy to rerun all tests on jenkins?

+ * @see parseReadPairInFastq
+ * *
+ */
+ private def parseReadInFastq(input: String,

This comment has been minimized.

@zyxue

zyxue Sep 28, 2016

Contributor

Is the use of private like this a good practice? It will be used in other methods in FastqRecordConverter via this.parseReadInFastq

@zyxue

zyxue Sep 28, 2016

Contributor

Is the use of private like this a good practice? It will be used in other methods in FastqRecordConverter via this.parseReadInFastq

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 28, 2016

Member

Is it easy to rerun all tests on jenkins?

Jenkins should re-run when you push new commits. If not, let me know when you've pushed everything you want, and I'll explicitly ask for a retest.

./scripts/jenkins-test lets you run the Jenkins tests locally, although you may have to mess with some path definitions.

Member

heuermh commented Sep 28, 2016

Is it easy to rerun all tests on jenkins?

Jenkins should re-run when you push new commits. If not, let me know when you've pushed everything you want, and I'll explicitly ask for a retest.

./scripts/jenkins-test lets you run the Jenkins tests locally, although you may have to mess with some path definitions.

zyxue added some commits Sep 28, 2016

renamed class
FastqConverterSuite => FastqRecordConverterSuite
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 28, 2016

Member

Jenkins, retest this please.

Member

heuermh commented Sep 28, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 28, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1507/

Build result: FAILURE

GitHub pull request #1185 of commit fc8db4e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7c99373 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision 7c99373 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7c99373154de4939570142573401cb93d93874ddFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1507/

Build result: FAILURE

GitHub pull request #1185 of commit fc8db4e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7c99373 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision 7c99373 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7c99373154de4939570142573401cb93d93874ddFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 29, 2016

Contributor

I think the error is due to unintended change of 3 pom.xml files, I have reverted the changes.

Contributor

zyxue commented Sep 29, 2016

I think the error is due to unintended change of 3 pom.xml files, I have reverted the changes.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 29, 2016

Member

Jenkins, retest this please.

Member

heuermh commented Sep 29, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 29, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1509/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1509/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 29, 2016

Member

Looks like that was it, tests pass now. I'll give this another round of review tomorrow, and hopefully we can get another set of eyes on it.

Member

heuermh commented Sep 29, 2016

Looks like that was it, tests pass now. I'll give this another round of review tomorrow, and hopefully we can get another set of eyes on it.

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 29, 2016

Contributor

Passing all current tests is one thing. I read your paper, http://dx.doi.org/10.1093/nar/gkp1137, and I think it would be a good idea to add more tests for irregular FASTQ formats. My confusion in the original code from that it's not obvious what the original author intended to do in certain cases. I guess I will just decide the scope accordingly, and do some more refactoring later.

Contributor

zyxue commented Sep 29, 2016

Passing all current tests is one thing. I read your paper, http://dx.doi.org/10.1093/nar/gkp1137, and I think it would be a good idea to add more tests for irregular FASTQ formats. My confusion in the original code from that it's not obvious what the original author intended to do in certain cases. I guess I will just decide the scope accordingly, and do some more refactoring later.

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 29, 2016

Contributor

@heuermh, have you tested a private function before? I am trying to test parseReadInFastq, directly. Not sure how to do so. Or do you have any suggestion, please? I followed the last comment in this post, but still haven't made it work yet. I am wondering if I used private properly.

Contributor

zyxue commented Sep 29, 2016

@heuermh, have you tested a private function before? I am trying to test parseReadInFastq, directly. Not sure how to do so. Or do you have any suggestion, please? I followed the last comment in this post, but still haven't made it work yet. I am wondering if I used private properly.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 29, 2016

Member

Hi @zyxue! Thank you so much for taking this on; I'm really excited to see this PR and will take a closer look at the code later today.

@heuermh, have you tested a private function before? I am trying to test parseReadInFastq, directly. Not sure how to do so. Or do you have any suggestion, please? I followed the last comment in this post, but still haven't made it work yet. I am wondering if I used private properly.

What I typically do is relax the protection from private to package private. E.g., in this case, I'd change:

private def parseReadInFastq(input: String,

to:

private[converters] def parseReadInFastq(input: String,

Then, you can call the function from the test suite, since the test suite is in the same package. Ideally, we'd be able to keep the function private, but (IIRC) the only way to keep the function fully private and to call it from test code is through the Java reflection API, which is a real hassle. I think this is a reasonable compromise.

Member

fnothaft commented Sep 29, 2016

Hi @zyxue! Thank you so much for taking this on; I'm really excited to see this PR and will take a closer look at the code later today.

@heuermh, have you tested a private function before? I am trying to test parseReadInFastq, directly. Not sure how to do so. Or do you have any suggestion, please? I followed the last comment in this post, but still haven't made it work yet. I am wondering if I used private properly.

What I typically do is relax the protection from private to package private. E.g., in this case, I'd change:

private def parseReadInFastq(input: String,

to:

private[converters] def parseReadInFastq(input: String,

Then, you can call the function from the test suite, since the test suite is in the same package. Ideally, we'd be able to keep the function private, but (IIRC) the only way to keep the function fully private and to call it from test code is through the Java reflection API, which is a real hassle. I think this is a reasonable compromise.

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 29, 2016

Contributor

Cool, @fnothaft. It looks that I can access the method now. I am not familiar with Java reflection API, but in Scalatest, there is a PrivateMethodTester trait that can be used to test a private method. It's also via reflection, but it also looks quite a bit of hassle, too. I tried it, but haven't made it work. I prefer your solution for now. Thanks!

Contributor

zyxue commented Sep 29, 2016

Cool, @fnothaft. It looks that I can access the method now. I am not familiar with Java reflection API, but in Scalatest, there is a PrivateMethodTester trait that can be used to test a private method. It's also via reflection, but it also looks quite a bit of hassle, too. I tried it, but haven't made it work. I prefer your solution for now. Thanks!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 29, 2016

Member

I just made a review pass, @zyxue. I've highlighted a few small syntax nits, but overall, this looks like a very solid PR; thank you for the great contribution!

Member

fnothaft commented Sep 29, 2016

I just made a review pass, @zyxue. I've highlighted a few small syntax nits, but overall, this looks like a very solid PR; thank you for the great contribution!

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 30, 2016

Member

@fnothaft I don't see any of your review comments. Were they lost or resolved?

Member

heuermh commented Sep 30, 2016

@fnothaft I don't see any of your review comments. Were they lost or resolved?

+ s"Input must have 4 lines (${lines.length.toString} found):\n${input}")
+
+ val readName = lines(0).drop(1)
+ if (readName.endsWith("/1") && setSecondOfPair)

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

+1, I've also seen _1/2

@fnothaft

fnothaft Sep 30, 2016

Member

+1, I've also seen _1/2

+ else {
+ if (readQualitiesRaw == "*") "B" * readSequence.length
+ else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length))
+ else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented")

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

IIRC, B is the code for "unknown" quality. CC @ryan-williams

@fnothaft

fnothaft Sep 30, 2016

Member

IIRC, B is the code for "unknown" quality. CC @ryan-williams

+
+ private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = {
+ val lines = input.toString.split('\n')
+ require(lines.length == 8,

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

@fnothaft

fnothaft Sep 30, 2016

Member

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

+ secondReadName,
+ secondReadSequence,
+ secondReadQualities
+ ) = this.parseReadPairInFastq(element._2.toString)

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

You don't need the this. here or below.

@fnothaft

fnothaft Sep 30, 2016

Member

You don't need the this. here or below.

- .setSecondaryAlignment(null)
- .setSupplementaryAlignment(null)
- .build()
+ this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0),

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

+ secondReadName,
+ secondReadSequence,
+ secondReadQualities
+ ) = this.parseReadPairInFastq(element._2.toString)

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

firstReadName == secondReadName,
"Reads %s and %s in Fragment have different names.".format(
@@ -156,17 +195,16 @@ private[adam] class FastqRecordConverter extends Serializable with Logging {
)
)
+ val alignments = List(
+ this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0),
+ this.makeAlignmentRecord(secondReadName, secondReadSequence, secondReadQualities, 1)

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

- val readName = trimTrailingReadNumber(lines(0).drop(1))
- val readSequence = lines(1)
+ val (readName, readSequence, readQualities) =
+ this.parseReadInFastq(element._2.toString, setFirstOfPair, setSecondOfPair, stringency)

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

- recordGroupOpt.foreach(builder.setRecordGroupName)
-
- builder.build()
+ this.makeAlignmentRecord(

This comment has been minimized.

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

@fnothaft

fnothaft Sep 30, 2016

Member

Don't need the this..

This comment has been minimized.

@zyxue

zyxue Oct 1, 2016

Contributor

I have removed all this., wondering when will this. ever be necessary?

@zyxue

zyxue Oct 1, 2016

Contributor

I have removed all this., wondering when will this. ever be necessary?

This comment has been minimized.

@heuermh

heuermh Oct 4, 2016

Member

this is only necessary if there is a collision between say a field and a local variable with the same name

@heuermh

heuermh Oct 4, 2016

Member

this is only necessary if there is a collision between say a field and a local variable with the same name

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Sep 30, 2016

Contributor

Where can I find more information on ValidationStringency.STRICT, what should the behavior be when it's STRICT or not, please?

Contributor

zyxue commented Sep 30, 2016

Where can I find more information on ValidationStringency.STRICT, what should the behavior be when it's STRICT or not, please?

@zyxue

This comment has been minimized.

Show comment
Hide comment
@zyxue

zyxue Oct 4, 2016

Contributor

@heuermh, Can you test it again, please? If there is no further request, I think it's ready to be merged.

Contributor

zyxue commented Oct 4, 2016

@heuermh, Can you test it again, please? If there is no further request, I think it's ready to be merged.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 4, 2016

Member

Jenkins, test this please.

Member

heuermh commented Oct 4, 2016

Jenkins, test this please.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 4, 2016

Member

Where can I find more information on ValidationStringency.STRICT, what should the behavior be when it's STRICT or not, please?

We're borrowing the enum and concept from htsjdk. Briefly, on errors strict should throw exceptions, lenient should log warnings, and silent should not complain.

Member

heuermh commented Oct 4, 2016

Where can I find more information on ValidationStringency.STRICT, what should the behavior be when it's STRICT or not, please?

We're borrowing the enum and concept from htsjdk. Briefly, on errors strict should throw exceptions, lenient should log warnings, and silent should not complain.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 4, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1516/

Build result: FAILURE

GitHub pull request #1185 of commit ce5e3a0 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 5a913a0e1c5cb9874266ae2a108db3822c19c7b1First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1516/

Build result: FAILURE

GitHub pull request #1185 of commit ce5e3a0 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 5a913a0e1c5cb9874266ae2a108db3822c19c7b1First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 5, 2016

Member

Jenkins, retest this please

Member

heuermh commented Oct 5, 2016

Jenkins, retest this please

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 5, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1520/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1520/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 7, 2016

Member

LGTM! Ping @heuermh for a review pass.
I'm thinking that we should merge this manually, since the commit history is pretty long. @heuermh let me know if/when it looks good to you, and I will merge it.

Member

fnothaft commented Oct 7, 2016

LGTM! Ping @heuermh for a review pass.
I'm thinking that we should merge this manually, since the commit history is pretty long. @heuermh let me know if/when it looks good to you, and I will merge it.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 7, 2016

Member

@heuermh is this still pending changes from your side?

Member

fnothaft commented Oct 7, 2016

@heuermh is this still pending changes from your side?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 8, 2016

Member

Will complete review on Monday

Member

heuermh commented Oct 8, 2016

Will complete review on Monday

@@ -41,6 +41,114 @@ import scala.collection.JavaConversions._
private[adam] class FastqRecordConverter extends Serializable with Logging {
+ /**
+ * Parse 4 lines at a time

This comment has been minimized.

@heuermh

heuermh Oct 13, 2016

Member

This doc comment doesn't match the method.

Perhaps something like Return true if the read name suffix and flags match.

@heuermh

heuermh Oct 13, 2016

Member

This doc comment doesn't match the method.

Perhaps something like Return true if the read name suffix and flags match.

+ val match2 = secondReadSuffix.findAllIn(readName)
+
+ if (match1.nonEmpty && isSecondOfPair)
+ throw new IllegalArgumentException(

This comment has been minimized.

@heuermh

heuermh Oct 13, 2016

Member

These exceptions are thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?

@heuermh

heuermh Oct 13, 2016

Member

These exceptions are thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?

+ val readName = lines(0).drop(1)
+ if (setFirstOfPair || setSecondOfPair) readNameSuffixAndIndexOfPairMustMatch(readName, setFirstOfPair)
+
+ val suffix = """([/ +_]1$)|([/ +_]2$)""".r

This comment has been minimized.

@heuermh

heuermh Oct 13, 2016

Member

I wonder if this regex and those above on line 50 and 51 might be combined as static private fields so that the two methods don't get out of sync

@heuermh

heuermh Oct 13, 2016

Member

I wonder if this regex and those above on line 50 and 51 might be combined as static private fields so that the two methods don't get out of sync

- val readName = trimTrailingReadNumber(lines(0).drop(1))
- val readSequence = lines(1)
+ if (setFirstOfPair && setSecondOfPair)
+ throw new IllegalArgumentException("setFirstOfPair and setSecondOfPair cannot be true at the same time")

This comment has been minimized.

@heuermh

heuermh Oct 13, 2016

Member

This exception is also thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?

@heuermh

heuermh Oct 13, 2016

Member

This exception is also thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?

+ .setReadPaired(readPaired)
+ .setReadInFragment(readInFragment)
+
+ if (recordGroupOpt != None)

This comment has been minimized.

@heuermh

heuermh Oct 13, 2016

Member

With opt.foreach you don't also need to check against None

@heuermh

heuermh Oct 13, 2016

Member

With opt.foreach you don't also need to check against None

@heuermh heuermh modified the milestone: 0.20.0 Oct 13, 2016

@fnothaft fnothaft referenced this pull request Oct 13, 2016

Closed

Fastq record converter #1208

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 13, 2016

Member

Moved over to #1208.

Member

fnothaft commented Oct 13, 2016

Moved over to #1208.

@fnothaft fnothaft closed this Oct 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment