New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export valid fastq #856

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@allenday
Contributor

allenday commented Oct 8, 2015

The adam2fastq exported in the case of a sam with "" (unknown read quality) just exports the "". This is invalid fastq.

This PR addresses the issue by assigning quality value of "B" for each base - this is conventionally used for reads of unknown quality.

Additionally, reads loaded in from fastq are also assigned "B"-string for malformed records containing "*" for read quality.

allenday added some commits Oct 7, 2015

ADAM currently exports malformed FASTQ when the quality in AlignmentR…
…ecord is null or "*". This gracefully fixes these records on import
ADAM currently exports malformed FASTQ when the quality in AlignmentR…
…ecord is null or "*". This gracefully fixes these records on import
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 8, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/987/

Build result: FAILURE

GitHub pull request #856 of commit a380305 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/856/merge^{commit} # timeout=10 > git branch -a --contains c882546 # timeout=10 > git rev-parse remotes/origin/pr/856/merge^{commit} # timeout=10Checking out Revision c882546 (origin/pr/856/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c8825467fdc55ad18cebac641804db2749da1011First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 8, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/987/

Build result: FAILURE

GitHub pull request #856 of commit a380305 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/856/merge^{commit} # timeout=10 > git branch -a --contains c882546 # timeout=10 > git rev-parse remotes/origin/pr/856/merge^{commit} # timeout=10Checking out Revision c882546 (origin/pr/856/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c8825467fdc55ad18cebac641804db2749da1011First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 8, 2015

Member

@allenday Jenkins is reporting a test failure but only with Hadoop 2.6.0 and Spark 1.4.1

org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctionsSuite.convert malformed FASTQ (no quality scores) => SAM => well-formed FASTQ => SAM

Error Message

Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.Exception: Fastq quality must be defined
 at org.bdgenomics.adam.converters.FastqRecordConverter.convertRead(FastqRecordConverter.scala:150)
 at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
 at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

Stacktrace

      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.Exception: Fastq quality must be defined
    at org.bdgenomics.adam.converters.FastqRecordConverter.convertRead(FastqRecordConverter.scala:150)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
      at scala.Option.foreach(Option.scala:236)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Member

heuermh commented Oct 8, 2015

@allenday Jenkins is reporting a test failure but only with Hadoop 2.6.0 and Spark 1.4.1

org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctionsSuite.convert malformed FASTQ (no quality scores) => SAM => well-formed FASTQ => SAM

Error Message

Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.Exception: Fastq quality must be defined
 at org.bdgenomics.adam.converters.FastqRecordConverter.convertRead(FastqRecordConverter.scala:150)
 at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
 at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
 at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

Stacktrace

      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.Exception: Fastq quality must be defined
    at org.bdgenomics.adam.converters.FastqRecordConverter.convertRead(FastqRecordConverter.scala:150)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadUnpairedFastq$1.apply(ADAMContext.scala:424)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
    at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
      at scala.Option.foreach(Option.scala:236)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 8, 2015

Member

Jenkins, retest this please.

Member

fnothaft commented Oct 8, 2015

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 8, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/990/

Build result: FAILURE

GitHub pull request #856 of commit a380305 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/856/merge^{commit} # timeout=10 > git branch -a --contains c882546 # timeout=10 > git rev-parse remotes/origin/pr/856/merge^{commit} # timeout=10Checking out Revision c882546 (origin/pr/856/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c8825467fdc55ad18cebac641804db2749da1011First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 8, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/990/

Build result: FAILURE

GitHub pull request #856 of commit a380305 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/856/merge^{commit} # timeout=10 > git branch -a --contains c882546 # timeout=10 > git rev-parse remotes/origin/pr/856/merge^{commit} # timeout=10Checking out Revision c882546 (origin/pr/856/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f c8825467fdc55ad18cebac641804db2749da1011First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@allenday

This comment has been minimized.

Show comment
Hide comment
@allenday

allenday Oct 9, 2015

Contributor

Yes, I see it. I'll investigate this coming week. Should be minor, all my
tests pass locally.

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Fri, Oct 9, 2015 at 3:45 AM, Frank Austin Nothaft <
notifications@github.com> wrote:

Jenkins, retest this please.


Reply to this email directly or view it on GitHub
#856 (comment).

Contributor

allenday commented Oct 9, 2015

Yes, I see it. I'll investigate this coming week. Should be minor, all my
tests pass locally.

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Fri, Oct 9, 2015 at 3:45 AM, Frank Austin Nothaft <
notifications@github.com> wrote:

Jenkins, retest this please.


Reply to this email directly or view it on GitHub
#856 (comment).

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 12, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/998/
Test PASSed.

AmplabJenkins commented Oct 12, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/998/
Test PASSed.

@allenday

This comment has been minimized.

Show comment
Hide comment
@allenday

allenday Oct 12, 2015

Contributor

@fnothaft test passing now

Contributor

allenday commented Oct 12, 2015

@fnothaft test passing now

else
adamRecord.getOrigQual
else if (adamRecord.getQual == null)
"B" * seqLength

This comment has been minimized.

@heuermh

heuermh Oct 20, 2015

Member

The wikipedia link above is dead, and it reads to me like B as unknown quality score is not universal across FASTQ records produced by different versions of Illumina software. Should this be a parameter presented to the user?

@heuermh

heuermh Oct 20, 2015

Member

The wikipedia link above is dead, and it reads to me like B as unknown quality score is not universal across FASTQ records produced by different versions of Illumina software. Should this be a parameter presented to the user?

else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length)
throw new Exception(s"Fastq sequence and quality strings must have the same length")
val readQualities =

This comment has been minimized.

@heuermh

heuermh Oct 20, 2015

Member

I think rather than fixing invalid FASTQ records when reading them in with non-strict validation stringency, I would prefer they get logged as invalid and dropped. Then for strict validation stringency, encountering an invalid FASTQ record would throw an exception that causes the read operation to stop.

@heuermh

heuermh Oct 20, 2015

Member

I think rather than fixing invalid FASTQ records when reading them in with non-strict validation stringency, I would prefer they get logged as invalid and dropped. Then for strict validation stringency, encountering an invalid FASTQ record would throw an exception that causes the read operation to stop.

This comment has been minimized.

@allenday

allenday Oct 20, 2015

Contributor

here's how HTSJDK handles it, I propose doing the same:
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/ValidationStringency.java

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Tue, Oct 20, 2015 at 10:08 PM, Michael L Heuer notifications@github.com
wrote:

In
adam-core/src/main/scala/org/bdgenomics/adam/converters/FastqRecordConverter.scala
#856 (comment):

@@ -143,10 +145,25 @@ class FastqRecordConverter extends Serializable with Logging {
// get fields for first read in pair
val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)

  • val readQualities = lines(3)
  • if (stringency == ValidationStringency.STRICT && lines(3) == "*" && readSequence.length > 1)
  •  throw new Exception(s"Fastq quality must be defined")
    
  • else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length)
  •  throw new Exception(s"Fastq sequence and quality strings must have the same length")
    
  • val readQualities =

I think rather than fixing invalid FASTQ records when reading them in with
non-strict validation stringency, I would prefer they get logged as invalid
and dropped. Then for strict validation stringency, encountering an invalid
FASTQ record would throw an exception that causes the read operation to
stop.


Reply to this email directly or view it on GitHub
https://github.com/bigdatagenomics/adam/pull/856/files#r42508048.

@allenday

allenday Oct 20, 2015

Contributor

here's how HTSJDK handles it, I propose doing the same:
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/ValidationStringency.java

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Tue, Oct 20, 2015 at 10:08 PM, Michael L Heuer notifications@github.com
wrote:

In
adam-core/src/main/scala/org/bdgenomics/adam/converters/FastqRecordConverter.scala
#856 (comment):

@@ -143,10 +145,25 @@ class FastqRecordConverter extends Serializable with Logging {
// get fields for first read in pair
val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)

  • val readQualities = lines(3)
  • if (stringency == ValidationStringency.STRICT && lines(3) == "*" && readSequence.length > 1)
  •  throw new Exception(s"Fastq quality must be defined")
    
  • else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length)
  •  throw new Exception(s"Fastq sequence and quality strings must have the same length")
    
  • val readQualities =

I think rather than fixing invalid FASTQ records when reading them in with
non-strict validation stringency, I would prefer they get logged as invalid
and dropped. Then for strict validation stringency, encountering an invalid
FASTQ record would throw an exception that causes the read operation to
stop.


Reply to this email directly or view it on GitHub
https://github.com/bigdatagenomics/adam/pull/856/files#r42508048.

This comment has been minimized.

@heuermh

heuermh Nov 2, 2015

Member

Emit warnings but keep going if possible and fix invalid input data are slightly different things. I think with a parameter or parameters presented to the user I'd support this change.

@heuermh

heuermh Nov 2, 2015

Member

Emit warnings but keep going if possible and fix invalid input data are slightly different things. I think with a parameter or parameters presented to the user I'd support this change.

This comment has been minimized.

@allenday

allenday Nov 10, 2015

Contributor

Can you be more specific with your proposal? Are you suggesting adding
another parameter globally, or just for FQ?

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Tue, Nov 3, 2015 at 4:12 AM, Michael L Heuer notifications@github.com
wrote:

In
adam-core/src/main/scala/org/bdgenomics/adam/converters/FastqRecordConverter.scala
#856 (comment):

@@ -143,10 +145,25 @@ class FastqRecordConverter extends Serializable with Logging {
// get fields for first read in pair
val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)

  • val readQualities = lines(3)
  • if (stringency == ValidationStringency.STRICT && lines(3) == "*" && readSequence.length > 1)
  •  throw new Exception(s"Fastq quality must be defined")
    
  • else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length)
  •  throw new Exception(s"Fastq sequence and quality strings must have the same length")
    
  • val readQualities =

Emit warnings but keep going if possible and fix invalid input data are
slightly different things. I think with a parameter or parameters presented
to the user I'd support this change.


Reply to this email directly or view it on GitHub
https://github.com/bigdatagenomics/adam/pull/856/files#r43675511.

@allenday

allenday Nov 10, 2015

Contributor

Can you be more specific with your proposal? Are you suggesting adding
another parameter globally, or just for FQ?

Sent with MailTrack
https://mailtrack.io/install?source=signature&lang=en&referral=allenday@allenday.com&idSignature=22

On Tue, Nov 3, 2015 at 4:12 AM, Michael L Heuer notifications@github.com
wrote:

In
adam-core/src/main/scala/org/bdgenomics/adam/converters/FastqRecordConverter.scala
#856 (comment):

@@ -143,10 +145,25 @@ class FastqRecordConverter extends Serializable with Logging {
// get fields for first read in pair
val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)

  • val readQualities = lines(3)
  • if (stringency == ValidationStringency.STRICT && lines(3) == "*" && readSequence.length > 1)
  •  throw new Exception(s"Fastq quality must be defined")
    
  • else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length)
  •  throw new Exception(s"Fastq sequence and quality strings must have the same length")
    
  • val readQualities =

Emit warnings but keep going if possible and fix invalid input data are
slightly different things. I think with a parameter or parameters presented
to the user I'd support this change.


Reply to this email directly or view it on GitHub
https://github.com/bigdatagenomics/adam/pull/856/files#r43675511.

This comment has been minimized.

@heuermh

heuermh Nov 10, 2015

Member

I think for this pull request having validation stringency and base quality to use when original quality is missing as parameters would be sufficient. They would need to be surfaced to the cli commands that call this code.

@heuermh

heuermh Nov 10, 2015

Member

I think for this pull request having validation stringency and base quality to use when original quality is missing as parameters would be sufficient. They would need to be surfaced to the cli commands that call this code.

This comment has been minimized.

@fnothaft

fnothaft Nov 10, 2015

Member

I agree with @heuermh, but perhaps we should get this merged in, and then add the further checking as a later pull request? The validation stuff would look something like #781 or #830. Nothing heavyweight, but could push this out another week.

@fnothaft

fnothaft Nov 10, 2015

Member

I agree with @heuermh, but perhaps we should get this merged in, and then add the further checking as a later pull request? The validation stuff would look something like #781 or #830. Nothing heavyweight, but could push this out another week.

This comment has been minimized.

@fnothaft

fnothaft Nov 10, 2015

Member

Actually, nevermind. I read the thread of comments, but not the code. This looks good enough to me for now. @heuermh I do think the validation stringency code here is wired up to the CLI. I'm ambivalent about whether the user should be able to specify the value for the missing base, but we can sort that out later.

@fnothaft

fnothaft Nov 10, 2015

Member

Actually, nevermind. I read the thread of comments, but not the code. This looks good enough to me for now. @heuermh I do think the validation stringency code here is wired up to the CLI. I'm ambivalent about whether the user should be able to specify the value for the missing base, but we can sort that out later.

This comment has been minimized.

@heuermh

heuermh Nov 10, 2015

Member

Sounds good to me, thanks

@heuermh

heuermh Nov 10, 2015

Member

Sounds good to me, thanks

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 10, 2015

Member

Thanks @allenday! I've merged this as 54d5095. Let's continue the discussion on #818.

Member

fnothaft commented Nov 10, 2015

Thanks @allenday! I've merged this as 54d5095. Let's continue the discussion on #818.

@fnothaft fnothaft closed this Nov 10, 2015

@heuermh heuermh added this to the 0.18.2 milestone Nov 10, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment