Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastq record converter #1185

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
8065a64
added FastqConverterSuite.scala
zyxue Sep 28, 2016
ede941e
refactored FastqRecordConverter.convertPair
zyxue Sep 28, 2016
6597b63
improved exception message
zyxue Sep 28, 2016
a944139
added tests for invalid fastq input
zyxue Sep 28, 2016
f2736e3
added parseReadPairInFastq private method
zyxue Sep 28, 2016
1b92e6e
updated a testcase for FastqRecordConverter.convertPair
zyxue Sep 28, 2016
eae9ca7
removed assignment for tests with invalid input
zyxue Sep 28, 2016
0bff9dc
added 3 tests for FastqRecordConverter.convertFragment
zyxue Sep 28, 2016
790ea31
refactored FastqRecordConverter.convertFragment
zyxue Sep 28, 2016
a931187
improved format of makeAlignmentRecord in convertPair
zyxue Sep 28, 2016
914190f
test convertFragment in more detail
zyxue Sep 28, 2016
07e6c78
factored out makeAlignmentRecord
zyxue Sep 28, 2016
8e7c029
use string Interpolation for exception message
zyxue Sep 28, 2016
f604a9b
refactored out parseReadInFastq
zyxue Sep 28, 2016
48c0f40
updated parseReadInFastq with param stringency
zyxue Sep 28, 2016
ec95050
refactored convertRead
zyxue Sep 28, 2016
63013b8
added tests for convertRead
zyxue Sep 28, 2016
7782ab3
updated default values in tests
zyxue Sep 28, 2016
d4c5ad6
[ADAM-1172] formatting and run format-source
zyxue Sep 28, 2016
2c055a8
renaming a test suite
zyxue Sep 28, 2016
f124fa5
drop explicitly set null in makeAlignmentRecord
zyxue Sep 28, 2016
bb3f26b
changed default setFirstOfPair to false
zyxue Sep 28, 2016
2c4f7cb
remove null value assertions in tests
zyxue Sep 28, 2016
6194ad4
test default getReadPaired in convertRead
zyxue Sep 28, 2016
11df4d0
setFirstOfPair && setSecondOfPair cannot be both true
zyxue Sep 28, 2016
8195510
added tests for setFirstOfPair && setSecondOfPair
zyxue Sep 28, 2016
fc8db4e
removed unnecessary comment
zyxue Sep 28, 2016
37b38bb
renamed class
zyxue Sep 28, 2016
efbc811
inspected error messages in a few tests
zyxue Sep 28, 2016
82230d6
revert unintended changes on pom.xml
zyxue Sep 29, 2016
4d9b2f6
make methods private to package only
zyxue Sep 30, 2016
449a517
change exception raised
zyxue Sep 30, 2016
698f015
test read length > quality length lenient mode
zyxue Sep 30, 2016
14c654a
replaced ValidationStringency.LENIENT with lenient
zyxue Sep 30, 2016
3ac85d6
added test case when no read quality is *
zyxue Sep 30, 2016
a797e26
enhanced read suffix removal
zyxue Sep 30, 2016
d11edf8
added method readNameSuffixAndIndexOfPairMustMatch
zyxue Sep 30, 2016
45d5150
refactored readNameSuffixAndIndexOfPairMustMatch
zyxue Oct 1, 2016
e35648a
removed unnecessary this.
zyxue Oct 1, 2016
ce5e3a0
simplified requirement()
zyxue Oct 1, 2016
272f637
rerun ./scripts/format-source
zyxue Oct 5, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -40,6 +40,96 @@ import scala.collection.JavaConversions._
*/
private[adam] class FastqRecordConverter extends Serializable with Logging {

/**
* Parse 4 lines at a time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment doesn't match the method.

Perhaps something like Return true if the read name suffix and flags match.

* @see parseReadPairInFastq
* *
*/
private def parseReadInFastq(input: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use of private like this a good practice? It will be used in other methods in FastqRecordConverter via this.parseReadInFastq

setFirstOfPair: Boolean = false,
setSecondOfPair: Boolean = false,
stringency: ValidationStringency = ValidationStringency.STRICT): (String, String, String) = {
val lines = input.split('\n')
require(lines.length == 4,
s"Input must have 4 lines (${lines.length.toString} found):\n${input}")

val readName = lines(0).drop(1)
if (readName.endsWith("/1") && setSecondOfPair)
Copy link
Member

@heuermh heuermh Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen files in the wild that use 1 and 2 (with space) instead of /1 and /2. Should we add that here?
See e.g. PairedEndFastqReader.java#L59

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen both, as well. I am not aware if there is a specification that lists all possibilities. I am thinking of using regex to account for all of them gradually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a specification, only convention. See http://dx.doi.org/10.1093/nar/gkp1137

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I've also seen _1/2

Copy link
Contributor Author

@zyxue zyxue Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by regex like [/ +_]1$

throw new Exception(
s"Found read name $readName ending in '/1' despite second-of-pair flag being set"
)
else if (readName.endsWith("/2") && setFirstOfPair)
throw new Exception(
s"Found read name $readName ending in '/2' despite first-of-pair flag being set"
)
val suffix = """(\/1$)|(\/2$)""".r
val readNameNoSuffix = suffix.replaceAllIn(readName, "")

val readSequence = lines(1)
val readQualitiesRaw = lines(3)

val readQualities =
if (stringency == ValidationStringency.STRICT) readQualitiesRaw
else {
if (readQualitiesRaw == "*") "B" * readSequence.length
else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length))
else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplementedError doesn't seem right, should be IllegalArgumentException. These length checks should also happen with strict stringency.

Copy link
Contributor Author

@zyxue zyxue Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the reason for padding B in case qualities information is incomplete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, B is the code for "unknown" quality. CC @ryan-williams

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/FASTQ_format

Also, in Illumina runs using PhiX controls, the character 'B' was observed to represent an "unknown quality score". The error rate of 'B' reads was roughly 3 phred scores lower the mean observed score of a given run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed NotImplementedError => IllegalArgumentException

else readQualitiesRaw
}

if (stringency == ValidationStringency.STRICT) {
if (readQualitiesRaw == "*" && readSequence.length > 1)
throw new IllegalArgumentException(s"Fastq quality must be defined for\n $input")
}

require(
readSequence.length == readQualities.length,
s"The first read: ${readName}, has different sequence and qual length."
)

(readNameNoSuffix, readSequence, readQualities)
}

private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = {
val lines = input.toString.split('\n')
require(lines.length == 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mentioned this before; perhaps now is the time to fix it? FASTQ format allows for hard line wrapping, so there may be new line characters at any place in the record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example for hard line wrapping? What does it look like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

Copy link
Contributor Author

@zyxue zyxue Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement parsing for wrapped lines. Then I found that it would require exact match of sequence length and quality length. Otherwise, it's ambiguous to tell when the quality lines stop. This makes padding with B when length(qual line) < length(seq line), is that right? e.g. error_short_qual.fastq from biojava is an error.

s"Record must have 8 lines (${lines.length.toString} found):\n${input}")

val (firstReadName, firstReadSequence, firstReadQualities) =
this.parseReadInFastq(lines.take(4).mkString("\n"), setFirstOfPair = true, setSecondOfPair = false)

val (secondReadName, secondReadSequence, secondReadQualities) =
this.parseReadInFastq(lines.drop(4).mkString("\n"), setFirstOfPair = false, setSecondOfPair = true)

(
firstReadName,
firstReadSequence,
firstReadQualities,
secondReadName,
secondReadSequence,
secondReadQualities
)
}

private def makeAlignmentRecord(readName: String,
sequence: String,
qual: String,
readInFragment: Int,
readPaired: Boolean = true,
recordGroupOpt: Option[String] = None): AlignmentRecord = {
val builder = AlignmentRecord.newBuilder
.setReadName(readName)
.setSequence(sequence)
.setQual(qual)
.setReadPaired(readPaired)
.setReadInFragment(readInFragment)

if (recordGroupOpt != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With opt.foreach you don't also need to check against None

recordGroupOpt.foreach(builder.setRecordGroupName)

builder.build
}

/**
* Converts a read pair in FASTQ format into two AlignmentRecords.
*
Expand All @@ -59,57 +149,19 @@ private[adam] class FastqRecordConverter extends Serializable with Logging {
* @see convertFragment
*/
def convertPair(element: (Void, Text)): Iterable[AlignmentRecord] = {
val lines = element._2.toString.split('\n')
require(lines.length == 8, "Record has wrong format:\n" + element._2.toString)

// get fields for first read in pair
val firstReadName = lines(0).drop(1)
val firstReadSequence = lines(1)
val firstReadQualities = lines(3)

require(
firstReadSequence.length == firstReadQualities.length,
"Read " + firstReadName + " has different sequence and qual length."
)

// get fields for second read in pair
val secondReadName = lines(4).drop(1)
val secondReadSequence = lines(5)
val secondReadQualities = lines(7)

require(
secondReadSequence.length == secondReadQualities.length,
"Read " + secondReadName + " has different sequence and qual length."
)
val (
firstReadName,
firstReadSequence,
firstReadQualities,
secondReadName,
secondReadSequence,
secondReadQualities
) = this.parseReadPairInFastq(element._2.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the this. here or below.


// build and return iterators
Iterable(
AlignmentRecord.newBuilder()
.setReadName(firstReadName)
.setSequence(firstReadSequence)
.setQual(firstReadQualities)
.setReadPaired(true)
.setProperPair(true)
.setReadInFragment(0)
.setReadNegativeStrand(null)
.setMateNegativeStrand(null)
.setPrimaryAlignment(null)
.setSecondaryAlignment(null)
.setSupplementaryAlignment(null)
.build(),
AlignmentRecord.newBuilder()
.setReadName(secondReadName)
.setSequence(secondReadSequence)
.setQual(secondReadQualities)
.setReadPaired(true)
.setProperPair(true)
.setReadInFragment(1)
.setReadNegativeStrand(null)
.setMateNegativeStrand(null)
.setPrimaryAlignment(null)
.setSecondaryAlignment(null)
.setSupplementaryAlignment(null)
.build()
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

this.makeAlignmentRecord(secondReadName, secondReadSequence, secondReadQualities, 1)
)
}

Expand All @@ -126,28 +178,15 @@ private[adam] class FastqRecordConverter extends Serializable with Logging {
* @see convertPair
*/
def convertFragment(element: (Void, Text)): Fragment = {
val lines = element._2.toString.split('\n')
require(lines.length == 8, "Record has wrong format:\n" + element._2.toString)

// get fields for first read in pair
val firstReadName = lines(0).drop(1)
val firstReadSequence = lines(1)
val firstReadQualities = lines(3)

require(
firstReadSequence.length == firstReadQualities.length,
"Read " + firstReadName + " has different sequence and qual length."
)

// get fields for second read in pair
val secondReadName = lines(4).drop(1)
val secondReadSequence = lines(5)
val secondReadQualities = lines(7)
val (
firstReadName,
firstReadSequence,
firstReadQualities,
secondReadName,
secondReadSequence,
secondReadQualities
) = this.parseReadPairInFastq(element._2.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..


require(
secondReadSequence.length == secondReadQualities.length,
"Read " + secondReadName + " has different sequence and qual length."
)
require(
firstReadName == secondReadName,
"Reads %s and %s in Fragment have different names.".format(
Expand All @@ -156,17 +195,16 @@ private[adam] class FastqRecordConverter extends Serializable with Logging {
)
)

val alignments = List(
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

this.makeAlignmentRecord(secondReadName, secondReadSequence, secondReadQualities, 1)
)

// build and return record
Fragment.newBuilder()
Fragment.newBuilder
.setReadName(firstReadName)
.setAlignments(List(AlignmentRecord.newBuilder()
.setSequence(firstReadSequence)
.setQual(firstReadQualities)
.build(), AlignmentRecord.newBuilder()
.setSequence(secondReadSequence)
.setQual(secondReadQualities)
.build()))
.build()
.setAlignments(alignments)
.build
}

/**
Expand All @@ -193,77 +231,24 @@ private[adam] class FastqRecordConverter extends Serializable with Logging {
setFirstOfPair: Boolean = false,
setSecondOfPair: Boolean = false,
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecord = {
val lines = element._2.toString.split('\n')
require(lines.length == 4, "Record has wrong format:\n" + element._2.toString)

def trimTrailingReadNumber(readName: String): String = {
if (readName.endsWith("/1")) {
if (setSecondOfPair) {
throw new Exception(
s"Found read name $readName ending in '/1' despite second-of-pair flag being set"
)
}
readName.dropRight(2)
} else if (readName.endsWith("/2")) {
if (setFirstOfPair) {
throw new Exception(
s"Found read name $readName ending in '/2' despite first-of-pair flag being set"
)
}
readName.dropRight(2)
} else {
readName
}
}

// get fields for first read in pair
val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)
val (readName, readSequence, readQualities) =
this.parseReadInFastq(element._2.toString, setFirstOfPair, setSecondOfPair, stringency)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..


lazy val suffix = s"\n=== printing received Fastq record for debugging ===\n${lines.mkString("\n")}\n=== End of debug output for Fastq record ==="
if (stringency == ValidationStringency.STRICT && lines(3) == "*" && readSequence.length > 1)
throw new IllegalArgumentException(s"Fastq quality must be defined. $suffix")
else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length)
throw new IllegalArgumentException(s"Fastq sequence and quality strings must have the same length.\n Fastq quality string of length ${lines(3).length}, expected ${readSequence.length} from the sequence length. $suffix")
// default to 0
val readInFragment =
if (setSecondOfPair) 1
else 0

val readQualities =
if (lines(3) == "*")
"B" * readSequence.length
else if (lines(3).length < lines(1).length)
lines(3) + ("B" * (lines(1).length - lines(3).length))
else if (lines(3).length > lines(1).length)
throw new NotImplementedError("Not implemented")
else
lines(3)
val readPaired = setFirstOfPair || setSecondOfPair

require(
readSequence.length == readQualities.length,
List(
s"Read $readName has different sequence and qual length:",
s"sequence=$readSequence",
s"qual=$readQualities"
).mkString("\n\t")
(setFirstOfPair && setSecondOfPair) == false,
"setFirstOfPair and setSecondOfPair cannot be true at the same time"
)

val builder = AlignmentRecord.newBuilder()
.setReadName(readName)
.setSequence(readSequence)
.setQual(readQualities)
.setReadPaired(setFirstOfPair || setSecondOfPair)
.setProperPair(null)
.setReadInFragment(
if (setFirstOfPair) 0
else if (setSecondOfPair) 1
else null
)
.setReadNegativeStrand(false)
.setMateNegativeStrand(null)
.setPrimaryAlignment(null)
.setSecondaryAlignment(null)
.setSupplementaryAlignment(null)

recordGroupOpt.foreach(builder.setRecordGroupName)

builder.build()
this.makeAlignmentRecord(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed all this., wondering when will this. ever be necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only necessary if there is a collision between say a field and a local variable with the same name

readName, readSequence, readQualities,
readInFragment, readPaired, recordGroupOpt)
}
}