New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fastq record converter #1185
Fastq record converter #1185
Changes from 30 commits
8065a64
ede941e
6597b63
a944139
f2736e3
1b92e6e
eae9ca7
0bff9dc
790ea31
a931187
914190f
07e6c78
8e7c029
f604a9b
48c0f40
ec95050
63013b8
7782ab3
d4c5ad6
2c055a8
f124fa5
bb3f26b
2c4f7cb
6194ad4
11df4d0
8195510
fc8db4e
37b38bb
efbc811
82230d6
4d9b2f6
449a517
698f015
14c654a
3ac85d6
a797e26
d11edf8
45d5150
e35648a
ce5e3a0
272f637
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,96 @@ import scala.collection.JavaConversions._ | |
*/ | ||
private[adam] class FastqRecordConverter extends Serializable with Logging { | ||
|
||
/** | ||
* Parse 4 lines at a time | ||
* @see parseReadPairInFastq | ||
* * | ||
*/ | ||
private def parseReadInFastq(input: String, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the use of |
||
setFirstOfPair: Boolean = false, | ||
setSecondOfPair: Boolean = false, | ||
stringency: ValidationStringency = ValidationStringency.STRICT): (String, String, String) = { | ||
val lines = input.split('\n') | ||
require(lines.length == 4, | ||
s"Input must have 4 lines (${lines.length.toString} found):\n${input}") | ||
|
||
val readName = lines(0).drop(1) | ||
if (readName.endsWith("/1") && setSecondOfPair) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've seen files in the wild that use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have seen both, as well. I am not aware if there is a specification that lists all possibilities. I am thinking of using regex to account for all of them gradually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't a specification, only convention. See http://dx.doi.org/10.1093/nar/gkp1137 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, I've also seen There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. addressed by regex like |
||
throw new Exception( | ||
s"Found read name $readName ending in '/1' despite second-of-pair flag being set" | ||
) | ||
else if (readName.endsWith("/2") && setFirstOfPair) | ||
throw new Exception( | ||
s"Found read name $readName ending in '/2' despite first-of-pair flag being set" | ||
) | ||
val suffix = """(\/1$)|(\/2$)""".r | ||
val readNameNoSuffix = suffix.replaceAllIn(readName, "") | ||
|
||
val readSequence = lines(1) | ||
val readQualitiesRaw = lines(3) | ||
|
||
val readQualities = | ||
if (stringency == ValidationStringency.STRICT) readQualitiesRaw | ||
else { | ||
if (readQualitiesRaw == "*") "B" * readSequence.length | ||
else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length)) | ||
else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, what's the reason for padding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://en.wikipedia.org/wiki/FASTQ_format
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
else readQualitiesRaw | ||
} | ||
|
||
if (stringency == ValidationStringency.STRICT) { | ||
if (readQualitiesRaw == "*" && readSequence.length > 1) | ||
throw new IllegalArgumentException(s"Fastq quality must be defined for\n $input") | ||
} | ||
|
||
require( | ||
readSequence.length == readQualities.length, | ||
s"The first read: ${readName}, has different sequence and qual length." | ||
) | ||
|
||
(readNameNoSuffix, readSequence, readQualities) | ||
} | ||
|
||
private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = { | ||
val lines = input.toString.split('\n') | ||
require(lines.length == 8, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've mentioned this before; perhaps now is the time to fix it? FASTQ format allows for hard line wrapping, so there may be new line characters at any place in the record. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you give an example for hard line wrapping? What does it look like? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are all kinds of correct and incorrect examples here https://github.com/biojava/biojava/tree/master/biojava-sequencing/src/test/resources/org/biojava/nbio/sequencing/io/fastq There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to implement parsing for wrapped lines. Then I found that it would require exact match of sequence length and quality length. Otherwise, it's ambiguous to tell when the quality lines stop. This makes padding with |
||
s"Record must have 8 lines (${lines.length.toString} found):\n${input}") | ||
|
||
val (firstReadName, firstReadSequence, firstReadQualities) = | ||
this.parseReadInFastq(lines.take(4).mkString("\n"), setFirstOfPair = true, setSecondOfPair = false) | ||
|
||
val (secondReadName, secondReadSequence, secondReadQualities) = | ||
this.parseReadInFastq(lines.drop(4).mkString("\n"), setFirstOfPair = false, setSecondOfPair = true) | ||
|
||
( | ||
firstReadName, | ||
firstReadSequence, | ||
firstReadQualities, | ||
secondReadName, | ||
secondReadSequence, | ||
secondReadQualities | ||
) | ||
} | ||
|
||
private def makeAlignmentRecord(readName: String, | ||
sequence: String, | ||
qual: String, | ||
readInFragment: Int, | ||
readPaired: Boolean = true, | ||
recordGroupOpt: Option[String] = None): AlignmentRecord = { | ||
val builder = AlignmentRecord.newBuilder | ||
.setReadName(readName) | ||
.setSequence(sequence) | ||
.setQual(qual) | ||
.setReadPaired(readPaired) | ||
.setReadInFragment(readInFragment) | ||
|
||
if (recordGroupOpt != None) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With |
||
recordGroupOpt.foreach(builder.setRecordGroupName) | ||
|
||
builder.build | ||
} | ||
|
||
/** | ||
* Converts a read pair in FASTQ format into two AlignmentRecords. | ||
* | ||
|
@@ -59,57 +149,19 @@ private[adam] class FastqRecordConverter extends Serializable with Logging { | |
* @see convertFragment | ||
*/ | ||
def convertPair(element: (Void, Text)): Iterable[AlignmentRecord] = { | ||
val lines = element._2.toString.split('\n') | ||
require(lines.length == 8, "Record has wrong format:\n" + element._2.toString) | ||
|
||
// get fields for first read in pair | ||
val firstReadName = lines(0).drop(1) | ||
val firstReadSequence = lines(1) | ||
val firstReadQualities = lines(3) | ||
|
||
require( | ||
firstReadSequence.length == firstReadQualities.length, | ||
"Read " + firstReadName + " has different sequence and qual length." | ||
) | ||
|
||
// get fields for second read in pair | ||
val secondReadName = lines(4).drop(1) | ||
val secondReadSequence = lines(5) | ||
val secondReadQualities = lines(7) | ||
|
||
require( | ||
secondReadSequence.length == secondReadQualities.length, | ||
"Read " + secondReadName + " has different sequence and qual length." | ||
) | ||
val ( | ||
firstReadName, | ||
firstReadSequence, | ||
firstReadQualities, | ||
secondReadName, | ||
secondReadSequence, | ||
secondReadQualities | ||
) = this.parseReadPairInFastq(element._2.toString) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You don't need the |
||
|
||
// build and return iterators | ||
Iterable( | ||
AlignmentRecord.newBuilder() | ||
.setReadName(firstReadName) | ||
.setSequence(firstReadSequence) | ||
.setQual(firstReadQualities) | ||
.setReadPaired(true) | ||
.setProperPair(true) | ||
.setReadInFragment(0) | ||
.setReadNegativeStrand(null) | ||
.setMateNegativeStrand(null) | ||
.setPrimaryAlignment(null) | ||
.setSecondaryAlignment(null) | ||
.setSupplementaryAlignment(null) | ||
.build(), | ||
AlignmentRecord.newBuilder() | ||
.setReadName(secondReadName) | ||
.setSequence(secondReadSequence) | ||
.setQual(secondReadQualities) | ||
.setReadPaired(true) | ||
.setProperPair(true) | ||
.setReadInFragment(1) | ||
.setReadNegativeStrand(null) | ||
.setMateNegativeStrand(null) | ||
.setPrimaryAlignment(null) | ||
.setSecondaryAlignment(null) | ||
.setSupplementaryAlignment(null) | ||
.build() | ||
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need the |
||
this.makeAlignmentRecord(secondReadName, secondReadSequence, secondReadQualities, 1) | ||
) | ||
} | ||
|
||
|
@@ -126,28 +178,15 @@ private[adam] class FastqRecordConverter extends Serializable with Logging { | |
* @see convertPair | ||
*/ | ||
def convertFragment(element: (Void, Text)): Fragment = { | ||
val lines = element._2.toString.split('\n') | ||
require(lines.length == 8, "Record has wrong format:\n" + element._2.toString) | ||
|
||
// get fields for first read in pair | ||
val firstReadName = lines(0).drop(1) | ||
val firstReadSequence = lines(1) | ||
val firstReadQualities = lines(3) | ||
|
||
require( | ||
firstReadSequence.length == firstReadQualities.length, | ||
"Read " + firstReadName + " has different sequence and qual length." | ||
) | ||
|
||
// get fields for second read in pair | ||
val secondReadName = lines(4).drop(1) | ||
val secondReadSequence = lines(5) | ||
val secondReadQualities = lines(7) | ||
val ( | ||
firstReadName, | ||
firstReadSequence, | ||
firstReadQualities, | ||
secondReadName, | ||
secondReadSequence, | ||
secondReadQualities | ||
) = this.parseReadPairInFastq(element._2.toString) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need the |
||
|
||
require( | ||
secondReadSequence.length == secondReadQualities.length, | ||
"Read " + secondReadName + " has different sequence and qual length." | ||
) | ||
require( | ||
firstReadName == secondReadName, | ||
"Reads %s and %s in Fragment have different names.".format( | ||
|
@@ -156,17 +195,16 @@ private[adam] class FastqRecordConverter extends Serializable with Logging { | |
) | ||
) | ||
|
||
val alignments = List( | ||
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need the |
||
this.makeAlignmentRecord(secondReadName, secondReadSequence, secondReadQualities, 1) | ||
) | ||
|
||
// build and return record | ||
Fragment.newBuilder() | ||
Fragment.newBuilder | ||
.setReadName(firstReadName) | ||
.setAlignments(List(AlignmentRecord.newBuilder() | ||
.setSequence(firstReadSequence) | ||
.setQual(firstReadQualities) | ||
.build(), AlignmentRecord.newBuilder() | ||
.setSequence(secondReadSequence) | ||
.setQual(secondReadQualities) | ||
.build())) | ||
.build() | ||
.setAlignments(alignments) | ||
.build | ||
} | ||
|
||
/** | ||
|
@@ -193,77 +231,24 @@ private[adam] class FastqRecordConverter extends Serializable with Logging { | |
setFirstOfPair: Boolean = false, | ||
setSecondOfPair: Boolean = false, | ||
stringency: ValidationStringency = ValidationStringency.STRICT): AlignmentRecord = { | ||
val lines = element._2.toString.split('\n') | ||
require(lines.length == 4, "Record has wrong format:\n" + element._2.toString) | ||
|
||
def trimTrailingReadNumber(readName: String): String = { | ||
if (readName.endsWith("/1")) { | ||
if (setSecondOfPair) { | ||
throw new Exception( | ||
s"Found read name $readName ending in '/1' despite second-of-pair flag being set" | ||
) | ||
} | ||
readName.dropRight(2) | ||
} else if (readName.endsWith("/2")) { | ||
if (setFirstOfPair) { | ||
throw new Exception( | ||
s"Found read name $readName ending in '/2' despite first-of-pair flag being set" | ||
) | ||
} | ||
readName.dropRight(2) | ||
} else { | ||
readName | ||
} | ||
} | ||
|
||
// get fields for first read in pair | ||
val readName = trimTrailingReadNumber(lines(0).drop(1)) | ||
val readSequence = lines(1) | ||
val (readName, readSequence, readQualities) = | ||
this.parseReadInFastq(element._2.toString, setFirstOfPair, setSecondOfPair, stringency) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need the |
||
|
||
lazy val suffix = s"\n=== printing received Fastq record for debugging ===\n${lines.mkString("\n")}\n=== End of debug output for Fastq record ===" | ||
if (stringency == ValidationStringency.STRICT && lines(3) == "*" && readSequence.length > 1) | ||
throw new IllegalArgumentException(s"Fastq quality must be defined. $suffix") | ||
else if (stringency == ValidationStringency.STRICT && lines(3).length != readSequence.length) | ||
throw new IllegalArgumentException(s"Fastq sequence and quality strings must have the same length.\n Fastq quality string of length ${lines(3).length}, expected ${readSequence.length} from the sequence length. $suffix") | ||
// default to 0 | ||
val readInFragment = | ||
if (setSecondOfPair) 1 | ||
else 0 | ||
|
||
val readQualities = | ||
if (lines(3) == "*") | ||
"B" * readSequence.length | ||
else if (lines(3).length < lines(1).length) | ||
lines(3) + ("B" * (lines(1).length - lines(3).length)) | ||
else if (lines(3).length > lines(1).length) | ||
throw new NotImplementedError("Not implemented") | ||
else | ||
lines(3) | ||
val readPaired = setFirstOfPair || setSecondOfPair | ||
|
||
require( | ||
readSequence.length == readQualities.length, | ||
List( | ||
s"Read $readName has different sequence and qual length:", | ||
s"sequence=$readSequence", | ||
s"qual=$readQualities" | ||
).mkString("\n\t") | ||
(setFirstOfPair && setSecondOfPair) == false, | ||
"setFirstOfPair and setSecondOfPair cannot be true at the same time" | ||
) | ||
|
||
val builder = AlignmentRecord.newBuilder() | ||
.setReadName(readName) | ||
.setSequence(readSequence) | ||
.setQual(readQualities) | ||
.setReadPaired(setFirstOfPair || setSecondOfPair) | ||
.setProperPair(null) | ||
.setReadInFragment( | ||
if (setFirstOfPair) 0 | ||
else if (setSecondOfPair) 1 | ||
else null | ||
) | ||
.setReadNegativeStrand(false) | ||
.setMateNegativeStrand(null) | ||
.setPrimaryAlignment(null) | ||
.setSecondaryAlignment(null) | ||
.setSupplementaryAlignment(null) | ||
|
||
recordGroupOpt.foreach(builder.setRecordGroupName) | ||
|
||
builder.build() | ||
this.makeAlignmentRecord( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have removed all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
readName, readSequence, readQualities, | ||
readInFragment, readPaired, recordGroupOpt) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment doesn't match the method.
Perhaps something like
Return true if the read name suffix and flags match.