-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various updates #421
Various updates #421
Conversation
Codecov Report
@@ Coverage Diff @@
## master #421 +/- ##
==========================================
- Coverage 96.27% 96.26% -0.02%
==========================================
Files 90 90
Lines 5099 5109 +10
Branches 659 652 -7
==========================================
+ Hits 4909 4918 +9
- Misses 190 191 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being so picky :/
def record(rec: SamRecord): Boolean = record(rec.asSam) | ||
|
||
/** Logs the last record if it wasn't already logged. */ | ||
def logLast(): Boolean = super.log() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really ugly that we have both log()
and log(String... message)
in the superclass. So when you refer to log()
it's not very clear what's being called :/ Can you maybe add a trailing comment here something like:
def logLast(): Boolean = super.log() // Calls the super's log() method, not log(message: String*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
|
||
case class OffsetAndLength(offset: Int, length: Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have at least a single line scaladoc if it's a public class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @return the offset and length of the longest homopolymer | ||
*/ | ||
def maxPolyX(s: String) : OffsetAndLength = { | ||
val bs = s.getBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this vs. using charAt()
causes the creation of a whole new byte array every time this function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
val bs = s.getBytes | ||
var (bestStart, bestLength) = (0,0) | ||
var start = 0 | ||
while (start < bs.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch to using a while
loop instead of the forloop
that was used previously? It automatically makes me think you're doing something other than incrementing by 1 each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
var (bestStart, bestLength) = (0,0) | ||
var start = 0 | ||
while (start < bs.length) { | ||
val firstBase = s.charAt(start).toByte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd to use s.charAt(start)
here and bs(i)
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
forloop(0, bs.length-1) { start => | ||
val (b1, b2) = (bs(start), bs(start+1)) | ||
var i = start | ||
while (i < s.length-1 && SequenceUtil.basesEqual(b1, bs(i)) && SequenceUtil.basesEqual(b2, bs(i+1))) i += 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- If you upper case like you're doing on 105 you don't really need to use
SequenceUtil.basesEqual()
here. - If you continue to use
SequenceUtil.basesEqual()
I'd be tempted to define and use the following within this class:
/** Checks to see if the two characters are identical, ignoring case. */
@inline final private def same(a: Char, b: Char): Boolean = {
Character.toUpperCase(a) == Character.toUpperCase(b)
}
This has the advantage that a) it'll work for all characters so if you try to use it on proteins it won't fail, and also will just make the usage more compact and easier to see what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but "I did it my way"
/** Returns the sequence that is the complement of the provided sequence. */ | ||
def complement(s: String): String = { | ||
val bs = s.getBytes | ||
for (i <- 0 until bs.length) bs(i) = SequenceUtil.complement(bs(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think forloop(from=0, until=bs.length)
will be much faster here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, fixed.
} | ||
|
||
it should "return zero when passed a zero length string" in { | ||
Sequences.maxPolyX("") shouldBe OffsetAndLength(0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you want this behavior? Also, I don't think it's reasonable to deprecate one method in favor of another if the behavior is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sometimes I have case classes with an empty sequence, so I don't want it to blow up when calling
maxPolyX
:
things.map(maxPolyX(_.sequence))
The longest homopolymer in an empty sequence is the empty string isn't it?
2. I don't think you should be using this method to catch the case (i.e. throw an exception) when you have an empty string. I think it's wrong, so I'd like to fix it. I really don't want to have to write guards everywhere in my code to ignore feeding empty strings to longestHomopolymer
. One could argue that asking for the longest homopolymer implies that there is at least one homopolymer, but I think that's just an argument for argument's sake.
Any objections to fixing this? Do you have code that will break (i.e. relies on the exception being thrown here)?
* @param s a DNA or RNA sequence | ||
* @return the offset and length of the longest homopolymer | ||
*/ | ||
def maxPolyX(s: String) : OffsetAndLength = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts on naming (apply to maxDinux as well):
- I think longest is clearer than max. Longest can only really be interpreted one way. Max could be interpreted to mean "longest single", or "most abundant" such that in a sequence like
ACACGTGTGTACAC
you might sayAC
is the "max dinuc" because there is more of it. - Homopolymer is a well defined term whereas
polyX
isn't really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, fixed.
Sequences.complement("AAA") shouldBe "TTT" | ||
Sequences.complement("ACAC") shouldBe "TGTG" | ||
Sequences.complement("") shouldBe "" | ||
Sequences.complement("AACCGGTGTG") shouldBe "TTGGCCACAC" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's only a call to a htsjdk function, but could you add a few basic tests for reverseComplement()
please? That way if the library function gets broken/changed unexpectedly we'll catch it, or if we decide to reimplement we won't be caught short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne I made all your suggested fixes with the exception of the longestHomopolymer
method failing when an empty string is given. Lets slack chat if you object to my reasoning to allow an empty string.
} | ||
|
||
it should "return zero when passed a zero length string" in { | ||
Sequences.maxPolyX("") shouldBe OffsetAndLength(0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sometimes I have case classes with an empty sequence, so I don't want it to blow up when calling
maxPolyX
:
things.map(maxPolyX(_.sequence))
The longest homopolymer in an empty sequence is the empty string isn't it?
2. I don't think you should be using this method to catch the case (i.e. throw an exception) when you have an empty string. I think it's wrong, so I'd like to fix it. I really don't want to have to write guards everywhere in my code to ignore feeding empty strings to longestHomopolymer
. One could argue that asking for the longest homopolymer implies that there is at least one homopolymer, but I think that's just an argument for argument's sake.
Any objections to fixing this? Do you have code that will break (i.e. relies on the exception being thrown here)?
Sequences.complement("AAA") shouldBe "TTT" | ||
Sequences.complement("ACAC") shouldBe "TGTG" | ||
Sequences.complement("") shouldBe "" | ||
Sequences.complement("AACCGGTGTG") shouldBe "TTGGCCACAC" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
def record(rec: SamRecord): Boolean = record(rec.asSam) | ||
|
||
/** Logs the last record if it wasn't already logged. */ | ||
def logLast(): Boolean = super.log() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
|
||
case class OffsetAndLength(offset: Int, length: Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @param s a DNA or RNA sequence | ||
* @return the offset and length of the longest homopolymer | ||
*/ | ||
def maxPolyX(s: String) : OffsetAndLength = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, fixed.
var (bestStart, bestLength) = (0,0) | ||
var start = 0 | ||
while (start < bs.length) { | ||
val firstBase = s.charAt(start).toByte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @return the offset and length of the longest dinucleotide sequence | ||
*/ | ||
def maxDinuc(s: String) : OffsetAndLength = { | ||
val bs = s.toUpperCase.getBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
forloop(0, bs.length-1) { start => | ||
val (b1, b2) = (bs(start), bs(start+1)) | ||
var i = start | ||
while (i < s.length-1 && SequenceUtil.basesEqual(b1, bs(i)) && SequenceUtil.basesEqual(b2, bs(i+1))) i += 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but "I did it my way"
/** Returns the sequence that is the complement of the provided sequence. */ | ||
def complement(s: String): String = { | ||
val bs = s.getBytes | ||
for (i <- 0 until bs.length) bs(i) = SequenceUtil.complement(bs(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, fixed.
} | ||
|
||
/** Reverse complements a string of bases. */ | ||
def reverseComplement(s: String): String = SequenceUtil.reverseComplement(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feeling good, you? Fixed.
This is useful if you want to log the final record (or count).
In some cases, the metric file would not be fully written.
27ae815
to
81dea48
Compare
81dea48
to
3397e55
Compare
@tfenne here are some various updates I have been accumulating.