Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various updates #421

Merged
merged 4 commits into from
Aug 13, 2018
Merged

Various updates #421

merged 4 commits into from
Aug 13, 2018

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Jul 28, 2018

@tfenne here are some various updates I have been accumulating.

@nh13 nh13 requested a review from tfenne July 28, 2018 02:40
@codecov-io
Copy link

codecov-io commented Jul 28, 2018

Codecov Report

Merging #421 into master will decrease coverage by 0.01%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #421      +/-   ##
==========================================
- Coverage   96.27%   96.26%   -0.02%     
==========================================
  Files          90       90              
  Lines        5099     5109      +10     
  Branches      659      652       -7     
==========================================
+ Hits         4909     4918       +9     
- Misses        190      191       +1
Impacted Files Coverage Δ
...ava/com/fulcrumgenomics/util/PickLongIndices.scala 94.49% <100%> (ø) ⬆️
...ain/scala/com/fulcrumgenomics/util/Sequences.scala 100% <100%> (ø) ⬆️
...c/main/scala/com/fulcrumgenomics/util/Metric.scala 93.75% <100%> (+0.2%) ⬆️
...cala/com/fulcrumgenomics/util/ProgressLogger.scala 50% <66.66%> (-16.67%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 431b695...3397e55. Read the comment docs.

tfenne
tfenne previously requested changes Jul 30, 2018
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being so picky :/

def record(rec: SamRecord): Boolean = record(rec.asSam)

/** Logs the last record if it wasn't already logged. */
def logLast(): Boolean = super.log()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really ugly that we have both log() and log(String... message) in the superclass. So when you refer to log() it's not very clear what's being called :/ Can you maybe add a trailing comment here something like:

def logLast(): Boolean = super.log()  // Calls the super's log() method, not log(message: String*)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}


case class OffsetAndLength(offset: Int, length: Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have at least a single line scaladoc if it's a public class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @return the offset and length of the longest homopolymer
*/
def maxPolyX(s: String) : OffsetAndLength = {
val bs = s.getBytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this vs. using charAt() causes the creation of a whole new byte array every time this function is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

val bs = s.getBytes
var (bestStart, bestLength) = (0,0)
var start = 0
while (start < bs.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch to using a while loop instead of the forloop that was used previously? It automatically makes me think you're doing something other than incrementing by 1 each iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

var (bestStart, bestLength) = (0,0)
var start = 0
while (start < bs.length) {
val firstBase = s.charAt(start).toByte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd to use s.charAt(start) here and bs(i) below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

forloop(0, bs.length-1) { start =>
val (b1, b2) = (bs(start), bs(start+1))
var i = start
while (i < s.length-1 && SequenceUtil.basesEqual(b1, bs(i)) && SequenceUtil.basesEqual(b2, bs(i+1))) i += 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. If you upper case like you're doing on 105 you don't really need to use SequenceUtil.basesEqual() here.
  2. If you continue to use SequenceUtil.basesEqual() I'd be tempted to define and use the following within this class:
/** Checks to see if the two characters are identical, ignoring case. */
@inline final private def same(a: Char, b: Char): Boolean = {
  Character.toUpperCase(a) == Character.toUpperCase(b)
}

This has the advantage that a) it'll work for all characters so if you try to use it on proteins it won't fail, and also will just make the usage more compact and easier to see what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but "I did it my way"

/** Returns the sequence that is the complement of the provided sequence. */
def complement(s: String): String = {
val bs = s.getBytes
for (i <- 0 until bs.length) bs(i) = SequenceUtil.complement(bs(i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think forloop(from=0, until=bs.length) will be much faster here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, fixed.

}

it should "return zero when passed a zero length string" in {
Sequences.maxPolyX("") shouldBe OffsetAndLength(0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you want this behavior? Also, I don't think it's reasonable to deprecate one method in favor of another if the behavior is different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sometimes I have case classes with an empty sequence, so I don't want it to blow up when calling maxPolyX:
things.map(maxPolyX(_.sequence))

The longest homopolymer in an empty sequence is the empty string isn't it?
2. I don't think you should be using this method to catch the case (i.e. throw an exception) when you have an empty string. I think it's wrong, so I'd like to fix it. I really don't want to have to write guards everywhere in my code to ignore feeding empty strings to longestHomopolymer. One could argue that asking for the longest homopolymer implies that there is at least one homopolymer, but I think that's just an argument for argument's sake.

Any objections to fixing this? Do you have code that will break (i.e. relies on the exception being thrown here)?

* @param s a DNA or RNA sequence
* @return the offset and length of the longest homopolymer
*/
def maxPolyX(s: String) : OffsetAndLength = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts on naming (apply to maxDinux as well):

  1. I think longest is clearer than max. Longest can only really be interpreted one way. Max could be interpreted to mean "longest single", or "most abundant" such that in a sequence like ACACGTGTGTACAC you might say AC is the "max dinuc" because there is more of it.
  2. Homopolymer is a well defined term whereas polyX isn't really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, fixed.

Sequences.complement("AAA") shouldBe "TTT"
Sequences.complement("ACAC") shouldBe "TGTG"
Sequences.complement("") shouldBe ""
Sequences.complement("AACCGGTGTG") shouldBe "TTGGCCACAC"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's only a call to a htsjdk function, but could you add a few basic tests for reverseComplement() please? That way if the library function gets broken/changed unexpectedly we'll catch it, or if we decide to reimplement we won't be caught short.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tfenne tfenne assigned nh13 and unassigned tfenne Jul 30, 2018
Copy link
Member Author

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne I made all your suggested fixes with the exception of the longestHomopolymer method failing when an empty string is given. Lets slack chat if you object to my reasoning to allow an empty string.

}

it should "return zero when passed a zero length string" in {
Sequences.maxPolyX("") shouldBe OffsetAndLength(0, 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sometimes I have case classes with an empty sequence, so I don't want it to blow up when calling maxPolyX:
things.map(maxPolyX(_.sequence))

The longest homopolymer in an empty sequence is the empty string isn't it?
2. I don't think you should be using this method to catch the case (i.e. throw an exception) when you have an empty string. I think it's wrong, so I'd like to fix it. I really don't want to have to write guards everywhere in my code to ignore feeding empty strings to longestHomopolymer. One could argue that asking for the longest homopolymer implies that there is at least one homopolymer, but I think that's just an argument for argument's sake.

Any objections to fixing this? Do you have code that will break (i.e. relies on the exception being thrown here)?

Sequences.complement("AAA") shouldBe "TTT"
Sequences.complement("ACAC") shouldBe "TGTG"
Sequences.complement("") shouldBe ""
Sequences.complement("AACCGGTGTG") shouldBe "TTGGCCACAC"
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def record(rec: SamRecord): Boolean = record(rec.asSam)

/** Logs the last record if it wasn't already logged. */
def logLast(): Boolean = super.log()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}


case class OffsetAndLength(offset: Int, length: Int)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @param s a DNA or RNA sequence
* @return the offset and length of the longest homopolymer
*/
def maxPolyX(s: String) : OffsetAndLength = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, fixed.

var (bestStart, bestLength) = (0,0)
var start = 0
while (start < bs.length) {
val firstBase = s.charAt(start).toByte
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @return the offset and length of the longest dinucleotide sequence
*/
def maxDinuc(s: String) : OffsetAndLength = {
val bs = s.toUpperCase.getBytes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

forloop(0, bs.length-1) { start =>
val (b1, b2) = (bs(start), bs(start+1))
var i = start
while (i < s.length-1 && SequenceUtil.basesEqual(b1, bs(i)) && SequenceUtil.basesEqual(b2, bs(i+1))) i += 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but "I did it my way"

/** Returns the sequence that is the complement of the provided sequence. */
def complement(s: String): String = {
val bs = s.getBytes
for (i <- 0 until bs.length) bs(i) = SequenceUtil.complement(bs(i))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, fixed.

}

/** Reverse complements a string of bases. */
def reverseComplement(s: String): String = SequenceUtil.reverseComplement(s)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feeling good, you? Fixed.

@nh13 nh13 assigned tfenne and unassigned nh13 Aug 11, 2018
This is useful if you want to log the final record (or count).
In some cases, the metric file would not be fully written.
@nh13 nh13 force-pushed the nh_various_updates branch 2 times, most recently from 27ae815 to 81dea48 Compare August 13, 2018 21:59
@nh13 nh13 dismissed tfenne’s stale review August 13, 2018 22:05

Approval over Slack

@nh13 nh13 merged commit f2d5d5e into master Aug 13, 2018
@nh13 nh13 deleted the nh_various_updates branch August 13, 2018 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants