Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in MannWhitneyU on some JVMs. #5371

Merged
merged 2 commits into from Oct 30, 2018
Merged

Fix bug in MannWhitneyU on some JVMs. #5371

merged 2 commits into from Oct 30, 2018

Conversation

davidbenjamin
Copy link
Contributor

Closes #4290.

Since the Mann-Whitney U statistic is always integer or half-integer, we don't need to store a histogram of Doubles, which causes issues on some JVMs. Instead we can multiply U by two and store an integer key for the histogram, which avoids the issue.

@droazen I'll assign you and also @meganshand to sign off on the statistics.

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -525,7 +529,7 @@ public double permutationTest(final double[] series1, final double[] series2, fi
assert (series2End == n2);

double newU = MathUtils.sum(newSeries1) - ((n1 * (n1 + 1)) / 2.0);
histo.increment(newU);
histo.increment(FastMath.round(2 * newU));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, but you shouldn't need to round, is that right? 2 * newU should always be an integer so casting or floor would also work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that with finite precision 2 * newU would end up slightly below the correct integer, in which case casting or floor would give the wrong value. And maybe this isn't supposed to happen, but I'm paranoid because this whole bug is from somebody's JVM doing things that the JVM isn't supposed to do.

Co-Authored-By: davidbenjamin <davidben@broadinstitute.org>
@codecov-io
Copy link

Codecov Report

Merging #5371 into master will decrease coverage by 0.001%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5371       +/-   ##
===============================================
- Coverage     86.931%   86.929%   -0.001%     
+ Complexity     30322     30320        -2     
===============================================
  Files           1849      1849               
  Lines         140743    140743               
  Branches       15476     15476               
===============================================
- Hits          122349    122347        -2     
- Misses         12786     12788        +2     
  Partials        5608      5608
Impacted Files Coverage Δ Complexity Δ
.../broadinstitute/hellbender/utils/MannWhitneyU.java 93.694% <100%> (ø) 48 <8> (ø) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 81.098% <0%> (+0.61%) 42% <0%> (ø) ⬇️

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbenjamin One question for you, otherwise looks good.

for (final Histogram.Bin<Double> bin : histo.values()) {
if (bin.getId() < testStatU) sumOfAllSmallerBins += bin.getValue();
for (final Histogram.Bin<Long> bin : histo.values()) {
if (bin.getId() < FastMath.round(2 * testStatU)) sumOfAllSmallerBins += bin.getValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something here, but can you explain why bin.getValue() is not divided by 2.0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the value is just the count, and we only modified the key.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I got confused because the code above is purposefully taking "half of the count in the observed bin".

👍 merge away

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants