-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix right candidate selection for large cluster refinement #20
Conversation
@@ -139,7 +139,7 @@ private static double scoreClustering(final byte[] partitions, final double[][] | |||
.mapToObj(i -> { | |||
// reverse of Gaussian | |||
int leftIndex = (int) (Math.sqrt(i + 0.25) - 0.5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First error, it should be: int leftIndex = (int) (Math.sqrt(2 * i + 0.25) - 0.5);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation: Gaussian formula is i = n * (n + 1) / 2, with n is leftIndex
To solve: 0 = n^2 + n - 2 * i
@@ -139,7 +139,7 @@ private static double scoreClustering(final byte[] partitions, final double[][] | |||
.mapToObj(i -> { | |||
// reverse of Gaussian | |||
int leftIndex = (int) (Math.sqrt(i + 0.25) - 0.5); | |||
int rightIndex = i - getNumEdges(leftIndex) + leftIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second error, it should just be int rightIndex = i - getNumEdges(leftIndex);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation: leftIndex and rightIndex should take all possible values where leftIndex > rightIndex and leftIndex < numNodes and rightIndex < numNodes.
i is a random edge < numEdges
leftIndex is the largest node, such that leftIndex * (leftIndex + 1) / 2 <= i
rightIndex is then the remainder.
Examples:
i | leftIndex | rightIndex |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 1 | 1 |
3 | 2 | 0 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 0 |
7 | 3 | 1 |
8 | 3 | 2 |
9 | 3 | 3 |
10 | 4 | 0 |
11 | 4 | 1 |
12 | 4 | 2 |
13 | 4 | 3 |
14 | 4 | 4 |
15 | 5 | 0 |
16 | 5 | 1 |
17 | 5 | 2 |
18 | 5 | 3 |
19 | 5 | 4 |
20 | 5 | 5 |
21 | 6 | 0 |
22 | 6 | 1 |
23 | 6 | 2 |
24 | 6 | 3 |
25 | 6 | 4 |
26 | 6 | 5 |
27 | 6 | 6 |
After reverse-engineering the algorithm and trying to understand the equation for computing the right candidate, we came to the conclusion the index need to be halved. Before it was causing issues due to IndexOutOfBoundsExceptions for larger clusters.
Now the equation seems to be sound, at least a quick check with Wolfram Alpha prooves that we cannot run out of bounds anymore.
@AHeise Can you give it a short review, if we haven't broken something unintentionally?