-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483
Conversation
Test build #70969 has finished for PR 16483 at commit
|
ping @srowen @ankurdave can you take a look at this? |
@rxin can you take a look? |
cc @ankurdave |
@rxin can anyone else review this? It would be nice to get this correctness fix into 2.2. |
@@ -162,7 +162,15 @@ object PageRank extends Logging { | |||
iteration += 1 | |||
} | |||
|
|||
rankGraph | |||
// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the name of the ticket as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few small comments.
vp, sendMessage, messageCombiner) | ||
.mapVertices((vid, attr) => attr._1) | ||
} // end of deltaPageRank | ||
|
||
// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same code as above, please factor it into a function.
|
||
// Static PageRank should only take 3 iterations to converge | ||
val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) => | ||
// Static PageRank should only take 2 iterations to converge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it take only two iterations to converge now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't change, were still comparing the output of the 2nd and 3rd iteration.
val newPR = teleport + (1.0 - resetProb) * msgSum | ||
val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR | ||
(newPR, newDelta) | ||
val newPR = if (lastDelta == Double.NegativeInfinity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My memory of the algorithm is a bit rusty. Why don't you need to check for self-loops here anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you mean the if (src==id)
check? I'm honestly not sure what was going on with this code its just wrong. The results do not match up with igraph/networkx at all. Furthermore the code is just nonsensical -- definition of var teleport = oldPR
that is then unconditionally set two lines down to teleport = oldPR*delta
without being used prior.
This revised implementation is much easier to follow and is now tested against 3 sets of reference values computed by igraph/networkx. Please let me know if you thing I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the new code is easier to follow in that respect.
In addition, this introduces an extra step reduction at each iteration. I am fine with that since it is for correctness, but @jkbradley may want to comment as well. |
Test build #74693 has finished for PR 16483 at commit
|
@thunterdb The extra step -- as implemented -- is only at the end as that gives the same result as doing it after every iteration but without the extra overhead. |
It looks good to me. cc @jkbradley or @mengxr for final approval |
Merging in master. Thanks! |
What changes were proposed in this pull request?
Graphs with sinks (vertices with no outgoing edges) don't have the expected rank sum of n (or 1 for personalized). We fix this by normalizing to the expected sum at the end of each implementation.
Additionally this fixes the dynamic version of personal pagerank which gave incorrect answers that were not detected by existing unit tests.
How was this patch tested?
Revamped existing and additional unit tests with reference values (and reproduction code) from igraph and NetworkX.
Note that for comparison on personal pagerank we use the arpack algorithm in igraph as prpack (the current default) redistributes rank to all vertices uniformly instead of just to the personalization source. We could take the alternate convention (redistribute rank to all vertices uniformly) but that would involve more extensive changes to the algorithms (the dynamic version would no longer be able to use Pregel).