New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271
Conversation
… first iteration which then changes the center in the second iteration
Test build #70100 has finished for PR 16271 at commit
|
Test build #70101 has finished for PR 16271 at commit
|
Test build #70139 has finished for PR 16271 at commit
|
I am not sure who if anyone would review graphx at this point, and I am not so familiar with the implementation here. If it converges to the same answer faster that's good. it might be nice to understand why this init is better, like any paper or similar implementaiton. |
I just emailed @ankurdave and he is going to look at this tonight. |
References
Since they are more focused on updating values for one evolving graph (the internet) they dont really talk about starting from scratch. But this does emphisize that there is no change to answers, just rate of convergence. A more direct statement would be Wikipedia
Note that there are two variants of pagerank that differ by a constant multiple in outputs but are determined by the dampening factor, we use the version that sums to N (most other implementations use the other). More Wikipedia:
Essentialy starting with the correct sum is closer to the actual fixed point and thus gets you faster convergence. The NetworkX implementation uses the variant that sums to 1 hence their initialization values are all 1/N. igraph is unfortunately not comparable as they use a more complex linear solver approach Additional credentials (if it matters): PhD Mathematics with dissertation in Graph Theory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly plead ignorance, but seems reasonable. Is the improvement mostly coming from a better magnitude of the initial weights then?
I'm just trying to figure out if the current implementation is also a different, decent idea, and just scaling resetProb is even better. No idea.
rankGraph = rankGraph.joinVertices(rankUpdates) { | ||
(vid, oldRank, msgSum) => | ||
val popActivations: BV[Double] = msgSum :* (1.0 - resetProb) | ||
rankGraph = rankGraph.outerJoinVertices(rankUpdates) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a related but slightly separate fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an intertwined bug in the implementation that was introduced at the same time (when moving away from Pregel in 15a5645). The only vertices not included in the original joinVertices
are source vertices (those with no incoming edges). Normally (in the absence of sinks) source vertices would have page rank equal to the reset probability. Since source vertices were not included in the join their rank was fixed at their initial value, which fortunately was the correct value. When we change the initial value of all vertices to 1 it exposes this error.
@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext { | |||
val resetProb = 0.15 | |||
val errorTol = 1.0e-5 | |||
|
|||
val staticRanks1 = starGraph.staticPageRank(numIter = 1, resetProb).vertices | |||
val staticRanks2 = starGraph.staticPageRank(numIter = 2, resetProb).vertices.cache() | |||
val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few extra iterations makes the test more robust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really more robust since it has a sink and thus is still wrong pending SPARK-18847. But it is needed with the change to fully propagate the change in rank of source vertices in the first iteration as explained above.
Yes the improvement is from the sum of magnitudes of initial values being closer to the (known) sum of the solution. Fiddling with resetProb controls a completely different thing. The current implementation has no advantage (excluding finding the incorrect solution to a star graph one iteration faster). |
Merged into master. |
…t leads to slow convergence ## What changes were proposed in this pull request? Change the initial value in all PageRank implementations to be `1.0` instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of `joinVertices` so that source vertices get updated in each iteration. This seems to have been introduced a long time ago in apache@15a5645#diff-b2bf3f97dcd2f19d61c921836159cda9L90 With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices). Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd ## How was this patch tested? (corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source. Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#16271 from aray/pagerank-initial-value.
What changes were proposed in this pull request?
Change the initial value in all PageRank implementations to be
1.0
instead ofresetProb
(default0.15
) and useouterJoinVertices
instead ofjoinVertices
so that source vertices get updated in each iteration.This seems to have been introduced a long time ago in 15a5645#diff-b2bf3f97dcd2f19d61c921836159cda9L90
With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices).
Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd
How was this patch tested?
(corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source.