[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271

aray · 2016-12-13T21:48:19Z

What changes were proposed in this pull request?

Change the initial value in all PageRank implementations to be 1.0 instead of resetProb (default 0.15) and use outerJoinVertices instead of joinVertices so that source vertices get updated in each iteration.

This seems to have been introduced a long time ago in 15a5645#diff-b2bf3f97dcd2f19d61c921836159cda9L90

With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices).

Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd

How was this patch tested?

(corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source.

…for personalized

… first iteration which then changes the center in the second iteration

SparkQA · 2016-12-13T22:22:54Z

Test build #70100 has finished for PR 16271 at commit 7ea03a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T22:28:27Z

Test build #70101 has finished for PR 16271 at commit 33cd794.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2016-12-14T15:21:11Z

Updated the above benchmark code with a log normal random graph on 10,000 vertices the difference is much more drastic.

(take the very bottom of the graph with a grain of salt as its in comparison to g.pageRank(0.00001), actual error continues to drop)

aray · 2016-12-14T16:54:27Z

ping @srowen @dbtsai @rxin @ankurdave @jegonzal

SparkQA · 2016-12-14T18:01:37Z

Test build #70139 has finished for PR 16271 at commit 8be9a97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-14T20:15:23Z

I am not sure who if anyone would review graphx at this point, and I am not so familiar with the implementation here. If it converges to the same answer faster that's good. it might be nice to understand why this init is better, like any paper or similar implementaiton.

rxin · 2016-12-14T20:42:23Z

I just emailed @ankurdave and he is going to look at this tonight.

aray · 2016-12-14T21:48:44Z

References
Pagerank paper

We need to make an initial assignment of the ranks. This assignment can be made by one of several strategies. If it is going to iterate until convergence, in general the initial values will not affect final values, just the rate of convergence. But we can speed
up convergence by choosing a good initial assignment.

Since they are more focused on updating values for one evolving graph (the internet) they dont really talk about starting from scratch. But this does emphisize that there is no change to answers, just rate of convergence.

A more direct statement would be Wikipedia

PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1.

Note that there are two variants of pagerank that differ by a constant multiple in outputs but are determined by the dampening factor, we use the version that sums to N (most other implementations use the other). More Wikipedia:

The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N.

Essentialy starting with the correct sum is closer to the actual fixed point and thus gets you faster convergence.

The NetworkX implementation uses the variant that sums to 1 hence their initialization values are all 1/N.

igraph is unfortunately not comparable as they use a more complex linear solver approach

Additional credentials (if it matters): PhD Mathematics with dissertation in Graph Theory

srowen

I mostly plead ignorance, but seems reasonable. Is the improvement mostly coming from a better magnitude of the initial weights then?

I'm just trying to figure out if the current implementation is also a different, decent idea, and just scaling resetProb is even better. No idea.

srowen · 2016-12-15T10:08:54Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

-      rankGraph = rankGraph.joinVertices(rankUpdates) {
-        (vid, oldRank, msgSum) =>
-          val popActivations: BV[Double] = msgSum :* (1.0 - resetProb)
+      rankGraph = rankGraph.outerJoinVertices(rankUpdates) {


Is this a related but slightly separate fix?

Its an intertwined bug in the implementation that was introduced at the same time (when moving away from Pregel in 15a5645). The only vertices not included in the original joinVertices are source vertices (those with no incoming edges). Normally (in the absence of sinks) source vertices would have page rank equal to the reset probability. Since source vertices were not included in the join their rank was fixed at their initial value, which fortunately was the correct value. When we change the initial value of all vertices to 1 it exposes this error.

srowen · 2016-12-15T10:09:30Z

graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala

@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext {
      val resetProb = 0.15
      val errorTol = 1.0e-5

-      val staticRanks1 = starGraph.staticPageRank(numIter = 1, resetProb).vertices
-      val staticRanks2 = starGraph.staticPageRank(numIter = 2, resetProb).vertices.cache()
+      val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices


A few extra iterations makes the test more robust?

Not really more robust since it has a sink and thus is still wrong pending SPARK-18847. But it is needed with the change to fully propagate the change in rank of source vertices in the first iteration as explained above.

aray · 2016-12-15T14:52:57Z

Yes the improvement is from the sum of magnitudes of initial values being closer to the (known) sum of the solution. Fiddling with resetProb controls a completely different thing. The current implementation has no advantage (excluding finding the incorrect solution to a star graph one iteration faster).

ankurdave · 2016-12-16T07:24:05Z

Thanks @aray for the explanation. I agree with @srowen - this looks reasonable to me. I'm going to merge it.

ankurdave · 2016-12-16T07:40:42Z

Merged into master.

…t leads to slow convergence ## What changes were proposed in this pull request? Change the initial value in all PageRank implementations to be `1.0` instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of `joinVertices` so that source vertices get updated in each iteration. This seems to have been introduced a long time ago in apache@15a5645#diff-b2bf3f97dcd2f19d61c921836159cda9L90 With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices). Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd ## How was this patch tested? (corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source. Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#16271 from aray/pagerank-initial-value.

aray added 5 commits December 12, 2016 12:34

fix

9149ca2

fix initial value for grid graph independent calculation

b145376

use outer join so that sources are updated and fix reset probability …

d39d2f0

…for personalized

fix star page rank test to account for sources getting updated in the…

7ea03a8

… first iteration which then changes the center in the second iteration

additional unit test with comparison to igraph/networkx

33cd794

update comment about rankGraph initialization

8be9a97

srowen reviewed Dec 15, 2016

View reviewed changes

asfgit closed this in 78062b8 Dec 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271

[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271

aray commented Dec 13, 2016 •

edited

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

aray commented Dec 14, 2016

aray commented Dec 14, 2016

SparkQA commented Dec 14, 2016

srowen commented Dec 14, 2016

rxin commented Dec 14, 2016

aray commented Dec 14, 2016

srowen left a comment

srowen Dec 15, 2016

aray Dec 15, 2016

srowen Dec 15, 2016

aray Dec 15, 2016

aray commented Dec 15, 2016

ankurdave commented Dec 16, 2016

ankurdave commented Dec 16, 2016

[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271

[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271

Conversation

aray commented Dec 13, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

aray commented Dec 14, 2016

aray commented Dec 14, 2016

SparkQA commented Dec 14, 2016

srowen commented Dec 14, 2016

rxin commented Dec 14, 2016

aray commented Dec 14, 2016

srowen left a comment

Choose a reason for hiding this comment

srowen Dec 15, 2016

Choose a reason for hiding this comment

aray Dec 15, 2016

Choose a reason for hiding this comment

srowen Dec 15, 2016

Choose a reason for hiding this comment

aray Dec 15, 2016

Choose a reason for hiding this comment

aray commented Dec 15, 2016

ankurdave commented Dec 16, 2016

ankurdave commented Dec 16, 2016

aray commented Dec 13, 2016 •

edited