Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271

Closed
wants to merge 6 commits into from

Conversation

aray
Copy link
Contributor

@aray aray commented Dec 13, 2016

What changes were proposed in this pull request?

Change the initial value in all PageRank implementations to be 1.0 instead of resetProb (default 0.15) and use outerJoinVertices instead of joinVertices so that source vertices get updated in each iteration.

This seems to have been introduced a long time ago in 15a5645#diff-b2bf3f97dcd2f19d61c921836159cda9L90

With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices).

Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd

How was this patch tested?

(corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source.

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70100 has finished for PR 16271 at commit 7ea03a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70101 has finished for PR 16271 at commit 33cd794.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Dec 14, 2016

Updated the above benchmark code with a log normal random graph on 10,000 vertices the difference is much more drastic.

(take the very bottom of the graph with a grain of salt as its in comparison to g.pageRank(0.00001), actual error continues to drop)

@aray
Copy link
Contributor Author

aray commented Dec 14, 2016

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70139 has finished for PR 16271 at commit 8be9a97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 14, 2016

I am not sure who if anyone would review graphx at this point, and I am not so familiar with the implementation here. If it converges to the same answer faster that's good. it might be nice to understand why this init is better, like any paper or similar implementaiton.

@rxin
Copy link
Contributor

rxin commented Dec 14, 2016

I just emailed @ankurdave and he is going to look at this tonight.

@aray
Copy link
Contributor Author

aray commented Dec 14, 2016

References
Pagerank paper

We need to make an initial assignment of the ranks. This assignment can be made by one of several strategies. If it is going to iterate until convergence, in general the initial values will not affect fi nal values, just the rate of convergence. But we can speed
up convergence by choosing a good initial assignment.

Since they are more focused on updating values for one evolving graph (the internet) they dont really talk about starting from scratch. But this does emphisize that there is no change to answers, just rate of convergence.

A more direct statement would be Wikipedia

PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1.

Note that there are two variants of pagerank that differ by a constant multiple in outputs but are determined by the dampening factor, we use the version that sums to N (most other implementations use the other). More Wikipedia:

The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N.

Essentialy starting with the correct sum is closer to the actual fixed point and thus gets you faster convergence.

The NetworkX implementation uses the variant that sums to 1 hence their initialization values are all 1/N.

igraph is unfortunately not comparable as they use a more complex linear solver approach

Additional credentials (if it matters): PhD Mathematics with dissertation in Graph Theory

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly plead ignorance, but seems reasonable. Is the improvement mostly coming from a better magnitude of the initial weights then?

I'm just trying to figure out if the current implementation is also a different, decent idea, and just scaling resetProb is even better. No idea.

rankGraph = rankGraph.joinVertices(rankUpdates) {
(vid, oldRank, msgSum) =>
val popActivations: BV[Double] = msgSum :* (1.0 - resetProb)
rankGraph = rankGraph.outerJoinVertices(rankUpdates) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a related but slightly separate fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its an intertwined bug in the implementation that was introduced at the same time (when moving away from Pregel in 15a5645). The only vertices not included in the original joinVertices are source vertices (those with no incoming edges). Normally (in the absence of sinks) source vertices would have page rank equal to the reset probability. Since source vertices were not included in the join their rank was fixed at their initial value, which fortunately was the correct value. When we change the initial value of all vertices to 1 it exposes this error.

@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext {
val resetProb = 0.15
val errorTol = 1.0e-5

val staticRanks1 = starGraph.staticPageRank(numIter = 1, resetProb).vertices
val staticRanks2 = starGraph.staticPageRank(numIter = 2, resetProb).vertices.cache()
val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few extra iterations makes the test more robust?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really more robust since it has a sink and thus is still wrong pending SPARK-18847. But it is needed with the change to fully propagate the change in rank of source vertices in the first iteration as explained above.

@aray
Copy link
Contributor Author

aray commented Dec 15, 2016

Yes the improvement is from the sum of magnitudes of initial values being closer to the (known) sum of the solution. Fiddling with resetProb controls a completely different thing. The current implementation has no advantage (excluding finding the incorrect solution to a star graph one iteration faster).

@ankurdave
Copy link
Contributor

Thanks @aray for the explanation. I agree with @srowen - this looks reasonable to me. I'm going to merge it.

@asfgit asfgit closed this in 78062b8 Dec 16, 2016
@ankurdave
Copy link
Contributor

Merged into master.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…t leads to slow convergence

## What changes were proposed in this pull request?

Change the initial value in all PageRank implementations to be `1.0` instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of `joinVertices` so that source vertices get updated in each iteration.

This seems to have been introduced a long time ago in apache@15a5645#diff-b2bf3f97dcd2f19d61c921836159cda9L90

With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices).

Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd

## How was this patch tested?

(corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#16271 from aray/pagerank-initial-value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants