New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271
[SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence #16271
Changes from all commits
9149ca2
b145376
d39d2f0
7ea03a8
33cd794
8be9a97
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,7 +41,7 @@ object GridPageRank { | |
} | ||
} | ||
// compute the pagerank | ||
var pr = Array.fill(nRows * nCols)(resetProb) | ||
var pr = Array.fill(nRows * nCols)(1.0) | ||
for (iter <- 0 until nIter) { | ||
val oldPr = pr | ||
pr = new Array[Double](nRows * nCols) | ||
|
@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext { | |
val resetProb = 0.15 | ||
val errorTol = 1.0e-5 | ||
|
||
val staticRanks1 = starGraph.staticPageRank(numIter = 1, resetProb).vertices | ||
val staticRanks2 = starGraph.staticPageRank(numIter = 2, resetProb).vertices.cache() | ||
val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A few extra iterations makes the test more robust? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really more robust since it has a sink and thus is still wrong pending SPARK-18847. But it is needed with the change to fully propagate the change in rank of source vertices in the first iteration as explained above. |
||
val staticRanks2 = starGraph.staticPageRank(numIter = 3, resetProb).vertices.cache() | ||
|
||
// Static PageRank should only take 2 iterations to converge | ||
// Static PageRank should only take 3 iterations to converge | ||
val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) => | ||
if (pr1 != pr2) 1 else 0 | ||
}.map { case (vid, test) => test }.sum() | ||
|
@@ -203,4 +203,30 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext { | |
assert(compareRanks(staticRanks, parallelStaticRanks) < errorTol) | ||
} | ||
} | ||
|
||
test("Loop with source PageRank") { | ||
withSpark { sc => | ||
val edges = sc.parallelize((1L, 2L) :: (2L, 3L) :: (3L, 4L) :: (4L, 2L) :: Nil) | ||
val g = Graph.fromEdgeTuples(edges, 1) | ||
val resetProb = 0.15 | ||
val tol = 0.0001 | ||
val numIter = 50 | ||
val errorTol = 1.0e-5 | ||
|
||
val staticRanks = g.staticPageRank(numIter, resetProb).vertices | ||
val dynamicRanks = g.pageRank(tol, resetProb).vertices | ||
assert(compareRanks(staticRanks, dynamicRanks) < errorTol) | ||
|
||
// Computed in igraph 1.0 w/ R bindings: | ||
// > page_rank(graph_from_literal( A -+ B -+ C -+ D -+ B)) | ||
// Alternatively in NetworkX 1.11: | ||
// > nx.pagerank(nx.DiGraph([(1,2),(2,3),(3,4),(4,2)])) | ||
// We multiply by the number of vertices to account for difference in normalization | ||
val igraphPR = Seq(0.0375000, 0.3326045, 0.3202138, 0.3096817).map(_ * 4) | ||
val ranks = VertexRDD(sc.parallelize(1L to 4L zip igraphPR)) | ||
assert(compareRanks(staticRanks, ranks) < errorTol) | ||
assert(compareRanks(dynamicRanks, ranks) < errorTol) | ||
|
||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a related but slightly separate fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an intertwined bug in the implementation that was introduced at the same time (when moving away from Pregel in 15a5645). The only vertices not included in the original
joinVertices
are source vertices (those with no incoming edges). Normally (in the absence of sinks) source vertices would have page rank equal to the reset probability. Since source vertices were not included in the join their rank was fixed at their initial value, which fortunately was the correct value. When we change the initial value of all vertices to 1 it exposes this error.