[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483

aray · 2017-01-06T06:48:16Z

What changes were proposed in this pull request?

Graphs with sinks (vertices with no outgoing edges) don't have the expected rank sum of n (or 1 for personalized). We fix this by normalizing to the expected sum at the end of each implementation.

Additionally this fixes the dynamic version of personal pagerank which gave incorrect answers that were not detected by existing unit tests.

How was this patch tested?

Revamped existing and additional unit tests with reference values (and reproduction code) from igraph and NetworkX.

Note that for comparison on personal pagerank we use the arpack algorithm in igraph as prpack (the current default) redistributes rank to all vertices uniformly instead of just to the personalization source. We could take the alternate convention (redistribute rank to all vertices uniformly) but that would involve more extensive changes to the algorithms (the dynamic version would no longer be able to use Pregel).

SparkQA · 2017-01-06T07:16:59Z

Test build #70969 has finished for PR 16483 at commit 41178a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2017-01-06T14:41:07Z

ping @srowen @ankurdave can you take a look at this?

aray · 2017-01-17T17:16:02Z

@rxin can you take a look?

rxin · 2017-01-17T18:51:13Z

cc @ankurdave

aray · 2017-03-16T14:25:40Z

@rxin can anyone else review this? It would be nice to get this correctness fix into 2.2.

thunterdb · 2017-03-16T20:51:09Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

@@ -162,7 +162,15 @@ object PageRank extends Logging {
      iteration += 1
    }

-    rankGraph
+    // If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct


put the name of the ticket as well

thunterdb

I have a few small comments.

thunterdb · 2017-03-16T20:57:04Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

      vp, sendMessage, messageCombiner)
      .mapVertices((vid, attr) => attr._1)
-  } // end of deltaPageRank
+
+    // If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct


This is the same code as above, please factor it into a function.

thunterdb · 2017-03-16T21:09:25Z

graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala


-      // Static PageRank should only take 3 iterations to converge
-      val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) =>
+      // Static PageRank should only take 2 iterations to converge


Why does it take only two iterations to converge now?

It didn't change, were still comparing the output of the 2nd and 3rd iteration.

thunterdb · 2017-03-16T21:26:11Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

-      val newPR = teleport + (1.0 - resetProb) * msgSum
-      val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR
-      (newPR, newDelta)
+      val newPR = if (lastDelta == Double.NegativeInfinity) {


My memory of the algorithm is a bit rusty. Why don't you need to check for self-loops here anymore?

I'm guessing you mean the if (src==id) check? I'm honestly not sure what was going on with this code its just wrong. The results do not match up with igraph/networkx at all. Furthermore the code is just nonsensical -- definition of var teleport = oldPR that is then unconditionally set two lines down to teleport = oldPR*delta without being used prior.

This revised implementation is much easier to follow and is now tested against 3 sets of reference values computed by igraph/networkx. Please let me know if you thing I'm missing something.

I agree that the new code is easier to follow in that respect.

thunterdb · 2017-03-16T21:30:03Z

In addition, this introduces an extra step reduction at each iteration. I am fine with that since it is for correctness, but @jkbradley may want to comment as well.

…k-sink2

SparkQA · 2017-03-16T22:54:37Z

Test build #74693 has finished for PR 16483 at commit ac5d0ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2017-03-16T23:59:50Z

@thunterdb The extra step -- as implemented -- is only at the end as that gives the same result as doing it after every iteration but without the extra overhead.

thunterdb · 2017-03-17T21:12:14Z

It looks good to me.

cc @jkbradley or @mengxr for final approval

rxin · 2017-03-17T21:22:49Z

Merging in master. Thanks!

page rank sink fixes and unit tests

41178a3

thunterdb reviewed Mar 16, 2017

View reviewed changes

aray added 2 commits March 16, 2017 16:41

Merge branch 'master' of https://github.com/apache/spark into pageran…

9beca7d

…k-sink2

@thunterdb review items

ac5d0ce

asfgit closed this in bfdeea5 Mar 17, 2017

felixcheung mentioned this pull request Aug 1, 2017

Add support for Spark 2.2.0 graphframes/graphframes#223

Merged

1 task

felixcheung mentioned this pull request Sep 15, 2017

Personalized PageRank broken in Spark 2.2+ graphframes/graphframes#235

Open

srowen mentioned this pull request Jan 10, 2018

Update PageRank.scala #20220

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483

aray commented Jan 6, 2017

SparkQA commented Jan 6, 2017

aray commented Jan 6, 2017

aray commented Jan 17, 2017

rxin commented Jan 17, 2017

aray commented Mar 16, 2017

thunterdb Mar 16, 2017

thunterdb left a comment

thunterdb Mar 16, 2017

thunterdb Mar 16, 2017

aray Mar 16, 2017

thunterdb Mar 16, 2017

aray Mar 16, 2017

thunterdb Mar 17, 2017

thunterdb commented Mar 16, 2017

SparkQA commented Mar 16, 2017

aray commented Mar 16, 2017

thunterdb commented Mar 17, 2017

rxin commented Mar 17, 2017

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483

Conversation

aray commented Jan 6, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 6, 2017

aray commented Jan 6, 2017

aray commented Jan 17, 2017

rxin commented Jan 17, 2017

aray commented Mar 16, 2017

thunterdb Mar 16, 2017

Choose a reason for hiding this comment

thunterdb left a comment

Choose a reason for hiding this comment

thunterdb Mar 16, 2017

Choose a reason for hiding this comment

thunterdb Mar 16, 2017

Choose a reason for hiding this comment

aray Mar 16, 2017

Choose a reason for hiding this comment

thunterdb Mar 16, 2017

Choose a reason for hiding this comment

aray Mar 16, 2017

Choose a reason for hiding this comment

thunterdb Mar 17, 2017

Choose a reason for hiding this comment

thunterdb commented Mar 16, 2017

SparkQA commented Mar 16, 2017

aray commented Mar 16, 2017

thunterdb commented Mar 17, 2017

rxin commented Mar 17, 2017