Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks #16483

Closed
wants to merge 3 commits into from

Conversation

aray
Copy link
Contributor

@aray aray commented Jan 6, 2017

What changes were proposed in this pull request?

Graphs with sinks (vertices with no outgoing edges) don't have the expected rank sum of n (or 1 for personalized). We fix this by normalizing to the expected sum at the end of each implementation.

Additionally this fixes the dynamic version of personal pagerank which gave incorrect answers that were not detected by existing unit tests.

How was this patch tested?

Revamped existing and additional unit tests with reference values (and reproduction code) from igraph and NetworkX.

Note that for comparison on personal pagerank we use the arpack algorithm in igraph as prpack (the current default) redistributes rank to all vertices uniformly instead of just to the personalization source. We could take the alternate convention (redistribute rank to all vertices uniformly) but that would involve more extensive changes to the algorithms (the dynamic version would no longer be able to use Pregel).

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70969 has finished for PR 16483 at commit 41178a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Jan 6, 2017

ping @srowen @ankurdave can you take a look at this?

@aray
Copy link
Contributor Author

aray commented Jan 17, 2017

@rxin can you take a look?

@rxin
Copy link
Contributor

rxin commented Jan 17, 2017

cc @ankurdave

@aray
Copy link
Contributor Author

aray commented Mar 16, 2017

@rxin can anyone else review this? It would be nice to get this correctness fix into 2.2.

@@ -162,7 +162,15 @@ object PageRank extends Logging {
iteration += 1
}

rankGraph
// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the name of the ticket as well

Copy link
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few small comments.

vp, sendMessage, messageCombiner)
.mapVertices((vid, attr) => attr._1)
} // end of deltaPageRank

// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same code as above, please factor it into a function.


// Static PageRank should only take 3 iterations to converge
val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) =>
// Static PageRank should only take 2 iterations to converge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it take only two iterations to converge now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't change, were still comparing the output of the 2nd and 3rd iteration.

val newPR = teleport + (1.0 - resetProb) * msgSum
val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR
(newPR, newDelta)
val newPR = if (lastDelta == Double.NegativeInfinity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memory of the algorithm is a bit rusty. Why don't you need to check for self-loops here anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you mean the if (src==id) check? I'm honestly not sure what was going on with this code its just wrong. The results do not match up with igraph/networkx at all. Furthermore the code is just nonsensical -- definition of var teleport = oldPR that is then unconditionally set two lines down to teleport = oldPR*delta without being used prior.

This revised implementation is much easier to follow and is now tested against 3 sets of reference values computed by igraph/networkx. Please let me know if you thing I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the new code is easier to follow in that respect.

@thunterdb
Copy link
Contributor

In addition, this introduces an extra step reduction at each iteration. I am fine with that since it is for correctness, but @jkbradley may want to comment as well.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74693 has finished for PR 16483 at commit ac5d0ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Mar 16, 2017

@thunterdb The extra step -- as implemented -- is only at the end as that gives the same result as doing it after every iteration but without the extra overhead.

@thunterdb
Copy link
Contributor

It looks good to me.

cc @jkbradley or @mengxr for final approval

@rxin
Copy link
Contributor

rxin commented Mar 17, 2017

Merging in master. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants