Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Page Rank Computation in PageRank.scala #2100

Closed
wants to merge 2 commits into from
Closed

Error in Page Rank Computation in PageRank.scala #2100

wants to merge 2 commits into from

Conversation

pfontana3w2
Copy link

I saw something strange in the Page Rank computation for runUntilConverge() in PageRank.scala. It uses the oldPR instead of the resetProb. Note that the run() Method in PageRank.scala uses resetProb as my correction does here (see Lines 95–96 of PageRank.scala).

Here is the diff that I see (in case it is hidden later):

-      val newPR = oldPR + (1.0 - resetProb) * msgSum
+      // Equation: resetProb * (1-resetProb)*msgSum
+      val newPR = resetProb + (1.0 - resetProb) * msgSum

This might not be the right correction, but I brought a pull request to see if I have found an error or not. If it is not correct, just deny the pull request.

Best Wishes

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pfontana3w2
Copy link
Author

Do take a look. I noticed that with this patch, dangling nodes no longer have page ranks equal to their reset probabilities, so there may not be an error, or my solution may not be the right one. Here is a small test that I did.

Edge Table:

Node1 Node2
1 2
1 3
3 2
3 4
5 3
6 7
7 8
8 9
9 7

Node Table:

NodeID NodeName
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
j.longaddress.com 10

Page Ranks Before Patch:

(4,0.29503124999999997)
(1,0.15)
(6,0.15)
(3,0.34124999999999994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.35878124999999994)

Page Ranks After Patch:

(4,0.2488125)
(1,0.3)
(6,0.3)
(3,0.5325)
(7,0.23925000000000002)
(9,0.19335000000000002)
(8,0.45600000000000007)
(10,0.3)
(5,0.3)
(2,0.2488125)

Note that node 10 is not in the edge table

@ankurdave
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2100 at commit 18eb231.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2100 at commit 18eb231.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pfontana3w2
Copy link
Author

It looks like I may be mistaken (since the unit tests failed), so I am closing this pull request.

@ankurdave
Copy link
Contributor

To answer your concern, the runUntilConvergence version of PageRank uses delta messages, where the msgSum operates on the delta graph where the rank of each vertex is the change in PageRank from one iteration to the next. You can see that on lines 144 and 149: the sendMessage function uses edge.srcAttr._2 as the source rank, which was set to newPR - oldPR.

As a result, we have to add in oldPR at each step to obtain the PageRank of the input graph rather than the delta graph.

@pfontana3w2
Copy link
Author

Thank you for taking the time to explain the code. I did another test and noticed that the two methods seem to give different outputs for a test case. See Issue: https://issues.apache.org/jira/browse/SPARK-3206?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants