[SPARK-27295][GraphX] Provision to provide the initial scores for source nodes while running Personalized Page Rank #24230

EshwarSR · 2019-03-27T17:49:49Z

What changes were proposed in this pull request?

The present implementation of parallel personalized page rank algorithm takes only node ids as the starting nodes for algorithm. And then it assigns initial value of 1.0 to all those source nodes.

But the user might also be interested in specifying the initial values for each node.

…g Personalized Page Rank - SPARK-27295

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

EshwarSR · 2019-03-28T06:05:51Z

I have added a new API with the same name. I tried avoiding the private method by simply calling the new API from the old API method. Is that okay?

srowen · 2019-03-28T14:17:48Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

@@ -196,11 +196,45 @@ object PageRank extends Logging {
    require(sources.nonEmpty, s"The list of sources must be non-empty," +


You can remove these checks if they are now checked in the other method.

srowen · 2019-03-28T14:18:40Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

@@ -196,11 +196,45 @@ object PageRank extends Logging {
    require(sources.nonEmpty, s"The list of sources must be non-empty," +
      s" but got ${sources.mkString("[", ",", "]")}")

+    val sourcesWithScores = sources zip Array.fill(sources.size)(1.0)


Let's use explicit '.' notation
But this can be sources.map((_, 1.0)) instead

srowen · 2019-03-28T14:19:40Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

    val zero = Vectors.sparse(sources.size, List()).asBreeze
-    // map of vid -> vector where for each vid, the _position of vid in source_ is set to 1.0
+    // map of vid -> vector where for each vid, the _position of vid in source_ is set to provided score
    val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) =>


While you're here, replace vid with (vid, score) so that you can avoid ._1 syntax

…case statement

srowen

One last thing - can you add a test for the new method? can just follow any existing tests for the existing method.

srowen · 2019-03-28T17:23:24Z

@EshwarSR Ah, OK, we have a little more work here. This method is also, or 'really', exposed by GraphOps.staticParallelPersonalizedPageRank. And that is tested by PageRankSuite. I think we need the same sort of new method in GraphOps, just one that calls the new variation on the method you added. Then see about refactoring the existing tests to add in one simple new test of the new functionality.

srowen · 2019-04-01T14:04:07Z

@EshwarSR if you can make a few more changes here per the last comment, I think this can be merged.

EshwarSR · 2019-04-02T05:38:45Z

HI @srowen, got held up with other work. Will get it done ASAP. Thanks!

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

…rallelPersonalizedPageRank

EshwarSR · 2019-04-03T18:23:40Z

Hi @srowen I've done the changes mentioned in your comment.
Hi @shahidki31 I have added the @since 3.0.0 in the documentation too.

srowen · 2019-04-03T20:05:12Z

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

+   *         indexed by the position of nodes in the sources list) and
+   *         edge attributes the normalized edge weight
+   *
+   * @since 3.0.0


Oh, this has to be an annotation like @Since("3.0.0") on the method, not within the scaladoc

I got confused by seeing this. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
I fixed it now.

srowen · 2019-04-03T20:05:45Z

graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala

@@ -115,7 +115,7 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext {
      assert(compareRanks(staticRanks, dynamicRanks) < errorTol)

      val parallelStaticRanks = starGraph
-        .staticParallelPersonalizedPageRank(Array(0), numIter, resetProb).mapVertices {
+        .staticParallelPersonalizedPageRank(Array((0, 1.0)), numIter, resetProb).mapVertices {


Rather than modify the existing tests, is it possible to create a new test that uses initial values that aren't 1? that would help verify the behavior is correct.

I thought logically it would be the same. So should I still create a separate test for it?
If yes, I need a little guidance here.

I suppose you aren't testing that the implementation correctly passes the initial scores and uses them, nor testing that the original method that causes 1 to be the default works now (unless some tests still cover this). There's no reason to make elaborate tests, but is there any simple test case you can copy/paste that shows the result is different and basically correct with initial scores that aren't 1?

@EshwarSR if you can add any minimal test of initial scores that aren't 1, and probably leave the existing tests, this can be merged

Hi @srowen I just tried comparing the scores b/w networkx and our new implementation. The scores are not aligned. Hence I was debugging if there is any issue with the code. Do you observer any place there is an error?

I don't know this code much at all. Yeah that's the kind of thing that's important to test -- do scores that aren't 1 work as intended?

I tried running the old implementation with hardcoded 1.0 as initial scores. Even the scores from that and networkx dont match. I'm sort of confused what is going wrong. I guess I need to see the networkx implementation to understand the exact difference between the implementations.

AmplabJenkins · 2019-11-21T01:08:07Z

Can one of the admins verify this patch?

github-actions · 2020-03-01T00:13:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Provision to provide the initial scores for source nodes while runnin…

afb74cd

…g Personalized Page Rank - SPARK-27295

srowen reviewed Mar 27, 2019

View reviewed changes

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala Show resolved Hide resolved

Adding a new API so that the existing API doesnt brake

b6847c4

Changed the comment

03b0f5a

srowen requested changes Mar 28, 2019

View reviewed changes

EshwarSR added 2 commits March 28, 2019 22:07

Removed unecessary checks. Used map instead of zip. Better naming in …

7341fb3

…case statement

Fixing style issue

0b55d9b

srowen reviewed Mar 28, 2019

View reviewed changes

shahidki31 reviewed Apr 2, 2019

View reviewed changes

graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala Show resolved Hide resolved

EshwarSR added 2 commits April 3, 2019 23:48

Added staticParallelPersonalizedPageRank for the new variant of runPa…

08ffef4

…rallelPersonalizedPageRank

Modified the test cases to test new functionality

0fda5ff

Added since in documentation

b845130

EshwarSR force-pushed the SPARK-27295 branch from a664166 to b845130 Compare April 3, 2019 18:29

srowen requested changes Apr 3, 2019

View reviewed changes

EshwarSR added 2 commits April 4, 2019 09:38

Changed to annotation

8cafe04

Added annotation

463457e

dongjoon-hyun added the GRAPHX label Jun 14, 2019

github-actions bot added the Stale label Mar 1, 2020

github-actions bot closed this Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27295][GraphX] Provision to provide the initial scores for source nodes while running Personalized Page Rank #24230

[SPARK-27295][GraphX] Provision to provide the initial scores for source nodes while running Personalized Page Rank #24230

EshwarSR commented Mar 27, 2019

EshwarSR commented Mar 28, 2019

srowen Mar 28, 2019

EshwarSR Mar 28, 2019

srowen Mar 28, 2019

EshwarSR Mar 28, 2019

srowen Mar 28, 2019

EshwarSR Mar 28, 2019

srowen left a comment

srowen commented Mar 28, 2019

srowen commented Apr 1, 2019

EshwarSR commented Apr 2, 2019

EshwarSR commented Apr 3, 2019 •

edited

Loading

srowen Apr 3, 2019

EshwarSR Apr 4, 2019

srowen Apr 3, 2019

EshwarSR Apr 4, 2019 •

edited

Loading

srowen Apr 4, 2019

srowen Apr 8, 2019

EshwarSR Apr 8, 2019

srowen Apr 8, 2019

EshwarSR Apr 22, 2019

AmplabJenkins commented Nov 21, 2019

github-actions bot commented Mar 1, 2020

		@@ -196,11 +196,45 @@ object PageRank extends Logging {
		require(sources.nonEmpty, s"The list of sources must be non-empty," +

[SPARK-27295][GraphX] Provision to provide the initial scores for source nodes while running Personalized Page Rank #24230

[SPARK-27295][GraphX] Provision to provide the initial scores for source nodes while running Personalized Page Rank #24230

Conversation

EshwarSR commented Mar 27, 2019

What changes were proposed in this pull request?

EshwarSR commented Mar 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

srowen commented Mar 28, 2019

srowen commented Apr 1, 2019

EshwarSR commented Apr 2, 2019

EshwarSR commented Apr 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EshwarSR Apr 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 21, 2019

github-actions bot commented Mar 1, 2020

EshwarSR commented Apr 3, 2019 •

edited

Loading

EshwarSR Apr 4, 2019 •

edited

Loading