Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12363][Mllib] Remove setRun and fix PowerIterationClustering failed test #10539

Closed
wants to merge 6 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Dec 31, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by @yanboliang. When setRuns is removed from PowerIterationClustering, one of the tests will be failed. I found that some dstAttrs of the normalized graph are not correct values but 0.0. By setting TripletFields.All in mapTriplets it can work.

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48543 has finished for PR 10539 at commit 4c7623f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

@viirya Thanks for you patch ! Should we also to set TripletFields.All at here?

@viirya
Copy link
Member Author

viirya commented Dec 31, 2015

@yanboliang Because that test doesn't fail so this commit doesn't set TripletFields.All there. However, I tried it and it also passes the test. Do you think we should set it also?

@yanboliang
Copy link
Contributor

@viirya I'm not very familiar with GraphX and I just found they are do the same thing for different interface. Maybe we can hear others' opinions.

@jkbradley
Copy link
Member

@ankurdave Would you mind checking out this issue? It's really weird to me. Here's what's happening:

  • Background: PowerIterationClustering (PIC) initializes a Graph[Double, Double] which represents a (symmetric) similarity matrix. It then runs power iteration on the graph (i.e., on the matrix) to get an approximate first eigenvector. It then clusters the vertices (based on their 1 Double value in the eigenvector) using KMeans.
  • On the surface: This PR changes KMeans in PIC to run 1 time instead of 5 times. That changes the results of PIC.
  • Test failure: There are 2 comparable tests in PowerIterationClusteringSuite: test("power iteration clustering") and test("power iteration clustering on graph"). The second one is failing, even though it seems to begin with exactly the same graph and values.
  • If you look more closely, it seems to be an issue with partitioning:
    • Same initialization: One test gives PIC a list of similarities (which PIC converts to a graph), and the other gives PIC the graph. However, if you print the graphs PIC begins with, they are exactly the same, except for the partitioning.
    • The PIC runs in the 2 tests diverge on the first iteration.
    • The 2 "fixes" I've seen are:
      • Changing TripletFields.Src to TripletFields.All somehow fixes the problem, even though only the edge and src attributes are used. (done in this PR)
      • Changing the number of partitions from 2 to 1 also fixes the problem. (tested locally)

Is it possible that GraphX is not shipping the correct data between partitions? Thanks a lot in advance for your help!

@jkbradley
Copy link
Member

@ankurdave Maybe I should ask you: What does Graph.fromEdges require of the given edge RDD? Is it just that identical edges need to be in the same partition? And does "identical" respect directionality? Or should we explicitly repartition using one of GraphX's strategies?

@ankurdave
Copy link
Contributor

@jkbradley I'm not completely sure of the cause, but one thing I noticed was that PowerIterationClustering calls GraphImpl.fromExistingRDDs in several places without preparing the vertices properly. That might be resulting in a corrupted routing table and causing confusion here.

Each VertexRDD contains routing tables corresponding to a particular EdgeRDD, and GraphImpl.fromExistingRDDs assumes the routing tables match the provided EdgeRDD. If that's not true, the returned graph will give incorrect results on triplet-based operations like mapTriplets and aggregateMessages.

I fixed this in PowerIterationClustering by replacing calls to GraphImpl.fromExistingRDDs(vertices, edges) with Graph(vertices, edges) and the two tests now fail in identical ways.

@jkbradley
Copy link
Member

@ankurdave Thanks a lot for taking a look!

@viirya Would you mind updating this PR according to @ankurdave 's suggestion? After that, we can look into what is causing the failure and create a fix for the test. Thanks!

@viirya
Copy link
Member Author

viirya commented Jan 12, 2016

@jkbradley Sure. I will go to update this.

@viirya
Copy link
Member Author

viirya commented Jan 12, 2016

@ankurdave @jkbradley I tried to replace calls to GraphImpl.fromExistingRDDs(vertices, edges) with Graph(vertices, edges) as @ankurdave suggested. And the two tests fail.

I did some experiments and it is interesting to me that the following has different results:

val normalized = GraphImpl.fromExistingRDDs(vD, gA.edges)
  .mapTriplets(
    e => e.attr / math.max(e.srcAttr, MLUtils.EPSILON),
    TripletFields.Src)

val normalized2 = Graph(normalized.vertices, normalized.edges)

If I return normalized from the method normalize, the first test succeeds. But if I return normalized2, the test fails. Because I create new graph with the vertices and edges from the same graph normalized, I suppose that normalized2 should be as same as normalized. Routing tables should not be the problem here because the edges and vertices are from the same graph, so the routing table should be correct.

So I am interested in why normalized succeeds and normalized2 fails.

@jkbradley
Copy link
Member

@viirya About your above comments:

Routing tables should not be the problem here because the edges and vertices are from the same graph, so the routing table should be correct.

I don't think that's true because normalized2 is created using Graph.apply, which will recompute routing tables.

I'd recommend using Graph.apply, and then exploring what PIC is doing in those 2 tests (e.g., by printing out the vertex and edge attributes on each iteration, and seeing how values are propagated around the graph).

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

@viirya I sent you an update at viirya#2. Could you review and merge it if it looks good to you? That should fix the PIC implementation. Thanks!

use Graph instead of GraphImpl and update tests/example based on PIC paper
@viirya
Copy link
Member Author

viirya commented Feb 12, 2016

@mengxr The update looks good, I've merged it. Thanks. As you modified the test cases in PowerIterationClusteringSuite, a minor question is the original test cases in the suite is not working even with using Graph.apply to replace GraphImpl.fromExistingRDDs?

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51186 has finished for PR 10539 at commit d749f6d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51189 has finished for PR 10539 at commit 2860025.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

The implementation was wrong: 1) As @ankurdave mentioned, GraphImpl.fromExistingRDDs might result a wrong routing table, 2) TripletFields were not properly set in mapTriplets. So I don't think the previous tests were correct.

Sorry for Scala style issues and failed Python tests! Could you help fix them? Thanks!

@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51237 has finished for PR 10539 at commit 926fa56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51240 has finished for PR 10539 at commit 926fa56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Feb 13, 2016
…ailed test

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.

(cherry picked from commit e3441e3)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Feb 13, 2016

LGTM. Merged into master, branch-1.6, and branch-1.5. Thanks!

@asfgit asfgit closed this in e3441e3 Feb 13, 2016
asfgit pushed a commit that referenced this pull request Feb 13, 2016
…ailed test

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.

(cherry picked from commit e3441e3)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Feb 14, 2016

@viirya If it doesn't take you much time, could you prepare backports for branch-1.4 and branch-1.3? I guess we only need to remove the Python changes. This is not high-priority since we don't have release plan for 1.4.x and 1.3.x.

@viirya
Copy link
Member Author

viirya commented Feb 14, 2016

@mengxr ok. I will do the backports.

@viirya viirya deleted the fix-poweriter branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants