Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906

Closed
wants to merge 28 commits into from

Conversation

yu-iskw
Copy link
Contributor

@yu-iskw yu-iskw commented Oct 23, 2014

I want to add a divisive hierarchical clustering algorithm implementation to MLlib. I don't support distance metrics other Euclidean distance metric yet. It would be nice to support it at other issue.
Could you review it?

Thanks!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

val treeRoot = this.clusterTree
val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _
data.sparkContext.broadcast(closestClusterIndexFinder)
val predicted = data.map(point => (closestClusterIndexFinder(point), point))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're using the broadcast variable correctly:

http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the way to use broadcast
yu-iskw@290d492

@srowen
Copy link
Member

srowen commented Oct 23, 2014

I just gave this a quick read-through, and the structure makes sense. I left several small comments. I see the chunks of logic I would expect, but did not evaluate it in detail. The existence of some tests suggests this probably basically works :) I am wondering about performance too as this relies on Scala idioms in many places; it might be worth a quick look with jprofiler if you can to see if there are any easy-win optimizations.

@mengxr
Copy link
Contributor

mengxr commented Oct 24, 2014

Jenkins, add to whitelist.

@mengxr
Copy link
Contributor

mengxr commented Oct 24, 2014

ok to test

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22177 has started for PR 2906 at commit 91a38e3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22179 has started for PR 2906 at commit 91a38e3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22177 has finished for PR 2906 at commit 91a38e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaHierarchicalClustering
    • class HierarchicalClusteringConf(
    • class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    • class ClusterTree(
    • class ClusteringModel(object):
    • class KMeansModel(ClusteringModel):
    • class HierarchicalClusteringModel(ClusteringModel):
    • class HierarchicalClustering(object):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22177/
Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Oct 24, 2014

@yu-iskw I added you to the whitelist. Future commits from you should trigger Jenkins automatically. Just took a very brief scan over the code and really appreciate the fact that more than half of the code is doc/test/example. I will check the implementation after the feature freeze. Some high-level questions for now:

  1. Is there a paper that you used as reference? If so, please cite it in the doc.
  2. Could you send some performance testing results on dense and sparse datasets?

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22179 has finished for PR 2906 at commit 91a38e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaHierarchicalClustering
    • class HierarchicalClusteringConf(
    • class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    • class ClusterTree(
    • class ClusteringModel(object):
    • class KMeansModel(ClusteringModel):
    • class HierarchicalClusteringModel(ClusteringModel):
    • class HierarchicalClustering(object):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22179/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22267 has started for PR 2906 at commit 1a08510.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22268 has started for PR 2906 at commit b014f50.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22270 has started for PR 2906 at commit 8dbbacd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22268 has finished for PR 2906 at commit b014f50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaHierarchicalClustering
    • class HierarchicalClusteringConf(
    • class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    • class ClusterTree(
    • class ClusteringModel(object):
    • class KMeansModel(ClusteringModel):
    • class HierarchicalClusteringModel(ClusteringModel):
    • class HierarchicalClustering(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22268/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22267 has finished for PR 2906 at commit 1a08510.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22267/
Test PASSed.

// TODO Supports distance metrics other Euclidean distance metric
val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
val treeRoot = this.clusterTree
sc.broadcast(metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not output, see my other note about sc.broadcast.

@freeman-lab
Copy link
Contributor

Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it in both Python and Scala. Overall great work, terrific to see my little gist turned into something so refined and performant! =) I left lots of comments, most minor, though documenting the caching behavior seems quite important.

The one significant addition I'd suggest is exposing another model output: a list of the centers at all nodes in the learned tree. This would be in addition to just the centers of the leaves, which is currently returned by getCenters (or clusterCenters in Python). Maybe call it getTreeCenters. It's basically given by model.clusterTree.toSeq().map(_.center). But we should make sure it's sorted so that it can be indexed using the values from the merge list. In other words, if Z is the merge list, and row i indicates that Z[i,0] and Z[i,1] were merged, we want to be able to get the centers associated with those nodes by calling, for example, model.treeCenters[Z[i,0]] and model.treeCenters[Z[i,1]]. What do you think?

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

public class JavaHierarchicalClustering {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other example code I see foregoes a lot of the boilerplate here of declaring a class, main method, System.out, etc. The indentation here is also significantly deeper than the 2-space indent in the code. Addressing these might make it easier to scan as an example on the web page.

@rnowling
Copy link
Contributor

rnowling commented Jan 8, 2015

@freeman-lab @srowen @mengxr many thanks!

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Mar 11, 2015

@freeman-lab, @srowen, I apologize for the delay in replying. I will modify the code ASAP.
And I have a question about the implementation. I think this implementation is very slow and it difficult to take the large number of clusters in an argument. So, I tried to implement the new one which is more scalable and faster than current one. The new one is 1000 times faster than the current one.

https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark

Should we continue the PR, or replace the current one with the new one. thanks!

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Mar 19, 2015

I've spoken with @freeman-lab. I am going to send a new PR after replacing the algorithm to the new one and adding wrapper classes for ml package.

@yu-iskw yu-iskw closed this Mar 19, 2015
asfgit pushed a commit that referenced this pull request Nov 9, 2015
I implemented a hierarchical clustering algorithm again.  This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

- This implementation based on a bi-sectiong K-means clustering.
    - It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
    - However, It is 1000x faster than the previous version through parallel processing.

Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>

Closes #5267 from yu-iskw/new-hierarchical-clustering.
asfgit pushed a commit that referenced this pull request Nov 9, 2015
I implemented a hierarchical clustering algorithm again.  This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

- This implementation based on a bi-sectiong K-means clustering.
    - It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
    - However, It is 1000x faster than the previous version through parallel processing.

Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>

Closes #5267 from yu-iskw/new-hierarchical-clustering.

(cherry picked from commit 8a23368)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants