[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906

yu-iskw · 2014-10-23T09:29:20Z

I want to add a divisive hierarchical clustering algorithm implementation to MLlib. I don't support distance metrics other Euclidean distance metric yet. It would be nice to support it at other issue.
Could you review it?

Thanks!

AmplabJenkins · 2014-10-23T09:32:11Z

Can one of the admins verify this patch?

rnowling · 2014-10-23T09:49:09Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala

+    val treeRoot = this.clusterTree
+    val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _
+    data.sparkContext.broadcast(closestClusterIndexFinder)
+    val predicted = data.map(point => (closestClusterIndexFinder(point), point))


I don't think you're using the broadcast variable correctly:

http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Modify the way to use broadcast
yu-iskw@290d492

srowen · 2014-10-23T16:37:34Z

I just gave this a quick read-through, and the structure makes sense. I left several small comments. I see the chunks of logic I would expect, but did not evaluate it in detail. The existence of some tests suggests this probably basically works :) I am wondering about performance too as this relies on Scala idioms in many places; it might be worth a quick look with jprofiler if you can to see if there are any easy-win optimizations.

mengxr · 2014-10-24T23:01:14Z

Jenkins, add to whitelist.

mengxr · 2014-10-24T23:02:09Z

ok to test

SparkQA · 2014-10-24T23:04:53Z

Test build #22177 has started for PR 2906 at commit 91a38e3.

This patch merges cleanly.

SparkQA · 2014-10-24T23:10:09Z

Test build #22179 has started for PR 2906 at commit 91a38e3.

This patch merges cleanly.

SparkQA · 2014-10-24T23:54:22Z

Test build #22177 has finished for PR 2906 at commit 91a38e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaHierarchicalClustering
- class HierarchicalClusteringConf(
- class HierarchicalClustering(val conf: HierarchicalClusteringConf)
- class ClusterTree(
- class ClusteringModel(object):
- class KMeansModel(ClusteringModel):
- class HierarchicalClusteringModel(ClusteringModel):
- class HierarchicalClustering(object):

AmplabJenkins · 2014-10-24T23:54:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22177/
Test FAILed.

mengxr · 2014-10-24T23:58:55Z

@yu-iskw I added you to the whitelist. Future commits from you should trigger Jenkins automatically. Just took a very brief scan over the code and really appreciate the fact that more than half of the code is doc/test/example. I will check the implementation after the feature freeze. Some high-level questions for now:

Is there a paper that you used as reference? If so, please cite it in the doc.
Could you send some performance testing results on dense and sparse datasets?

SparkQA · 2014-10-25T00:03:37Z

Test build #22179 has finished for PR 2906 at commit 91a38e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaHierarchicalClustering
- class HierarchicalClusteringConf(
- class HierarchicalClustering(val conf: HierarchicalClusteringConf)
- class ClusterTree(
- class ClusteringModel(object):
- class KMeansModel(ClusteringModel):
- class HierarchicalClusteringModel(ClusteringModel):
- class HierarchicalClustering(object):

AmplabJenkins · 2014-10-25T00:03:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22179/
Test FAILed.

SparkQA · 2014-10-27T02:04:47Z

Test build #22267 has started for PR 2906 at commit 1a08510.

This patch merges cleanly.

SparkQA · 2014-10-27T02:09:48Z

Test build #22268 has started for PR 2906 at commit b014f50.

This patch merges cleanly.

SparkQA · 2014-10-27T02:39:46Z

Test build #22270 has started for PR 2906 at commit 8dbbacd.

This patch merges cleanly.

SparkQA · 2014-10-27T03:20:06Z

Test build #22268 has finished for PR 2906 at commit b014f50.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaHierarchicalClustering
- class HierarchicalClusteringConf(
- class HierarchicalClustering(val conf: HierarchicalClusteringConf)
- class ClusterTree(
- class ClusteringModel(object):
- class KMeansModel(ClusteringModel):
- class HierarchicalClusteringModel(ClusteringModel):
- class HierarchicalClustering(object):

AmplabJenkins · 2014-10-27T03:20:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22268/
Test PASSed.

SparkQA · 2014-10-27T03:21:32Z

Test build #22267 has finished for PR 2906 at commit 1a08510.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-27T03:21:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22267/
Test PASSed.

freeman-lab · 2015-01-08T02:45:12Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala

+    // TODO Supports distance metrics other Euclidean distance metric
+    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
+    val treeRoot = this.clusterTree
+    sc.broadcast(metric)


Not output, see my other note about sc.broadcast.

freeman-lab · 2015-01-08T04:24:47Z

Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it in both Python and Scala. Overall great work, terrific to see my little gist turned into something so refined and performant! =) I left lots of comments, most minor, though documenting the caching behavior seems quite important.

The one significant addition I'd suggest is exposing another model output: a list of the centers at all nodes in the learned tree. This would be in addition to just the centers of the leaves, which is currently returned by getCenters (or clusterCenters in Python). Maybe call it getTreeCenters. It's basically given by model.clusterTree.toSeq().map(_.center). But we should make sure it's sorted so that it can be indexed using the values from the merge list. In other words, if Z is the merge list, and row i indicates that Z[i,0] and Z[i,1] were merged, we want to be able to get the centers associated with those nodes by calling, for example, model.treeCenters[Z[i,0]] and model.treeCenters[Z[i,1]]. What do you think?

srowen · 2015-01-08T09:08:22Z

docs/mllib-clustering.md

+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+public class JavaHierarchicalClustering {


The other example code I see foregoes a lot of the boilerplate here of declaring a class, main method, System.out, etc. The indentation here is also significantly deeper than the 2-space indent in the code. Addressing these might make it easier to scan as an example on the web page.

rnowling · 2015-01-08T15:16:33Z

@freeman-lab @srowen @mengxr many thanks!

yu-iskw · 2015-03-11T06:58:36Z

@freeman-lab, @srowen, I apologize for the delay in replying. I will modify the code ASAP.
And I have a question about the implementation. I think this implementation is very slow and it difficult to take the large number of clusters in an argument. So, I tried to implement the new one which is more scalable and faster than current one. The new one is 1000 times faster than the current one.

https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark

Should we continue the PR, or replace the current one with the new one. thanks!

yu-iskw · 2015-03-19T20:50:17Z

I've spoken with @freeman-lab. I am going to send a new PR after replacing the algorithm to the new one and adding wrapper classes for ml package.

I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.

I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering. (cherry picked from commit 8a23368) Signed-off-by: Xiangrui Meng <meng@databricks.com>

rnowling reviewed Oct 23, 2014
View reviewed changes

yu-iskw force-pushed the hierarchical branch from 1a08510 to b014f50 Compare October 27, 2014 02:02

freeman-lab reviewed Jan 8, 2015
View reviewed changes

srowen reviewed Jan 8, 2015
View reviewed changes

yu-iskw closed this Mar 19, 2015

yu-iskw mentioned this pull request Mar 30, 2015

[SPARK-6517][mllib] Implement the Algorithm of Hierarchical Clustering #5267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906

[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906

yu-iskw commented Oct 23, 2014

AmplabJenkins commented Oct 23, 2014

rnowling Oct 23, 2014

yu-iskw Oct 27, 2014

srowen commented Oct 23, 2014

mengxr commented Oct 24, 2014

mengxr commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

mengxr commented Oct 24, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

freeman-lab Jan 8, 2015

freeman-lab commented Jan 8, 2015

srowen Jan 8, 2015

rnowling commented Jan 8, 2015

yu-iskw commented Mar 11, 2015

yu-iskw commented Mar 19, 2015

[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906

[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906

Conversation

yu-iskw commented Oct 23, 2014

AmplabJenkins commented Oct 23, 2014

rnowling Oct 23, 2014

Choose a reason for hiding this comment

yu-iskw Oct 27, 2014

Choose a reason for hiding this comment

srowen commented Oct 23, 2014

mengxr commented Oct 24, 2014

mengxr commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

mengxr commented Oct 24, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

freeman-lab Jan 8, 2015

Choose a reason for hiding this comment

freeman-lab commented Jan 8, 2015

srowen Jan 8, 2015

Choose a reason for hiding this comment

rnowling commented Jan 8, 2015

yu-iskw commented Mar 11, 2015

yu-iskw commented Mar 19, 2015