[SPARK-6517][mllib] Implement the Algorithm of Hierarchical Clustering #5267

yu-iskw · 2015-03-30T13:40:12Z

I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

This implementation based on a bi-sectiong K-means clustering.
- It derives from the @freeman-lab 's implementation
The basic idea is not changed from the previous version. ([SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906)
- However, It is 1000x faster than the previous version through parallel processing.

Thank you for your great cooperation, RJ Nowling(@rnowling), Jeremy Freeman(@freeman-lab), Xiangrui Meng(@mengxr) and Sean Owen(@srowen).

SparkQA · 2015-03-30T15:04:46Z

Test build #29402 has finished for PR 5267 at commit af0f65b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClustering(
- class ClusterTree(
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):
This patch does not change any dependencies.

SparkQA · 2015-03-30T15:41:27Z

Test build #29403 has finished for PR 5267 at commit 3df7f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClustering(
- class ClusterTree(
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):
This patch does not change any dependencies.

yu-iskw · 2015-03-31T23:20:08Z

Hi @jkbradley, @mengxr,
Would you review this PR, especially from the point of view about the design? It supports a saving and loading model functions in both of Scala and Python. I want to make sure that my implementation is suitable in terms of the concept of save/load.
Thanks

freeman-lab · 2015-04-01T08:06:01Z

@yu-iskw great putting this new version together, I'd be happy to do a review (especially re: the algorithm), should be able to get to it in the next few days!

yu-iskw · 2015-04-01T08:09:42Z

@freeman-lab, Thank you for your attention to this matter.

jkbradley · 2015-04-01T21:23:42Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala

+/**
+ * Top-level methods for calling the hierarchical clustering algorithm
+ */
+object HierarchicalClustering extends Logging {


I don't think we need static train() methods since users can use the builder pattern from the HierarchicalClustering class.

Should we remove static train() methods of others? For example, KMeans has static train() method. I'd like to understand the difference. I think we should keep the design concept consistent for users. It is not good that some algorithms support static train() method and others don't support it.

Those static train() methods used to be added everywhere, but at some point, we realized that they require a lot of duplicated code (since Java does not recognize default arguments). We're trying to just add builder methods from now on, but we have to keep the old static train() methods for API stability.

You're right about consistency. I've been thinking about whether we should start deprecating the old static train() methods.

I understand. I agree with that we should start deprecating the old ones.
I am removing the train() static method from HierarchicalClustering object.
Thanks!

Sorry, one more question. How about consistency between Scala and Python?
MLlib in PySpark doesn't support builder methods. How to call a train() method in PySpark should similar to that in Scala. I think that is a static train() method.

Yes, that's an issue but one where there isn't a great solution. Let's discuss it on the JIRA: [https://issues.apache.org/jira/browse/SPARK-6682]

Alright. Thank you for letting me know.

jkbradley · 2015-04-01T21:24:39Z

Thanks @freeman-lab in advance for reviewing! I'd be happy to make a pass too, but will wait to avoid duplicate passes over the code.

yu-iskw · 2015-04-02T04:39:27Z

@jkbradley, thank you for your quick replying. I understand you are waiting for @freeman-lab 's review. I am making an example and writing the documentation in parallel now.

yu-iskw · 2015-04-08T09:51:11Z

Sorry for modifying the code before your feedback. The main points of the difference are like below. Thank you for your continuous support.

Remove the static train() method from HierarchicalClustering object
Remove the parentheses from the setter/getter methods
Add a method to calculate Within Set Sum of Squared Error (WSSSE) into HierarchicalClusteringModel

SparkQA · 2015-04-08T11:30:58Z

Test build #29855 has finished for PR 5267 at commit 9d32110.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):
This patch does not change any dependencies.

yu-iskw · 2015-04-14T00:45:32Z

@freeman-lab, do you know any good evaluations of a hierarchical clustering algorithm except Within Set Sumb of Squared Error(WSSSE)? For example, I know Silhouette Coefficient is a evaluation for it. However, I thinks it is hard to implement as a distributed processing.

freeman-lab · 2015-04-27T05:05:27Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala

+    this
+  }
+
+  def getSubIterations: Int = this.maxIterations


Why the name swap? Shouldn't this be getMaxIterations?

freeman-lab · 2015-04-27T05:25:24Z

@yu-iskw I'm not familiar with any other self-contained metrics (there are a bunch of metrics for relating estimated clusters to some known ground-truth clustering, but I don't think that's what you mean). Are you wanting to provide other outputs to the user to assess clustering quality?

freeman-lab · 2015-04-27T05:32:11Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala

+    }
+  }
+
+  def getClusters(): Array[ClusterTree] = this.tree.getLeavesNodes()


Remove parentheses after getClusters

freeman-lab · 2015-04-27T05:35:26Z

@yu-iskw I'm still going through the patch, but so far it's looking good! I've also been testing it locally.

Is there a reason you removed the toMergeList method from the previous version of this submission? That seemed quite useful to me, as it's a common way to describe the output of hierarchical clustering, both in formal treatments as well as in other analysis libraries (though I do suggest naming it toLinkageMatrix). What do you think about bringing it back?

SparkQA · 2015-04-27T18:20:37Z

Test build #31035 has started for PR 5267 at commit 9d32110.

yu-iskw · 2015-04-30T01:13:27Z

@freeman-lab thank you for reviewing. I will modify it soon.

The reason removing toMergeList at the old version is that I think it depends on scipy much. However, as you're suggesting, we should support any function to get a dendrogram. How is adjacency list?
Or if there is any general data structure for a linkage matrix, we should support it. Do you have any idea?

freeman-lab · 2015-05-01T13:21:56Z

@yu-iskw that makes sense! I do think the linkage matrix / merge list is a general enough data structure for this algorithm that it's definitely worth having as an output, and doesn't actually depend on scipy. The way you had it before is, I think, basically the same thing used in scipy, R, and matlab. I would call the method toLinkageMatrix, that seems to be the most common name.

SparkQA · 2015-05-21T04:29:42Z

Test build #33218 has finished for PR 5267 at commit c3ce8ca.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):

yu-iskw · 2015-05-21T04:30:29Z

Sorry for the delay in my response. I understand the linkage matrix is the common data structure. I supported the toLinkageMatrix method.. And There are three big modifications like below in these differences. Could you review it?

Remove unnecessary parentheses
Add the methods to convert a model to a linkage matrix and adjacency list
Add a Java test file

SparkQA · 2015-05-21T06:47:09Z

Test build #33221 has finished for PR 5267 at commit e990f1e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):

SparkQA · 2015-05-21T09:03:28Z

Test build #33227 has finished for PR 5267 at commit 957d3e2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):

SparkQA · 2015-05-21T10:50:48Z

Test build #33240 has finished for PR 5267 at commit ba5c208.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):

SparkQA · 2015-05-22T04:18:24Z

Test build #33316 has finished for PR 5267 at commit ca54fe3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HierarchicalClusteringModel(val tree: ClusterTree)
- class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
- class HierarchicalClustering(object):

yu-iskw · 2015-06-03T13:36:38Z

@freeman-lab How is going? Thank you for your great support.

SparkQA · 2015-10-29T18:23:45Z

Test build #44615 has finished for PR 5267 at commit a876ba2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

SparkQA · 2015-10-29T22:55:20Z

Test build #44642 has finished for PR 5267 at commit 5da05d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

SparkQA · 2015-10-30T00:11:40Z

Test build #44646 has finished for PR 5267 at commit a50689a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

Refactor bisecting k-means

yu-iskw · 2015-11-09T19:51:44Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

+    val random = new Random(seed)
+    var numLeafClustersNeeded = k - 1
+    var level = 1
+    while (activeClusters.nonEmpty && numLeafClustersNeeded > 0 && level < 63) {


We should use val levelLimit = log10(Long.MaxValue) / Log10(2), instead of 63. It's a minor issue.

SparkQA · 2015-11-09T20:00:11Z

Test build #45396 has finished for PR 5267 at commit 75ca2a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * Sets the random seed (default: hash value of the class name).\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

SparkQA · 2015-11-09T20:57:09Z

Test build #45406 has finished for PR 5267 at commit 29ccdf9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * Sets the random seed (default: hash value of the class name).\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

mengxr · 2015-11-09T22:56:56Z

LGTM. Merged into master and branch-1.6. Thanks!

I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering. (cherry picked from commit 8a23368) Signed-off-by: Xiangrui Meng <meng@databricks.com>

yu-iskw · 2015-11-09T23:02:10Z

Thank you for merging it!!

freeman-lab · 2015-11-09T23:27:05Z

awesome, nice job all!

yu-iskw · 2015-11-09T23:28:35Z

@freeman-lab thank you for your great support!

jkbradley · 2015-11-10T00:07:01Z

@yu-iskw Thanks for persevering and getting this merged!

yu-iskw · 2015-11-10T06:25:28Z

@jkbradley thanks!

jkbradley reviewed Apr 1, 2015
View reviewed changes

freeman-lab reviewed Apr 27, 2015
View reviewed changes

Rename a variable in BisectingKMeansModelSuite

a876ba2

yu-iskw added 7 commits October 29, 2015 13:12

tmp

1985fea

Make this implementation more simple

629f897

Reorganize import statements and adjust parameters and return values

1f84ded

Rename WSSSE to computeCost

12b3223

Remove updateClusterIndex

ef4a3e8

Remove BisectingClusterStat object

57b06ba

Fix minors

5da05d3

Improve initNextCenters

a50689a

mengxr and others added 2 commits November 9, 2015 00:27

refactor

d422be7

Merge pull request #4 from mengxr/SPARK-6517

75ca2a0

Refactor bisecting k-means

yu-iskw reviewed Nov 9, 2015
View reviewed changes

Remove a magic number 63 for level limitation

29ccdf9

asfgit closed this in 8a23368 Nov 9, 2015

[SPARK-6517][mllib] Implement the Algorithm of Hierarchical Clustering #5267

[SPARK-6517][mllib] Implement the Algorithm of Hierarchical Clustering #5267

Conversation

yu-iskw commented Mar 30, 2015

SparkQA commented Mar 30, 2015

SparkQA commented Mar 30, 2015

yu-iskw commented Mar 31, 2015

freeman-lab commented Apr 1, 2015

yu-iskw commented Apr 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Apr 1, 2015

yu-iskw commented Apr 2, 2015

yu-iskw commented Apr 8, 2015

SparkQA commented Apr 8, 2015

yu-iskw commented Apr 14, 2015

Choose a reason for hiding this comment

freeman-lab commented Apr 27, 2015

Choose a reason for hiding this comment

freeman-lab commented Apr 27, 2015

SparkQA commented Apr 27, 2015

yu-iskw commented Apr 30, 2015

freeman-lab commented May 1, 2015

SparkQA commented May 21, 2015

yu-iskw commented May 21, 2015

SparkQA commented May 21, 2015

SparkQA commented May 21, 2015

SparkQA commented May 21, 2015

SparkQA commented May 22, 2015

yu-iskw commented Jun 3, 2015

SparkQA commented Oct 29, 2015

SparkQA commented Oct 29, 2015

SparkQA commented Oct 30, 2015

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2015

SparkQA commented Nov 9, 2015

mengxr commented Nov 9, 2015

yu-iskw commented Nov 9, 2015

freeman-lab commented Nov 9, 2015

yu-iskw commented Nov 9, 2015

jkbradley commented Nov 10, 2015

yu-iskw commented Nov 10, 2015