Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6517][mllib] Implement the Algorithm of Hierarchical Clustering #5267

Closed
wants to merge 77 commits into from

Conversation

yu-iskw
Copy link
Contributor

@yu-iskw yu-iskw commented Mar 30, 2015

I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

Thank you for your great cooperation, RJ Nowling(@rnowling), Jeremy Freeman(@freeman-lab), Xiangrui Meng(@mengxr) and Sean Owen(@srowen).

@SparkQA
Copy link

SparkQA commented Mar 30, 2015

Test build #29402 has finished for PR 5267 at commit af0f65b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClustering(
    • class ClusterTree(
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Mar 30, 2015

Test build #29403 has finished for PR 5267 at commit 3df7f11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClustering(
    • class ClusterTree(
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):
  • This patch does not change any dependencies.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Mar 31, 2015

Hi @jkbradley, @mengxr,
Would you review this PR, especially from the point of view about the design? It supports a saving and loading model functions in both of Scala and Python. I want to make sure that my implementation is suitable in terms of the concept of save/load.
Thanks

@freeman-lab
Copy link
Contributor

@yu-iskw great putting this new version together, I'd be happy to do a review (especially re: the algorithm), should be able to get to it in the next few days!

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Apr 1, 2015

@freeman-lab, Thank you for your attention to this matter.

/**
* Top-level methods for calling the hierarchical clustering algorithm
*/
object HierarchicalClustering extends Logging {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need static train() methods since users can use the builder pattern from the HierarchicalClustering class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove static train() methods of others? For example, KMeans has static train() method. I'd like to understand the difference. I think we should keep the design concept consistent for users. It is not good that some algorithms support static train() method and others don't support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those static train() methods used to be added everywhere, but at some point, we realized that they require a lot of duplicated code (since Java does not recognize default arguments). We're trying to just add builder methods from now on, but we have to keep the old static train() methods for API stability.

You're right about consistency. I've been thinking about whether we should start deprecating the old static train() methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I agree with that we should start deprecating the old ones.
I am removing the train() static method from HierarchicalClustering object.
Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more question. How about consistency between Scala and Python?
MLlib in PySpark doesn't support builder methods. How to call a train() method in PySpark should similar to that in Scala. I think that is a static train() method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's an issue but one where there isn't a great solution. Let's discuss it on the JIRA: [https://issues.apache.org/jira/browse/SPARK-6682]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Thank you for letting me know.

@jkbradley
Copy link
Member

Thanks @freeman-lab in advance for reviewing! I'd be happy to make a pass too, but will wait to avoid duplicate passes over the code.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Apr 2, 2015

@jkbradley, thank you for your quick replying. I understand you are waiting for @freeman-lab 's review. I am making an example and writing the documentation in parallel now.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Apr 8, 2015

Sorry for modifying the code before your feedback. The main points of the difference are like below. Thank you for your continuous support.

  • Remove the static train() method from HierarchicalClustering object
  • Remove the parentheses from the setter/getter methods
  • Add a method to calculate Within Set Sum of Squared Error (WSSSE) into HierarchicalClusteringModel

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29855 has finished for PR 5267 at commit 9d32110.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):
  • This patch does not change any dependencies.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Apr 14, 2015

@freeman-lab, do you know any good evaluations of a hierarchical clustering algorithm except Within Set Sumb of Squared Error(WSSSE)? For example, I know Silhouette Coefficient is a evaluation for it. However, I thinks it is hard to implement as a distributed processing.

this
}

def getSubIterations: Int = this.maxIterations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the name swap? Shouldn't this be getMaxIterations?

@freeman-lab
Copy link
Contributor

@yu-iskw I'm not familiar with any other self-contained metrics (there are a bunch of metrics for relating estimated clusters to some known ground-truth clustering, but I don't think that's what you mean). Are you wanting to provide other outputs to the user to assess clustering quality?

}
}

def getClusters(): Array[ClusterTree] = this.tree.getLeavesNodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove parentheses after getClusters

@freeman-lab
Copy link
Contributor

@yu-iskw I'm still going through the patch, but so far it's looking good! I've also been testing it locally.

Is there a reason you removed the toMergeList method from the previous version of this submission? That seemed quite useful to me, as it's a common way to describe the output of hierarchical clustering, both in formal treatments as well as in other analysis libraries (though I do suggest naming it toLinkageMatrix). What do you think about bringing it back?

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31035 has started for PR 5267 at commit 9d32110.

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Apr 30, 2015

@freeman-lab thank you for reviewing. I will modify it soon.

The reason removing toMergeList at the old version is that I think it depends on scipy much. However, as you're suggesting, we should support any function to get a dendrogram. How is adjacency list?
Or if there is any general data structure for a linkage matrix, we should support it. Do you have any idea?

@freeman-lab
Copy link
Contributor

@yu-iskw that makes sense! I do think the linkage matrix / merge list is a general enough data structure for this algorithm that it's definitely worth having as an output, and doesn't actually depend on scipy. The way you had it before is, I think, basically the same thing used in scipy, R, and matlab. I would call the method toLinkageMatrix, that seems to be the most common name.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33218 has finished for PR 5267 at commit c3ce8ca.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):

@yu-iskw
Copy link
Contributor Author

yu-iskw commented May 21, 2015

Sorry for the delay in my response. I understand the linkage matrix is the common data structure. I supported the toLinkageMatrix method.. And There are three big modifications like below in these differences. Could you review it?

  • Remove unnecessary parentheses
  • Add the methods to convert a model to a linkage matrix and adjacency list
  • Add a Java test file

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33221 has finished for PR 5267 at commit e990f1e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33227 has finished for PR 5267 at commit 957d3e2.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33240 has finished for PR 5267 at commit ba5c208.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33316 has finished for PR 5267 at commit ca54fe3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HierarchicalClusteringModel(val tree: ClusterTree)
    • class HierarchicalClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader):
    • class HierarchicalClustering(object):

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Jun 3, 2015

@freeman-lab How is going? Thank you for your great support.

@SparkQA
Copy link

SparkQA commented Oct 29, 2015

Test build #44615 has finished for PR 5267 at commit a876ba2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

@SparkQA
Copy link

SparkQA commented Oct 29, 2015

Test build #44642 has finished for PR 5267 at commit 5da05d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44646 has finished for PR 5267 at commit a50689a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

val random = new Random(seed)
var numLeafClustersNeeded = k - 1
var level = 1
while (activeClusters.nonEmpty && numLeafClustersNeeded > 0 && level < 63) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use val levelLimit = log10(Long.MaxValue) / Log10(2), instead of 63. It's a minor issue.

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45396 has finished for PR 5267 at commit 75ca2a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * Sets the random seed (default: hash value of the class name).\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45406 has finished for PR 5267 at commit 29ccdf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * Sets the random seed (default: hash value of the class name).\n * class BisectingKMeansModel @Since(\"1.6.0\") (\n

@mengxr
Copy link
Contributor

mengxr commented Nov 9, 2015

LGTM. Merged into master and branch-1.6. Thanks!

@asfgit asfgit closed this in 8a23368 Nov 9, 2015
asfgit pushed a commit that referenced this pull request Nov 9, 2015
I implemented a hierarchical clustering algorithm again.  This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

- This implementation based on a bi-sectiong K-means clustering.
    - It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
    - However, It is 1000x faster than the previous version through parallel processing.

Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>

Closes #5267 from yu-iskw/new-hierarchical-clustering.

(cherry picked from commit 8a23368)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@yu-iskw
Copy link
Contributor Author

yu-iskw commented Nov 9, 2015

Thank you for merging it!!

@freeman-lab
Copy link
Contributor

awesome, nice job all!

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Nov 9, 2015

@freeman-lab thank you for your great support!

@jkbradley
Copy link
Member

@yu-iskw Thanks for persevering and getting this merged!

@yu-iskw
Copy link
Contributor Author

yu-iskw commented Nov 10, 2015

@jkbradley thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants