-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6517][mllib] Implement the Algorithm of Hierarchical Clustering #5267
Conversation
Test build #29402 has finished for PR 5267 at commit
|
Test build #29403 has finished for PR 5267 at commit
|
Hi @jkbradley, @mengxr, |
@yu-iskw great putting this new version together, I'd be happy to do a review (especially re: the algorithm), should be able to get to it in the next few days! |
@freeman-lab, Thank you for your attention to this matter. |
/** | ||
* Top-level methods for calling the hierarchical clustering algorithm | ||
*/ | ||
object HierarchicalClustering extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need static train() methods since users can use the builder pattern from the HierarchicalClustering class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove static train() methods of others? For example, KMeans has static train() method. I'd like to understand the difference. I think we should keep the design concept consistent for users. It is not good that some algorithms support static train() method and others don't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those static train() methods used to be added everywhere, but at some point, we realized that they require a lot of duplicated code (since Java does not recognize default arguments). We're trying to just add builder methods from now on, but we have to keep the old static train() methods for API stability.
You're right about consistency. I've been thinking about whether we should start deprecating the old static train() methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. I agree with that we should start deprecating the old ones.
I am removing the train() static method from HierarchicalClustering object.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more question. How about consistency between Scala and Python?
MLlib in PySpark doesn't support builder methods. How to call a train() method in PySpark should similar to that in Scala. I think that is a static train() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's an issue but one where there isn't a great solution. Let's discuss it on the JIRA: [https://issues.apache.org/jira/browse/SPARK-6682]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Thank you for letting me know.
Thanks @freeman-lab in advance for reviewing! I'd be happy to make a pass too, but will wait to avoid duplicate passes over the code. |
@jkbradley, thank you for your quick replying. I understand you are waiting for @freeman-lab 's review. I am making an example and writing the documentation in parallel now. |
Sorry for modifying the code before your feedback. The main points of the difference are like below. Thank you for your continuous support.
|
Test build #29855 has finished for PR 5267 at commit
|
@freeman-lab, do you know any good evaluations of a hierarchical clustering algorithm except Within Set Sumb of Squared Error(WSSSE)? For example, I know Silhouette Coefficient is a evaluation for it. However, I thinks it is hard to implement as a distributed processing. |
this | ||
} | ||
|
||
def getSubIterations: Int = this.maxIterations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the name swap? Shouldn't this be getMaxIterations
?
@yu-iskw I'm not familiar with any other self-contained metrics (there are a bunch of metrics for relating estimated clusters to some known ground-truth clustering, but I don't think that's what you mean). Are you wanting to provide other outputs to the user to assess clustering quality? |
} | ||
} | ||
|
||
def getClusters(): Array[ClusterTree] = this.tree.getLeavesNodes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove parentheses after getClusters
@yu-iskw I'm still going through the patch, but so far it's looking good! I've also been testing it locally. Is there a reason you removed the |
Test build #31035 has started for PR 5267 at commit |
@freeman-lab thank you for reviewing. I will modify it soon. The reason removing |
@yu-iskw that makes sense! I do think the linkage matrix / merge list is a general enough data structure for this algorithm that it's definitely worth having as an output, and doesn't actually depend on scipy. The way you had it before is, I think, basically the same thing used in scipy, R, and matlab. I would call the method |
Test build #33218 has finished for PR 5267 at commit
|
Sorry for the delay in my response. I understand the linkage matrix is the common data structure. I supported the
|
Test build #33221 has finished for PR 5267 at commit
|
Test build #33227 has finished for PR 5267 at commit
|
Test build #33240 has finished for PR 5267 at commit
|
Test build #33316 has finished for PR 5267 at commit
|
@freeman-lab How is going? Thank you for your great support. |
Test build #44615 has finished for PR 5267 at commit
|
Test build #44642 has finished for PR 5267 at commit
|
Test build #44646 has finished for PR 5267 at commit
|
Refactor bisecting k-means
val random = new Random(seed) | ||
var numLeafClustersNeeded = k - 1 | ||
var level = 1 | ||
while (activeClusters.nonEmpty && numLeafClustersNeeded > 0 && level < 63) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use val levelLimit = log10(Long.MaxValue) / Log10(2)
, instead of 63
. It's a minor issue.
Test build #45396 has finished for PR 5267 at commit
|
Test build #45406 has finished for PR 5267 at commit
|
LGTM. Merged into master and branch-1.6. Thanks! |
I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering. (cherry picked from commit 8a23368) Signed-off-by: Xiangrui Meng <meng@databricks.com>
Thank you for merging it!! |
awesome, nice job all! |
@freeman-lab thank you for your great support! |
@yu-iskw Thanks for persevering and getting this merged! |
@jkbradley thanks! |
I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517
Thank you for your great cooperation, RJ Nowling(@rnowling), Jeremy Freeman(@freeman-lab), Xiangrui Meng(@mengxr) and Sean Owen(@srowen).