-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2429] [MLlib] Hierarchical Implementation of KMeans #2906
Conversation
Can one of the admins verify this patch? |
val treeRoot = this.clusterTree | ||
val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _ | ||
data.sparkContext.broadcast(closestClusterIndexFinder) | ||
val predicted = data.map(point => (closestClusterIndexFinder(point), point)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you're using the broadcast variable correctly:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify the way to use broadcast
yu-iskw@290d492
I just gave this a quick read-through, and the structure makes sense. I left several small comments. I see the chunks of logic I would expect, but did not evaluate it in detail. The existence of some tests suggests this probably basically works :) I am wondering about performance too as this relies on Scala idioms in many places; it might be worth a quick look with jprofiler if you can to see if there are any easy-win optimizations. |
Jenkins, add to whitelist. |
ok to test |
Test build #22177 has started for PR 2906 at commit
|
Test build #22179 has started for PR 2906 at commit
|
Test build #22177 has finished for PR 2906 at commit
|
Test FAILed. |
@yu-iskw I added you to the whitelist. Future commits from you should trigger Jenkins automatically. Just took a very brief scan over the code and really appreciate the fact that more than half of the code is doc/test/example. I will check the implementation after the feature freeze. Some high-level questions for now:
|
Test build #22179 has finished for PR 2906 at commit
|
Test FAILed. |
Test build #22267 has started for PR 2906 at commit
|
Test build #22268 has started for PR 2906 at commit
|
Test build #22270 has started for PR 2906 at commit
|
Test build #22268 has finished for PR 2906 at commit
|
Test PASSed. |
Test build #22267 has finished for PR 2906 at commit
|
Test PASSed. |
// TODO Supports distance metrics other Euclidean distance metric | ||
val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0) | ||
val treeRoot = this.clusterTree | ||
sc.broadcast(metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not output, see my other note about sc.broadcast
.
Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it in both Python and Scala. Overall great work, terrific to see my little gist turned into something so refined and performant! =) I left lots of comments, most minor, though documenting the caching behavior seems quite important. The one significant addition I'd suggest is exposing another model output: a list of the centers at all nodes in the learned tree. This would be in addition to just the centers of the leaves, which is currently returned by |
import org.apache.spark.mllib.linalg.Vector; | ||
import org.apache.spark.mllib.linalg.Vectors; | ||
|
||
public class JavaHierarchicalClustering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other example code I see foregoes a lot of the boilerplate here of declaring a class, main method, System.out, etc. The indentation here is also significantly deeper than the 2-space indent in the code. Addressing these might make it easier to scan as an example on the web page.
@freeman-lab @srowen @mengxr many thanks! |
@freeman-lab, @srowen, I apologize for the delay in replying. I will modify the code ASAP. https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark Should we continue the PR, or replace the current one with the new one. thanks! |
I've spoken with @freeman-lab. I am going to send a new PR after replacing the algorithm to the new one and adding wrapper classes for ml package. |
I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.
I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering. (cherry picked from commit 8a23368) Signed-off-by: Xiangrui Meng <meng@databricks.com>
I want to add a divisive hierarchical clustering algorithm implementation to MLlib. I don't support distance metrics other Euclidean distance metric yet. It would be nice to support it at other issue.
Could you review it?
Thanks!