-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8402][MLLIB] DP Means Clustering #6880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #35125 timed out for PR 6880 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference between this and a KMeansModel? Might we be able to consolidate them into something like a ClusterCentersModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sryza-Thanks for the comment. @mengxr @jkbradley Could you please give your opinon on the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sryza DpMeansModel class was designed following the KMeansModel and GaussianMixtureModel. Do you know whether there is any plan to consolidate the clustermodel classes to something like ClusterCentersModel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there are plans, just thought it might be a good idea now that three's a crowd. Probably best to wait for @mengxr or @jkbradley to weigh in before making changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok @sryza ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAVER - http://www.naver.com/
sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8402][MLLIB] DP Means Clustering (#6880)> 이 다음과 같은 이유로 전송 실패했습니다.
받는 사람이 회원님의 메일을 수신차단 하였습니다.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect users call clustering models via a generic interface, as least for now. So we don't need to address this in this PR.
|
@mengxr Could you please say your comments on this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow Spark code style guide: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
|
@FlytxtRnD I haven't checked the implementation yet. Some high-level comments:
On the algorithm part, could you list a few successful stories about k-means vs. kp-means? Some benchmark result also helps. |
|
Test build #36766 has finished for PR 6880 at commit
|
|
@mengxr I have reduced the PR length so that it would be easier for you to review. The style issues have been fixed wherever they were observed. |
|
@mengxr Could you please tell me how to generate the API docs? I run build/sbt unidoc as mentioned in https://github.com/apache/spark/blob/master/docs/README.md. But it ends in an assertion error. Please help. |
|
@mengxr @jkbradley Gentle remainder. |
|
@FlytxtRnD To generate the docs, I've always used jekyll (following the instructions on that same page). I know that builds more than you want, but does that at leasdt work? Sorry this PR is having to wait a bit for full review! |
|
@jkbradley To generate docs, I installed jekyll. In jekyll build command, it is showing error. But SKIP_API=1 jekyll build is successfully completed. Could you please help me to solve this? |
|
@FlytxtRnD You might need |
|
Thank you @mengxr . We will take a look into the PR you mentioned.We are looking forward to have DP-Means in the 1.6 release. Thanks a lot for your kind support. |
|
@mengxr We have updated the JIRA ticket to include the benchmark results as well..Could you please take a look and give your suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DP means -> DP-means which is used in the original paper, similar to k-means
|
Test build #42899 has finished for PR 6880 at commit
|
|
@mengxr @jkbradley I have incorporated the suggestions and changes and updated the PR. Could you please take another look ? |
|
@mengxr Could you please have a look into this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could also implement this with a single aggregateByKey. It makes this more simple and more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yu-iskw Could you please give more inputs on this note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FlytxtRnD sure! This would work like this. However, I haven't confirmed it carefully yet.
yu-iskw@4070ae6
|
Thank you @yu-iskw for the review comments.. Will update the PR asap |
|
Test build #44917 has finished for PR 6880 at commit
|
|
@FlytxtRnD thank you for the update. We should add |
|
@yu-iskw @jkbradley @mengxr any other review comments, please. |
|
Test build #54751 has finished for PR 6880 at commit
|
|
Test build #55098 has finished for PR 6880 at commit
|
|
Test build #55108 has finished for PR 6880 at commit
|
|
Test build #69330 has finished for PR 6880 at commit
|
|
I'm closing this few-year-old PR because the corresponding JIRA (https://issues.apache.org/jira/browse/SPARK-8402) was closed as "Won't Fix". |
DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters. This algorithm helps to cluster data points without specifying the number of clusters in advance.