[SPARK-8402][MLLIB] DP Means Clustering #6880

FlytxtRnD · 2015-06-18T11:00:29Z

DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters. This algorithm helps to cluster data points without specifying the number of clusters in advance.

SparkQA · 2015-06-18T15:43:17Z

Test build #35125 timed out for PR 6880 at commit 0c0a478 after a configured wait of 175m.

sryza · 2015-06-19T00:54:49Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala

Is there any difference between this and a KMeansModel? Might we be able to consolidate them into something like a ClusterCentersModel?

@sryza-Thanks for the comment. @mengxr @jkbradley Could you please give your opinon on the same?

@sryza DpMeansModel class was designed following the KMeansModel and GaussianMixtureModel. Do you know whether there is any plan to consolidate the clustermodel classes to something like ClusterCentersModel ?

I don't know if there are plans, just thought it might be a good idea now that three's a crowd. Probably best to wait for @mengxr or @jkbradley to weigh in before making changes.

ok @sryza ..

NAVER - http://www.naver.com/

sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8402][MLLIB] DP Means Clustering (#6880)> 이 다음과 같은 이유로 전송 실패했습니다.

받는 사람이 회원님의 메일을 수신차단 하였습니다.

I don't expect users call clustering models via a generic interface, as least for now. So we don't need to address this in this PR.

FlytxtRnD · 2015-06-29T12:46:27Z

@mengxr Could you please say your comments on this PR ?

mengxr · 2015-07-01T23:41:55Z

examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala

Please follow Spark code style guide: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

mengxr · 2015-07-01T23:47:18Z

@FlytxtRnD I haven't checked the implementation yet. Some high-level comments:

Please follow the code style guide. I saw wrong indentation, extra spacing, vertical alignment in your code.
Move save/load and the example code to follow-up PRs. Keep this PR small to accelerate the code review.
Check the generated API doc. Usually this is the simplest way to find public APIs that should be private.

On the algorithm part, could you list a few successful stories about k-means vs. kp-means? Some benchmark result also helps.

SparkQA · 2015-07-08T08:08:45Z

Test build #36766 has finished for PR 6880 at commit 907f4f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(

FlytxtRnD · 2015-07-09T10:46:31Z

@mengxr I have reduced the PR length so that it would be easier for you to review. The style issues have been fixed wherever they were observed.
I will change the paper name in the next update and the benchmark results will also be ready asap.
Could you please review this updated PR and give suggestions?

FlytxtRnD · 2015-07-09T11:49:14Z

@mengxr Could you please tell me how to generate the API docs? I run build/sbt unidoc as mentioned in https://github.com/apache/spark/blob/master/docs/README.md. But it ends in an assertion error. Please help.

FlytxtRnD · 2015-07-19T15:14:10Z

@mengxr @jkbradley Gentle remainder.

jkbradley · 2015-07-21T05:48:24Z

@FlytxtRnD To generate the docs, I've always used jekyll (following the instructions on that same page). I know that builds more than you want, but does that at leasdt work?

Sorry this PR is having to wait a bit for full review!

FlytxtRnD · 2015-07-28T11:46:34Z

@jkbradley To generate docs, I installed jekyll. In jekyll build command, it is showing error.
[info] Done updating. [error] (catalyst/compile:compile) Compilation failed [error] Total time: 601 s, completed 28 Jul, 2015 4:25:49 PM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala jekyll 2.5.3 | Error: No such file or directory - ../target/scala-2.10/unidoc/.

But SKIP_API=1 jekyll build is successfully completed. Could you please help me to solve this?

mengxr · 2015-07-28T19:48:12Z

@FlytxtRnD You might need build/sbt clean first. Given the review bandwidth, we may not be able to make this into 1.5. So I will make another pass after the 1.5 feature freeze. In the meantime, it would be super helpful if you can help review some other PRs that are on the 1.5 roadmap, e.g. #5267 (bisecting k-means). Thanks!

FlytxtRnD · 2015-07-29T04:45:59Z

Thank you @mengxr . We will take a look into the PR you mentioned.We are looking forward to have DP-Means in the 1.6 release. Thanks a lot for your kind support.

FlytxtRnD · 2015-09-02T07:00:02Z

@mengxr We have updated the JIRA ticket to include the benchmark results as well..Could you please take a look and give your suggestions?

mengxr · 2015-09-08T17:09:27Z

examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala

DP means -> DP-means which is used in the original paper, similar to k-means

SparkQA · 2015-09-23T09:02:18Z

Test build #42899 has finished for PR 6880 at commit e796866.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- case class WeightedPoint(vector: Vector, count: Long)
- class DpMeansModel(

FlytxtRnD · 2015-09-28T05:21:11Z

@mengxr @jkbradley I have incorporated the suggestions and changes and updated the PR. Could you please take another look ?

FlytxtRnD · 2015-10-19T09:25:15Z

@mengxr Could you please have a look into this?

yu-iskw · 2015-11-02T00:06:49Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala

I think we could also implement this with a single aggregateByKey. It makes this more simple and more efficient.

@yu-iskw Could you please give more inputs on this note?

@FlytxtRnD sure! This would work like this. However, I haven't confirmed it carefully yet.
yu-iskw@4070ae6

FlytxtRnD · 2015-11-02T05:44:49Z

Thank you @yu-iskw for the review comments.. Will update the PR asap

FlytxtRnD · 2015-11-03T11:52:19Z

@yu-iskw PR is updated. Shall I include @SInCE to the methods? Or is it done after getting merged? Please provide any other suggestions, if any.

SparkQA · 2015-11-03T12:27:01Z

Test build #44917 has finished for PR 6880 at commit b088e46.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Params(\n * class DpMeansModel(\n

yu-iskw · 2015-11-03T16:36:52Z

@FlytxtRnD thank you for the update. We should add @Since tags in the first commit.
Btw, I haven't read the original paper carefully yet. I'll review this PR in terms of the algorithm.

FlytxtRnD · 2015-11-04T03:57:34Z

@yu-iskw I didn't get your comment on @SInCE tags. We will be waiting for further review comments.

FlytxtRnD · 2015-11-13T11:09:59Z

@yu-iskw @jkbradley @mengxr any other review comments, please.

SparkQA · 2016-04-02T02:38:52Z

Test build #54751 has finished for PR 6880 at commit b088e46.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-06T07:12:31Z

Test build #55098 has finished for PR 6880 at commit 23316d4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-06T12:32:26Z

Test build #55108 has finished for PR 6880 at commit c25eae2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T15:16:49Z

Test build #69330 has finished for PR 6880 at commit c25eae2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-07-15T01:20:23Z

I'm closing this few-year-old PR because the corresponding JIRA (https://issues.apache.org/jira/browse/SPARK-8402) was closed as "Won't Fix".

FlytxtRnD added 3 commits June 18, 2015 13:24

Add DP means clustering

99ed7de

check whether input iterator is empty

bcbfc1d

style check corrections

0c0a478

sryza reviewed Jun 19, 2015
View reviewed changes

mengxr reviewed Jul 1, 2015
View reviewed changes

style corrections,removed model save/load

907f4f1

Merge remote-tracking branch 'upstream/master' into DpMeans

ed44a7c

mengxr reviewed Sep 8, 2015
View reviewed changes

FlytxtRnD added 3 commits September 16, 2015 11:20

Merge remote-tracking branch 'upstream/master' into DpMeans

97a25da

Merge remote-tracking branch 'upstream/master' into DpMeans

f8d6937

first review changes

e796866

yu-iskw reviewed Nov 2, 2015
View reviewed changes

FlytxtRnD added 2 commits November 2, 2015 12:09

Merge remote-tracking branch 'upstream/master' into DpMeans

94a95a9

replaced mapPartitions with aggregate

b088e46

FlytxtRnD mentioned this pull request Dec 31, 2015

[SPARK-6724][WIP][MLLIB]Model import/export for FPGrowth #7320

Closed

FlytxtRnD added 2 commits April 5, 2016 15:03

Merge remote-tracking branch 'upstream/master' into DpMeans

fd90f2b

scalastyle changes

23316d4

logging import corrected

c25eae2

dongjoon-hyun added the MLLIB label Jun 14, 2019

JoshRosen closed this Jul 15, 2019

[SPARK-8402][MLLIB] DP Means Clustering #6880

[SPARK-8402][MLLIB] DP Means Clustering #6880

Uh oh!

Conversation

FlytxtRnD commented Jun 18, 2015

Uh oh!

SparkQA commented Jun 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

NAVER - http://www.naver.com/

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FlytxtRnD commented Jun 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 1, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

FlytxtRnD commented Jul 9, 2015

Uh oh!

FlytxtRnD commented Jul 9, 2015

Uh oh!

FlytxtRnD commented Jul 19, 2015

Uh oh!

jkbradley commented Jul 21, 2015

Uh oh!

FlytxtRnD commented Jul 28, 2015

Uh oh!

mengxr commented Jul 28, 2015

Uh oh!

FlytxtRnD commented Jul 29, 2015

Uh oh!

FlytxtRnD commented Sep 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 23, 2015

Uh oh!

FlytxtRnD commented Sep 28, 2015

Uh oh!

FlytxtRnD commented Oct 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FlytxtRnD commented Nov 2, 2015

Uh oh!

FlytxtRnD commented Nov 3, 2015

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

yu-iskw commented Nov 3, 2015

Uh oh!

FlytxtRnD commented Nov 4, 2015

Uh oh!

FlytxtRnD commented Nov 13, 2015

Uh oh!

SparkQA commented Apr 2, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

SparkQA commented Nov 29, 2016