Skip to content

Conversation

@FlytxtRnD
Copy link
Contributor

DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters. This algorithm helps to cluster data points without specifying the number of clusters in advance.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35125 timed out for PR 6880 at commit 0c0a478 after a configured wait of 175m.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between this and a KMeansModel? Might we be able to consolidate them into something like a ClusterCentersModel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza-Thanks for the comment. @mengxr @jkbradley Could you please give your opinon on the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza DpMeansModel class was designed following the KMeansModel and GaussianMixtureModel. Do you know whether there is any plan to consolidate the clustermodel classes to something like ClusterCentersModel ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there are plans, just thought it might be a good idea now that three's a crowd. Probably best to wait for @mengxr or @jkbradley to weigh in before making changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @sryza ..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAVER - http://www.naver.com/

sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8402][MLLIB] DP Means Clustering (#6880)> 이 다음과 같은 이유로 전송 실패했습니다.


받는 사람이 회원님의 메일을 수신차단 하였습니다.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect users call clustering models via a generic interface, as least for now. So we don't need to address this in this PR.

@FlytxtRnD
Copy link
Contributor Author

@mengxr Could you please say your comments on this PR ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr
Copy link
Contributor

mengxr commented Jul 1, 2015

@FlytxtRnD I haven't checked the implementation yet. Some high-level comments:

  1. Please follow the code style guide. I saw wrong indentation, extra spacing, vertical alignment in your code.
  2. Move save/load and the example code to follow-up PRs. Keep this PR small to accelerate the code review.
  3. Check the generated API doc. Usually this is the simplest way to find public APIs that should be private.

On the algorithm part, could you list a few successful stories about k-means vs. kp-means? Some benchmark result also helps.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36766 has finished for PR 6880 at commit 907f4f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(

@FlytxtRnD
Copy link
Contributor Author

@mengxr I have reduced the PR length so that it would be easier for you to review. The style issues have been fixed wherever they were observed.
I will change the paper name in the next update and the benchmark results will also be ready asap.
Could you please review this updated PR and give suggestions?

@FlytxtRnD
Copy link
Contributor Author

@mengxr Could you please tell me how to generate the API docs? I run build/sbt unidoc as mentioned in https://github.com/apache/spark/blob/master/docs/README.md. But it ends in an assertion error. Please help.

@FlytxtRnD
Copy link
Contributor Author

@mengxr @jkbradley Gentle remainder.

@jkbradley
Copy link
Member

@FlytxtRnD To generate the docs, I've always used jekyll (following the instructions on that same page). I know that builds more than you want, but does that at leasdt work?

Sorry this PR is having to wait a bit for full review!

@FlytxtRnD
Copy link
Contributor Author

@jkbradley To generate docs, I installed jekyll. In jekyll build command, it is showing error.
[info] Done updating. [error] (catalyst/compile:compile) Compilation failed [error] Total time: 601 s, completed 28 Jul, 2015 4:25:49 PM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala jekyll 2.5.3 | Error: No such file or directory - ../target/scala-2.10/unidoc/.

But SKIP_API=1 jekyll build is successfully completed. Could you please help me to solve this?

@mengxr
Copy link
Contributor

mengxr commented Jul 28, 2015

@FlytxtRnD You might need build/sbt clean first. Given the review bandwidth, we may not be able to make this into 1.5. So I will make another pass after the 1.5 feature freeze. In the meantime, it would be super helpful if you can help review some other PRs that are on the 1.5 roadmap, e.g. #5267 (bisecting k-means). Thanks!

@FlytxtRnD
Copy link
Contributor Author

Thank you @mengxr . We will take a look into the PR you mentioned.We are looking forward to have DP-Means in the 1.6 release. Thanks a lot for your kind support.

@FlytxtRnD
Copy link
Contributor Author

@mengxr We have updated the JIRA ticket to include the benchmark results as well..Could you please take a look and give your suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP means -> DP-means which is used in the original paper, similar to k-means

@SparkQA
Copy link

SparkQA commented Sep 23, 2015

Test build #42899 has finished for PR 6880 at commit e796866.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • case class WeightedPoint(vector: Vector, count: Long)
    • class DpMeansModel(

@FlytxtRnD
Copy link
Contributor Author

@mengxr @jkbradley I have incorporated the suggestions and changes and updated the PR. Could you please take another look ?

@FlytxtRnD
Copy link
Contributor Author

@mengxr Could you please have a look into this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also implement this with a single aggregateByKey. It makes this more simple and more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yu-iskw Could you please give more inputs on this note?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FlytxtRnD sure! This would work like this. However, I haven't confirmed it carefully yet.
yu-iskw@4070ae6

@FlytxtRnD
Copy link
Contributor Author

Thank you @yu-iskw for the review comments.. Will update the PR asap

@FlytxtRnD
Copy link
Contributor Author

@yu-iskw PR is updated. Shall I include @SInCE to the methods? Or is it done after getting merged? Please provide any other suggestions, if any.

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44917 has finished for PR 6880 at commit b088e46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Params(\n * class DpMeansModel(\n

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 3, 2015

@FlytxtRnD thank you for the update. We should add @Since tags in the first commit.
Btw, I haven't read the original paper carefully yet. I'll review this PR in terms of the algorithm.

@FlytxtRnD
Copy link
Contributor Author

@yu-iskw I didn't get your comment on @SInCE tags. We will be waiting for further review comments.

@FlytxtRnD
Copy link
Contributor Author

@yu-iskw @jkbradley @mengxr any other review comments, please.

@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54751 has finished for PR 6880 at commit b088e46.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55098 has finished for PR 6880 at commit 23316d4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55108 has finished for PR 6880 at commit c25eae2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69330 has finished for PR 6880 at commit c25eae2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

I'm closing this few-year-old PR because the corresponding JIRA (https://issues.apache.org/jira/browse/SPARK-8402) was closed as "Won't Fix".

@JoshRosen JoshRosen closed this Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants