SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407

jkbradley · 2014-07-14T17:56:45Z

Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error

https://issues.apache.org/jira/browse/SPARK-1215

Solution: Print warning, and use duplicate cluster centers so that exactly k centers are returned.

Parameterizing max memory.

Fixing scalastyle issue.

mengxr · 2014-07-15T10:06:07Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala

+      if (j == 0) {
+        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
+          s" Using duplicate point for center k = $i.")
+        j = 1


The code may be clearer if written in this way

centers(i) = if (j == 0) { logWarning("...") points(0).toDense } else { points(j - 1).toDense }

or

if (j == 0) { logWarning("...") centers(i) = points(0).toDense } else { centers(i) = points(j - 1).toDense }

I'll go with the second suggestion.

mengxr · 2014-07-15T10:09:02Z

@jkbradley The fix looks good to me except some minor style issues. Thanks for fixing it! Btw, please add [MLLIB] to the title so this is easy to find.

SparkQA · 2014-07-15T20:13:00Z

QA tests have started for PR 1407. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull

SparkQA · 2014-07-15T21:54:06Z

QA results for PR 1407:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull

…lass

SparkQA · 2014-07-16T05:18:01Z

QA tests have started for PR 1407. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull

…n. Added temp DTRunnerJKB, eventually to merge with DecisionTreeRunner

SparkQA · 2014-07-16T06:56:45Z

QA results for PR 1407:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull

SparkQA · 2014-07-17T18:43:00Z

QA tests have started for PR 1407. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull

SparkQA · 2014-07-17T18:43:38Z

QA results for PR 1407:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull

jkbradley · 2014-07-17T19:25:24Z

I tangled stuff in this PR, so I am closing it and resubmitting (with updates per mengxr's suggestions) as PR 1468: #1468

Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. (Re-submitting PR after tangling commits in PR 1407 #1407 ) Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1468 from jkbradley/kmeans-fix and squashes the following commits: 4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr 6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite.

Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. (Re-submitting PR after tangling commits in PR 1407 apache#1407 ) Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1468 from jkbradley/kmeans-fix and squashes the following commits: 4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr 6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite.

manishamde and others added 30 commits April 20, 2014 13:33

adding support for very deep trees

50b143a

Parameterizing max memory.

abc5a23

Merge pull request #5 from etrain/deep_tree

2f6072c

Parameterizing max memory.

minor: added doc for maxMemory parameter

2f1e093

Fixing scalastyle issue.

0287772

Merge pull request #6 from etrain/deep_tree

fecf89a

Fixing scalastyle issue.

updating user documentation

719d009

merge from master

9dbdabe

updated documentation

1517155

added unit test

718506b

renamed parameter

e0426ee

removed unused imports

dad9652

modified scala.math to math

cbd9f14

added documentation, fixed off by 1 error in max level calculation

5e82202

formatting

4731cda

grammar

5eca9e4

more formatting

8053fed

programming guide blurb

426bb28

formatting

b27ad2c

minor formatting

ce004a1

added docs

7fc9545

merged master

968ca9d

added weighted point class

a1a6e09

changing instance format to weighted labeled point

14aea48

fixed tests

455bea9

todo for multiclass support

46f909c

added multiclass support for find splits bins

4d5f70c

tests for multiclass classification

3f85a17

minor mods

46e06ee

prepared for multiclass without breaking binary classification

6c7af22

manishamde added 2 commits July 14, 2014 16:11

removed label weights support

afced16

fixing weird multiline bug

c8428c4

mengxr reviewed Jul 15, 2014
View reviewed changes

manishamde and others added 2 commits July 15, 2014 12:53

adding developer api annotation for overriden methods

45e767a

Merge remote-tracking branch 'upstream/master'

66e72d7

jkbradley added 2 commits July 15, 2014 22:16

Merge remote-tracking branch 'upstream/master'

4dd8782

Merge remote-tracking branch 'upstream/master' into manishamde-multic…

f60f22d

…lass

Added DecisionTree.print() method for human-readable model descriptio…

eb5700f

…n. Added temp DTRunnerJKB, eventually to merge with DecisionTreeRunner

jkbradley changed the title ~~SPARK-1215: Clustering: Index out of bounds error~~ SPARK-1215 [MLLIB]: Clustering: Index out of bounds error Jul 17, 2014

jkbradley added 2 commits July 17, 2014 10:53

updating DT API, but not done yet

8725f7b

Updated DT API in wrong branch. Moving commit to my branch.

182511f

jkbradley mentioned this pull request Jul 17, 2014

SPARK-1215 [MLLIB]: Clustering: Index out of bounds error (2) #1468

Closed

jkbradley closed this Jul 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407

SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407

jkbradley commented Jul 14, 2014

mengxr Jul 15, 2014

jkbradley Jul 17, 2014

mengxr commented Jul 15, 2014

SparkQA commented Jul 15, 2014

SparkQA commented Jul 15, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 17, 2014

SparkQA commented Jul 17, 2014

jkbradley commented Jul 17, 2014

SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407

SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407

Conversation

jkbradley commented Jul 14, 2014

mengxr Jul 15, 2014

Choose a reason for hiding this comment

jkbradley Jul 17, 2014

Choose a reason for hiding this comment

mengxr commented Jul 15, 2014

SparkQA commented Jul 15, 2014

SparkQA commented Jul 15, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 17, 2014

SparkQA commented Jul 17, 2014

jkbradley commented Jul 17, 2014