Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407

Closed
wants to merge 89 commits into from

Conversation

jkbradley
Copy link
Member

Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error

https://issues.apache.org/jira/browse/SPARK-1215

Solution: Print warning, and use duplicate cluster centers so that exactly k centers are returned.

if (j == 0) {
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
s" Using duplicate point for center k = $i.")
j = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code may be clearer if written in this way

centers(i) = 
  if (j == 0) {
    logWarning("...")
    points(0).toDense
  } else {
    points(j - 1).toDense
  }

or

if (j == 0) {
  logWarning("...")
  centers(i) = points(0).toDense
} else {
  centers(i) = points(j - 1).toDense
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go with the second suggestion.

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2014

@jkbradley The fix looks good to me except some minor style issues. Thanks for fixing it! Btw, please add [MLLIB] to the title so this is easy to find.

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA tests have started for PR 1407. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA results for PR 1407:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 16, 2014

QA tests have started for PR 1407. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull

…n. Added temp DTRunnerJKB, eventually to merge with DecisionTreeRunner
@SparkQA
Copy link

SparkQA commented Jul 16, 2014

QA results for PR 1407:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull

@jkbradley jkbradley changed the title SPARK-1215: Clustering: Index out of bounds error SPARK-1215 [MLLIB]: Clustering: Index out of bounds error Jul 17, 2014
@SparkQA
Copy link

SparkQA commented Jul 17, 2014

QA tests have started for PR 1407. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 17, 2014

QA results for PR 1407:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16785/consoleFull

@jkbradley
Copy link
Member Author

I tangled stuff in this PR, so I am closing it and resubmitting (with updates per mengxr's suggestions) as PR 1468: #1468

@jkbradley jkbradley closed this Jul 17, 2014
asfgit pushed a commit that referenced this pull request Jul 17, 2014
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.  (Re-submitting PR after tangling commits in PR 1407 #1407 )

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1468 from jkbradley/kmeans-fix and squashes the following commits:

4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr
6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 18, 2014
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.  (Re-submitting PR after tangling commits in PR 1407 apache#1407 )

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes apache#1468 from jkbradley/kmeans-fix and squashes the following commits:

4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr
6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.  (Re-submitting PR after tangling commits in PR 1407 apache#1407 )

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes apache#1468 from jkbradley/kmeans-fix and squashes the following commits:

4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr
6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants