New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1215 [MLLIB]: Clustering: Index out of bounds error #1407
Conversation
Parameterizing max memory.
Fixing scalastyle issue.
if (j == 0) { | ||
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." + | ||
s" Using duplicate point for center k = $i.") | ||
j = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code may be clearer if written in this way
centers(i) =
if (j == 0) {
logWarning("...")
points(0).toDense
} else {
points(j - 1).toDense
}
or
if (j == 0) {
logWarning("...")
centers(i) = points(0).toDense
} else {
centers(i) = points(j - 1).toDense
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go with the second suggestion.
@jkbradley The fix looks good to me except some minor style issues. Thanks for fixing it! Btw, please add |
QA tests have started for PR 1407. This patch merges cleanly. |
QA results for PR 1407: |
QA tests have started for PR 1407. This patch merges cleanly. |
…n. Added temp DTRunnerJKB, eventually to merge with DecisionTreeRunner
QA results for PR 1407: |
QA tests have started for PR 1407. This patch merges cleanly. |
QA results for PR 1407: |
I tangled stuff in this PR, so I am closing it and resubmitting (with updates per mengxr's suggestions) as PR 1468: #1468 |
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. (Re-submitting PR after tangling commits in PR 1407 #1407 ) Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1468 from jkbradley/kmeans-fix and squashes the following commits: 4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr 6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite.
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. (Re-submitting PR after tangling commits in PR 1407 apache#1407 ) Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1468 from jkbradley/kmeans-fix and squashes the following commits: 4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr 6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite.
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. (Re-submitting PR after tangling commits in PR 1407 apache#1407 ) Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1468 from jkbradley/kmeans-fix and squashes the following commits: 4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr 6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite.
Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error
https://issues.apache.org/jira/browse/SPARK-1215
Solution: Print warning, and use duplicate cluster centers so that exactly k centers are returned.