-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29967][ML][FOLLOWUP] KMeans Cleanup #27014
Conversation
@@ -258,7 +258,6 @@ class KMeans private ( | |||
val distanceMeasureInstance = DistanceMeasure.decodeFromString(this.distanceMeasure) | |||
|
|||
val dataVectorWithNorm = data.map(d => d._1) | |||
val weights = data.map(d => d._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this var is unused
@@ -284,7 +283,6 @@ class KMeans private ( | |||
// Execute iterations of Lloyd's algorithm until converged | |||
while (iteration < maxIterations && !converged) { | |||
val costAccum = sc.doubleAccumulator | |||
val countAccum = sc.longAccumulator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found that costAccum
also records the count
@@ -232,7 +232,7 @@ class KMeans private ( | |||
val zippedData = data.zip(norms).map { case ((v, w), norm) => | |||
(new VectorWithNorm(v, norm), w) | |||
} | |||
zippedData.persist() | |||
zippedData.persist(StorageLevel.MEMORY_AND_DISK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in line with other intermediate RDDs persisted in KMeans
Test build #115805 has finished for PR 27014 at commit
|
@@ -42,10 +42,10 @@ class KMeansModel (@Since("1.0.0") val clusterCenters: Array[Vector], | |||
private[spark] val numIter: Int) | |||
extends Saveable with Serializable with PMMLExportable { | |||
|
|||
private val distanceMeasureInstance: DistanceMeasure = | |||
@transient private lazy val distanceMeasureInstance: DistanceMeasure = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason that this not serializable? if it's just to not serialize it, it might not be worth the overhead of lazy, but won't matter much. I can understand not serializing the cluster centers twice, effectively
### What changes were proposed in this pull request? 1, remove unused imports and variables 2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count 3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers. ### Why are the changes needed? 1,remove unused codes 2,avoid repeated computation 3,reduce broadcasted size ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes apache#27014 from zhengruifeng/kmeans_clean_up. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
What changes were proposed in this pull request?
1, remove unused imports and variables
2, remove
countAccum: LongAccumulator
, sincecostAccum: DoubleAccumulator
also records the count3, mark
clusterCentersWithNorm
in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers.Why are the changes needed?
1,remove unused codes
2,avoid repeated computation
3,reduce broadcasted size
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites