Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29967][ML][FOLLOWUP] KMeans Cleanup #27014

Closed
wants to merge 1 commit into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, remove unused imports and variables
2, remove countAccum: LongAccumulator, since costAccum: DoubleAccumulator also records the count
3, mark clusterCentersWithNorm in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers.

Why are the changes needed?

1,remove unused codes
2,avoid repeated computation
3,reduce broadcasted size

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

init
@@ -258,7 +258,6 @@ class KMeans private (
val distanceMeasureInstance = DistanceMeasure.decodeFromString(this.distanceMeasure)

val dataVectorWithNorm = data.map(d => d._1)
val weights = data.map(d => d._2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this var is unused

@@ -284,7 +283,6 @@ class KMeans private (
// Execute iterations of Lloyd's algorithm until converged
while (iteration < maxIterations && !converged) {
val costAccum = sc.doubleAccumulator
val countAccum = sc.longAccumulator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found that costAccum also records the count

@@ -232,7 +232,7 @@ class KMeans private (
val zippedData = data.zip(norms).map { case ((v, w), norm) =>
(new VectorWithNorm(v, norm), w)
}
zippedData.persist()
zippedData.persist(StorageLevel.MEMORY_AND_DISK)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in line with other intermediate RDDs persisted in KMeans

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115805 has finished for PR 27014 at commit cf5d705.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -42,10 +42,10 @@ class KMeansModel (@Since("1.0.0") val clusterCenters: Array[Vector],
private[spark] val numIter: Int)
extends Saveable with Serializable with PMMLExportable {

private val distanceMeasureInstance: DistanceMeasure =
@transient private lazy val distanceMeasureInstance: DistanceMeasure =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason that this not serializable? if it's just to not serialize it, it might not be worth the overhead of lazy, but won't matter much. I can understand not serializing the cluster centers twice, effectively

@zhengruifeng zhengruifeng deleted the kmeans_clean_up branch December 29, 2019 04:44
fqaiser94 pushed a commit to fqaiser94/spark that referenced this pull request Mar 30, 2020
### What changes were proposed in this pull request?
1, remove unused imports and variables
2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count
3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers.

### Why are the changes needed?
1,remove unused codes
2,avoid repeated computation
3,reduce broadcasted size

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes apache#27014 from zhengruifeng/kmeans_clean_up.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants