[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB #21561

zhengruifeng · 2018-06-14T07:06:36Z

What changes were proposed in this pull request?

logNumExamples in KMeans/BiKM/GMM/AFT/NB

How was this patch tested?

existing tests

SparkQA · 2018-06-14T08:16:54Z

Test build #91823 has finished for PR 21561 at commit 61b95a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-01T03:29:06Z

Test build #93864 has finished for PR 21561 at commit 96e8425.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-01T03:50:16Z

Test build #93865 has finished for PR 21561 at commit 2e48282.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-01T05:06:50Z

Test build #93866 has finished for PR 21561 at commit 1a93c34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-08-10T13:21:04Z

mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala

    // Aggregates term frequencies per label.
    // TODO: Calling aggregateByKey and collect creates two stages, we can implement something
    // TODO: similar to reduceByKeyLocally to save one stage.
    val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
-      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
+      .map { row =>
+        countAccum.add(1L)


Is this guaranteed to work correctly, given that this is in a map operation? wondering if this introduces a correctness issue or whether this number is available elsewhere.

This should work correctly, however, to guarantee the correctness, I update the pr to compute the number without Accumulator

srowen · 2018-08-10T13:22:11Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

-  def run(input: RDD[Vector]): BisectingKMeansModel = {
+
+  private[spark] def run(input: RDD[Vector],
+                         instr: Option[Instrumentation]): BisectingKMeansModel = {


Elsewhere I see the instrumentation made available with "insrumented" -- is this different?

instrumented will create a new Instrumentation, and instrumented is only used in ml
When mllib's impls is called, the Instrumentation will be passed as a parameters, like what KMeans does (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L362).

SparkQA · 2018-08-13T07:01:49Z

Test build #94669 has finished for PR 21561 at commit fb3ff2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-13T09:21:18Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

-  def run(input: RDD[Vector]): BisectingKMeansModel = {
+
+  private[spark] def run(input: RDD[Vector],
+                         instr: Option[Instrumentation]): BisectingKMeansModel = {


nit: indentation

SparkQA · 2018-08-14T02:54:14Z

Test build #94718 has finished for PR 21561 at commit 5f403fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-14T08:02:28Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -299,7 +299,7 @@ class KMeans private (
      val bcCenters = sc.broadcast(centers)

      // Find the new centers
-      val newCenters = data.mapPartitions { points =>
+      val collected = data.mapPartitions { points =>


nit: can we find a better name than collected?

I am neutral on this.

mgaido91 · 2018-08-14T08:04:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+      }.collectAsMap()
+
+      if (iteration == 0) {
+        val numSamples = collected.values.map(_._2).sum


what about moving this in the foreach, so it is computed only id needed?

mgaido91 · 2018-08-14T08:04:25Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

@@ -171,6 +169,8 @@ class BisectingKMeans private (
    val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
    var assignments = vectors.map(v => (ROOT_INDEX, v))
    var activeClusters = summarize(d, assignments, dMeasure)
+    val numSamples = activeClusters.values.map(_.size).sum


SparkQA · 2018-08-15T04:01:06Z

Test build #94781 has finished for PR 21561 at commit ecab85c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

LGTM apart from the two comments still to address (naming and extra newline)

mgaido91 · 2018-08-15T08:42:13Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

-   */
-  @Since("1.6.0")
-  def run(input: RDD[Vector]): BisectingKMeansModel = {
+


nit: extra newline

srowen · 2018-08-15T19:01:57Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

+   * @param input RDD of vectors
+   * @return model for the bisecting kmeans
+   */
+  @Since("1.6.0")


Nit: this should be since 2.4?

this api was already existing since 1.6.0, so we should keep the since annotation?

You couldn't call BisectingKMeans.run(...) before this, right? it wasn't in a superclass or anything. In that sense I think this method needs to be marked as new as of 2.4.0, right?

def run(input: RDD[Vector]): BisectingKMeansModel is a public api since 1.6, and users can call it.

Oh right I get it now, this isn't a new method, it's 'replacing' the definition above. 👍

srowen · 2018-08-16T22:23:49Z

Merged to master

zhengruifeng force-pushed the alg_logNumExamples branch from 96e8425 to 2e48282 Compare August 1, 2018 03:45

srowen reviewed Aug 10, 2018

View reviewed changes

mgaido91 reviewed Aug 13, 2018

View reviewed changes

zhengruifeng force-pushed the alg_logNumExamples branch from fb3ff2b to 5f403fa Compare August 14, 2018 01:40

mgaido91 reviewed Aug 14, 2018

View reviewed changes

zhengruifeng added 5 commits August 15, 2018 10:45

nit

92f1052

nit

208c2a3

update nb

b7c62dd

nit

3790b6b

nit

ecab85c

zhengruifeng force-pushed the alg_logNumExamples branch from 5f403fa to ecab85c Compare August 15, 2018 02:47

mgaido91 reviewed Aug 15, 2018

View reviewed changes

srowen reviewed Aug 15, 2018

View reviewed changes

srowen approved these changes Aug 16, 2018

View reviewed changes

asfgit closed this in e501924 Aug 16, 2018

zhengruifeng deleted the alg_logNumExamples branch August 17, 2018 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB #21561

[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB #21561

zhengruifeng commented Jun 14, 2018

SparkQA commented Jun 14, 2018

SparkQA commented Aug 1, 2018

SparkQA commented Aug 1, 2018

SparkQA commented Aug 1, 2018

srowen Aug 10, 2018

zhengruifeng Aug 13, 2018

srowen Aug 10, 2018

zhengruifeng Aug 13, 2018

SparkQA commented Aug 13, 2018

mgaido91 Aug 13, 2018

SparkQA commented Aug 14, 2018

mgaido91 Aug 14, 2018

zhengruifeng Aug 15, 2018

mgaido91 Aug 14, 2018

mgaido91 Aug 14, 2018

SparkQA commented Aug 15, 2018

mgaido91 left a comment

mgaido91 Aug 15, 2018

srowen Aug 15, 2018

zhengruifeng Aug 16, 2018

srowen Aug 16, 2018

zhengruifeng Aug 16, 2018

srowen Aug 16, 2018

srowen commented Aug 16, 2018

[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB #21561

[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB #21561

Conversation

zhengruifeng commented Jun 14, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 14, 2018

SparkQA commented Aug 1, 2018

SparkQA commented Aug 1, 2018

SparkQA commented Aug 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 13, 2018

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 15, 2018

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Aug 16, 2018