Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB #21561

Closed
wants to merge 5 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

logNumExamples in KMeans/BiKM/GMM/AFT/NB

How was this patch tested?

existing tests

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91823 has finished for PR 21561 at commit 61b95a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93864 has finished for PR 21561 at commit 96e8425.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93865 has finished for PR 21561 at commit 2e48282.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93866 has finished for PR 21561 at commit 1a93c34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Aggregates term frequencies per label.
// TODO: Calling aggregateByKey and collect creates two stages, we can implement something
// TODO: similar to reduceByKeyLocally to save one stage.
val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
.map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
.map { row =>
countAccum.add(1L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to work correctly, given that this is in a map operation? wondering if this introduces a correctness issue or whether this number is available elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work correctly, however, to guarantee the correctness, I update the pr to compute the number without Accumulator

def run(input: RDD[Vector]): BisectingKMeansModel = {

private[spark] def run(input: RDD[Vector],
instr: Option[Instrumentation]): BisectingKMeansModel = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere I see the instrumentation made available with "insrumented" -- is this different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instrumented will create a new Instrumentation, and instrumented is only used in ml
When mllib's impls is called, the Instrumentation will be passed as a parameters, like what KMeans does (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L362).

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94669 has finished for PR 21561 at commit fb3ff2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def run(input: RDD[Vector]): BisectingKMeansModel = {

private[spark] def run(input: RDD[Vector],
instr: Option[Instrumentation]): BisectingKMeansModel = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94718 has finished for PR 21561 at commit 5f403fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -299,7 +299,7 @@ class KMeans private (
val bcCenters = sc.broadcast(centers)

// Find the new centers
val newCenters = data.mapPartitions { points =>
val collected = data.mapPartitions { points =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we find a better name than collected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this.

}.collectAsMap()

if (iteration == 0) {
val numSamples = collected.values.map(_._2).sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about moving this in the foreach, so it is computed only id needed?

@@ -171,6 +169,8 @@ class BisectingKMeans private (
val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
var assignments = vectors.map(v => (ROOT_INDEX, v))
var activeClusters = summarize(d, assignments, dMeasure)
val numSamples = activeClusters.values.map(_.size).sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94781 has finished for PR 21561 at commit ecab85c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from the two comments still to address (naming and extra newline)

*/
@Since("1.6.0")
def run(input: RDD[Vector]): BisectingKMeansModel = {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline

* @param input RDD of vectors
* @return model for the bisecting kmeans
*/
@Since("1.6.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be since 2.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this api was already existing since 1.6.0, so we should keep the since annotation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You couldn't call BisectingKMeans.run(...) before this, right? it wasn't in a superclass or anything. In that sense I think this method needs to be marked as new as of 2.4.0, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def run(input: RDD[Vector]): BisectingKMeansModel is a public api since 1.6, and users can call it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right I get it now, this isn't a new method, it's 'replacing' the definition above. 👍

@srowen
Copy link
Member

srowen commented Aug 16, 2018

Merged to master

@asfgit asfgit closed this in e501924 Aug 16, 2018
@zhengruifeng zhengruifeng deleted the alg_logNumExamples branch August 17, 2018 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants