[SPARK-13097][ML] Binarizer allowing Double AND Vector input types #10976

seddonm1 · 2016-01-29T06:00:46Z

This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).

This contribution is my original work and I license the work to the project under the project's open source license.

@viirya @mengxr

This change extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input columns.

sethah · 2016-01-29T16:31:43Z

The Jira in the PR title is for adding a Binarizer and was already resolved. Is there a Jira for extending Binarizer to Vector types? Typically a Jira is created to discuss the merits and use cases and then a PR can be made against that Jira ticket.

seddonm1 · 2016-01-29T21:54:31Z

Thanks @sethah. I was unsure if this was sufficiently different from the original code where they mentioned 'We need to discuss whether we should process multiple columns or a vector column.'

I have raised SPARK-13097 and renamed this issue.

mengxr · 2016-02-11T23:58:41Z

ok to test

mengxr · 2016-02-12T01:11:09Z

test this please

mengxr · 2016-02-12T01:17:35Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

@@ -62,28 +65,57 @@ final class Binarizer(override val uid: String)
  def setOutputCol(value: String): this.type = set(outputCol, value)

  override def transform(dataset: DataFrame): DataFrame = {
-    transformSchema(dataset.schema, logging = true)
+    val outputSchema = transformSchema(dataset.schema)


Please keep logging on, which uses DEBUG level.

mengxr · 2016-02-12T01:19:26Z

Some minor comments. Overall it looks good to me.

seddonm1 · 2016-02-12T22:07:35Z

Thanks @mengxr.

I have updated the pull request with your suggestions. I believe that we still need to use the .compressed even after moving the if (value > td) { so that we can 'Return a vector in either dense or sparse format, whichever uses less storage.' which will occur if all values in a dense vector are greater than the threshold.

Vectors.sparse(data.size, indices.result(), values.result()).compressed

mengxr · 2016-02-12T22:19:19Z

LGTM pending Jenkins

mengxr · 2016-02-12T22:29:19Z

test this please

SparkQA · 2016-02-12T22:38:33Z

Test build #51209 has finished for PR 10976 at commit ea07001.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-02-14T08:01:45Z

test this please

SparkQA · 2016-02-14T08:13:19Z

Test build #51255 has finished for PR 10976 at commit 6a6988c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-14T09:53:18Z

Test build #51261 has finished for PR 10976 at commit 8a8dc67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-02-14T10:03:11Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

-    val attr = BinaryAttribute.defaultAttr.withName(outputColName)
-    val outputFields = inputFields :+ attr.toStructField()
-    StructType(outputFields)
+    var outCol: StructField = null


nit: we can use val here like:

val outCol: StructField = inputType match { case DoubleType => BinaryAttribute.defaultAttr.withName(outputColName).toStructField() case _: VectorUDT => new StructField(outputColName, new VectorUDT, true) case other => throw new IllegalArgumentException(s"Data type $other is not supported.") }

viirya · 2016-02-14T10:08:11Z

LGTM, besides a minor comment.

SparkQA · 2016-02-14T22:05:00Z

Test build #51278 has finished for PR 10976 at commit 9750238.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-02-16T04:15:46Z

Merged into master. Thanks!

[SPARK-5891][ML] Binarizer allowing Double AND Vector inputs

3510d8d

This change extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input columns.

seddonm1 changed the title ~~[SPARK-5891][ML] Binarizer allowing Double AND Vector input types~~ [SPARK-13097][ML] Binarizer allowing Double AND Vector input types Jan 29, 2016

mengxr reviewed Feb 12, 2016
View reviewed changes

Update based on pull request feedback

ea07001

Scala style tests

6a6988c

Updated test suite file to fix Scala Style Test errors

8a8dc67

viirya reviewed Feb 14, 2016
View reviewed changes

Updated based on pull request feedback

9750238

asfgit closed this in cbeb006 Feb 16, 2016

[SPARK-13097][ML] Binarizer allowing Double AND Vector input types #10976

[SPARK-13097][ML] Binarizer allowing Double AND Vector input types #10976

Uh oh!

Conversation

seddonm1 commented Jan 29, 2016

Uh oh!

sethah commented Jan 29, 2016

Uh oh!

seddonm1 commented Jan 29, 2016

Uh oh!

mengxr commented Feb 11, 2016

Uh oh!

mengxr commented Feb 12, 2016

Uh oh!

mengxr Feb 12, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Feb 12, 2016

Uh oh!

seddonm1 commented Feb 12, 2016

Uh oh!

mengxr commented Feb 12, 2016

Uh oh!

mengxr commented Feb 12, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

mengxr commented Feb 14, 2016

Uh oh!

SparkQA commented Feb 14, 2016

Uh oh!

SparkQA commented Feb 14, 2016

Uh oh!

viirya Feb 14, 2016

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 14, 2016

Uh oh!

SparkQA commented Feb 14, 2016

Uh oh!

mengxr commented Feb 16, 2016

Uh oh!

Uh oh!