[SPARK-23578][ML] Add multicolumn support for Binarizer #20732

tengpeng · 2018-03-04T18:57:20Z

What changes were proposed in this pull request?

[Spark-20542] added an API that Bucketizer that can bin multiple columns. Based on this change, a multicolumn support is added for Binarizer.

How was this patch tested?

Added test cases.

zhengruifeng · 2019-09-18T08:33:34Z

--- Bucketizer already support multi-columns since SPARK-20542 ---

tengpeng · 2019-09-18T08:40:20Z

This PR is for Binarizer not Bucketizer.

…

On Wed, Sep 18, 2019 at 4:35 PM Ruifeng Zheng ***@***.***> wrote: Closed #20732 <#20732>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20732?email_source=notifications&email_token=AAUZHMS56RWQEK6XVRTMAMDQKHR57A5CNFSM4ETNZUDKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTWBUA7Q#event-2642624638>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUZHMTZBK3ETLV65NWCN5TQKHR57ANCNFSM4ETNZUDA> .

zhengruifeng · 2019-09-18T08:53:07Z

@tengpeng Sorry for this stupid mistake!

AmplabJenkins · 2019-09-18T08:58:52Z

Can one of the admins verify this patch?

zhengruifeng · 2019-09-18T09:09:55Z

I think supporting multi-column in Binarizer is reasonable.
@tengpeng will you go on with this work? I can help reviewing.

tengpeng · 2019-09-18T09:46:34Z

I think this PR is ready for review but I am not sure I am missing anything.

zhengruifeng · 2019-09-18T10:54:54Z

@tengpeng I just find a potential bug in current Binarizer impl #25829, if it is confirmed by other committers, I may need to make a backporting.
After that, I will review this PR, but I guess you will need to rebase it.

zhengruifeng

@tengpeng I leave some comments. You need to rebase the PR now, since the bugfix I noticed above was merged.

zhengruifeng · 2019-09-23T08:29:38Z

examples/src/main/scala/org/apache/spark/examples/ml/BinarizerExample.scala

@@ -45,7 +45,23 @@ object BinarizerExample {
    binarizedDataFrame.show()
    // $example off$

+    // $example on$
+    val data2 = Array((0, 0.1), (1, 0.8), (2, 0.2))
+    val dataFrame2 = spark.createDataFrame((0 until data.length).map { idx =>


what about adding a vector column?

zhengruifeng · 2019-09-23T08:31:31Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

+
+  /** @group param */
+  @Since("2.3.1")
+  val thresholds: DoubleArrayParam =


what about extending HasThresholds?

zhengruifeng · 2019-09-23T08:32:30Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

+  @Since("2.3.1")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
+
+  @Since("2.3.1")


we dont need to add since annotation for a private method

Please refer to method ParamValidators.checkSingleVsMultiColumnParams added in SPARK-22799, which is used in Bucketizer/QuantileDiscretizer/StringIndexer.

I guess we do not need this method.

zhengruifeng · 2019-09-23T08:39:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

+      td => udf { (in: Double) => if (in > td) 1.0 else 0.0 }
+    }
+
+    val binarizerVector = td.map { td =>


This processing UDF is updated for vectors with threshold<0. Please refer to the lastest master brance.

zhengruifeng · 2019-09-23T08:45:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

+
+    val inputType = inputColName.map { col => schema(col).dataType }
+
+    val binarizerDouble: Seq[UserDefinedFunction] = td.map {


I prefer not to create the binarizerDouble & binarizerVector in the beginning, since a part of the entries are never used.
I prefer initialize new Columns like this:

val outputCols = inputColNames.zip(outputColNames).zip(thresholds).map{ case (inputColName, outputColName, threshold) => schema(inputColName).dataType match{ ... } }

zhengruifeng · 2019-09-23T08:50:13Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

+  @Since("2.3.1")
+  private[feature] def isBinarizerMultipleColumns(): Boolean = {
+    if (isSet(inputCols) && isSet(inputCol)) {
+      logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " +


We can not set both of them, according to current ML's convention.

zhengruifeng · 2019-09-23T08:51:30Z

mllib/src/test/scala/org/apache/spark/ml/feature/BinarizerSuite.scala

@@ -101,6 +163,23 @@ class BinarizerSuite extends MLTest with DefaultReadWriteTest {
    }
  }

+  test("multiple column: Binarize vector of continuous features with setter") {


I think one suite with both double and vector columns is enough.

zhengruifeng · 2019-09-23T08:51:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala

+  new DoubleParam(this, "threshold", "threshold used to binarize continuous features")
+
+  /** @group param */
+  @Since("2.3.1")


@since("3.0.0")

zhengruifeng · 2019-10-08T03:04:22Z

@tengpeng Hi, Peng, will u go on working on this?

Add multicolumn support for Binarizer

a0cc9f1

dongjoon-hyun added the ML label Jun 14, 2019

zhengruifeng closed this Sep 18, 2019

zhengruifeng reopened this Sep 18, 2019

zhengruifeng reviewed Sep 23, 2019

View reviewed changes

zhengruifeng closed this Oct 16, 2019


		val inputType = inputColName.map { col => schema(col).dataType }

		val binarizerDouble: Seq[UserDefinedFunction] = td.map {

[SPARK-23578][ML] Add multicolumn support for Binarizer #20732

[SPARK-23578][ML] Add multicolumn support for Binarizer #20732

Uh oh!

Conversation

tengpeng commented Mar 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zhengruifeng commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tengpeng commented Sep 18, 2019 via email

Uh oh!

zhengruifeng commented Sep 18, 2019

Uh oh!

AmplabJenkins commented Sep 18, 2019

Uh oh!

zhengruifeng commented Sep 18, 2019

Uh oh!

tengpeng commented Sep 18, 2019

Uh oh!

zhengruifeng commented Sep 18, 2019

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Oct 8, 2019

Uh oh!

Uh oh!

tengpeng commented Mar 4, 2018 •

edited

Loading

zhengruifeng commented Sep 18, 2019 •

edited

Loading