Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23578][ML] Add multicolumn support for Binarizer #20732

Closed
wants to merge 1 commit into from

Conversation

tengpeng
Copy link
Contributor

@tengpeng tengpeng commented Mar 4, 2018

What changes were proposed in this pull request?

[Spark-20542] added an API that Bucketizer that can bin multiple columns. Based on this change, a multicolumn support is added for Binarizer.

How was this patch tested?

Added test cases.

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Sep 18, 2019

--- Bucketizer already support multi-columns since SPARK-20542 ---

@tengpeng
Copy link
Contributor Author

tengpeng commented Sep 18, 2019 via email

@zhengruifeng zhengruifeng reopened this Sep 18, 2019
@zhengruifeng
Copy link
Contributor

@tengpeng Sorry for this stupid mistake!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@zhengruifeng
Copy link
Contributor

I think supporting multi-column in Binarizer is reasonable.
@tengpeng will you go on with this work? I can help reviewing.

@tengpeng
Copy link
Contributor Author

I think this PR is ready for review but I am not sure I am missing anything.

@zhengruifeng
Copy link
Contributor

@tengpeng I just find a potential bug in current Binarizer impl #25829, if it is confirmed by other committers, I may need to make a backporting.
After that, I will review this PR, but I guess you will need to rebase it.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tengpeng I leave some comments. You need to rebase the PR now, since the bugfix I noticed above was merged.

@@ -45,7 +45,23 @@ object BinarizerExample {
binarizedDataFrame.show()
// $example off$

// $example on$
val data2 = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame2 = spark.createDataFrame((0 until data.length).map { idx =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a vector column?


/** @group param */
@Since("2.3.1")
val thresholds: DoubleArrayParam =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about extending HasThresholds?

@Since("2.3.1")
def setOutputCols(value: Array[String]): this.type = set(outputCols, value)

@Since("2.3.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need to add since annotation for a private method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to method ParamValidators.checkSingleVsMultiColumnParams added in SPARK-22799, which is used in Bucketizer/QuantileDiscretizer/StringIndexer.

I guess we do not need this method.

td => udf { (in: Double) => if (in > td) 1.0 else 0.0 }
}

val binarizerVector = td.map { td =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This processing UDF is updated for vectors with threshold<0. Please refer to the lastest master brance.


val inputType = inputColName.map { col => schema(col).dataType }

val binarizerDouble: Seq[UserDefinedFunction] = td.map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to create the binarizerDouble & binarizerVector in the beginning, since a part of the entries are never used.
I prefer initialize new Columns like this:

val outputCols = inputColNames.zip(outputColNames).zip(thresholds).map{ case (inputColName, outputColName, threshold) =>
   schema(inputColName).dataType match{
   ...
  }
}

@Since("2.3.1")
private[feature] def isBinarizerMultipleColumns(): Boolean = {
if (isSet(inputCols) && isSet(inputCol)) {
logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not set both of them, according to current ML's convention.

@@ -101,6 +163,23 @@ class BinarizerSuite extends MLTest with DefaultReadWriteTest {
}
}

test("multiple column: Binarize vector of continuous features with setter") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one suite with both double and vector columns is enough.

new DoubleParam(this, "threshold", "threshold used to binarize continuous features")

/** @group param */
@Since("2.3.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@since("3.0.0")

@zhengruifeng
Copy link
Contributor

@tengpeng Hi, Peng, will u go on working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants