New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23578][ML] Add multicolumn support for Binarizer #20732
Conversation
--- |
This PR is for Binarizer not Bucketizer.
…On Wed, Sep 18, 2019 at 4:35 PM Ruifeng Zheng ***@***.***> wrote:
Closed #20732 <#20732>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20732?email_source=notifications&email_token=AAUZHMS56RWQEK6XVRTMAMDQKHR57A5CNFSM4ETNZUDKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTWBUA7Q#event-2642624638>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAUZHMTZBK3ETLV65NWCN5TQKHR57ANCNFSM4ETNZUDA>
.
|
@tengpeng Sorry for this stupid mistake! |
Can one of the admins verify this patch? |
I think supporting multi-column in Binarizer is reasonable. |
I think this PR is ready for review but I am not sure I am missing anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tengpeng I leave some comments. You need to rebase the PR now, since the bugfix I noticed above was merged.
@@ -45,7 +45,23 @@ object BinarizerExample { | |||
binarizedDataFrame.show() | |||
// $example off$ | |||
|
|||
// $example on$ | |||
val data2 = Array((0, 0.1), (1, 0.8), (2, 0.2)) | |||
val dataFrame2 = spark.createDataFrame((0 until data.length).map { idx => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding a vector column?
|
||
/** @group param */ | ||
@Since("2.3.1") | ||
val thresholds: DoubleArrayParam = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about extending HasThresholds
?
@Since("2.3.1") | ||
def setOutputCols(value: Array[String]): this.type = set(outputCols, value) | ||
|
||
@Since("2.3.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need to add since annotation for a private method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to method ParamValidators.checkSingleVsMultiColumnParams
added in SPARK-22799
, which is used in Bucketizer
/QuantileDiscretizer
/StringIndexer
.
I guess we do not need this method.
td => udf { (in: Double) => if (in > td) 1.0 else 0.0 } | ||
} | ||
|
||
val binarizerVector = td.map { td => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This processing UDF is updated for vectors with threshold<0. Please refer to the lastest master brance.
|
||
val inputType = inputColName.map { col => schema(col).dataType } | ||
|
||
val binarizerDouble: Seq[UserDefinedFunction] = td.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to create the binarizerDouble
& binarizerVector
in the beginning, since a part of the entries are never used.
I prefer initialize new Columns like this:
val outputCols = inputColNames.zip(outputColNames).zip(thresholds).map{ case (inputColName, outputColName, threshold) =>
schema(inputColName).dataType match{
...
}
}
@Since("2.3.1") | ||
private[feature] def isBinarizerMultipleColumns(): Boolean = { | ||
if (isSet(inputCols) && isSet(inputCol)) { | ||
logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not set both of them, according to current ML's convention.
@@ -101,6 +163,23 @@ class BinarizerSuite extends MLTest with DefaultReadWriteTest { | |||
} | |||
} | |||
|
|||
test("multiple column: Binarize vector of continuous features with setter") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one suite with both double and vector columns is enough.
new DoubleParam(this, "threshold", "threshold used to binarize continuous features") | ||
|
||
/** @group param */ | ||
@Since("2.3.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@since("3.0.0")
@tengpeng Hi, Peng, will u go on working on this? |
What changes were proposed in this pull request?
[Spark-20542] added an API that Bucketizer that can bin multiple columns. Based on this change, a multicolumn support is added for Binarizer.
How was this patch tested?
Added test cases.