-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13097][ML] Binarizer allowing Double AND Vector input types #10976
Conversation
This change extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input columns.
The Jira in the PR title is for adding a Binarizer and was already resolved. Is there a Jira for extending Binarizer to |
Thanks @sethah. I was unsure if this was sufficiently different from the original code where they mentioned 'We need to discuss whether we should process multiple columns or a vector column.' I have raised SPARK-13097 and renamed this issue. |
ok to test |
test this please |
@@ -62,28 +65,57 @@ final class Binarizer(override val uid: String) | |||
def setOutputCol(value: String): this.type = set(outputCol, value) | |||
|
|||
override def transform(dataset: DataFrame): DataFrame = { | |||
transformSchema(dataset.schema, logging = true) | |||
val outputSchema = transformSchema(dataset.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep logging
on, which uses DEBUG level.
Some minor comments. Overall it looks good to me. |
Thanks @mengxr. I have updated the pull request with your suggestions. I believe that we still need to use the .compressed even after moving the
|
LGTM pending Jenkins |
test this please |
Test build #51209 has finished for PR 10976 at commit
|
test this please |
Test build #51255 has finished for PR 10976 at commit
|
Test build #51261 has finished for PR 10976 at commit
|
val attr = BinaryAttribute.defaultAttr.withName(outputColName) | ||
val outputFields = inputFields :+ attr.toStructField() | ||
StructType(outputFields) | ||
var outCol: StructField = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use val
here like:
val outCol: StructField = inputType match {
case DoubleType =>
BinaryAttribute.defaultAttr.withName(outputColName).toStructField()
case _: VectorUDT =>
new StructField(outputColName, new VectorUDT, true)
case other =>
throw new IllegalArgumentException(s"Data type $other is not supported.")
}
LGTM, besides a minor comment. |
Test build #51278 has finished for PR 10976 at commit
|
Merged into master. Thanks! |
This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.
A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).
This contribution is my original work and I license the work to the project under the project's open source license.
@viirya @mengxr