Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13097][ML] Binarizer allowing Double AND Vector input types #10976

Closed
wants to merge 5 commits into from

Conversation

seddonm1
Copy link
Contributor

This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).

This contribution is my original work and I license the work to the project under the project's open source license.

@viirya @mengxr

This change extends the existing SparkML Binarizer [SPARK-5891] to allow
Vector in addition to the existing Double input columns.
@sethah
Copy link
Contributor

sethah commented Jan 29, 2016

The Jira in the PR title is for adding a Binarizer and was already resolved. Is there a Jira for extending Binarizer to Vector types? Typically a Jira is created to discuss the merits and use cases and then a PR can be made against that Jira ticket.

@seddonm1
Copy link
Contributor Author

Thanks @sethah. I was unsure if this was sufficiently different from the original code where they mentioned 'We need to discuss whether we should process multiple columns or a vector column.'

I have raised SPARK-13097 and renamed this issue.

@seddonm1 seddonm1 changed the title [SPARK-5891][ML] Binarizer allowing Double AND Vector input types [SPARK-13097][ML] Binarizer allowing Double AND Vector input types Jan 29, 2016
@mengxr
Copy link
Contributor

mengxr commented Feb 11, 2016

ok to test

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

test this please

@@ -62,28 +65,57 @@ final class Binarizer(override val uid: String)
def setOutputCol(value: String): this.type = set(outputCol, value)

override def transform(dataset: DataFrame): DataFrame = {
transformSchema(dataset.schema, logging = true)
val outputSchema = transformSchema(dataset.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep logging on, which uses DEBUG level.

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

Some minor comments. Overall it looks good to me.

@seddonm1
Copy link
Contributor Author

Thanks @mengxr.

I have updated the pull request with your suggestions. I believe that we still need to use the .compressed even after moving the if (value > td) { so that we can 'Return a vector in either dense or sparse format, whichever uses less storage.' which will occur if all values in a dense vector are greater than the threshold.

Vectors.sparse(data.size, indices.result(), values.result()).compressed

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

LGTM pending Jenkins

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

test this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51209 has finished for PR 10976 at commit ea07001.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Feb 14, 2016

test this please

@SparkQA
Copy link

SparkQA commented Feb 14, 2016

Test build #51255 has finished for PR 10976 at commit 6a6988c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2016

Test build #51261 has finished for PR 10976 at commit 8a8dc67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val attr = BinaryAttribute.defaultAttr.withName(outputColName)
val outputFields = inputFields :+ attr.toStructField()
StructType(outputFields)
var outCol: StructField = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use val here like:

val outCol: StructField = inputType match {
  case DoubleType =>
    BinaryAttribute.defaultAttr.withName(outputColName).toStructField()
  case _: VectorUDT =>
    new StructField(outputColName, new VectorUDT, true)
  case other =>
    throw new IllegalArgumentException(s"Data type $other is not supported.")
}

@viirya
Copy link
Member

viirya commented Feb 14, 2016

LGTM, besides a minor comment.

@SparkQA
Copy link

SparkQA commented Feb 14, 2016

Test build #51278 has finished for PR 10976 at commit 9750238.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Feb 16, 2016

Merged into master. Thanks!

@asfgit asfgit closed this in cbeb006 Feb 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants