Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11569] [ML] Fix StringIndexer to handle null value properly #9920

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
5 participants
@jliwork
Copy link
Contributor

commented Nov 23, 2015

I was having some problem with rebase on #9709, so I had to close that PR and creating a new pull request with my latest fix.

Thanks to @jkbradley and @holdenk for your comments. I have updated my fix so that it will allow user to config either to filter out null values or throw an error with StringIndexer.setHandleInvalid("skip") API. The default is StringIndexer.setHandleInvalid("error").

Please let me know what you think. Thanks again!

@jliwork

This comment has been minimized.

Copy link
Contributor Author

commented Nov 30, 2015

@jkbradley @holdenk Could you please take a look my latest fix and let me know of any further comments? thanks!

@@ -164,7 +175,7 @@ class StringIndexerModel (
val filteredDataset = (getHandleInvalid) match {
case "skip" => {
val filterer = udf { label: String =>
labelToIndex.contains(label)
label != null

This comment has been minimized.

Copy link
@holdenk

holdenk Dec 3, 2015

Contributor

This no longer filters out values that are not present in labaelToIndex

@@ -255,7 +255,8 @@ private[ml] trait HasFitIntercept extends Params {
private[ml] trait HasHandleInvalid extends Params {

/**
* Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an errror). More options may be added later..
* Param for how to handle invalid entries. Options are skip (which will filter out rows with null values), or error

This comment has been minimized.

Copy link
@holdenk

holdenk Dec 3, 2015

Contributor

You've changed the meaning of an existing parameter - this is generally something we want to avoid doing.

@holdenk

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2015

Sorry for my slow reply - looking at this it seems like you've updated the meaning of handleInvalid - it no longer serves its original purposes (unless I've missed something). This is probably not quite the best path forward - maybe something for handleNulls and keep the old handle invalid?

I really like the thoroughness of the tests & I think the logic is pretty solid (just changing the meaning of things in the API is to avoided).

@srowen

This comment has been minimized.

Copy link
Member

commented May 6, 2016

@jliwork would you like to continue working on this? or else please close the PR

@AmplabJenkins

This comment has been minimized.

Copy link

commented May 12, 2016

Can one of the admins verify this patch?

@jliwork

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2016

Yes. I'd like to continue working on this. But since this PR is obsolete. I will close it and open a new one instead.

@jliwork jliwork closed this May 12, 2016

@holdenk

This comment has been minimized.

Copy link
Contributor

commented May 12, 2016

Great - can you ping me on your new PR when its ready? :)

@jliwork

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2016

@holdenk Definitely :-) Thanks!

@imatiach-msft

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2017

@jliwork @srowen are you currently working on this in-progress JIRA 11569? If not, I would be interested in continuing the initial pull request that was closed. Please let me know, thank you!

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 14, 2017

[SPARK-11569][ML] Fix StringIndexer to handle null value properly
## What changes were proposed in this pull request?

This PR is to enhance StringIndexer with NULL values handling.

Before the PR, StringIndexer will throw an exception when encounters NULL values.
With this PR:
- handleInvalid=error: Throw an exception as before
- handleInvalid=skip: Skip null values as well as unseen labels
- handleInvalid=keep: Give null values an additional index as well as unseen labels

BTW, I noticed someone was trying to solve the same problem ( apache#9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)

## How was this patch tested?

new unit tests

Author: Menglong TAN <tanmenglong@renrenche.com>
Author: Menglong TAN <tanmenglong@gmail.com>

Closes apache#17233 from crackcell/11569_StringIndexer_NULL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.