-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-7446][MLLIB] Add inverse transform for string indexer #6339
[Spark-7446][MLLIB] Add inverse transform for string indexer #6339
Conversation
Test build #33308 has finished for PR 6339 at commit
|
Test build #33312 has finished for PR 6339 at commit
|
This will be useful. But should be have inverse transform api in the parent class? |
@mengxr - should we put this in the parent class or maybe just add a trait for transformers which are invertible? |
12c13f6
to
88779c1
Compare
Test build #36733 has finished for PR 6339 at commit
|
…so we can use it in a pipeline
Test build #36740 has finished for PR 6339 at commit
|
@mengxr : Does the inverse() method returning a separate transformer seem ok? |
jenkins, retest this please. |
Test build #40 has finished for PR 6339 at commit
|
Test build #37919 has finished for PR 6339 at commit
|
This will be very useful. For the design, I don't think it actually needs to be linked to StringIndexer at all since everything it needs is stored in the metadata (by StringIndexer). It should be able to take an input column of type Double (or other numerics too?), look at the metadata for a mapping, and convert the column to the corresponding string labels. It could still be called InverseStringIndexer for clarity, though. |
@holdenk Can you please add a description (e.g. copied from the JIRA description)? We're trying to enforce that for PRs. Thanks |
Oh sure. I will add that. I am about to get on a flight to a conference so On Thursday, July 23, 2015, jkbradley notifications@github.com wrote:
Cell : 425-233-8271 |
OK no problem |
Updated. |
|
||
/** | ||
* Return a model to perform the inverse transformation. | ||
* Note: by default we keep the original columns during this transformation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. How about writing:
Note: By default we keep the original columns during this transformation, so the inverse should only be used on new columns such as predicted labels.
Please make the PR description match the PR. ("We can add a parameter to StringIndexer/StringIndexModel for this." is no longer quite what the PR is doing.) |
Are we going to have |
I don't think there are enough use cases for a general invert/unapply. If more real use cases come up, we can revisit this. |
@holdenk The remaining issues are:
Thanks! |
Ok! If instead of null would it be ok to use option and have the setter set as Some()? I feel like using an empty array as the default value is pretty ugly, but I understand if thats not a good enough reason to use an option (even if we hide it from the user). |
Since the public API can't use Option, I'd prefer we just stick with an empty array. Users should not really even need to set it to an empty array since that will be the default, so hopefully it won't bother them much. |
ok, will do. |
Test build #39166 has finished for PR 6339 at commit
|
Test build #39182 has finished for PR 6339 at commit
|
Test build #39183 has finished for PR 6339 at commit
|
Test build #39253 has finished for PR 6339 at commit
|
@jkbradley should have the changes you asked for now. |
Test build #39255 has finished for PR 6339 at commit
|
LGTM once we get the tests working Jenkins test this please |
Test build #1266 has finished for PR 6339 at commit
|
Merging with master |
I built the master branch on my cluster yesterday and was trying to use the feature added in this pull request in PySpark. But I got an error message from the following code. Is there anything else I need to do to use this in PySpark? Or am I using it in a wrong way? labelIndexer = StringIndexer(inputCol="labelStr", outputCol="label") AttributeError: 'StringIndexer' object has no attribute 'invert' Not sure if this is the best place for the question. Let me know if there is a best place to ask this. |
It's failing because invert is only in StringIndexer, not StringIndexerModel. |
@jkbradley for sure :) |
Wait @jkbradley its actually only implemented on StringIndexerModel which is where it belongs, but it just was not add it to the PySpark API. I could add it to the StringIndexer directly as well (provided the labels were supplied). I've created a JIRA for adding this to PySpark as well. |
Ohh, right. I need to read more carefully... |
It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.