[Spark-7446][MLLIB] Add inverse transform for string indexer #6339

holdenk · 2015-05-22T01:29:55Z

It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.

SparkQA · 2015-05-22T01:36:05Z

Test build #33308 has finished for PR 6339 at commit 06e1e35.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-22T03:49:36Z

Test build #33312 has finished for PR 6339 at commit 12c13f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-06-13T01:40:01Z

This will be useful. But should be have inverse transform api in the parent class?

holdenk · 2015-06-16T02:54:13Z

@mengxr - should we put this in the parent class or maybe just add a trait for transformers which are invertible?

SparkQA · 2015-07-08T00:07:36Z

Test build #36733 has finished for PR 6339 at commit 88779c1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…so we can use it in a pipeline

SparkQA · 2015-07-08T00:53:39Z

Test build #36740 has finished for PR 6339 at commit 557bef8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-07-10T01:29:28Z

@mengxr : Does the inverse() method returning a separate transformer seem ok?

holdenk · 2015-07-21T06:32:00Z

jenkins, retest this please.

SparkQA · 2015-07-21T07:14:12Z

Test build #40 has finished for PR 6339 at commit 557bef8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-21T08:07:48Z

Test build #37919 has finished for PR 6339 at commit 557bef8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-24T02:20:25Z

This will be very useful. For the design, I don't think it actually needs to be linked to StringIndexer at all since everything it needs is stored in the metadata (by StringIndexer). It should be able to take an input column of type Double (or other numerics too?), look at the metadata for a mapping, and convert the column to the corresponding string labels.

It could still be called InverseStringIndexer for clarity, though.

jkbradley · 2015-07-24T02:21:34Z

@holdenk Can you please add a description (e.g. copied from the JIRA description)? We're trying to enforce that for PRs. Thanks

holdenk · 2015-07-24T02:23:48Z

Oh sure. I will add that. I am about to get on a flight to a conference so
might take a bit of time.

On Thursday, July 23, 2015, jkbradley notifications@github.com wrote:

@holdenk https://github.com/holdenk Can you please add a description
(e.g. copied from the JIRA description)? We're trying to enforce that for
PRs. Thanks

—
Reply to this email directly or view it on GitHub
#6339 (comment).

Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

jkbradley · 2015-07-24T02:51:00Z

OK no problem

holdenk · 2015-07-26T20:48:17Z

Updated.

jkbradley · 2015-07-27T20:23:11Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+
+  /**
+   * Return a model to perform the inverse transformation.
+   * Note: by default we keep the original columns during this transformation


This is confusing. How about writing:

Note: By default we keep the original columns during this transformation, so the inverse should only be used on new columns such as predicted labels.

jkbradley · 2015-07-27T20:24:52Z

Please make the PR description match the PR. ("We can add a parameter to StringIndexer/StringIndexModel for this." is no longer quite what the PR is doing.)

dbtsai · 2015-07-30T23:24:53Z

Are we going to have invert or unapply in the Transformer trait? In scala, there is an extractor pattern http://docs.scala-lang.org/tutorials/tour/extractor-objects.html , and unapply will return Option since some of the transformation is one way. But Option will not be friendly for Java.

jkbradley · 2015-07-31T00:19:18Z

I don't think there are enough use cases for a general invert/unapply. If more real use cases come up, we can revisit this.

jkbradley · 2015-07-31T00:21:48Z

@holdenk The remaining issues are:

Add explanation to doc which is built into the labels Param. (Explain that, if labels are not given, they will be taken from the inputCol metadata.)
Switch away from using null. I confirmed we're trying to avoid the usage.
The scala style nit above

Thanks!

holdenk · 2015-07-31T00:24:24Z

Ok! If instead of null would it be ok to use option and have the setter set as Some()? I feel like using an empty array as the default value is pretty ugly, but I understand if thats not a good enough reason to use an option (even if we hide it from the user).

jkbradley · 2015-07-31T00:31:49Z

Since the public API can't use Option, I'd prefer we just stick with an empty array. Users should not really even need to set it to an empty array since that will be the default, so hopefully it won't bother them much.

holdenk · 2015-07-31T00:34:40Z

ok, will do.

…dexer

SparkQA · 2015-07-31T06:29:16Z

Test build #39166 has finished for PR 6339 at commit 64dd3a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…dexer

SparkQA · 2015-07-31T07:36:59Z

Test build #39182 has finished for PR 6339 at commit 9e241d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T07:46:25Z

Test build #39183 has finished for PR 6339 at commit 6a38edb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T19:17:22Z

Test build #39253 has finished for PR 6339 at commit b9cffb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-07-31T19:19:33Z

@jkbradley should have the changes you asked for now.

SparkQA · 2015-07-31T21:52:12Z

Test build #39255 has finished for PR 6339 at commit 7cdf915.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-01T03:50:51Z

LGTM once we get the tests working

Jenkins test this please

SparkQA · 2015-08-01T04:52:03Z

Test build #1266 has finished for PR 6339 at commit 7cdf915.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-01T08:09:13Z

Merging with master

billhao · 2015-08-05T19:41:15Z

I built the master branch on my cluster yesterday and was trying to use the feature added in this pull request in PySpark. But I got an error message from the following code. Is there anything else I need to do to use this in PySpark? Or am I using it in a wrong way?

labelIndexer = StringIndexer(inputCol="labelStr", outputCol="label")
indexer = labelIndexer.fit(df)
indexer.invert(inputCol="prediction", outputCol="predictionStr")

AttributeError: 'StringIndexer' object has no attribute 'invert'

Not sure if this is the best place for the question. Let me know if there is a best place to ask this.

jkbradley · 2015-08-05T19:55:22Z

It's failing because invert is only in StringIndexer, not StringIndexerModel.
That's a good point though; we should probably add invert() to StringIndexerModel as well. @holdenk Would you mind sending a quick PR to do that? Thanks!

holdenk · 2015-08-05T20:08:19Z

@jkbradley for sure :)

holdenk · 2015-08-05T20:21:04Z

Wait @jkbradley its actually only implemented on StringIndexerModel which is where it belongs, but it just was not add it to the PySpark API. I could add it to the StringIndexer directly as well (provided the labels were supplied). I've created a JIRA for adding this to PySpark as well.

jkbradley · 2015-08-05T21:21:30Z

Ohh, right. I need to read more carefully...
I'll watch that JIRA.

holdenk changed the title ~~[Spark-7446[MLLIB] Add inverse transform for string indexer~~ [Spark-7446][MLLIB] Add inverse transform for string indexer May 22, 2015

holdenk added 3 commits July 7, 2015 16:35

Some progress

bb16a6a

Finish reverse part and add a test :)

78b28c1

fix long line

88779c1

holdenk force-pushed the SPARK-7446-inverse-transform-for-string-indexer branch from 12c13f6 to 88779c1 Compare July 7, 2015 23:35

Instead of using a private inverse transform, add an invert function …

557bef8

…so we can use it in a pipeline

jkbradley reviewed Jul 27, 2015
View reviewed changes

holdenk added 2 commits July 30, 2015 22:54

Fix comment styles, use empty array as the default, etc.

4f06c59

Merge branch 'master' into SPARK-7446-inverse-transform-for-string-in…

64dd3a3

…dexer

holdenk added 3 commits July 30, 2015 23:56

Merge branch 'master' into SPARK-7446-inverse-transform-for-string-in…

21c8cfa

…dexer

use Array.empty

9e241d8

Setting the default needs to come after the value gets defined

6a38edb

Update the labels param to have the metadata note

b9cffb6

scala style comment fix

7cdf915

asfgit closed this in 6503897 Aug 1, 2015

[Spark-7446][MLLIB] Add inverse transform for string indexer #6339

[Spark-7446][MLLIB] Add inverse transform for string indexer #6339

Conversation

holdenk commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

dbtsai commented Jun 13, 2015

holdenk commented Jun 16, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 8, 2015

holdenk commented Jul 10, 2015

holdenk commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 21, 2015

jkbradley commented Jul 24, 2015

jkbradley commented Jul 24, 2015

holdenk commented Jul 24, 2015

jkbradley commented Jul 24, 2015

holdenk commented Jul 26, 2015

jkbradley Jul 27, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 27, 2015

dbtsai commented Jul 30, 2015

jkbradley commented Jul 31, 2015

jkbradley commented Jul 31, 2015

holdenk commented Jul 31, 2015

jkbradley commented Jul 31, 2015

holdenk commented Jul 31, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015

SparkQA commented Jul 31, 2015

holdenk commented Jul 31, 2015

SparkQA commented Jul 31, 2015

jkbradley commented Aug 1, 2015

SparkQA commented Aug 1, 2015

jkbradley commented Aug 1, 2015

billhao commented Aug 5, 2015

jkbradley commented Aug 5, 2015

holdenk commented Aug 5, 2015

holdenk commented Aug 5, 2015

jkbradley commented Aug 5, 2015