Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-7446][MLLIB] Add inverse transform for string indexer #6339

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented May 22, 2015

It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33308 has finished for PR 6339 at commit 06e1e35.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33312 has finished for PR 6339 at commit 12c13f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [Spark-7446[MLLIB] Add inverse transform for string indexer [Spark-7446][MLLIB] Add inverse transform for string indexer May 22, 2015
@dbtsai
Copy link
Member

dbtsai commented Jun 13, 2015

This will be useful. But should be have inverse transform api in the parent class?

@holdenk
Copy link
Contributor Author

holdenk commented Jun 16, 2015

@mengxr - should we put this in the parent class or maybe just add a trait for transformers which are invertible?

@holdenk holdenk force-pushed the SPARK-7446-inverse-transform-for-string-indexer branch from 12c13f6 to 88779c1 Compare July 7, 2015 23:35
@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36733 has finished for PR 6339 at commit 88779c1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36740 has finished for PR 6339 at commit 557bef8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 10, 2015

@mengxr : Does the inverse() method returning a separate transformer seem ok?

@holdenk
Copy link
Contributor Author

holdenk commented Jul 21, 2015

jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #40 has finished for PR 6339 at commit 557bef8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37919 has finished for PR 6339 at commit 557bef8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

This will be very useful. For the design, I don't think it actually needs to be linked to StringIndexer at all since everything it needs is stored in the metadata (by StringIndexer). It should be able to take an input column of type Double (or other numerics too?), look at the metadata for a mapping, and convert the column to the corresponding string labels.

It could still be called InverseStringIndexer for clarity, though.

@jkbradley
Copy link
Member

@holdenk Can you please add a description (e.g. copied from the JIRA description)? We're trying to enforce that for PRs. Thanks

@holdenk
Copy link
Contributor Author

holdenk commented Jul 24, 2015

Oh sure. I will add that. I am about to get on a flight to a conference so
might take a bit of time.

On Thursday, July 23, 2015, jkbradley notifications@github.com wrote:

@holdenk https://github.com/holdenk Can you please add a description
(e.g. copied from the JIRA description)? We're trying to enforce that for
PRs. Thanks


Reply to this email directly or view it on GitHub
#6339 (comment).

Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

@jkbradley
Copy link
Member

OK no problem

@holdenk
Copy link
Contributor Author

holdenk commented Jul 26, 2015

Updated.


/**
* Return a model to perform the inverse transformation.
* Note: by default we keep the original columns during this transformation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. How about writing:

Note: By default we keep the original columns during this transformation, so the inverse should only be used on new columns such as predicted labels.

@jkbradley
Copy link
Member

Please make the PR description match the PR. ("We can add a parameter to StringIndexer/StringIndexModel for this." is no longer quite what the PR is doing.)

@dbtsai
Copy link
Member

dbtsai commented Jul 30, 2015

Are we going to have invert or unapply in the Transformer trait? In scala, there is an extractor pattern http://docs.scala-lang.org/tutorials/tour/extractor-objects.html , and unapply will return Option since some of the transformation is one way. But Option will not be friendly for Java.

@jkbradley
Copy link
Member

I don't think there are enough use cases for a general invert/unapply. If more real use cases come up, we can revisit this.

@jkbradley
Copy link
Member

@holdenk The remaining issues are:

  • Add explanation to doc which is built into the labels Param. (Explain that, if labels are not given, they will be taken from the inputCol metadata.)
  • Switch away from using null. I confirmed we're trying to avoid the usage.
  • The scala style nit above

Thanks!

@holdenk
Copy link
Contributor Author

holdenk commented Jul 31, 2015

Ok! If instead of null would it be ok to use option and have the setter set as Some()? I feel like using an empty array as the default value is pretty ugly, but I understand if thats not a good enough reason to use an option (even if we hide it from the user).

@jkbradley
Copy link
Member

Since the public API can't use Option, I'd prefer we just stick with an empty array. Users should not really even need to set it to an empty array since that will be the default, so hopefully it won't bother them much.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 31, 2015

ok, will do.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39166 has finished for PR 6339 at commit 64dd3a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39182 has finished for PR 6339 at commit 9e241d8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39183 has finished for PR 6339 at commit 6a38edb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39253 has finished for PR 6339 at commit b9cffb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 31, 2015

@jkbradley should have the changes you asked for now.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39255 has finished for PR 6339 at commit 7cdf915.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM once we get the tests working

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #1266 has finished for PR 6339 at commit 7cdf915.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master

@asfgit asfgit closed this in 6503897 Aug 1, 2015
@billhao
Copy link

billhao commented Aug 5, 2015

I built the master branch on my cluster yesterday and was trying to use the feature added in this pull request in PySpark. But I got an error message from the following code. Is there anything else I need to do to use this in PySpark? Or am I using it in a wrong way?

labelIndexer = StringIndexer(inputCol="labelStr", outputCol="label")
indexer = labelIndexer.fit(df)
indexer.invert(inputCol="prediction", outputCol="predictionStr")

AttributeError: 'StringIndexer' object has no attribute 'invert'

Not sure if this is the best place for the question. Let me know if there is a best place to ask this.

@jkbradley
Copy link
Member

It's failing because invert is only in StringIndexer, not StringIndexerModel.
That's a good point though; we should probably add invert() to StringIndexerModel as well. @holdenk Would you mind sending a quick PR to do that? Thanks!

@holdenk
Copy link
Contributor Author

holdenk commented Aug 5, 2015

@jkbradley for sure :)

@holdenk
Copy link
Contributor Author

holdenk commented Aug 5, 2015

Wait @jkbradley its actually only implemented on StringIndexerModel which is where it belongs, but it just was not add it to the PySpark API. I could add it to the StringIndexer directly as well (provided the labels were supplied). I've created a JIRA for adding this to PySpark as well.

@jkbradley
Copy link
Member

Ohh, right. I need to read more carefully...
I'll watch that JIRA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants