Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183

Closed
wants to merge 5 commits into from

Conversation

yanboliang
Copy link
Contributor

Add multiple columns support to StringIndexer, then users can transform multiple input columns to multiple output columns simultaneously.

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43987 has finished for PR 9183 at commit 10ec734.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44069 has finished for PR 9183 at commit 24ad0fd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44075 has finished for PR 9183 at commit d039f9a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44077 has finished for PR 9183 at commit d039f9a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2015

Test build #44146 has finished for PR 9183 at commit a64f71d.

  • This patch fails SparkR unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
    • class HasOutputCols(Params):

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 23, 2015

Test build #44196 has finished for PR 9183 at commit a64f71d.

  • This patch fails SparkR unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
    • class HasOutputCols(Params):

@SparkQA
Copy link

SparkQA commented Oct 23, 2015

Test build #44220 has finished for PR 9183 at commit 2e8bc28.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
    • class HasOutputCols(Params):

@yanboliang
Copy link
Contributor Author

Because of the multiple columns StringIndexer use Aggregate rather than countByValue to compute distinct value count, if two or more values has the same count, it will has indeterminate order.
So

  1. binary classification label column may be indexed to different result(0, 1 or 1, 0);
  2. OneHotEncoder will drop the last category in the encoded vector by default, if there are more than one value can be drop, it will indeterminate drop which one in this proposal.
    I don't think we need to keep the order restriction produced by countByValue which may lead poor performance in the Aggregate implementation, so I disabled some test cases. If my proposal work well, I can enable and update these test cases.

@@ -56,14 +56,3 @@ test_that("feature interaction vs native glm", {
rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), iris)
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
})

test_that("summary coefficients match with native glm", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not removed, just temporary disable. Because of this PR changed the semantics of StringIndexer a little that cause this test case produce indeterminate result. We should first discuss the semantics changing is necessary or not, and then we can update the test case to produce determinate result.

@SparkQA
Copy link

SparkQA commented Dec 10, 2015

Test build #47507 has started for PR 9183 at commit 2e8bc28.

@BenFradet
Copy link
Contributor

+1, I've been meaning to request this transformer for a while.

@BenFradet
Copy link
Contributor

LGTM, I think a good follow up would be to do the same with IndexToString.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
@pkch
Copy link
Contributor

pkch commented Jun 28, 2016

What needs to happen to move this forward? This was a PR that would have been the first iteration of a significant improvement in handling of wide datasets.

@rxin
Copy link
Contributor

rxin commented Jun 28, 2016

I think @yanboliang just need to push this forward and get people to review it.

@MLnick
Copy link
Contributor

MLnick commented Aug 22, 2016

@yanboliang will you be reviving this PR?

@yanboliang
Copy link
Contributor Author

@MLnick I think this is an important feature to make ML pipeline handle large datasets elegantly. I will update/send a new PR soon and looking forward that you can help to review. Thanks!

@pramitchoudhary
Copy link

@yanboliang This is a very helpful initiative by you. Thanks for taking it up. Let me know, if you need any help for this PR.

@yanboliang
Copy link
Contributor Author

@pramitchoudhary Yeah, lots of users vote for this feature. I will update this PR to match master in the next release cycle and looking forward to review/comment. Since Spark 2.1 is code freeze, we can only merge bug fix and docs change right now. Thanks.

@WeichenXu123
Copy link
Contributor

@yanboliang I will take over this feature and create a new PR soon.

@minixalpha
Copy link
Contributor

@WeichenXu123 Any activity for the new PR?

@yanboliang yanboliang deleted the spark-11215 branch September 11, 2017 04:07
@WeichenXu123
Copy link
Contributor

@minixalpha Sorry for delay. Too busy recently. But I will try to finish and commit my new PR once I get time. Thanks!

@minixalpha
Copy link
Contributor

Thanks for you job! @WeichenXu123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants