[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183

yanboliang · 2015-10-20T15:30:48Z

Add multiple columns support to StringIndexer, then users can transform multiple input columns to multiple output columns simultaneously.

SparkQA · 2015-10-20T16:16:26Z

Test build #43987 has finished for PR 9183 at commit 10ec734.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-21T17:08:13Z

Test build #44069 has finished for PR 9183 at commit 24ad0fd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-21T17:58:58Z

Test build #44075 has finished for PR 9183 at commit d039f9a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-21T18:54:33Z

Test build #44077 has finished for PR 9183 at commit d039f9a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-22T16:06:13Z

Test build #44146 has finished for PR 9183 at commit a64f71d.

This patch fails SparkR unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
- class HasOutputCols(Params):

yanboliang · 2015-10-23T03:18:43Z

Jenkins, test this please.

SparkQA · 2015-10-23T06:24:06Z

Test build #44196 has finished for PR 9183 at commit a64f71d.

This patch fails SparkR unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
- class HasOutputCols(Params):

SparkQA · 2015-10-23T13:27:22Z

Test build #44220 has finished for PR 9183 at commit 2e8bc28.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
- class HasOutputCols(Params):

yanboliang · 2015-10-23T13:47:00Z

Because of the multiple columns StringIndexer use Aggregate rather than countByValue to compute distinct value count, if two or more values has the same count, it will has indeterminate order.
So

binary classification label column may be indexed to different result(0, 1 or 1, 0);
OneHotEncoder will drop the last category in the encoded vector by default, if there are more than one value can be drop, it will indeterminate drop which one in this proposal.
I don't think we need to keep the order restriction produced by countByValue which may lead poor performance in the Aggregate implementation, so I disabled some test cases. If my proposal work well, I can enable and update these test cases.

felixcheung · 2015-10-28T06:25:20Z

R/pkg/inst/tests/test_mllib.R

@@ -56,14 +56,3 @@ test_that("feature interaction vs native glm", {
  rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), iris)
  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
 })
-
-test_that("summary coefficients match with native glm", {


why is this removed?

It's not removed, just temporary disable. Because of this PR changed the semantics of StringIndexer a little that cause this test case produce indeterminate result. We should first discuss the semantics changing is necessary or not, and then we can update the test case to produce determinate result.

SparkQA · 2015-12-10T14:48:54Z

Test build #47507 has started for PR 9183 at commit 2e8bc28.

BenFradet · 2015-12-17T08:57:41Z

+1, I've been meaning to request this transformer for a while.

BenFradet · 2015-12-18T23:33:35Z

LGTM, I think a good follow up would be to do the same with IndexToString.

rxin · 2016-06-15T22:16:56Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

pkch · 2016-06-28T05:37:52Z

What needs to happen to move this forward? This was a PR that would have been the first iteration of a significant improvement in handling of wide datasets.

rxin · 2016-06-28T05:42:52Z

I think @yanboliang just need to push this forward and get people to review it.

MLnick · 2016-08-22T10:29:26Z

@yanboliang will you be reviving this PR?

yanboliang · 2016-08-22T10:46:15Z

@MLnick I think this is an important feature to make ML pipeline handle large datasets elegantly. I will update/send a new PR soon and looking forward that you can help to review. Thanks!

pramitchoudhary · 2016-11-02T00:30:31Z

@yanboliang This is a very helpful initiative by you. Thanks for taking it up. Let me know, if you need any help for this PR.

yanboliang · 2016-11-02T09:57:34Z

@pramitchoudhary Yeah, lots of users vote for this feature. I will update this PR to match master in the next release cycle and looking forward to review/comment. Since Spark 2.1 is code freeze, we can only merge bug fix and docs change right now. Thanks.

WeichenXu123 · 2017-07-26T21:02:40Z

@yanboliang I will take over this feature and create a new PR soon.

minixalpha · 2017-09-11T04:02:48Z

@WeichenXu123 Any activity for the new PR?

WeichenXu123 · 2017-09-11T04:37:34Z

@minixalpha Sorry for delay. Too busy recently. But I will try to finish and commit my new PR once I get time. Thanks!

minixalpha · 2017-09-11T04:42:16Z

Thanks for you job! @WeichenXu123

Add multiple columns support to StringIndexer

10ec734

modify setInputCol and setOutputCol, fix output column metadata

24ad0fd

fix Param check for isSet

d039f9a

PySpark support inputCols and outputCols

a64f71d

disable indeterminate test case

2e8bc28

felixcheung reviewed Oct 28, 2015
View reviewed changes

asfgit closed this in 1a33f2e Jun 15, 2016

yanboliang deleted the spark-11215 branch September 11, 2017 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183

[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183

yanboliang commented Oct 20, 2015

SparkQA commented Oct 20, 2015

SparkQA commented Oct 21, 2015

SparkQA commented Oct 21, 2015

SparkQA commented Oct 21, 2015

SparkQA commented Oct 22, 2015

yanboliang commented Oct 23, 2015

SparkQA commented Oct 23, 2015

SparkQA commented Oct 23, 2015

yanboliang commented Oct 23, 2015

felixcheung Oct 28, 2015

yanboliang Oct 28, 2015

SparkQA commented Dec 10, 2015

BenFradet commented Dec 17, 2015

BenFradet commented Dec 18, 2015

rxin commented Jun 15, 2016

pkch commented Jun 28, 2016

rxin commented Jun 28, 2016

MLnick commented Aug 22, 2016

yanboliang commented Aug 22, 2016

pramitchoudhary commented Nov 2, 2016

yanboliang commented Nov 2, 2016

WeichenXu123 commented Jul 26, 2017

minixalpha commented Sep 11, 2017

WeichenXu123 commented Sep 11, 2017

minixalpha commented Sep 11, 2017

Navigation Menu

[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183

[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183

Conversation

yanboliang commented Oct 20, 2015

SparkQA commented Oct 20, 2015

SparkQA commented Oct 21, 2015

SparkQA commented Oct 21, 2015

SparkQA commented Oct 21, 2015

SparkQA commented Oct 22, 2015

yanboliang commented Oct 23, 2015

SparkQA commented Oct 23, 2015

SparkQA commented Oct 23, 2015

yanboliang commented Oct 23, 2015

felixcheung Oct 28, 2015

Choose a reason for hiding this comment

yanboliang Oct 28, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 10, 2015

BenFradet commented Dec 17, 2015

BenFradet commented Dec 18, 2015

rxin commented Jun 15, 2016

pkch commented Jun 28, 2016

rxin commented Jun 28, 2016

MLnick commented Aug 22, 2016

yanboliang commented Aug 22, 2016

pramitchoudhary commented Nov 2, 2016

yanboliang commented Nov 2, 2016

WeichenXu123 commented Jul 26, 2017

minixalpha commented Sep 11, 2017

WeichenXu123 commented Sep 11, 2017

minixalpha commented Sep 11, 2017