New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11215] [ML] Add multiple columns support to StringIndexer #9183
Conversation
Test build #43987 has finished for PR 9183 at commit
|
Test build #44069 has finished for PR 9183 at commit
|
Test build #44075 has finished for PR 9183 at commit
|
Test build #44077 has finished for PR 9183 at commit
|
Test build #44146 has finished for PR 9183 at commit
|
Jenkins, test this please. |
Test build #44196 has finished for PR 9183 at commit
|
Test build #44220 has finished for PR 9183 at commit
|
Because of the multiple columns
|
@@ -56,14 +56,3 @@ test_that("feature interaction vs native glm", { | |||
rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), iris) | |||
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals) | |||
}) | |||
|
|||
test_that("summary coefficients match with native glm", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not removed, just temporary disable. Because of this PR changed the semantics of StringIndexer
a little that cause this test case produce indeterminate result. We should first discuss the semantics changing is necessary or not, and then we can update the test case to produce determinate result.
Test build #47507 has started for PR 9183 at commit |
+1, I've been meaning to request this transformer for a while. |
LGTM, I think a good follow up would be to do the same with |
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. |
What needs to happen to move this forward? This was a PR that would have been the first iteration of a significant improvement in handling of wide datasets. |
I think @yanboliang just need to push this forward and get people to review it. |
@yanboliang will you be reviving this PR? |
@MLnick I think this is an important feature to make ML pipeline handle large datasets elegantly. I will update/send a new PR soon and looking forward that you can help to review. Thanks! |
@yanboliang This is a very helpful initiative by you. Thanks for taking it up. Let me know, if you need any help for this PR. |
@pramitchoudhary Yeah, lots of users vote for this feature. I will update this PR to match master in the next release cycle and looking forward to review/comment. Since Spark 2.1 is code freeze, we can only merge bug fix and docs change right now. Thanks. |
@yanboliang I will take over this feature and create a new PR soon. |
@WeichenXu123 Any activity for the new PR? |
@minixalpha Sorry for delay. Too busy recently. But I will try to finish and commit my new PR once I get time. Thanks! |
Thanks for you job! @WeichenXu123 |
Add multiple columns support to
StringIndexer
, then users can transform multiple input columns to multiple output columns simultaneously.