Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31309][ML] Migrate the ChiSquareTest from MLlib to ML #28078

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, Move the impl of ChiSq from .mllib to the .ml side;
2, in .mllib.ChiSqTest, call the impl in .ml.ChiSquareTest

Why are the changes needed?

We should migrate the algs from MLlib to ML

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120626 has finished for PR 28078 at commit 9e708d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120627 has finished for PR 28078 at commit ee2248f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

I have a question, @zhengruifeng . According to the new rubric, mllib will survive like HiveContext forever?

We should migrate the algs from MLlib to ML

@srowen
Copy link
Member

srowen commented Mar 31, 2020

I think .mllib will be around forever, yes. It's not that crazy; it still provides the RDD API for many of the same operations, and they share mostly one code base under the hood. And RDDs are not going away.

@dongjoon-hyun
Copy link
Member

Thanks, @srowen .

@zhengruifeng
Copy link
Contributor Author

@srowen @dongjoon-hyun OK, we will keep the .mllib, so I will close this PR. Thanks for pointing it out.

@zhengruifeng zhengruifeng deleted the mv_chisq_ml branch April 1, 2020 02:38
@srowen
Copy link
Member

srowen commented Apr 1, 2020

I don't think that means we can't refactor and improve the relationship between .ml and .mllib; I'm just saying I don't think we will deprecate or remove the latter. If this is just moving code, OK, maybe not worth it. If we can reduce duplication, that's OK.

@zhengruifeng
Copy link
Contributor Author

OK, I did not get the point.

I think it worthwhile to move some impls to the .ml side, since in .ml we can use both Dataset and RDD in implementation. Also beacuse that we will still need to maintain those impls, moving them to .ml may help keeping consistence

@zhengruifeng zhengruifeng restored the mv_chisq_ml branch April 1, 2020 06:00
@zhengruifeng zhengruifeng reopened this Apr 1, 2020
@SparkQA
Copy link

SparkQA commented Apr 1, 2020

Test build #120666 has finished for PR 28078 at commit 35c6db2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 1, 2020

It's not a bad change and I don't mind some cleanup personally; some are just formatting changes.
The general pattern is to let .ml reuse code in .mllib where they need to share, rather than both point to each other. I'm sure there are cases of both unfortunately (? I'd have to check). I might not move the code just to move it though unless it buys some simplification.

But yet the logic here doesn't depend on whether .mllib will be removed. That isn't an end goal as far as I can tell.

@zhengruifeng
Copy link
Contributor Author

@srowen This PR is mainly moving the code without bringing more simplification. But it maybe the first step to futher improvements, a possible point may be returning results in rows instead a single row (in this PR the methods chiSquaredDenseFeatures and chiSquaredSparseFeatures are added to directly return rdd of test results, I thought migration is enough to open a PR, so I did not point it out).

For other impls, if I want to make an improvements based on DF or DS, should I just use DF or DS in the .mllib side (since .mllib side are almost impled on RDD) or move it to .ml at first?

I'm sure there are cases of both unfortunately (? I'd have to check).

Yes, at least GMM was implemented in both side.

@srowen
Copy link
Member

srowen commented Apr 8, 2020

Given the other discussion, I'm not sure we want to change the requests of the chi-squared test method. That doesn't mean there aren't other reasons to do this.

Yes the general idea is to call .mllib from .ml and not the other way around. If that's becoming hard or impossible, that could justify moving the logic.

If there isn't a strong reason I suppose I just wouldn't do it. I don't feel strongly.

@zhengruifeng
Copy link
Contributor Author

OK, I will close this PR.

@zhengruifeng zhengruifeng deleted the mv_chisq_ml branch April 9, 2020 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants