New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31309][ML] Migrate the ChiSquareTest from MLlib to ML #28078
Conversation
Test build #120626 has finished for PR 28078 at commit
|
Test build #120627 has finished for PR 28078 at commit
|
I have a question, @zhengruifeng . According to the new rubric,
|
I think .mllib will be around forever, yes. It's not that crazy; it still provides the RDD API for many of the same operations, and they share mostly one code base under the hood. And RDDs are not going away. |
Thanks, @srowen . |
@srowen @dongjoon-hyun OK, we will keep the .mllib, so I will close this PR. Thanks for pointing it out. |
I don't think that means we can't refactor and improve the relationship between .ml and .mllib; I'm just saying I don't think we will deprecate or remove the latter. If this is just moving code, OK, maybe not worth it. If we can reduce duplication, that's OK. |
OK, I did not get the point. I think it worthwhile to move some impls to the .ml side, since in .ml we can use both Dataset and RDD in implementation. Also beacuse that we will still need to maintain those impls, moving them to .ml may help keeping consistence |
ee2248f
to
35c6db2
Compare
Test build #120666 has finished for PR 28078 at commit
|
It's not a bad change and I don't mind some cleanup personally; some are just formatting changes. But yet the logic here doesn't depend on whether .mllib will be removed. That isn't an end goal as far as I can tell. |
@srowen This PR is mainly moving the code without bringing more simplification. But it maybe the first step to futher improvements, a possible point may be returning results in rows instead a single row (in this PR the methods For other impls, if I want to make an improvements based on DF or DS, should I just use DF or DS in the .mllib side (since .mllib side are almost impled on RDD) or move it to .ml at first?
Yes, at least GMM was implemented in both side. |
Given the other discussion, I'm not sure we want to change the requests of the chi-squared test method. That doesn't mean there aren't other reasons to do this. Yes the general idea is to call .mllib from .ml and not the other way around. If that's becoming hard or impossible, that could justify moving the logic. If there isn't a strong reason I suppose I just wouldn't do it. I don't feel strongly. |
OK, I will close this PR. |
What changes were proposed in this pull request?
1, Move the impl of ChiSq from .mllib to the .ml side;
2, in
.mllib.ChiSqTest
, call the impl in.ml.ChiSquareTest
Why are the changes needed?
We should migrate the algs from MLlib to ML
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites