-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR #27570
Conversation
Test build #118378 has finished for PR 27570 at commit
|
Test build #118485 has finished for PR 27570 at commit
|
mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala
Outdated
Show resolved
Hide resolved
) | ||
|
||
prediction1 <- predict(model1, df) | ||
expect_is(prediction1, "SparkDataFrame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check the predict result here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if such check are really useful here. In practice fitting is not unlikely failure point and most likely problems are related to parameter passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked other classification tests. It seems other tests checked the typeof
and result of the prediction. I guess it might be better to be consistent with other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typeof
is not applicable here. typeof
is S compatibility thingy, and can be used only to distinguish between core types (here it could only determine if value is S4 type).
Test build #118498 has finished for PR 27570 at commit
|
Test build #118500 has finished for PR 27570 at commit
|
We can combine this and SPARK-30819, but it doesn't matter much. They might cause a merge conflict with each other. |
#' @param formula a symbolic description of the model to be fitted. Currently only a few formula | ||
#' operators are supported, including '~', '.', ':', '+', and '-'. | ||
#' @param factorSize dimensionality of the factors. | ||
#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked this TODO yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it more for a discussion. Adding custom formula components is not very hard, the question is if it makes sense to complicate for such thing.
) | ||
|
||
prediction1 <- predict(model1, df) | ||
expect_is(prediction1, "SparkDataFrame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked other classification tests. It seems other tests checked the typeof
and result of the prediction. I guess it might be better to be consistent with other tests?
mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala
Outdated
Show resolved
Hide resolved
cc @felixcheung |
Test build #118603 has finished for PR 27570 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the R part well, but looks plausible.
If anyone felt strongly about putting it into 3.0 to match Python/Scala, I think that could be OK, but OK to put it into 3.1 too.
expect_equal(summary(model1)$factorSize, 3) | ||
|
||
# Test model save/load | ||
if (windows_with_hadoop()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to avoid failures in case of missing winutils
. If i recall correctly the primary target was CRAN tests (and these shouldn't run here anyway), but I think it still applicable to AppVeyor.
@zero323 Could you please add an item for R in FRClassifier section of ml-classification-regression.md? Please also update sparkr.md to include FMClassifier. |
Test build #118821 has finished for PR 27570 at commit
|
815bdf4
to
2131c96
Compare
Test build #118832 has finished for PR 27570 at commit
|
Test build #118834 has finished for PR 27570 at commit
|
Test build #118862 has finished for PR 27570 at commit
|
Any more comments @huaxingao ? this will conflict with #27571 once merged, so @zero323 would you be able to update quickly after that? I think it's valid to merge this into 3.0 as the underlying functionality is in 3.0. |
) | ||
|
||
prediction1 <- predict(model1, df) | ||
expect_is(prediction1, "SparkDataFrame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that all the other ML R tests check the prediction result. For example, in LinearSVM,
# Test prediction with string label
prediction <- predict(model, training)
expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character")
expected <- c("versicolor", "versicolor", "versicolor", "virginica", "virginica",
"virginica", "virginica", "virginica", "virginica", "virginica")
expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected)
Is it OK if we do something similar here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can. The question is what we are really trying to test in such cases? What types of implementation mistakes can we detect here, that are not already covered by JVM tests and / or SparkR data frames tests?
These checks involve additional jobs and many tests are already rejected to keep things manageable, so unless these serve specific purpose, I'd prefer to keep things lean here.
In contrast there are many SparkR ML failure modes that are real, and could be tested, but are crippled by lack of required API. But that's way beyond the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with this.
@zero323 |
@zero323 Sorry, one more thing: |
If you check lines 492-495 this is already handled. Honestly I am aware of any dataset that is good for binary classification, won't require any transformations, and comes from core datasets (so it doesn't create annoying dependency). |
I can, but it will require some additional checking, as I am pretty much limited to vim and what Jenkins outputs at the moment. Though honestly I'd rather see #27593 in 3.0 ‒ it is, for whatever reason, long overdue. One way or another I'll try to make another sweep later today or tomorrow and see where I can get this. |
Test build #119237 has finished for PR 27570 at commit
|
retest this please |
Test build #119334 has finished for PR 27570 at commit
|
retest this please |
Test build #119342 has finished for PR 27570 at commit
|
Test build #119519 has finished for PR 27570 at commit
|
Test build #119525 has finished for PR 27570 at commit
|
Retest this please. |
Test build #119536 has finished for PR 27570 at commit
|
Test build #119541 has finished for PR 27570 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not closely, but looks good.
mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala
Outdated
Show resolved
Hide resolved
@zero323 if you want to take a look at the final small comments I think we can finish this out |
I believe we're left with this one ‒ #27570 (comment) ‒ but I am still not convinced that adding such tests provides any practical value here. |
Test build #120820 has finished for PR 27570 at commit
|
OK if there are no objections I'm going to start merging these for 3.1 |
Merged to master |
Thanks @huajianmao @srowen @viirya |
### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMClassifier`: - Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`. - `FMClassificationModel` S4 class. - Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes apache#27570 from zero323/SPARK-30820. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
This pull request adds SparkR wrapper for
FMClassifier
:org.apache.spark.ml.r.FMClassifierWrapper
.FMClassificationModel
S4 class.spark.fmClassifier
,predict
,summary
andwrite.ml
generics.Why are the changes needed?
Feature parity.
Does this PR introduce any user-facing change?
No (new API).
How was this patch tested?
New unit tests.