[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR #27570

zero323 · 2020-02-13T21:57:23Z

What changes were proposed in this pull request?

This pull request adds SparkR wrapper for FMClassifier:

Supporting org.apache.spark.ml.r.FMClassifierWrapper.
FMClassificationModel S4 class.
Corresponding spark.fmClassifier, predict, summary and write.ml generics.
Corresponding docs and tests.

Why are the changes needed?

Feature parity.

Does this PR introduce any user-facing change?

No (new API).

How was this patch tested?

New unit tests.

SparkQA · 2020-02-13T23:19:04Z

Test build #118378 has finished for PR 27570 at commit 42df01f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FMClassifierWrapperWriter(instance: FMClassifierWrapper) extends MLWriter
class FMClassifierWrapperReader extends MLReader[FMClassifierWrapper]

SparkQA · 2020-02-16T02:14:40Z

Test build #118485 has finished for PR 27570 at commit e2c6b87.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FMClassifierWrapperWriter(instance: FMClassifierWrapper) extends MLWriter
class FMClassifierWrapperReader extends MLReader[FMClassifierWrapper]

examples/src/main/r/ml/fmClassifier.R

mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala

R/pkg/tests/fulltests/test_mllib_classification.R

huaxingao · 2020-02-16T07:12:25Z

R/pkg/tests/fulltests/test_mllib_classification.R

+  )
+
+  prediction1 <- predict(model1, df)
+  expect_is(prediction1, "SparkDataFrame")


Can we also check the predict result here?

I am not sure if such check are really useful here. In practice fitting is not unlikely failure point and most likely problems are related to parameter passing.

I looked other classification tests. It seems other tests checked the typeof and result of the prediction. I guess it might be better to be consistent with other tests?

typeof is not applicable here. typeof is S compatibility thingy, and can be used only to distinguish between core types (here it could only determine if value is S4 type).

R/pkg/R/mllib_classification.R

mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala

SparkQA · 2020-02-16T11:47:57Z

Test build #118498 has finished for PR 27570 at commit 31842d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-16T12:04:00Z

Test build #118500 has finished for PR 27570 at commit 1e2b879.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-02-17T15:13:28Z

We can combine this and SPARK-30819, but it doesn't matter much. They might cause a merge conflict with each other.
@huaxingao are you OK with this one?

huaxingao · 2020-02-17T23:30:19Z

R/pkg/R/mllib_classification.R

+#' @param formula a symbolic description of the model to be fitted. Currently only a few formula
+#'                operators are supported, including '~', '.', ':', '+', and '-'.
+#' @param factorSize dimensionality of the factors.
+#' @param fitLinear whether to fit linear term.  # TODO Can we express this with formula?


Have you checked this TODO yet?

I think it more for a discussion. Adding custom formula components is not very hard, the question is if it makes sense to complicate for such thing.

huaxingao · 2020-02-17T23:43:42Z

R/pkg/tests/fulltests/test_mllib_classification.R

+  )
+
+  prediction1 <- predict(model1, df)
+  expect_is(prediction1, "SparkDataFrame")


I looked other classification tests. It seems other tests checked the typeof and result of the prediction. I guess it might be better to be consistent with other tests?

examples/src/main/r/ml/fmClassifier.R

mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala

huaxingao · 2020-02-18T01:56:08Z

cc @felixcheung

SparkQA · 2020-02-18T02:40:24Z

Test build #118603 has finished for PR 27570 at commit f1851a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I don't know the R part well, but looks plausible.
If anyone felt strongly about putting it into 3.0 to match Python/Scala, I think that could be OK, but OK to put it into 3.1 too.

srowen · 2020-02-20T14:35:38Z

R/pkg/tests/fulltests/test_mllib_classification.R

+  expect_equal(summary(model1)$factorSize, 3)
+
+  # Test model save/load
+  if (windows_with_hadoop()) {


Out of curiosity, why this check?

This is used to avoid failures in case of missing winutils. If i recall correctly the primary target was CRAN tests (and these shouldn't run here anyway), but I think it still applicable to AppVeyor.

huaxingao · 2020-02-20T18:47:41Z

@zero323 Could you please add an item for R in FRClassifier section of ml-classification-regression.md? Please also update sparkr.md to include FMClassifier.

SparkQA · 2020-02-23T00:50:35Z

Test build #118821 has finished for PR 27570 at commit 653b0dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
probabilistic, multiclass classifiers based on applying Bayes' theorem with strong (naive) independence
In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities);

docs/ml-classification-regression.md

docs/sparkr.md

examples/src/main/r/ml/fmClassifier.R

SparkQA · 2020-02-23T11:15:26Z

Test build #118832 has finished for PR 27570 at commit 815bdf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-23T11:30:41Z

Test build #118834 has finished for PR 27570 at commit 2131c96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/ml-classification-regression.md

SparkQA · 2020-02-24T11:29:59Z

Test build #118862 has finished for PR 27570 at commit 27800b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-03-02T14:52:20Z

Any more comments @huaxingao ? this will conflict with #27571 once merged, so @zero323 would you be able to update quickly after that? I think it's valid to merge this into 3.0 as the underlying functionality is in 3.0.

huaxingao · 2020-03-02T16:41:13Z

R/pkg/tests/fulltests/test_mllib_classification.R

+  )
+
+  prediction1 <- predict(model1, df)
+  expect_is(prediction1, "SparkDataFrame")


Seems to me that all the other ML R tests check the prediction result. For example, in LinearSVM,

# Test prediction with string label prediction <- predict(model, training) expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character") expected <- c("versicolor", "versicolor", "versicolor", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica") expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected)

Is it OK if we do something similar here?

Sure we can. The question is what we are really trying to test in such cases? What types of implementation mistakes can we detect here, that are not already covered by JVM tests and / or SparkR data frames tests?

These checks involve additional jobs and many tests are already rejected to keep things manageable, so unless these serve specific purpose, I'd prefer to keep things lean here.

In contrast there are many SparkR ML failure modes that are real, and could be tested, but are crippled by lack of required API. But that's way beyond the scope of this PR.

I am OK with this.

R/pkg/tests/fulltests/test_mllib_classification.R

huaxingao · 2020-03-02T17:02:42Z

@zero323
I left a couple of more inline comments. I think we are almost there. You did a good job on this PR. Thanks a lot for bearing with my nitpicking.
Also, if this goes into 3.0, all the version info need to be updated.

huaxingao · 2020-03-02T17:26:30Z

@zero323 Sorry, one more thing: FMClassifier currently only supports binary classification, so the labels must be 0 and 1. The dataset iris you are using in the test has string label, right? Maybe change the dataset?

zero323 · 2020-03-03T15:09:45Z

@zero323 Sorry, one more thing: FMClassifier currently only supports binary classification, so the labels must be 0 and 1. The dataset iris you are using in the test has string label, right? Maybe change the dataset?

If you check lines 492-495 this is already handled. Honestly I am aware of any dataset that is good for binary classification, won't require any transformations, and comes from core datasets (so it doesn't create annoying dependency).

zero323 · 2020-03-03T15:33:21Z

Any more comments @huaxingao ? this will conflict with #27571 once merged, so @zero323 would you be able to update quickly after that? I think it's valid to merge this into 3.0 as the underlying functionality is in 3.0.

I can, but it will require some additional checking, as I am pretty much limited to vim and what Jenkins outputs at the moment. Though honestly I'd rather see #27593 in 3.0 ‒ it is, for whatever reason, long overdue.

One way or another I'll try to make another sweep later today or tomorrow and see where I can get this.

SparkQA · 2020-03-03T18:05:43Z

Test build #119237 has finished for PR 27570 at commit 6a62bf6.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-03-04T19:50:44Z

retest this please

SparkQA · 2020-03-04T21:22:53Z

Test build #119334 has finished for PR 27570 at commit 2cdc769.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-03-04T21:56:29Z

retest this please

SparkQA · 2020-03-04T23:22:47Z

Test build #119342 has finished for PR 27570 at commit 2cdc769.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-07T18:25:35Z

Test build #119519 has finished for PR 27570 at commit 2cdc769.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-08T03:12:48Z

Test build #119525 has finished for PR 27570 at commit 0541f04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2020-03-08T13:15:44Z

Retest this please.

SparkQA · 2020-03-08T14:32:28Z

Test build #119536 has finished for PR 27570 at commit 0541f04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-08T22:28:03Z

Test build #119541 has finished for PR 27570 at commit 6e56263.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Not closely, but looks good.

R/pkg/R/mllib_classification.R

mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala

srowen · 2020-03-31T21:49:12Z

@zero323 if you want to take a look at the final small comments I think we can finish this out

zero323 · 2020-04-04T23:54:43Z

@zero323 if you want to take a look at the final small comments I think we can finish this out

I believe we're left with this one ‒ #27570 (comment) ‒ but I am still not convinced that adding such tests provides any practical value here.

SparkQA · 2020-04-05T00:51:02Z

Test build #120820 has finished for PR 27570 at commit 7126bbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-04-06T14:15:27Z

OK if there are no objections I'm going to start merging these for 3.1

srowen · 2020-04-07T14:02:10Z

Merged to master

zero323 · 2020-04-07T14:47:32Z

Thanks @huajianmao @srowen @viirya

### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMClassifier`: - Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`. - `FMClassificationModel` S4 class. - Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes apache#27570 from zero323/SPARK-30820. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

zero323 force-pushed the SPARK-30820 branch from 42df01f to e2c6b87 Compare February 15, 2020 20:14

huaxingao reviewed Feb 16, 2020

View reviewed changes

zero323 force-pushed the SPARK-30820 branch from 31842d0 to 1e2b879 Compare February 16, 2020 10:47

huaxingao reviewed Feb 17, 2020

View reviewed changes

srowen reviewed Feb 20, 2020

View reviewed changes

huaxingao reviewed Feb 23, 2020

View reviewed changes

docs/ml-classification-regression.md Outdated Show resolved Hide resolved

docs/ml-classification-regression.md Outdated Show resolved Hide resolved

docs/sparkr.md Outdated Show resolved Hide resolved

examples/src/main/r/ml/fmClassifier.R Outdated Show resolved Hide resolved

zero323 force-pushed the SPARK-30820 branch 2 times, most recently from 815bdf4 to 2131c96 Compare February 23, 2020 09:59

huaxingao reviewed Feb 23, 2020

View reviewed changes

docs/ml-classification-regression.md Show resolved Hide resolved

dongjoon-hyun added ML SPARKR labels Feb 28, 2020

huaxingao reviewed Mar 2, 2020

View reviewed changes

zero323 force-pushed the SPARK-30820 branch from 2cdc769 to 0541f04 Compare March 8, 2020 01:51

zero323 added 8 commits March 8, 2020 21:55

First draft of FMClassifier

d44582f

Address comments

a018ce9

Address the comments

2d6af0d

Update docs

7c5fcf0

Update docs

a41ae58

Remove section

c16e8be

Check if summaries of orginal and loaded are equal

607b66d

Unlink

6e56263

zero323 force-pushed the SPARK-30820 branch from 0541f04 to 6e56263 Compare March 8, 2020 20:56

viirya reviewed Mar 26, 2020

View reviewed changes

R/pkg/R/mllib_classification.R Show resolved Hide resolved

mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala Outdated Show resolved Hide resolved

Remove obsolete load

7126bbf

srowen closed this in 0d37f79 Apr 7, 2020

zero323 deleted the SPARK-30820 branch April 7, 2020 14:47

[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR #27570

[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR #27570

Conversation

zero323 commented Feb 13, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 13, 2020

SparkQA commented Feb 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 16, 2020

SparkQA commented Feb 16, 2020

srowen commented Feb 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Feb 18, 2020

SparkQA commented Feb 18, 2020

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Feb 20, 2020

SparkQA commented Feb 23, 2020

SparkQA commented Feb 23, 2020

SparkQA commented Feb 23, 2020

SparkQA commented Feb 24, 2020

srowen commented Mar 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Mar 2, 2020

huaxingao commented Mar 2, 2020

zero323 commented Mar 3, 2020

zero323 commented Mar 3, 2020

SparkQA commented Mar 3, 2020

huaxingao commented Mar 4, 2020

SparkQA commented Mar 4, 2020

huaxingao commented Mar 4, 2020

SparkQA commented Mar 4, 2020

SparkQA commented Mar 7, 2020

SparkQA commented Mar 8, 2020

zero323 commented Mar 8, 2020

SparkQA commented Mar 8, 2020

SparkQA commented Mar 8, 2020

viirya left a comment

Choose a reason for hiding this comment

srowen commented Mar 31, 2020

zero323 commented Apr 4, 2020

SparkQA commented Apr 5, 2020

srowen commented Apr 6, 2020

srowen commented Apr 7, 2020

zero323 commented Apr 7, 2020