Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR #27570

Closed
wants to merge 9 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Feb 13, 2020

What changes were proposed in this pull request?

This pull request adds SparkR wrapper for FMClassifier:

  • Supporting org.apache.spark.ml.r.FMClassifierWrapper.
  • FMClassificationModel S4 class.
  • Corresponding spark.fmClassifier, predict, summary and write.ml generics.
  • Corresponding docs and tests.

Why are the changes needed?

Feature parity.

Does this PR introduce any user-facing change?

No (new API).

How was this patch tested?

New unit tests.

@SparkQA
Copy link

SparkQA commented Feb 13, 2020

Test build #118378 has finished for PR 27570 at commit 42df01f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FMClassifierWrapperWriter(instance: FMClassifierWrapper) extends MLWriter
  • class FMClassifierWrapperReader extends MLReader[FMClassifierWrapper]

@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118485 has finished for PR 27570 at commit e2c6b87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FMClassifierWrapperWriter(instance: FMClassifierWrapper) extends MLWriter
  • class FMClassifierWrapperReader extends MLReader[FMClassifierWrapper]

examples/src/main/r/ml/fmClassifier.R Outdated Show resolved Hide resolved
examples/src/main/r/ml/fmClassifier.R Show resolved Hide resolved
examples/src/main/r/ml/fmClassifier.R Outdated Show resolved Hide resolved
examples/src/main/r/ml/fmClassifier.R Show resolved Hide resolved
R/pkg/tests/fulltests/test_mllib_classification.R Outdated Show resolved Hide resolved
)

prediction1 <- predict(model1, df)
expect_is(prediction1, "SparkDataFrame")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check the predict result here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if such check are really useful here. In practice fitting is not unlikely failure point and most likely problems are related to parameter passing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked other classification tests. It seems other tests checked the typeof and result of the prediction. I guess it might be better to be consistent with other tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typeof is not applicable here. typeof is S compatibility thingy, and can be used only to distinguish between core types (here it could only determine if value is S4 type).

R/pkg/R/mllib_classification.R Outdated Show resolved Hide resolved
R/pkg/R/mllib_classification.R Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118498 has finished for PR 27570 at commit 31842d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118500 has finished for PR 27570 at commit 1e2b879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 17, 2020

We can combine this and SPARK-30819, but it doesn't matter much. They might cause a merge conflict with each other.
@huaxingao are you OK with this one?

#' @param formula a symbolic description of the model to be fitted. Currently only a few formula
#' operators are supported, including '~', '.', ':', '+', and '-'.
#' @param factorSize dimensionality of the factors.
#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked this TODO yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it more for a discussion. Adding custom formula components is not very hard, the question is if it makes sense to complicate for such thing.

)

prediction1 <- predict(model1, df)
expect_is(prediction1, "SparkDataFrame")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked other classification tests. It seems other tests checked the typeof and result of the prediction. I guess it might be better to be consistent with other tests?

examples/src/main/r/ml/fmClassifier.R Outdated Show resolved Hide resolved
@huaxingao
Copy link
Contributor

cc @felixcheung

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118603 has finished for PR 27570 at commit f1851a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the R part well, but looks plausible.
If anyone felt strongly about putting it into 3.0 to match Python/Scala, I think that could be OK, but OK to put it into 3.1 too.

expect_equal(summary(model1)$factorSize, 3)

# Test model save/load
if (windows_with_hadoop()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to avoid failures in case of missing winutils. If i recall correctly the primary target was CRAN tests (and these shouldn't run here anyway), but I think it still applicable to AppVeyor.

@huaxingao
Copy link
Contributor

@zero323 Could you please add an item for R in FRClassifier section of ml-classification-regression.md? Please also update sparkr.md to include FMClassifier.

@SparkQA
Copy link

SparkQA commented Feb 23, 2020

Test build #118821 has finished for PR 27570 at commit 653b0dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • probabilistic, multiclass classifiers based on applying Bayes' theorem with strong (naive) independence
  • In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities);

docs/ml-classification-regression.md Outdated Show resolved Hide resolved
docs/ml-classification-regression.md Outdated Show resolved Hide resolved
docs/sparkr.md Outdated Show resolved Hide resolved
examples/src/main/r/ml/fmClassifier.R Outdated Show resolved Hide resolved
@zero323 zero323 force-pushed the SPARK-30820 branch 2 times, most recently from 815bdf4 to 2131c96 Compare February 23, 2020 09:59
@SparkQA
Copy link

SparkQA commented Feb 23, 2020

Test build #118832 has finished for PR 27570 at commit 815bdf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2020

Test build #118834 has finished for PR 27570 at commit 2131c96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2020

Test build #118862 has finished for PR 27570 at commit 27800b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 2, 2020

Any more comments @huaxingao ? this will conflict with #27571 once merged, so @zero323 would you be able to update quickly after that? I think it's valid to merge this into 3.0 as the underlying functionality is in 3.0.

)

prediction1 <- predict(model1, df)
expect_is(prediction1, "SparkDataFrame")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that all the other ML R tests check the prediction result. For example, in LinearSVM,

  # Test prediction with string label
  prediction <- predict(model, training)
  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character")
  expected <- c("versicolor", "versicolor", "versicolor", "virginica",  "virginica",
                "virginica",  "virginica",  "virginica",  "virginica",  "virginica")
  expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected)

Is it OK if we do something similar here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can. The question is what we are really trying to test in such cases? What types of implementation mistakes can we detect here, that are not already covered by JVM tests and / or SparkR data frames tests?

These checks involve additional jobs and many tests are already rejected to keep things manageable, so unless these serve specific purpose, I'd prefer to keep things lean here.

In contrast there are many SparkR ML failure modes that are real, and could be tested, but are crippled by lack of required API. But that's way beyond the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with this.

@huaxingao
Copy link
Contributor

@zero323
I left a couple of more inline comments. I think we are almost there. You did a good job on this PR. Thanks a lot for bearing with my nitpicking.
Also, if this goes into 3.0, all the version info need to be updated.

@huaxingao
Copy link
Contributor

@zero323 Sorry, one more thing: FMClassifier currently only supports binary classification, so the labels must be 0 and 1. The dataset iris you are using in the test has string label, right? Maybe change the dataset?

@zero323
Copy link
Member Author

zero323 commented Mar 3, 2020

@zero323 Sorry, one more thing: FMClassifier currently only supports binary classification, so the labels must be 0 and 1. The dataset iris you are using in the test has string label, right? Maybe change the dataset?

If you check lines 492-495 this is already handled. Honestly I am aware of any dataset that is good for binary classification, won't require any transformations, and comes from core datasets (so it doesn't create annoying dependency).

@zero323
Copy link
Member Author

zero323 commented Mar 3, 2020

Any more comments @huaxingao ? this will conflict with #27571 once merged, so @zero323 would you be able to update quickly after that? I think it's valid to merge this into 3.0 as the underlying functionality is in 3.0.

I can, but it will require some additional checking, as I am pretty much limited to vim and what Jenkins outputs at the moment. Though honestly I'd rather see #27593 in 3.0 ‒ it is, for whatever reason, long overdue.

One way or another I'll try to make another sweep later today or tomorrow and see where I can get this.

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119237 has finished for PR 27570 at commit 6a62bf6.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119334 has finished for PR 27570 at commit 2cdc769.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119342 has finished for PR 27570 at commit 2cdc769.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119519 has finished for PR 27570 at commit 2cdc769.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2020

Test build #119525 has finished for PR 27570 at commit 0541f04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 8, 2020

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 8, 2020

Test build #119536 has finished for PR 27570 at commit 0541f04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2020

Test build #119541 has finished for PR 27570 at commit 6e56263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not closely, but looks good.

R/pkg/R/mllib_classification.R Show resolved Hide resolved
@srowen
Copy link
Member

srowen commented Mar 31, 2020

@zero323 if you want to take a look at the final small comments I think we can finish this out

@zero323
Copy link
Member Author

zero323 commented Apr 4, 2020

@zero323 if you want to take a look at the final small comments I think we can finish this out

I believe we're left with this one ‒ #27570 (comment) ‒ but I am still not convinced that adding such tests provides any practical value here.

@SparkQA
Copy link

SparkQA commented Apr 5, 2020

Test build #120820 has finished for PR 27570 at commit 7126bbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 6, 2020

OK if there are no objections I'm going to start merging these for 3.1

@srowen srowen closed this in 0d37f79 Apr 7, 2020
@srowen
Copy link
Member

srowen commented Apr 7, 2020

Merged to master

@zero323
Copy link
Member Author

zero323 commented Apr 7, 2020

Thanks @huajianmao @srowen @viirya

@zero323 zero323 deleted the SPARK-30820 branch April 7, 2020 14:47
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMClassifier`:

- Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`.
- `FMClassificationModel` S4 class.
- Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes apache#27570 from zero323/SPARK-30820.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants