[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes by yanboliang · Pull Request #15826 · apache/spark

yanboliang · 2016-11-09T08:42:14Z

What changes were proposed in this pull request?

Refactor out trainWithLabelCheck and make mllib.NaiveBayes call into it.
Avoid capturing the outer object for modelType.
Move requireNonnegativeValues and requireZeroOneBernoulliValues to companion object.

How was this patch tested?

Existing tests.

…te extra pass on data ## What changes were proposed in this pull request? [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels: * mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses). * mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above. During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use ```val labels = data.map(_.label).distinct().collect().sorted``` to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15402 from yanboliang/spark-17835.

yanboliang · 2016-11-09T08:57:39Z

-  private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli)
+  private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli)
+
+  private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => {


@thunterdb I define these two functions still by val which should be more appropriate, since there is difference between val and def to define a function. http://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function

@yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method

yanboliang · 2016-11-09T08:59:50Z

cc @thunterdb @zhengruifeng

SparkQA · 2016-11-09T09:48:55Z

Test build #68398 has finished for PR 15826 at commit 48f7e86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb

@yanboliang thanks! I have a few small comments.

thunterdb · 2016-11-09T19:07:38Z

-  private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli)
+  private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli)
+
+  private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => {


@yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method

thunterdb · 2016-11-09T19:08:12Z

+      s"Naive Bayes requires nonnegative feature values but found $v.")
+  }
+
+  private[NaiveBayes] val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {


same thing here

thunterdb · 2016-11-09T19:09:22Z

-    if (isML) {
+  private[spark] def trainWithLabelCheck(
+      dataset: Dataset[_],
+      positiveLabel: Boolean = true): NaiveBayesModel = {


Since it is an internal method, I suggest to not use a default argument and specify it explicitly above.

…val into def.

SparkQA · 2016-11-10T05:48:13Z

Test build #68443 has finished for PR 15826 at commit 89d0c77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-11-11T12:16:19Z

@thunterdb Any more comments? Thanks!

thunterdb · 2016-11-11T18:59:17Z

@yanboliang that looks great, thank you. LGTM.

yanboliang · 2016-11-12T14:11:54Z

Merged into master and branch-2.1. Thanks for reviewing.

## What changes were proposed in this pull request? * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it. * Avoid capturing the outer object for ```modelType```. * Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15826 from yanboliang/spark-14077-2. (cherry picked from commit 22cb3a0) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

## What changes were proposed in this pull request? * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it. * Avoid capturing the outer object for ```modelType```. * Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15826 from yanboliang/spark-14077-2.

yanboliang added 3 commits November 8, 2016 23:16

Refactor out trainWithLabelCheck.

d0f95fb

Avoid capturing the outer object.

8cf2a80

Update some variables to more private.

48f7e86

yanboliang changed the title ~~[SPARK-14077][ML][FOLLOW-UP] Minor cleanups for NaiveBayes~~ [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes Nov 9, 2016

yanboliang mentioned this pull request Nov 9, 2016

[SPARK-14077][ML] Refactor NaiveBayes to support weighted instances #12819

Closed

yanboliang commented Nov 9, 2016

View reviewed changes

thunterdb suggested changes Nov 9, 2016

View reviewed changes

Turn requireNonnegativeValues and requireZeroOneBernoulliValues from …

89d0c77

…val into def.

asfgit closed this in 22cb3a0 Nov 12, 2016

yanboliang deleted the spark-14077-2 branch November 12, 2016 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes#15826

[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes#15826
yanboliang wants to merge 4 commits intoapache:masterfrom
yanboliang:spark-14077-2

yanboliang commented Nov 9, 2016

Uh oh!

yanboliang Nov 9, 2016

Uh oh!

thunterdb Nov 9, 2016

Uh oh!

yanboliang commented Nov 9, 2016

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

thunterdb left a comment

Uh oh!

thunterdb Nov 9, 2016

Uh oh!

thunterdb Nov 9, 2016

Uh oh!

thunterdb Nov 9, 2016

Uh oh!

SparkQA commented Nov 10, 2016

Uh oh!

yanboliang commented Nov 11, 2016

Uh oh!

thunterdb commented Nov 11, 2016

Uh oh!

yanboliang commented Nov 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yanboliang commented Nov 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yanboliang Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Nov 9, 2016

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

thunterdb left a comment

Choose a reason for hiding this comment

Uh oh!

thunterdb Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 10, 2016

Uh oh!

yanboliang commented Nov 11, 2016

Uh oh!

thunterdb commented Nov 11, 2016

Uh oh!

yanboliang commented Nov 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants